Skip to main content
Redhat Developers  Logo
  • AI

    Get started with AI

    • Red Hat AI
      Accelerate the development and deployment of enterprise AI solutions.
    • AI learning hub
      Explore learning materials and tools, organized by task.
    • AI interactive demos
      Click through scenarios with Red Hat AI, including training LLMs and more.
    • AI/ML learning paths
      Expand your OpenShift AI knowledge using these learning resources.
    • AI quickstarts
      Focused AI use cases designed for fast deployment on Red Hat AI platforms.
    • No-cost AI training
      Foundational Red Hat AI training.

    Featured resources

    • OpenShift AI learning
    • Open source AI for developers
    • AI product application development
    • Open source-powered AI/ML for hybrid cloud
    • AI and Node.js cheat sheet

    Red Hat AI Factory with NVIDIA

    • Red Hat AI Factory with NVIDIA is a co-engineered, enterprise-grade AI solution for building, deploying, and managing AI at scale across hybrid cloud environments.
    • Explore the solution
  • Learn

    Self-guided

    • Documentation
      Find answers, get step-by-step guidance, and learn how to use Red Hat products.
    • Learning paths
      Explore curated walkthroughs for common development tasks.
    • Guided learning
      Receive custom learning paths powered by our AI assistant.
    • See all learning

    Hands-on

    • Developer Sandbox
      Spin up Red Hat's products and technologies without setup or configuration.
    • Interactive labs
      Learn by doing in these hands-on, browser-based experiences.
    • Interactive demos
      Click through product features in these guided tours.

    Browse by topic

    • AI/ML
    • Automation
    • Java
    • Kubernetes
    • Linux
    • See all topics

    Training & certifications

    • Courses and exams
    • Certifications
    • Skills assessments
    • Red Hat Academy
    • Learning subscription
    • Explore training
  • Build

    Get started

    • Red Hat build of Podman Desktop
      A downloadable, local development hub to experiment with our products and builds.
    • Developer Sandbox
      Spin up Red Hat's products and technologies without setup or configuration.

    Download products

    • Access product downloads to start building and testing right away.
    • Red Hat Enterprise Linux
    • Red Hat AI
    • Red Hat OpenShift
    • Red Hat Ansible Automation Platform
    • See all products

    Featured

    • Red Hat build of OpenJDK
    • Red Hat JBoss Enterprise Application Platform
    • Red Hat OpenShift Dev Spaces
    • Red Hat Developer Toolset

    References

    • E-books
    • Documentation
    • Cheat sheets
    • Architecture center
  • Community

    Get involved

    • Events
    • Live AI events
    • Red Hat Summit
    • Red Hat Accelerators
    • Community discussions

    Follow along

    • Articles & blogs
    • Developer newsletter
    • Videos
    • Github

    Get help

    • Customer service
    • Customer support
    • Regional contacts
    • Find a partner

    Join the Red Hat Developer program

    • Download Red Hat products and project builds, access support documentation, learning content, and more.
    • Explore the benefits

Intelligent inference scheduling with llm-d on Red Hat AI

Intelligent inference scheduling with llm-d

June 11, 2026
Edoardo Vacchi Madhu Goutham Reddy Ambati
Related topics:
AI inference
Related products:
Red Hat AIRed Hat AI InferenceRed Hat OpenShift AI

    If a 100-token prompt and a 100,000-token prompt both count as one request to your load balancer, then your GPUs are already doing work they don't need to do.

    Traditional round-robin balancing treats large language model (LLM) inference like stateless web traffic. But LLM requests are not stateless: They carry context, they benefit from cached computation, and they can vary in cost by orders of magnitude. A load balancer that ignores all of this leaves performance on the table.

    The open source llm-d project implements a high-performance distributed inference serving stack that provides a number of ready-made recipes to bring your workloads from zero to production.

    In this blog post, we show how llm-d routes each inference request to the GPU that already has the relevant data cached. It can also factor in queue depth and real-time load.

    In upstream benchmarks on shared-prefix workloads, it cut time-to-first-token (TTFT) by more than 99% and more than doubled throughput, without changing hardware. We'll show how the Red Hat stack packages this neatly into a single Kubernetes resource, so you get intelligent scheduling without wiring the plumbing yourself.

    Why round-robin breaks down for LLMs

    Traditional load balancers assume requests are fast, uniform, and stateless. LLM inference breaks all three.

    • Requests are slow: A single chat completion can run for 60 seconds while the model generates tokens one at a time. The GPU stays busy the entire time. Round-robin sends the next request to the next pod whether or not that pod is already saturated.
    • Requests are not uniform: A 100-token prompt and a 100,000-token prompt both look like "one request" but the longer one needs orders of magnitude more computation during the prefill phase. Without visibility into request size, the load balancer cannot spread work evenly.
    • Requests benefit from state: LLM engines like vLLM cache the key-value (KV) tensors (attention state) computed during prefill. If the next request shares the same prefix, the engine can skip that computation entirely. Round-robin scatters requests across all pods, so the same prefix gets recomputed from scratch on every replica.

    Suppose four replicas are serving a chatbot. Every user's request starts with the same system prompt. Round-robin computes that prompt four separate times. With intelligent scheduling, it gets computed once and reused (see figure 1).

    Round-robin compared to intelligent scheduling.
    Figure 1: Round-robin compared to intelligent scheduling.

    How llm-d routes requests

    llm-d fixes this by adding a scheduling layer called the endpoint picker (EPP). Before each request is forwarded, the EPP checks every model server pod to find the best target, starting with which pods already have this request's prefix cached.

    The EPP is built on Gateway API Inference Extension, a Kubernetes SIG project. It works with standard gateways like Istio or Envoy Gateway through the ext-proc protocol, so it plugs into infrastructure you may already have. Clients keep using the same inference API. The scheduling happens transparently.

    The routing decision follows a three-stage pipeline,  as illustrated in figure 2:

    The EPP scheduling pipeline: Filter, score, pick.
    Figure 2: The EPP scheduling pipeline: Filter, score, pick.
    1. Filter: Remove pods that cannot serve this request (unhealthy, out of memory or serving a different model).
    2. Score: Rate each remaining pod on configurable dimensions. By default, the scorer checks how much of this request's prefix each pod already has cached.
    3. Pick: Select the pod with the highest score. Ties are broken randomly.

    All three stages are pluggable. You can swap scorers, add new filters, and tune weights to match your workload.

    Prefix caching: where the biggest gains come from

    LLM inference is divided into two phases:

    1. Prefill: Compute-heavy, computes KV tensors for every token at once, and it emits the first token
    2. Decode: Memory bandwidth-bound, generates only one token at a time

    Prefill dominates time-to-first-token, especially for long prompts. For instance, the vLLM inference server stores the resulting KV tensors in GPU memory; this is called the KV cache. With prefix caching, vLLM goes one step further. When a new request shares the same token prefix as a previous one, it reuses those cached tensors across requests, skipping the expensive prefill for the shared portion.

    This works well on a single instance. But in a distributed deployment, standard routing scatters requests across pods, so related requests rarely land where the prefix is cached. llm-d's EPP closes that gap: it routes each request to the pod that already holds the relevant prefix, so vLLM's prefix caching actually fires.

    By default, the EPP tracks its own routing decisions: it hashes incoming prompts into blocks and compares them against what it has previously routed to each pod. This lightweight approach is fast enough to make a useful routing decision on every request.

    Three workload patterns benefit the most:

    • Shared system prompts: Every request starts with the same instructions. Cache-aware routing sends them to the same pod.
    • Multi-turn conversations: Each turn resends the full history. Only the latest user message is new.
    • Agentic tool loops: Tool-calling agents resend tool descriptions with every iteration. Cache-aware routing avoids recomputing them.

    To prevent a single pod from becoming a hotspot, the EPP has two safeguards. It periodically routes a fraction of requests away from the cache-warm pod to seed the prefix into additional caches. And if a cache-warm pod's predicted latency is much worse than a less-loaded pod, the EPP breaks affinity. A cache miss is cheaper than waiting behind a long queue.

    This approach adds no extra latency to the routing decision. For workloads with high cache churn, llm-d also offers a precise prefix cache mode that introspects each vLLM instance for actual KV cache entries. Precise mode is expected to become the default in a future release.

    Add load-aware scheduling

    The default profile scores pods using prefix-cache-scorer and queue-scorer,with prefix-cache weighted higher. For workloads that also need load balancing, you can add a load-aware-scorer to the profile. This scorer penalizes pods with longer request queues, so work spreads more evenly across the pool.

    When both scorers are active, the EPP combines their outputs into a single weighted score per pod. A chatbot profile might weight prefix-cache-scorer higher to favor cache reuse; a batch processing profile might weight load-aware-scorer higher to favor shorter queues. Additional scorers like session-affinity-scorer, active-request-scorer and context-length-aware are available for specialized use cases. See the scheduling architecture documentation for the full list.

    The llm-d team ran the same shared-prefix workload on the same hardware with two configurations: Standard Kubernetes Service routing and the llm-d inference scheduler.

    Setup

    8 vLLM replicas on 16 H100 GPUs (2 GPUs each, Tensor Parallelism = 2), Qwen/Qwen3-32B, shared-prefix synthetic workload (150 groups, 5 prompts per group) at 60 requests per second.

    • Time-to-first-token: 35 seconds to 120 milliseconds. Prefix cache hits skip the expensive prefill entirely (more than 99% reduction).
    • Throughput: More than doubled (+151%). GPUs spend their cycles generating new tokens rather than recomputing cached prefixes.
    • Request latency: Down 35%. The scheduler spreads load based on real-time pod state, so no single pod becomes a bottleneck.

    The more prefix sharing your workload has, the bigger the improvement. Workloads with mostly unique prompts still benefit from load-aware scheduling, but the prefix caching gains are smaller.

    The upstream llm-d guide includes a reproducible test harness so you can run these benchmarks on your own cluster. For more on how scorer choice affects different workload profiles, read the full details on the llm-d blog.

    What else llm-d provides

    Prefix-cache-aware and load-aware scheduling are the most common configurations. But for production deployments with tighter requirements, llm-d provides additional capabilities:

    • Predicted latency scheduling: An XGBoost model trained on live traffic predicts per-pod time-to-first-token (TTFT) and time-per-output-token (about 5% mean absolute percentage error (MAPE)). Requests can include service level objective (SLO) headers like x-slo-ttft-ms to route against latency targets. See Predicted latency-based scheduling for LLMs.
    • Flow control: Priority bands, fairness policies and backpressure for multi-tenant deployments. See the Flow control configuration guide.
    • Prefill/decode disaggregation: Split the compute-heavy prefill phase from the memory-bandwidth-bound decode phase onto different pod pools. See the Prefill/decode disaggregation deployment guide.

    The following sections walk through deploying llm-d on two platforms: Red Hat OpenShift AI 3.4 and Red Hat AI Inference on third-party managed Kubernetes services like Azure Kubernetes Services (AKS) or CoreWeave Kubernetes Services (CKS). The scheduling configuration is the same on both; only the installation path differs.

    Note: The benchmarks described previously used 16 H100 GPUs with Qwen3-32B. The following deployment examples use a smaller model (Qwen3-8B-FP8-dynamic) with fewer replicas to lower the hardware barrier. You will see the same scheduling behavior (prefix cache hits, intelligent pod selection) but the absolute numbers will differ.

    Deploy on Red Hat OpenShift AI

    Deploying llm-d upstream means orchestrating multiple helm charts for the inference gateway, the EPP, and the model servers, then wiring HTTPRoutes and configuring your gateway provider manually.

    Red Hat OpenShift AI 3.4 collapses all of that into a single resource. The LLMInferenceService controller handles the EPP deployment, gateway wiring, and model server lifecycle automatically. It plugs into OpenShift's built-in monitoring stack, so Prometheus scraping and dashboard access work without extra configuration.

    For architecture context on how KServe and llm-d work together, see Best of Both Worlds: Cloud-Native AI Inference at Scale using KServe and llm-d and Production-grade LLM inference at scale with KServe, llm-d and vLLM.

    Prerequisites

    Before deploying, your cluster needs:

    • OpenShift 4.19.9 or later with GPU worker nodes
    • Red Hat OpenShift AI 3.4 installed with model serving enabled (installation guide)
    • The openshift-ai-inference gateway created and the LeaderWorkerSet operator installed by a cluster admin (distributed inference setup)
    • User workload monitoring enabled (monitoring configuration)
    • A Hugging Face token with access to the model you want to serve (this example uses RedHatAI/Qwen3-8B-FP8-dynamic)

    Step 1: Create the inference service

    Save the following YAML to a file (for example, qwen3-8b.yaml):

    apiVersion: serving.kserve.io/v1alpha1
    kind: LLMInferenceService
    metadata:
      name: qwen3-8b
      namespace: my-inference-ns
    spec:
      replicas: 2
      model:
        uri: hf://RedHatAI/Qwen3-8B-FP8-dynamic
        name: RedHatAI/Qwen3-8B-FP8-dynamic
      template:
        containers:
          - name: main
            resources:
              limits:
                cpu: "4"
                memory: 32Gi
                nvidia.com/gpu: "1"
              requests:
                cpu: "2"
                memory: 16Gi
                nvidia.com/gpu: "1"
      router:
        route: {}
        gateway: {}
        scheduler: {}

    Then create the namespace and apply it:

    $ oc new-project my-inference-ns
    $ oc apply -f qwen3-8b.yaml

    Leaving route, gateway, and scheduler empty tells Red Hat OpenShift AI to use the operator-managed defaults. The controller creates the model server pods, deploys the EPP alongside them, and wires the HTTPRoute through the inference gateway.

    Step 2: Verify

    $ oc get llmisvc qwen3-8b -n my-inference-ns

    Wait for the Ready column to show True.

    Deploy on third-party managed Kubernetes services

    The same llm-d stack runs on managed Kubernetes platforms like Azure Kubernetes Service and CoreWeave without OpenShift. The scheduling configuration is identical. The deployment path is different: a single Helm chart replaces the Operator Lifecycle Manager (OLM) and the OpenShift operator lifecycle.

    Important: Distributed Inference with llm-d on managed Kubernetes is a Technology Preview feature only. Technology Preview features are not supported with Red Hat production service level agreements (SLAs) and might not be functionally complete. Red Hat does not recommend using them in production. These features provide early access to upcoming product features, enabling customers to test functionality and provide feedback during the development process.

    For more information about the support scope of Red Hat Technology Preview features, see Technology Preview features support scope.

    The dedicated managed Kubernetes (xKS) deployment guide is planned for publication when the feature is generally available.

    Prerequisites

    Before deploying, your cluster needs:

    • Kubernetes 1.33+ on AKS or CoreWeave with GPU nodes
    • Helm 3.17+ with Open Container Initiative (OCI) support
    • Registry authentication for registry.redhat.io and quay.io with a pull secret file (~/pull-secret.json)
    • Red Hat AI Inference Server early access credentials from Red Hat

    Install Red Hat AI Inference :

    $ helm registry login registry.redhat.io
    $ helm upgrade rhaii oci://quay.io/rhoai/rhai-on-xks-chart \
        --install --create-namespace \
        --namespace rhaii \
        --set azure.enabled=true \
        --set-file imagePullSecret.dockerConfigJson=~/pull-secret.json

    For CoreWeave, replace --set azure.enabled=true with --set azure.enabled=false --set coreweave.enabled=true. The chart installs the Red Hat AI Inference Stack operator, KServe, cert-manager, Istio, and LeaderWorkerSet automatically in two phases. Deployment takes 5-10 minutes.

    After installation, configure the inference gateway following the procedure in your Red Hat AI Inference documentation.

    Step 1: Create the inference service

    The LLMInferenceService spec is the same as on OpenShift, with one difference: Managed Kubernetes has no cluster-wide pull secret, so you must specify imagePullSecrets explicitly. Save the following YAML to a file (for example, qwen3-8b.yaml):

    apiVersion: serving.kserve.io/v1alpha1
    kind: LLMInferenceService
    metadata:
      name: qwen3-8b
      namespace: my-inference-ns
    spec:
      replicas: 2
      model:
        uri: hf://RedHatAI/Qwen3-8B-FP8-dynamic
        name: RedHatAI/Qwen3-8B-FP8-dynamic
      router:
        scheduler:
          template:
            imagePullSecrets:
            - name: rhai-pull-secret
            containers:
            - name: main
            - name: tokenizer
        route: {}
        gateway: {}
      template:
        imagePullSecrets:
        - name: rhai-pull-secret
        containers:
        - name: main
          resources:
            limits:
              cpu: "4"
              memory: 32Gi
              nvidia.com/gpu: "1"
            requests:
              cpu: "2"
              memory: 16Gi
              nvidia.com/gpu: "1"

    Then create the namespace, the pull secret, and apply it:

    $ kubectl create namespace my-inference-ns
    $ kubectl create secret docker-registry rhai-pull-secret \
      --from-file=.dockerconfigjson=$HOME/pull-secret.json \
      -n my-inference-ns
    $ kubectl apply -f qwen3-8b.yaml
    

    The scheduler.template.containers stubs (main and tokenizer) are required when imagePullSecrets is set. Without them, the controller produces an empty containers list.

    Step 2: Verify

    $ kubectl get llmisvc -n my-inference-ns

    Wait for READY to show True.

    Send traffic and observe

    Now it's time to test the setup.

    Port-forward to the inference gateway

    On Red Hat OpenShift AI:

    $ oc port-forward svc/openshift-ai-inference-openshift-default \
      -n openshift-ingress 9090:443 &
    $ export TOKEN=$(oc whoami -t)
    $ export NO_PROXY=localhost,127.0.0.1

    On third-party managed Kubernetes services:

    $ kubectl port-forward svc/inference-gateway-istio \
      -n redhat-ods-applications 9090:80 &

    Send a shared-prefix workload

    Send a few requests that share the same system prompt. On OpenShift AI:

    $ for i in $(seq 1 5); do
      curl -sk https://localhost:9090/my-inference-ns/qwen3-8b/v1/chat/completions \
        -H "Content-Type: application/json" \
        -H "Authorization: Bearer ${TOKEN}" \
        -d '{"model":"RedHatAI/Qwen3-8B-FP8-dynamic","messages":[{"role":"system","content":"You are a helpful assistant for Kubernetes operations."},{"role":"user","content":"How do I check pod status?"}],"max_tokens":100}' &
    done
    wait

    On third-party managed Kubernetes:

    $ for i in $(seq 1 5); do
      curl -s http://localhost:9090/my-inference-ns/qwen3-8b/v1/chat/completions \
        -H "Content-Type: application/json" \
        -d '{"model":"RedHatAI/Qwen3-8B-FP8-dynamic","messages":[{"role":"system","content":"You are a helpful assistant for Kubernetes operations."},{"role":"user","content":"How do I check pod status?"}],"max_tokens":100}' &
    done
    wait

    The first request runs the full prefill for the system prompt. The four that follow land on the same pod with a cache hit, so they skip prefill and return a faster first token. You should notice the first response taking noticeably longer than the rest. That gap is the scheduling at work.

    Observe

    Key signals to watch:

    • Prefix cache hit rate: Confirms the EPP is routing to pods with cached prefix data.
    • Running requests per pod: Shows whether load is distributed. Uneven distribution may mean one pod is bottlenecked or scorer weights need tuning.
    • Time-to-first-token: The metric that should improve most compared to round-robin.

    On Red Hat OpenShift AI, these metrics are available in the dashboard and through Prometheus. On third-party managed Kubernetes, scrape them directly from Prometheus.

    Try it

    Adapt the LLMInferenceService example to your model and GPU profile. For complete instructions including authentication, GPU node configuration and gateway creation:

    • On OpenShift AI 3.4: Follow the distributed inference documentation.
    • On third-party managed Kubernetes: Follow the xKS documentation.
    • Upstream on Kubernetes: The optimized baseline guide walks through a full deployment with helmfile, 8 vLLM replicas, and a benchmark harness you can run yourself.

    llm-d supports NVIDIA, AMD, Intel XPU, Intel Gaudi, Google TPU, and CPU-only simulation deployments. The project is open source at github.com/llm-d/llm-d.

    Next steps

    Precise prefix caching is expected to become the default scheduling profile. Heterogeneous GPU support and deeper service level objectives (SLO) integration (request shedding, priority escalation) are in active development upstream. These improvements are expected to be included in future Red Hat AI releases as they mature. Follow the llm-d blog and GitHub releases for updates.

    That 100,000-token prompt does not have to mean a 35-second wait.

    Related Posts

    • Combining KServe and llm-d for optimized generative AI inference

    • Accelerate multi-turn LLM workloads on OpenShift AI with llm-d intelligent routing

    • Introduction to distributed inference with llm-d

    • llm-d: Kubernetes-native distributed inferencing

    Recent Posts

    • Add automated AI evaluations to your CI/CD pipeline

    • Configure input guardrails for an OpenShift AI voice agent

    • Intelligent inference scheduling with llm-d on Red Hat AI

    • What's new in Red Hat Ansible Automation Platform 2.7

    • Building and running Bazel applications on AutoSD: Toolchains, containers, and recommended practices

    What’s up next?

    Red Hat AI Inference Server

    Red Hat AI Inference

    Move larger models from code to production faster with an end-to-end...

    Red Hat Developers logo LinkedIn YouTube Twitter Facebook

    Platforms

    • Red Hat AI
    • Red Hat Enterprise Linux
    • Red Hat OpenShift
    • Red Hat Ansible Automation Platform
    • See all products

    Build

    • Developer Sandbox
    • Developer tools
    • Interactive tutorials
    • API catalog

    Quicklinks

    • Learning resources
    • E-books
    • Cheat sheets
    • Blog
    • Events
    • Newsletter

    Communicate

    • About us
    • Contact sales
    • Find a partner
    • Report a website issue
    • Site status dashboard
    • Report a security problem

    RED HAT DEVELOPER

    Build here. Go anywhere.

    We serve the builders. The problem solvers who create careers with code.

    Join us if you’re a developer, software engineer, web designer, front-end designer, UX designer, computer scientist, architect, tester, product manager, project manager or team lead.

    Sign me up

    Red Hat legal and privacy links

    • About Red Hat
    • Jobs
    • Events
    • Locations
    • Contact Red Hat
    • Red Hat Blog
    • Inclusion at Red Hat
    • Cool Stuff Store
    • Red Hat Summit
    © 2026 Red Hat

    Red Hat legal and privacy links

    • Privacy statement
    • Terms of use
    • All policies and guidelines
    • Digital accessibility

    Chat Support

    Please log in with your Red Hat account to access chat support.