Skip to main content
Redhat Developers  Logo
  • Products

    Featured

    • Red Hat Enterprise Linux
      Red Hat Enterprise Linux Icon
    • Red Hat OpenShift AI
      Red Hat OpenShift AI
    • Red Hat Enterprise Linux AI
      Linux icon inside of a brain
    • Image mode for Red Hat Enterprise Linux
      RHEL image mode
    • Red Hat OpenShift
      Openshift icon
    • Red Hat Ansible Automation Platform
      Ansible icon
    • Red Hat Developer Hub
      Developer Hub
    • View All Red Hat Products
    • Linux

      • Red Hat Enterprise Linux
      • Image mode for Red Hat Enterprise Linux
      • Red Hat Universal Base Images (UBI)
    • Java runtimes & frameworks

      • JBoss Enterprise Application Platform
      • Red Hat build of OpenJDK
    • Kubernetes

      • Red Hat OpenShift
      • Microsoft Azure Red Hat OpenShift
      • Red Hat OpenShift Virtualization
      • Red Hat OpenShift Lightspeed
    • Integration & App Connectivity

      • Red Hat Build of Apache Camel
      • Red Hat Service Interconnect
      • Red Hat Connectivity Link
    • AI/ML

      • Red Hat OpenShift AI
      • Red Hat Enterprise Linux AI
    • Automation

      • Red Hat Ansible Automation Platform
      • Red Hat Ansible Lightspeed
    • Developer tools

      • Red Hat Trusted Software Supply Chain
      • Podman Desktop
      • Red Hat OpenShift Dev Spaces
    • Developer Sandbox

      Developer Sandbox
      Try Red Hat products and technologies without setup or configuration fees for 30 days with this shared Openshift and Kubernetes cluster.
    • Try at no cost
  • Technologies

    Featured

    • AI/ML
      AI/ML Icon
    • Linux
      Linux Icon
    • Kubernetes
      Cloud icon
    • Automation
      Automation Icon showing arrows moving in a circle around a gear
    • View All Technologies
    • Programming Languages & Frameworks

      • Java
      • Python
      • JavaScript
    • System Design & Architecture

      • Red Hat architecture and design patterns
      • Microservices
      • Event-Driven Architecture
      • Databases
    • Developer Productivity

      • Developer productivity
      • Developer Tools
      • GitOps
    • Secure Development & Architectures

      • Security
      • Secure coding
    • Platform Engineering

      • DevOps
      • DevSecOps
      • Ansible automation for applications and services
    • Automated Data Processing

      • AI/ML
      • Data Science
      • Apache Kafka on Kubernetes
      • View All Technologies
    • Start exploring in the Developer Sandbox for free

      sandbox graphic
      Try Red Hat's products and technologies without setup or configuration.
    • Try at no cost
  • Learn

    Featured

    • Kubernetes & Cloud Native
      Openshift icon
    • Linux
      Rhel icon
    • Automation
      Ansible cloud icon
    • Java
      Java icon
    • AI/ML
      AI/ML Icon
    • View All Learning Resources

    E-Books

    • GitOps Cookbook
    • Podman in Action
    • Kubernetes Operators
    • The Path to GitOps
    • View All E-books

    Cheat Sheets

    • Linux Commands
    • Bash Commands
    • Git
    • systemd Commands
    • View All Cheat Sheets

    Documentation

    • API Catalog
    • Product Documentation
    • Legacy Documentation
    • Red Hat Learning

      Learning image
      Boost your technical skills to expert-level with the help of interactive lessons offered by various Red Hat Learning programs.
    • Explore Red Hat Learning
  • Developer Sandbox

    Developer Sandbox

    • Access Red Hat’s products and technologies without setup or configuration, and start developing quicker than ever before with our new, no-cost sandbox environments.
    • Explore Developer Sandbox

    Featured Developer Sandbox activities

    • Get started with your Developer Sandbox
    • OpenShift virtualization and application modernization using the Developer Sandbox
    • Explore all Developer Sandbox activities

    Ready to start developing apps?

    • Try at no cost
  • Blog
  • Events
  • Videos

llm-d: Kubernetes-native distributed inferencing

Announcing the llm-d community

May 20, 2025
Robert Shaw Clayton Coleman - Google, Carlos Costa - IBM
Related topics:
Artificial intelligenceSummit 2025
Related products:
Red Hat AI

Share:

    llm-d is a Kubernetes-native high-performance distributed LLM inference framework—a well-lit path for anyone to serve at scale, with the fastest time-to-value and competitive performance per dollar for most models across most hardware accelerators.

    With llm-d, users can operationalize gen AI deployments with a modular, high-performance, end-to-end serving solution that leverages the latest distributed inference optimizations like KV-cache aware routing and disaggregated serving, co-designed and integrated with the Kubernetes operational tooling in Inference Gateway (IGW).

    LLM inference goes distributed

    Scaling LLM inference is more complex than typical applications due to unique workload characteristics.

    Why standard scale out falls short

    Kubernetes typically scales out application workloads with uniform replicas and round-robin load balancing (Figure 1).

    A service deploys to multiple LLM instances.
    Figure 1: Deploying a service to multiple vLLM instances.

    This simple pattern is very effective for most request patterns, which have the following characteristics:

    • Requests are short-lived and generally uniform in resource utilization.
    • Requests have generally uniform latency service level objectives (SLOs).
    • Each replica can process each request equally well.
    • Specializing variants and coordinating replicas to process a single request is not useful.

    LLM serving is unique

    The LLM inference workload, however, is unique with slow, non-uniform, expensive requests (Figure 2). This means that typical scale-out and load-balancing patterns fall short of optimal performance.

    Comparison of modern HTTP requests (fast, uniform, cheap) handled by a microservice load balancer with high QPS, versus LLM requests (slow, non-uniform, expensive) routed by an LLM-aware load balancer with much higher cost per request.
    Figure 2: Comparison of modern HTTP requests.

    Let’s take a look at each one step-by-step.

    1. Requests are expensive with significant variance in resource utilization.

    Each LLM inference request has a different “shape” to it, as measured by the number of input tokens and output tokens. There is significant variance in these parameters across requests and workloads, as illustrated in Figure 3:

    • RAG has long inputs—prompt and retrieved docs—and short generated outputs.
    • Reasoning has a short or medium inputs and long generated outputs.
    Two diagrams illustrate the RAG pattern (System Prompt, Document Retrieval, User Query, Gen) and a Thinking/Reasoning pattern (System Prompt, User Input, Gen, Think, Gen, Think, Gen) with prefill and decode stages.
    Figure 3: Comparing the RAG pattern and Thinking/Reasoning pattern with prefill and decode stages.

    These differences in request times can lead to significant imbalances across instances, which are compounded as loaded instances get overwhelmed. Overloads lead to longer ITL (inter-token latency), which leads to more load, which leads to more ITL.

    2. Routing to specific replicas with cached prior computation can achieve orders of magnitude better latency.

    Many common LLM workloads have “multi-turn” request patterns, where the same prompt is sent iteratively to the same instance (Figure 4):

    • Agentic (tool calls are iterative request flow).
    • Code completion task (requests reuse current codebase as context).
    A diagram titled "Agentic Pattern" shows a sequence: System Prompt, User, Gen, Tool, User, Gen, Tool, Gen. Below, three "Pass" timelines illustrate prefill and decode stages for each interaction.
    Figure 4: The agentic pattern sequence.

    LLM inference servers like vLLM implement a method called “automatic prefix caching,” which enables “skipping” a significant amount of prefill computation when there is a cache hit (Figure 5). If requests are routed to vLLM replicas that have the data in the cache, we skip computation. Increasing the likelihood of prefix cache hits with a larger cache size can dramatically improve tail latencies.

    A diagram illustrating prefix caching. Two prompts share a prefix, computing it once and extending the KV cache for individual requests to output "Albany" and "Sacramento".
    Figure 5: The prefix caching method.

    3. Specializing and coordinating replicas to process a single request can lead to more throughput per GPU.

    Inference is split into two phases: prefill and decode. Prefill generates the first output token and runs in parallel over all the prompt tokens—this phase is compute bound. Decode generates tokens one at a time by doing a full pass over the model, making this phase memory bandwidth-bound.

    Standard LLM deployments perform the prefill and decode phases of inference within a single replica. Given that prefill and decode phases of inference have different resource requirements, co-locating these phases on the same replica leads to inefficient resource use, especially for long sequences.

    Disaggregation (e.g., Distserve) separates prefill and decode phases onto different variants, enabling independent optimization and scaling of each phase (Figure 6).

    • Google leverages disaggregated serving on TPUs to provide better first-token latency and simplify operational scaling.
    • DeepSeek released a discussion of the design of their inference system, which leverages aggressive disaggregation to achieve remarkable performance at scale.
    An API server distributes prefill requests to a prefill load balancer, which further distributes them via an expert-parallel load balancer to the prefill service. Similarly, decode requests are distributed to a decode service.
    Figure 6: Disaggregation separates the prefill and decode phases.

    4. Production deployments often have a range of quality of service (QoS) requirements.

    Use cases for a single LLM endpoint can have a wide variety of quality of service requirements. Consider the following examples:

    • Latency is the most important factor: Code completion requests and search responses need to minimize latency to provide an “in the loop” experience. O(ms) latency tolerance.
    • Latency is important: Chat agent sessions and email drafting with interactive use cases. O(seconds) latency tolerance.
    • Latency tolerant: Video call and email summarization and “deep research” agents with daily or hourly usage patterns. O(minutes) latency tolerance.
    • Latency agnostic: Overnight batch processing workloads, meeting minute generation, and autonomous agents. O(hours) latency tolerance.

    Given the compute intensity (and, therefore, high costs) of LLMs, tight latency SLOs are substantially more expensive to achieve. This spectrum of latency requirements presents an opportunity to further optimize infrastructure efficiency—the more latency tolerant a workload is, the more we can optimize infrastructure efficiency amongst other workloads.

    Why llm-d?

    To exploit these characteristics and achieve optimal performance for LLM workloads, the inference serving landscape is rapidly transitioning towards distributed cluster-scale architectures. For instance, in its Open Source Week, the DeepSeek team published the design of its inference system, which aggressively leverages disaggregation and KV caching to achieve remarkable performance per dollar of compute.

    However, for most gen AI innovators, ML platform teams, and IT operations groups, these benefits remain out of reach. Building and operating a complex, monolithic system is time-consuming and challenging, especially in the context of the rapid pace of innovation and enterprise deployments with tens or hundreds of models for divergent use cases. This complexity risks time to market, higher operational costs and sprawl, and difficulty adopting and experimenting.

    Our objective

    The objective of llm-d is to create a well-lit path for anyone to adopt the leading distributed inference optimizations within their existing deployment framework: Kubernetes.

    To achieve this goal, we have the following design principles for the project:

    • Operationalizability: Modular and resilient architecture with native integration into Kubernetes via Inference Gateway API.
    • Flexibility: Cross-platform (active work to support NVIDIA, Google TPU, AMD, and Intel), with extensible implementations of key composable layers of the stack.
    • Performance: leverage distributed optimizations like disaggregation and prefix-aware routing to achieve the highest token/$ while meeting SLOs.

    Architecture

    To achieve this objective, we designed llm-d with a modular and layered architecture (Figure 7) on top of industry-standard open source technologies: vLLM, Kubernetes, and Inference Gateway.

    • vLLM. vLLM is the leading open source LLM inference engine, supporting a wide range of models (including Llama and DeepSeek) and hardware accelerators (including NVIDIA GPU, Google TPU, AMD ) with high performance.
    • Kubernetes (K8s). K8s is an open source container orchestration engine for automating deployment, scaling, and management of containerized applications. It is the industry standard for deploying and updating LLM inference engines across various hardware accelerators.
    • Inference Gateway (IGW). IGW is an official Kubernetes project that extends the Gateway API (the next generation of Kubernetes Ingress and Load Balancing API) with inference-specific routing. IGW includes many important features like model routing, serving priority, and extensible scheduling logic for “smart” load balancing. IGW integrates with many different gateway implementations, such as Envoy, making it widely portable across Kubernetes clusters.
    A client sends a request to a Kubernetes Inference Gateway, which routes it to an Inference Pool. The pool contains model variants (Prefill and Decode) with shared and independent prefix caching, managed by an Inference Scheduler and Variant Autoscaler.
    Figure 7: The llm-d architecture.

    And our key new contributions:

    • vLLM Optimized Inference Scheduler: IGW defines a pattern for customizable “smart” load-balancing via the Endpoint Picker Protocol (EPP). Leveraging enhanced operational telemetry exposed by vLLM, the inference scheduler implements the filtering and scoring algorithms necessary to make “smart” scheduling decisions around disaggregated serving, prefix-cache-awareness, and load-awareness, validated to be used out-of-the-box by llm-d users. Advanced teams can also tweak or implement their own scorers and filterers to further customize for their use cases, while still benefiting from upcoming operational features in the inference gateway, like flow control and latency-aware balancing. 

      For more details, see our north star: [PUBLIC] llm-d Scheduler Northstar

    • Disaggregated serving with vLLM: llm-d leverages vLLM’s recently enabled support for disaggregated serving via a pluggable KV Connector API to run prefill and decode on independent instances, using high-performance transport libraries like NVIDIA NIXL. 

      In llm-d, we plan to support two “well-lit” paths for prefill/decode (P/D) disaggregation:

      • Latency optimized implementation using fast interconnects (IB, RDMA, ICI).
      • Throughput optimized implementation using data center networking.

      For more details, see our Northstar: [PUBLIC] llm-d Disaggregated Serving Northstar

    • Disaggregated prefix caching with vLLM: llm-d uses the same vLLM KV connector API used in disaggregated serving to provide a pluggable cache for previous calculations, including offloading KVs to host, remote storage, and systems like LMCache. 

      In llm-d, we plan to support two “well-lit” paths for KV cache disaggregation:

      • Independent caching with basic offloading to host memory and disk, providing a zero operational cost mechanism that utilizes all system resources.
      • Shared caching with KV transfer between instances and shared storage with global indexing, providing potential for higher performance at the cost of a more operationally complex system.

      For more details, see our Northstar: [PUBLIC] llm-d Prefix Caching Northstar

    • Variant autoscaling over hardware, workload, and traffic: Accelerator hardware varies dramatically in terms of compute, memory, and cost, workloads sharing the same models vary by their required quality of service, the distinct phases of LLM inference and large mixture-of-expert models vary on whether they are compute, memory, or network bound, and incoming traffic varies over time and by workload. Today, all of these decisions are made at deployment time, and almost all deployers struggle to enable autoscaling to reduce their costs safely.

      Drawing on extensive experience from end users and OSS collaborators like AIBrix, we plan to implement a traffic- and hardware-aware autoscaler that:

      • Measures the capacity of each model server instance.
      • Derive a load function that takes into account different request shapes and QoS.
      • Using the recent traffic mix—QPS (Queries Per Second), QoS, and shape distribution—calculate the optimal mix of instances to handle prefill, decode, and latency-tolerant requests, and label each instance with a grouping.
      • Report load metrics per grouping that allows Kubernetes horizontal pod autoscaling to match hardware in use to hardware needed without violating SLOs.

      For more details, see our Northstar: [PUBLIC] llm-d Autoscaling Northstar

    Example llm-d features

    llm-d integrates IGW and vLLM together, enabling a high performance distributed serving stack. Let’s discuss some of the example features enabled by llm-d.

    Prefix and KV cache-aware routing

    The first key collaboration between IGW and vLLM in llm-d was developing prefix-cache aware routing to complement the existing KV cache utilization aware load balancing in IGW.

    We conducted a series of experiments to evaluate the performance of the llm-d Inference Scheduler with prefix-aware routing on 2 NVIDIA 8xH100 nodes using the LMbenchmark in a long-input/short-output configuration designed to stress KV cache reuse and routing decision quality. See Table 1 and Figure 8.

     ModelConfigurationISL | OSLLatency SLO
    S1LlaMA 4 Scout FP8TP2, 2 replicas20,000 | 100None
    S2LlaMA 4 Scout FP8TP2, 4 replicas12,000 | 100P95 TTFT <= 2s
    S3Llama 3.1 70B FP16TP2, 4 replicas8,000 | 100P95 TTFT <= 2s
    Three line graphs titled "TTFT vs QPS" show the Time To First Token against Queries Per Second for different Llama model configurations (replicas and parameters). Baseline and Ilm-d TTFT are compared against an SLA line.
    Figure 8: llm-d Inference Scheduler performance.

    Key observations:

    • S1: At 4 QPS, llm-d achieves a mean TTFT approximately 3X lower than the baseline (lower is better).
    • S2: llm-d delivers ~50% higher QPS than the baseline while meeting SLO requirements (higher is better).
    • S3: llm-d sustains 2X the baseline QPS under SLO constraints (higher is better).

    These results show that llm-d’s cache- and prefix-aware scheduling effectively reduces TTFT and increases QPS compared to the baseline, while consistently meeting SLO requirements.

    Try it out with the base.yaml config in our quick start. And as a customization example, see the template for adding your own scheduler filter.

    P/D disaggregation

    We’ve completed an initial implementation of P/D disaggregation with vLLM and llm-d-inference-scheduler, which delivers promising speedups for prefill-heavy workloads (20:1 ISL | OSL). Our next focus is finalizing the implementation with heterogeneous TP and completing comprehensive benchmarks for disaggregated serving. Short-term priorities include enabling heterogeneous TP, scaling with high-performance P/D + EP<>DP for large scale MoEs, and DP-aware load balancing. We will follow up with a detailed performance blog in the coming weeks.

    Try it out with the pd-nixl.yaml config in our quick start.

    Get started with llm-d

    llm-d builds brings together the performance of vLLM with the operationalizability of Kuberentes, creating a modular architecture for distributed LLM inference, targeting high performance on the latest models and agentic architectures.

    We welcome AI engineers and researchers to join the llm-d community and contribute:

    • Check out our repository on GitHub.
    • Join our developer Slack.
    • Try out our quick starts to deploy llm-d on your Kubernetes cluster.

    Please join us. The future of AI is open.

    Related Posts

    • Llama 4 herd is here with Day 0 inference support in vLLM

    • Deploy Llama 3 8B with vLLM

    • How we optimized vLLM for DeepSeek-R1

    • Performance boosts in vLLM 0.8.1: Switching to the V1 engine

    • vLLM brings FP8 inference to the open source community

    • LLM Compressor is here: Faster inference with vLLM

    Recent Posts

    • GuideLLM: Evaluate LLM deployments for real-world inference

    • Unleashing multimodal magic with RamaLama

    • Integrate Red Hat AI Inference Server & LangChain in agentic workflows

    • Streamline multi-cloud operations with Ansible and ServiceNow

    • Automate dynamic application security testing with RapiDAST

    Red Hat Developers logo LinkedIn YouTube Twitter Facebook

    Products

    • Red Hat Enterprise Linux
    • Red Hat OpenShift
    • Red Hat Ansible Automation Platform

    Build

    • Developer Sandbox
    • Developer Tools
    • Interactive Tutorials
    • API Catalog

    Quicklinks

    • Learning Resources
    • E-books
    • Cheat Sheets
    • Blog
    • Events
    • Newsletter

    Communicate

    • About us
    • Contact sales
    • Find a partner
    • Report a website issue
    • Site Status Dashboard
    • Report a security problem

    RED HAT DEVELOPER

    Build here. Go anywhere.

    We serve the builders. The problem solvers who create careers with code.

    Join us if you’re a developer, software engineer, web designer, front-end designer, UX designer, computer scientist, architect, tester, product manager, project manager or team lead.

    Sign me up

    Red Hat legal and privacy links

    • About Red Hat
    • Jobs
    • Events
    • Locations
    • Contact Red Hat
    • Red Hat Blog
    • Inclusion at Red Hat
    • Cool Stuff Store
    • Red Hat Summit

    Red Hat legal and privacy links

    • Privacy statement
    • Terms of use
    • All policies and guidelines
    • Digital accessibility

    Report a website issue