Skip to main content
Redhat Developers  Logo
  • AI

    Get started with AI

    • Red Hat AI
      Accelerate the development and deployment of enterprise AI solutions.
    • AI learning hub
      Explore learning materials and tools, organized by task.
    • AI interactive demos
      Click through scenarios with Red Hat AI, including training LLMs and more.
    • AI/ML learning paths
      Expand your OpenShift AI knowledge using these learning resources.
    • AI quickstarts
      Focused AI use cases designed for fast deployment on Red Hat AI platforms.
    • No-cost AI training
      Foundational Red Hat AI training.

    Featured resources

    • OpenShift AI learning
    • Open source AI for developers
    • AI product application development
    • Open source-powered AI/ML for hybrid cloud
    • AI and Node.js cheat sheet

    Red Hat AI Factory with NVIDIA

    • Red Hat AI Factory with NVIDIA is a co-engineered, enterprise-grade AI solution for building, deploying, and managing AI at scale across hybrid cloud environments.
    • Explore the solution
  • Learn

    Self-guided

    • Documentation
      Find answers, get step-by-step guidance, and learn how to use Red Hat products.
    • Learning paths
      Explore curated walkthroughs for common development tasks.
    • Guided learning
      Receive custom learning paths powered by our AI assistant.
    • See all learning

    Hands-on

    • Developer Sandbox
      Spin up Red Hat's products and technologies without setup or configuration.
    • Interactive labs
      Learn by doing in these hands-on, browser-based experiences.
    • Interactive demos
      Click through product features in these guided tours.

    Browse by topic

    • AI/ML
    • Automation
    • Java
    • Kubernetes
    • Linux
    • See all topics

    Training & certifications

    • Courses and exams
    • Certifications
    • Skills assessments
    • Red Hat Academy
    • Learning subscription
    • Explore training
  • Build

    Get started

    • Red Hat build of Podman Desktop
      A downloadable, local development hub to experiment with our products and builds.
    • Developer Sandbox
      Spin up Red Hat's products and technologies without setup or configuration.

    Download products

    • Access product downloads to start building and testing right away.
    • Red Hat Enterprise Linux
    • Red Hat AI
    • Red Hat OpenShift
    • Red Hat Ansible Automation Platform
    • See all products

    Featured

    • Red Hat build of OpenJDK
    • Red Hat JBoss Enterprise Application Platform
    • Red Hat OpenShift Dev Spaces
    • Red Hat Developer Toolset

    References

    • E-books
    • Documentation
    • Cheat sheets
    • Architecture center
  • Community

    Get involved

    • Events
    • Live AI events
    • Red Hat Summit
    • Red Hat Accelerators
    • Community discussions

    Follow along

    • Articles & blogs
    • Developer newsletter
    • Videos
    • Github

    Get help

    • Customer service
    • Customer support
    • Regional contacts
    • Find a partner

    Join the Red Hat Developer program

    • Download Red Hat products and project builds, access support documentation, learning content, and more.
    • Explore the benefits

Optimizing distributed AI inference: Advanced deployment patterns

Prefill/decode disaggregation, KV cache strategy, and speculative decoding

June 24, 2026
Fatih E. Nar Yuchen Fama Greg Pereira Yuan Tang
Related topics:
AI inferenceArtificial intelligence
Related products:
Red Hat AIRed Hat AI Inference

    In Designing distributed AI inference: Core concepts and scaling dimensions, we established the groundwork for distributed inference: the prefill/decode split that shapes every deployment decision, and the five dimensions of parallelism—tensor, pipeline, expert, data, and context—that determine how a model maps onto hardware.

    This blog covers the three optimization levers that push past that baseline: prefill/decode disaggregation, key-value (KV) cache strategy, and speculative decoding. Each one trades operational complexity for a measurable improvement in cost, latency, or throughput. We'll walk through each lever starting with the decision rule (when does it pay off?), then the mechanism, and finally the production concerns that tend not to surface until you are running at scale.

    P/D disaggregation as a deployment pattern

    The previous article described disaggregation as a feature of llm-d, but in practice it is a deployment topology (the most consequential one we cover here), and we need to reason about it as such rather than check it off as a capability.

    Flow from a client through Envoy AI Gateway and llm-d scheduler to a compute-bound prefill pool, KV-transfer fabric, and decode pool.
    Figure 1: Separation of prefill and decode phases in a disaggregated inference architecture.

    When to disaggregate

    The decision rule is not based on model size; it depends on measurable observations within the system. Profile a baseline single-pool deployment to measure the ratio of prefill GPU-seconds to decode GPU-seconds on your real traffic. Then compare that to the ratio of decode-optimized to prefill-optimized GPU cost in your environment. If your traffic is prefill-heavy while you pay for decode-class hardware (because decode dominates wall-clock), or vice versa, the gap between those two ratios represents your available savings.

    Disaggregation tends to pay off for long-prompt retrieval-augmented generation (RAG) with short answers (prefill-heavy), high-concurrency chat with short prompts and long answers (decode-heavy), or any fleet large enough that the cost reduction justifies the operational complexity. In our benchmarks, this architecture reduces costs by 25% to 40% on chat- and RAG-shaped traffic. This block aligns with published disaggregation results: Splitwise reports approximately 20% lower cost at 1.4× throughput, and DistServe shows up to 7.4× higher goodput.

    Conversely, disaggregation does not pay off for single-node deployments where the network hop between prefill and decode workers exceeds the savings. It is also inefficient for a fleet small enough that two pools of one worker are worse than one pool of two workers.

    Sizing the two pools

    Prefill workers scale with the arrival rate of new prompts and with prompt-length distribution, while decode workers scale with concurrent active sessions, average token output sizing and target TPOT, and the two scale independently. A useful first cut for a chat workload with mean prompt 800 tokens and mean output 200 tokens at 5,000 concurrent sessions on Qwen3.5-35B-A3B works out to roughly one H100 of prefill capacity per ~30 requests/second of arrival rate, and roughly one decode GPU (L40S-class) per ~150 concurrent sessions, in our lab measurements. These numbers move with model and quantization, but the ratio (typically 1:3 to 1:5 prefill to decode workers for chat) is broadly stable across the workloads we have benchmarked.

    KV-transfer connectors and the data path

    Once the pools are split, the KV cache produced by prefill must reach decode workers, and vLLM exposes a KVConnector interface with three production-relevant implementations.

    Table 1: KV-cache connectors.
    ConnectorRecommended forTransportNotes
    NixlConnectorSingle-cluster, RDMA/NVLink availableNVIDIA NIXL over UCXDefault for high-performance PD; metadata server is a startup single point of failure (SPOF)
    LMCacheConnectorCross-instance cache sharing, HBM → DRAM → NVMe tieringNIXL under the hood, plus offload backendsAdds tiered KV cache + shared prefix index (LMCache, arXiv:2510.09665)
    MooncakeConnectorCluster-scale shared cache poolsRDMA-nativeRecommended when you want a separate KV-cache cluster that many vLLM instances pull from
    MooncakeStoreConnectorTiered cache offloading through use of a distributed master storeCache offloadingOffload tier behind MooncakeConnector; KV lands in a distributed master store rather than peer HBM

    Treat the KV-transfer fabric like any other production data path. Measure end-to-end latency, including queue time, alert on tail latency rather than the mean, and verify RDMA driver health on every node. NIXL's asynchronous send and receive operations provide the right primitive for production only when prefill workers do not block on decode-side acknowledgement.

    The disaggregated KV cache pool

    Instead of viewing the cluster's KV cache as per-worker memory plus a transfer protocol, treat it as a cluster-wide resource with its own scheduling concerns. LMCache implements this approach by tiering data across HBM, DRAM, and NVMe storage tiers while exposing a global prefix index. When two requests share a prefix (such as a system prompt, a few-shot prefix, or the first thousand tokens of a contract), they share KV blocks regardless of which instance generated them.

    The llm-d scheduler

    Cache-aware routing turns disaggregation from a feature into a functional deployment pattern, Instead of distributing requests via round-robin routing across decode workers, llm-d routes a request directly to the worker that contains the warmest KV state for the request's prefix. Published llm-d benchmarks report up to 57× faster time-to-first-token (TTFT) and two times the throughput of round-robin routing under high prefix reuse (using eight pods and 16 H100 GPUs).

    Our internal measurements are more conservative and serve as illustration rather than a guarantee. We observed a 25% improvement on default settings, a two- to three-fold increase in tokens per second per GPU when paired with prefix-cache-hit routing, and three- to five-fold cost-per-token reduction on chat-shaped workloads with high prefix reuse. While live production deployment numbers will vary from these lab-style figures, the overall performance improvements remain consistent across every workload we measured.

    Hybrid GPU-CPU prefill

    A new architectural pattern is emerging in accelerated computing environments. In this configuration, the CPU handles early prefill tasks like embedding lookups and attention preparation, while the GPU processes matrix-multiplication operations. This architecture is not yet a primary recommendation for production environments. However, designing your deployment with a separately scheduled prefill pool is advantageous, because you can change the worker type as the hardware matures without refactoring the underlying system.

    Failure modes that bite in production

    The NIXL metadata server is a single point of failure on startup. Deploy two metadata servers behind a TCP load balancer and verify failover capabilities before initiating the first canary rollout. Decode workers that lag during KV transfers create tail-latency bottlenecks for the entire fleet. Implementing admission control at the API gateway is more effective than retrying failed requests downstream. Canary rollouts for disaggregated fleets must update only one pool at a time. Configure automated rollback gates for both TTFT and time-per-output-token (TPOT) to catch a performance regression in either key performance indicator (KPI) before it affects the cluster.

    KV cache: Tiering, sharing, squeezing

    While PagedAttention solved the fragmentation problem inside a single GPU, the next challenges occur across multiple GPUs and across the cluster. These distributed environments require a different set of tools. The shared key-value (KV) cache hierarchy (Figure 2) illustrates how system components interact across different hardware layers to mitigate memory pressure.

    vLLM workers route requests via an llm-d scheduler to a tiered KV-cache hierarchy of HBM, pinned DRAM, NVMe, and a global prefix index.
    Figure 2: Shared KV cache hierarchy.

    Tiered hierarchy

    Now that we have covered how LMCache manages cache tiering as a cluster-wide resource, we need to consider when to enable this feature. Most enterprise workloads include inactive data prefixes that can fit into DRAM or NVMe storage but exceed your HBM capacity. The evaluation process is straightforward: if your prefix-cache hit rate increases by using a 10 times larger cache than you can afford in expensive onboard memory, implementing a tiering strategy provides a clear performance advantage.

    KV reuse versus prefix sharing

    Developers often confuse prefix sharing with KV cache reuse, but these concepts represent distinct operational concerns that require different configuration settings.

    Prefix sharing allows two separate requests that begin with the same tokens to share a single cache. In contrast, KV cache reuse retains the data from a single request across multiple turns of a live session. Multi-tenant chat platforms benefit from both capabilities. Shared system prompts utilize prefix sharing, while conversation history relies on KV cache reuse. Therefore, prefix sharing is a routing function that directs a new request to the specific worker node hosting the warm prefix. Cache reuse is a session-affinity function that keeps a conversation session linked to the same decode worker.

    Quantization

    Using an 8-bit floating-point (FP8) KV cache halves the memory footprint with a measurable but usually acceptable quality cost on most enterprise tasks. In contrast, a 4-bit floating-point (FP4) cache strategy is more aggressive. Currently, you should deploy FP4 formats only on workloads where you can run an evaluation pass against your own data. Red Hat's LLM Compressor produces quantized variants of Qwen models and validates them against your own evaluation set rather than a generic benchmark.

    Decode kernels

    vLLM's decode path is fast in 2026 due to a combination of multiple architectural advancements. Several open source projects contribute to these speed improvements, including FlashMLA from DeepSeek, ThunderMLA from Stanford, and the PyTorch FlexAttention decode path.

    Platform teams rarely tune these kernels directly. However, knowing which kernel your vLLM build uses helps you investigate unexpected decode regressions after a version upgrade. Managing multi-node fleets introduces challenges with tracking code origins and maintaining consistency across environments. Every prefill, decode, and draft-model replica must load identical, pinned kernel binaries.

    To achieve this consistency, a production runtime compiles kernels from source and includes a software bill of materials (SBOM) directly within the container image. This approach avoids pulling binaries from a public registry on the first request. We discuss this kernel supply-chain trade-off and the GPU Kernel Manager (GKM) bridge for heterogeneous fleets in the blog post What GPU kernels mean for your distributed inference.

    PagedAttention versus RadixAttention

    SGLang's RadixAttention takes a different architectural bet by organizing the cache as a prefix tree. With this design, the structure of conversation guides how the system removes old data. In contrast, PagedAttention treats the cache like virtual memory pages.

    Engineers often debate these architectural trade-offs: RadixAttention excels on workloads with deep, branching prefix trees, including agent workflows and structured prompting. PagedAttention performs better on workloads with irregular prefix patterns and highly varied sequence lengths. vLLM continues to use PagedAttention because it serves as a general-purpose runtime. Its page-table abstraction model scales efficiently across disaggregated architectures, where data naturally moves in pages.

    Speculative decoding

    Speculative decoding generates multiple candidate tokens simultaneously by using a lower-cost draft path. Next, the target model verifies these tokens. When the system accepts these draft tokens, the model emits multiple tokens during a single forward pass. The different categories of speculative decoding techniques that vLLM supports have vastly different operational costs. You can see this token verification loop in action in Figure 3.

    Four speculative decoding methods supported by vLLM: two-model draft-based, self-speculative, multi-token decoding, and native multi-token prediction.
    Figure 3: The speculative decoding token verification loop, where a draft model proposes candidate tokens for verification by the target model.

    Two-model, draft-based (EAGLE family)

    In this deployment strategy, a small draft model proposes tokens that the target then verifies. Benchmarks show that EAGLE-3 can achieve up to a six times speedup on dense models. The newer EAGLE 3.1 framework (released in May 2026) extends these performance gains into long-context workloads. It delivers up to two times the token acceptance length of EAGLE-3, making it an excellent starting choice for dense Qwen3.6 workloads.

    Single-model self-speculative

    In this strategy, the model drafts and verifies its own outputs. The system typically uses a subset of the model's own layers to generate the draft. This approach lowers setup costs because you do not have to train or host a separate draft model. However, this setup results in a more variable acceptance rate that fluctuates based on your workload.

    Multi-token decoding (Medusa heads)

    Adding multiple output heads to the target model with the Medusa architecture lets the engine predict several tokens per pass. This technique requires roughly half the engineering cost of EAGLE-3. However, token acceptance rates are correspondingly more modest, typically ranging from 0.55 to 0.70.

    Interleaved decode and constrained-decoding interactions

    Less a technique than a scheduling choice, vLLM interleaves spec-decoded sessions with normal decode steps to keep batches full. There is one important caveat: when the workload uses constrained decoding (such as JSON mode or tool calls with grammars), the acceptance rate of speculative decoding often collapses because the constraint mask invalidates speculative tokens. Be sure to measure performance before assuming speculative decoding helps in tool-calling traffic.

    Multi-token prediction (MTP)

    Some models, notably DeepSeek-V3, ship with MTP heads trained jointly with the main model, and acceptance rates can exceed 80% out of the box. This efficiency makes MTP a natural choice for any model that ships MTP-trained, but you cannot add it to other models without re-training them.

    Table 2: Speculative decoding options versus workloads.
    WorkloadAnchor modelRecommendedWhy
    Short conversational, denseQwen3.6-27BEAGLE 3.1Best accept-rate/cost ratio, current generation
    Long-context (>64k)Qwen3.5 dense or MoEEAGLE 3.1Long-context acceptance is the headline improvement
    MoE flagshipQwen3.5-397B-A17BNative MTP if trained; else EAGLE-3Active-param shape favors MTP-style heads
    Code completionQwen3.6 densen-gram / prompt-lookupRepetitive code structure makes hit rate high; no draft to host
    Strict memory budgetanyn-gramNo draft model to host on decode GPU
    Heavy tool-calling trafficanyDisable or testConstrained decoding interaction is severe

    Note

    The draft model occupies decode-worker HBM and adds verification overhead. On already-saturated, large-batch decode fleets, the performance gain from speculative decoding shrinks because the batch already amortizes the kernel launch. At the extreme, this overhead can result in a net loss. The clearest advantages occur at low-to-moderate concurrency, where each forward pass is underutilized.

    Putting inference optimization levers into practice

    These three levers—disaggregation, cache architecture, and speculative decoding—are where most cost and latency improvements live once the parallelism layout is set. However, they are still individual mechanisms. The real deployment question is how they work together for a specific traffic shape, and what to do if performance regresses in production.

    In the next blog post, we'll put the techniques from parts 1 and 2 into practice with concrete deployment blueprints for six traffic profiles. We also walk through inference troubleshooting recipes for the failures that recur most often at scale, and lay out a structured growth path from a single-node baseline to a multi-model AI grid.

    Related Posts

    • Designing distributed AI inference: Core concepts and scaling dimensions

    • llama.cpp vs. vLLM: Choosing the right local LLM inference engine

    • How speculative decoding delivers faster LLM inference

    • Intelligent inference scheduling with llm-d on Red Hat AI

    • How to prevent AI inference stack silent failures

    • What GPU kernels mean for your distributed inference

    Recent Posts

    • Optimizing distributed AI inference: Advanced deployment patterns

    • Beyond regex: Harvesting security logic with LLMs

    • Build a Red Hat Enterprise Linux EUS image with image-builder CLI

    • Connect EvalHub to protected production model servers

    • Building a custom Red Hat Enterprise Linux kernel for NVIDIA DGX Spark

    Red Hat Developers logo LinkedIn YouTube Twitter Facebook

    Platforms

    • Red Hat AI
    • Red Hat Enterprise Linux
    • Red Hat OpenShift
    • Red Hat Ansible Automation Platform
    • See all products

    Build

    • Developer Sandbox
    • Developer tools
    • Interactive tutorials
    • API catalog

    Quicklinks

    • Learning resources
    • E-books
    • Cheat sheets
    • Blog
    • Events
    • Newsletter

    Communicate

    • About us
    • Contact sales
    • Find a partner
    • Report a website issue
    • Site status dashboard
    • Report a security problem

    RED HAT DEVELOPER

    Build here. Go anywhere.

    We serve the builders. The problem solvers who create careers with code.

    Join us if you’re a developer, software engineer, web designer, front-end designer, UX designer, computer scientist, architect, tester, product manager, project manager or team lead.

    Sign me up

    Red Hat legal and privacy links

    • About Red Hat
    • Jobs
    • Events
    • Locations
    • Contact Red Hat
    • Red Hat Blog
    • Inclusion at Red Hat
    • Cool Stuff Store
    • Red Hat Summit
    © 2026 Red Hat

    Red Hat legal and privacy links

    • Privacy statement
    • Terms of use
    • All policies and guidelines
    • Digital accessibility

    Chat Support

    Please log in with your Red Hat account to access chat support.