Skip to main content
Redhat Developers  Logo
  • AI

    Get started with AI

    • Red Hat AI
      Accelerate the development and deployment of enterprise AI solutions.
    • AI learning hub
      Explore learning materials and tools, organized by task.
    • AI interactive demos
      Click through scenarios with Red Hat AI, including training LLMs and more.
    • AI/ML learning paths
      Expand your OpenShift AI knowledge using these learning resources.
    • AI quickstarts
      Focused AI use cases designed for fast deployment on Red Hat AI platforms.
    • No-cost AI training
      Foundational Red Hat AI training.

    Featured resources

    • OpenShift AI learning
    • Open source AI for developers
    • AI product application development
    • Open source-powered AI/ML for hybrid cloud
    • AI and Node.js cheat sheet

    Red Hat AI Factory with NVIDIA

    • Red Hat AI Factory with NVIDIA is a co-engineered, enterprise-grade AI solution for building, deploying, and managing AI at scale across hybrid cloud environments.
    • Explore the solution
  • Learn

    Self-guided

    • Documentation
      Find answers, get step-by-step guidance, and learn how to use Red Hat products.
    • Learning paths
      Explore curated walkthroughs for common development tasks.
    • Guided learning
      Receive custom learning paths powered by our AI assistant.
    • See all learning

    Hands-on

    • Developer Sandbox
      Spin up Red Hat's products and technologies without setup or configuration.
    • Interactive labs
      Learn by doing in these hands-on, browser-based experiences.
    • Interactive demos
      Click through product features in these guided tours.

    Browse by topic

    • AI/ML
    • Automation
    • Java
    • Kubernetes
    • Linux
    • See all topics

    Training & certifications

    • Courses and exams
    • Certifications
    • Skills assessments
    • Red Hat Academy
    • Learning subscription
    • Explore training
  • Build

    Get started

    • Red Hat build of Podman Desktop
      A downloadable, local development hub to experiment with our products and builds.
    • Developer Sandbox
      Spin up Red Hat's products and technologies without setup or configuration.

    Download products

    • Access product downloads to start building and testing right away.
    • Red Hat Enterprise Linux
    • Red Hat AI
    • Red Hat OpenShift
    • Red Hat Ansible Automation Platform
    • See all products

    Featured

    • Red Hat build of OpenJDK
    • Red Hat JBoss Enterprise Application Platform
    • Red Hat OpenShift Dev Spaces
    • Red Hat Developer Toolset

    References

    • E-books
    • Documentation
    • Cheat sheets
    • Architecture center
  • Community

    Get involved

    • Events
    • Live AI events
    • Red Hat Summit
    • Red Hat Accelerators
    • Community discussions

    Follow along

    • Articles & blogs
    • Developer newsletter
    • Videos
    • Github

    Get help

    • Customer service
    • Customer support
    • Regional contacts
    • Find a partner

    Join the Red Hat Developer program

    • Download Red Hat products and project builds, access support documentation, learning content, and more.
    • Explore the benefits

Designing distributed AI inference: Core concepts and scaling dimensions

Understanding prefill, decode, and 5D parallelism

June 22, 2026
Fatih E. Nar Yuchen Fama Greg Pereira Yuan Tang
Related topics:
AI inferenceArtificial intelligence
Related products:
Red Hat AIRed Hat AI Inference

    Choosing a model-serving engine like vLLM is a crucial first step for enterprise AI inference, but it's only the foundation. By itself, a runtime does not dictate how a service scales, operates, or balances cost against performance in production.

    Think of this blog post as the mental models you need before you touch a config file. We will establish the core vocabulary of inference, break down the inherent structural tension between token generation phases (prefill versus decode), and map out the dimensions of parallelism required to control your infrastructure's baseline limits.

    In this three-part series, we design the serving blueprint by mapping a model's needs, a hardware budget, and service-level objectives onto the deployment manifests that fit the actual workload. No single blueprint fits every case, because the workload is shaped by the business behind it:

    • How many requests arrive at once
    • How long the prompts and contexts run
    • The subject-matter domain
    • How much reasoning each request demands

    We map business needs to a few short-listed major KPIs: time to first token (TTFT), time per output token (TPOT, or inter-token latency), throughput in requests per second and tokens per second, GPU utilization, and key-value (KV) cache hit rate. These trade against each other constantly, so most deployment decisions are about which trade you can tolerate for which workload, and for what business benefit. The structural trade-offs of this framework are mapped out in Figure 1.

    Four connected quadrants detailing workload signature, KPI priority, runtime layout, and production controls for AI deployment.
    Figure 1: The distributed inference decision framework.

    Three key things have developed in the ecosystem since the blog post Why vLLM is the best choice for AI inference today was published:

    • The llm-d project's acceptance into the CNCF Sandbox gives it a clearer open-governance path and makes it easier to evaluate in environments where vendor neutrality, community stewardship, and long-term ecosystem alignment matter.
    • KV cache management has become one of the most active areas of innovation in distributed inference. Projects such as NIXL, LMCache, and Mooncake are competing and converging around different approaches to KV cache transfer, reuse, and offload. For platform teams, this means cache architecture is now a first-class design decision: it affects latency, network requirements, failure handling, observability, hardware affinity, and the sizing of prefill and decode pools.
    • Speculative decoding is moving from research technique to practical optimization path. EAGLE 3.1, a collaboration between the EAGLE team, the vLLM team, and TorchSpec, landed in late May 2026 with materially better long-context acceptance length than EAGLE-3. It makes speculative decoding more relevant for long-document workloads that were previously borderline, while leaving the core engineering question intact: whether the acceptance rate, memory overhead, batching behavior, and operational complexity justify enabling it for a given deployment.

    Note

    The high-value AI models we use for our examples throughout the article (based on our industry-specific assessments) are Alibaba's Qwen3.5 and Qwen3.6 families, in both dense (Qwen3.6-27B) and mixture-of-experts (Qwen3.5-35B-A3B and Qwen3.5-397B-A17B) variants. These models are useful anchors because they span a realistic enterprise serving spectrum: smaller dense models for general-purpose latency-sensitive workloads, mid-sized MoE models for cost-efficient throughput, and larger MoE models for high-capability workloads that stress routing, memory, and distributed execution.

    Prefill/decode: Two phases, two KPIs

    LLM inference is two workloads pretending to be one.

    Prefill processes the input prompt and populates the KV cache. Because the prompt tokens can be processed in parallel, this phase is dominated by dense matrix operations and tends to be compute-bound. It is the primary driver of time to first token (TTFT), especially for long prompts and retrieval-augmented generation workloads.

    Decode generates the response token by token, repeatedly attending over the accumulated KV cache as each new token is produced. This phase tends to be memory-bandwidth-bound: it stresses HBM capacity and bandwidth more than raw compute, and it is the primary driver of time per output token (TPOT), also called inter-token latency. Figure 2 details the contrasting dynamics and resource bottlenecks of these two foundational phases.

    Three columns detailing compute-bound prompt prefill processing, GPU resource interference, and sequential memory-bandwidth-bound token decode phase.
    Figure 2: Prefill versus decode.

    The two phases interfere whenever they share a GPU, because the batching strategy that helps one hurts the other. Aggressive batching accumulates enough prompt tokens to use GPU compute efficiently, but a large prefill batch can occupy the worker long enough to delay the decode steps queued behind it, raising TPOT and making streaming feel uneven. Prioritizing decode is the opposite failure: it protects inter-token latency but leaves prefill under-batched, wasting compute and raising the cost of long prompts.

    The business-level symptoms follow directly: a fleet tuned only for decode feels responsive in chat while burning capacity on prefill-heavy work, and a fleet tuned only for prefill posts healthy aggregate tokens per second while making interactive users wait too long for the first token. This prefill/decode split is a foundational lens for much of what follows, starting with disaggregation in section 4.

    That leads to an architectural fork. The first path is to run prefill and decode together on the same homogeneous workers and let the scheduler arbitrate between them. This is the default shape for single-GPU vLLM deployments, and it remains the right answer for many small and medium-sized fleets because it is simpler to operate and avoids KV cache transfer overhead.

    The second path is to split them across heterogeneous pools and route each request through both. The threshold for that split is not a model size or a token count, but a measurable imbalance between the two phases—one large enough that the savings from right-sizing each pool exceed the cost of moving the KV cache between them.

    Table 1: Prefill/decode attributes.
    PhaseBound byScales withRight-sized hardwareKPI it drives
    PrefillCompute (FLOPs)Prompt lengthH100/H200, MI300X, B100/B200/B300TTFT
    DecodeMemory bandwidthConcurrent active sessionsA100, L40S, MI300, older H100TPOT

    The 5D parallelism

    Modern LLM serving spans a five-dimensional design space, illustrated in Figure 3:

    • Tensor parallelism
    • Pipeline parallelism
    • Expert parallelism
    • Data parallelism
    • Context (sequence) parallelism, which often splits into prefill context parallel (PCP) and decode context parallel (DCP).

    While the previous blog post introduced the first four dimensions, context parallelism has become a first-class fifth dimension because Qwen3.5's native 262,000-token context makes it unavoidable for any serious long-context deployment.

    5 columns detailing matrix, layer, expert, replica, and sequence splits, followed by common distributed architectural combinations.
    Figure 3: 5D parallelism.

    Tensor parallelism

    Tensor parallelism (TP) splits each weight matrix across GPUs and runs an all-reduce per layer. This is latency-sensitive within a node and communication-heavy across nodes, so NVIDIA NVLink and NVSwitch are effectively the budget that decides how far TP can scale before the all-reduce dominates. For dense models, the general best practice is to try the quantized version first, then use the minimum TP that fits the model with adequate KV headroom, then scale out with DP. The exception is a stringent single-request TTFT service-level objective (SLO), which can justify higher TP. For Mixture of Experts (MoE) models, where DP ranks coordinate on the expert layers, the TP/DP split differs because it determines how the experts are distributed.

    Pipeline parallelism

    Pipeline parallelism (PP) splits the model by layer ranges and passes activations between stages. Because it exchanges activation tensors point-to-point at the stage boundaries rather than running an all-reduce every layer, it is far more tolerant of limited inter-node bandwidth than TP, and it can run over Ethernet-class fabrics where cross-node TP would stall.

    However, that tolerance is relative, not absolute: the handoffs still sit on the critical path, the activation volume grows with batch size, sequence length, and hidden dimension, and any added link latency widens the pipeline bubbles that form when one stage waits on its predecessor. Micro-batching exists to hide those bubbles by keeping every stage fed.

    All of this makes PP useful when a model is too large to fit one node and the inter-node fabric isn't fast enough for cross-node TP, but it is a last resort, not a default for large models: quantize first, then try EP+DP on one node, and reach for PP only when the weights genuinely don't fit.

    Expert parallelism

    Expert parallelism (EP) distributes the experts of an MoE model across GPUs. Because each token is routed only to its assigned expert subset, traffic becomes asymmetric and bursty, and one or two popular experts can become tail-latency anchors for the whole fleet under skewed routing. Expert Parallel Load Balancing (EPLB) can replicate hot experts and rebalance routing dynamically, but it is not free: under stable routing patterns the rebalancing overhead can exceed the benefit, so enable EPLB when monitoring confirms persistent imbalance rather than as a blanket default. EP has two further costs.

    • Couples DP ranks that would otherwise be independent, because the expert layers synchronize on every forward pass even when some ranks have fewer requests.
    • Requires all-to-all traffic that scales with the number of active experts and the batch size.

    Data parallelism

    Data parallelism (DP) runs full model replicas behind a load balancer for linear throughput scaling with no model-sharding complexity. It is a common starting point when the model fits one node and the bottleneck is concurrent request volume; KServe's ReplicaSet-based serving handles this case directly. DP alone does not set the ceiling, though: the parallelism configuration and routing strategy determine the real performance limit. Common pitfalls include:

    • Stacking more DP replicas without checking for API-server bottlenecks
    • Ignoring KV cache hit rate and MoE synchronization overhead
    • Underestimating multi-node network costs

    Context parallelism

    Context parallelism (CP) shards the sequence (context) dimension across GPUs, so that a single request's context no longer has to fit, or be computed, on one device. Long context stresses prefill and decode in different ways, and the two phases carry different SLOs, so CP is applied separately to each as PCP and DCP.

    Prefill context parallel

    Prefill context parallel (PCP) targets TTFT. During prefill, attention cost grows with the square of the prompt length, so a long prompt can dominate time to first token. PCP partitions the sequence across devices and lets each one compute attention over its own chunk in parallel, which splits the prefill compute across GPUs and brings TTFT down. The cost is hardware: PCP expands the world size and runs in its own communication domain, so the device count becomes TP × PCP. Reach for it when prefill compute is what dominates TTFT and the GPU budget allows.

    Decode context parallel

    Decode context parallel (DCP) targets throughput. During decode, the bottleneck is KV cache capacity: the more KV tokens you can hold, the larger the batch, and the higher the throughput. DCP shards the KV cache along the sequence dimension across the GPUs already in the tensor-parallel group, reusing the TP communication domain, so it needs no extra devices. 

    It is most valuable for models with few KV heads, such as Qwen3.5: under plain TP, when there are fewer KV heads than TP ranks, the KV cache is replicated across ranks and wastes HBM, and DCP removes that duplication and hands the freed memory back to the batch. Decode context parallel is supported in vLLM for both MLA and GQA models, and some attention backends can combine it with multi-token prediction (MTP) to accelerate decoding further.

    Because DCP costs no additional GPUs and directly reduces KV duplication, it is the one most deployments should reach first. PCP is a more resource-intensive option, which you can add when a long prefill exceeds your TTFT budget.

    Table 2: Prefill/decode AI model versus hardware versus parallelism options.
    ModelHardware/precisionSuggested layoutNotes
    Qwen3.6-27B (dense)1×H100, FP8TP=1~27 GB weights; single-GPU baseline
    Qwen3.6-27B (dense)8×H100, FP8DP=8TP=1 per replica
    Qwen3.5-35B-A3B (MoE)8×H100, FP8

    DP=8,

    --enable-expert-parallel

    EP=8; experts sharded
    Qwen3.5-35B-A3B (MoE)8×H100, BF16

    TP=2, DP=4,

    --enable-expert-parallel

    ~70 GB weights → TP=2 to fit; EP=8
    Qwen3.5-397B-A17B (MoE)16×H100 (2 nodes), FP8

    PP=2, EP=8,

    --enable-expert-parallel

    ~397 GB weights; layers split across the two nodes, experts split within each
    Any Qwen3.5 ≥128k decode+ DCP (on top of above)--decode-context-parallel-size 2Shards KV on existing TP GPUs; no new devices
    Any Qwen3.5 ≥128k prefill+ PCP (on top of above)--prefill-context-parallel-size 2Adds GPUs (TP × PCP)

    Note

    vLLM computes EP = TP × DP per pipeline stage, so EP= in the table is for convenience. The optimal configuration of different parallelism combinations depends on the model architecture and hardware. Expert parallelism can be latency-optimal for models like Qwen3 with limited key-value heads, whereas data parallelism maximizes throughput but can introduce latency issues due to dispatch bubbles during the prefill and decode stages.

    Understanding how these five dimensions of parallelism interact is the difference between a well-tuned enterprise fleet and a cluster that stumbles at every step, wasting costly compute. By scaling your tensor, pipeline, expert, data, and context parallelism based on your specific model size and hardware budget, you establish control over your infrastructure's baseline limits.

    However, actually configuring your execution layout is only the first step. Now that you have the essential mental models down, it's time to pull the practical engineering levers that directly control your bottom line and user experience. In our upcoming articles, we will look at the deployment patterns and performance updates that help you achieve an efficient deployment.

    Related Posts

    • llama.cpp vs. vLLM: Choosing the right local LLM inference engine

    • How speculative decoding delivers faster LLM inference

    • Intelligent inference scheduling with llm-d on Red Hat AI

    • Build a local voice agent with Red Hat OpenShift AI

    • Build modular AI pipelines with OpenShift AI and reusable components

    • Running AI inference on Rebellions ATOM NPU with Red Hat AI

    Recent Posts

    • Designing distributed AI inference: Core concepts and scaling dimensions

    • How to integrate CyberArk with Identity Management

    • Deploy MemPalace MCP Server on Red Hat OpenShift AI

    • Get ready for Ansible Automation Platform 2.7

    • Manage LLM evaluation workloads at scale with EvalHub and Kueue

    Red Hat Developers logo LinkedIn YouTube Twitter Facebook

    Platforms

    • Red Hat AI
    • Red Hat Enterprise Linux
    • Red Hat OpenShift
    • Red Hat Ansible Automation Platform
    • See all products

    Build

    • Developer Sandbox
    • Developer tools
    • Interactive tutorials
    • API catalog

    Quicklinks

    • Learning resources
    • E-books
    • Cheat sheets
    • Blog
    • Events
    • Newsletter

    Communicate

    • About us
    • Contact sales
    • Find a partner
    • Report a website issue
    • Site status dashboard
    • Report a security problem

    RED HAT DEVELOPER

    Build here. Go anywhere.

    We serve the builders. The problem solvers who create careers with code.

    Join us if you’re a developer, software engineer, web designer, front-end designer, UX designer, computer scientist, architect, tester, product manager, project manager or team lead.

    Sign me up

    Red Hat legal and privacy links

    • About Red Hat
    • Jobs
    • Events
    • Locations
    • Contact Red Hat
    • Red Hat Blog
    • Inclusion at Red Hat
    • Cool Stuff Store
    • Red Hat Summit
    © 2026 Red Hat

    Red Hat legal and privacy links

    • Privacy statement
    • Terms of use
    • All policies and guidelines
    • Digital accessibility

    Chat Support

    Please log in with your Red Hat account to access chat support.