Skip to main content
Redhat Developers  Logo
  • Products

    Platforms

    • Red Hat Enterprise Linux
      Red Hat Enterprise Linux Icon
    • Red Hat AI
      Red Hat AI
    • Red Hat OpenShift
      Openshift icon
    • Red Hat Ansible Automation Platform
      Ansible icon
    • See all Red Hat products

    Featured

    • Red Hat build of OpenJDK
    • Red Hat Developer Hub
    • Red Hat JBoss Enterprise Application Platform
    • Red Hat OpenShift Dev Spaces
    • Red Hat OpenShift Local
    • Red Hat Developer Sandbox

      Try Red Hat products and technologies without setup or configuration fees for 30 days with this shared Red Hat OpenShift and Kubernetes cluster.
    • Try at no cost
  • Technologies

    Featured

    • AI/ML
      AI/ML Icon
    • Linux
      Linux Icon
    • Kubernetes
      Cloud icon
    • Automation
      Automation Icon showing arrows moving in a circle around a gear
    • See all technologies
    • Programming languages & frameworks

      • Java
      • Python
      • JavaScript
    • System design & architecture

      • Red Hat architecture and design patterns
      • Microservices
      • Event-Driven Architecture
      • Databases
    • Developer experience

      • Productivity
      • Tools
      • GitOps
    • Automated data processing

      • AI/ML
      • Data science
      • Apache Kafka on Kubernetes
    • Platform engineering

      • DevOps
      • DevSecOps
      • Red Hat Ansible Automation Platform for applications and services
    • Secure development & architectures

      • Security
      • Secure coding
  • Learn

    Featured

    • Kubernetes & cloud native
      Openshift icon
    • Linux
      Rhel icon
    • Automation
      Ansible cloud icon
    • AI/ML
      AI/ML Icon
    • See all learning resources

    E-books

    • GitOps cookbook
    • Podman in action
    • Kubernetes operators
    • The path to GitOps
    • See all e-books

    Cheat sheets

    • Linux commands
    • Bash commands
    • Git
    • systemd commands
    • See all cheat sheets

    Documentation

    • Product documentation
    • API catalog
    • Legacy documentation
  • Developer Sandbox

    Developer Sandbox

    • Access Red Hat’s products and technologies without setup or configuration, and start developing quicker than ever before with our new, no-cost sandbox environments.
    • Explore the Developer Sandbox

    Featured Developer Sandbox activities

    • Get started with your Developer Sandbox
    • OpenShift virtualization and application modernization using the Developer Sandbox
    • Explore all Developer Sandbox activities

    Ready to start developing apps?

    • Try at no cost
  • Blog
  • Events
  • Videos

Accelerate multi-turn LLM workloads on OpenShift AI with llm-d intelligent routing

January 13, 2026
Philip Hayes
Related topics:
Artificial intelligenceHybrid CloudObservability
Related products:
Red Hat AI Inference ServerRed Hat AIRed Hat OpenShift AI

    Large language model deployments are moving beyond simple single-turn prompts into richer, multi-turn conversational experiences, code assistants, document analysis workflows, long-context chatbots, and more. These workloads place new strains on inference servers, especially around prefix reuse, KV cache locality, and tail latency experienced by end users.

    In this blog post, we walk through a hands-on demonstration comparing llm-d, Red Hat’s distributed LLM inference solution, with a traditional deployment of vLLM using naive load balancing. The goal is to show how llm-d’s prefix-aware intelligent routing delivers smoother performance, lower P95/P99 latency, and more efficient use of GPU resources.

    Why this matters

    vLLM has earned its reputation as one of the fastest LLM inference engines available. It’s easy to deploy, fast, and optimizes throughput through kernel fusion and prefix caching.

    But users don’t just send isolated prompts, they engage in multi-turn conversations, where later requests depend on the KV cache built during the decode phase of earlier turns.

    In multi-replica deployments, something subtle but important happens: Round-robin routing scatters requests across replicas, destroying KV cache locality and causing severe tail latency spikes.

    llm-d addresses this problem directly with KV-cache-aware routing, ensuring follow-up turns land on replicas that already contain the relevant context.

    Let’s walk through the benchmark.

    The multi-turn benchmarking tool

    Throughout this exercise, we’ll use a benchmarking tool designed for conversational workloads. The tool uses real-world documents as seeds and automatically classifies them as CODE or TEXT. To mimic real usage, it simulates multi-turn conversation flows by running parallel workers with random delays and reusing model responses as input. This approach generates realistic KV cache access patterns that expose the limitations of naive routing.

    This structure is exactly what exposes the limitations of naive routing, and the advantages of llm-d.

    We'll keep a Grafana dashboard open throughout the process to monitor TTFT, KV cache hit rate, GPU utilization, and throughput.

    Prerequisites

    • An OpenShift 4.20 cluster with GPU-enabled worker nodes
    • OpenShift AI 3.0 installed and configured
    • NVIDIA GPU Operator installed and functional
    • At least four NVIDIA L4 or A10 GPUs (or similar) for multi-replica testing

    Note

    The benchmarking figures produced for this article were generated on a single node OpenShift cluster with 4 x NVIDIA L4 GPUs. While these results are typical for this specific configuration, they are intended for demonstration purposes and do not represent production-grade infrastructure or enterprise-level averages. Performance outcomes depend on specific variables, such as input types and replica counts.

    Step 0: Deploy monitoring stack

    Enter the following:

    git clone https://github.com/rh-aiservices-bu/rhaoi3-llm-d

    Deploy the monitoring stack, Prometheus and Grafana:

    oc apply -k monitoring
    # Wait for Grafana to be ready
    oc wait --for=condition=ready pod -l app=grafana -n llm-d-monitoring --timeout=300s
    # Get Grafana URL
    export GRAFANA_URL=$(oc get route grafana-secure -n llm-d-monitoring -o jsonpath='{.spec.host}')
    echo "Grafana URL: https://$GRAFANA_URL"

    Step 1: Establish the baseline: vLLM performance with GuideLLM

    Before we compare routing strategies, it’s important to note that vLLM is an outstanding inference engine. It is fast, efficient, simple to configure, and delivers excellent throughput.

    Deploy vLLM (4 replicas):

    oc apply -k vllm
    oc wait --for=condition=ready pod \
      -l serving.kserve.io/inferenceservice=qwen-vllm \
      -n demo-llm --timeout=300s
    Run the GuideLLM micro-benchmark:
    oc apply -k guidellm/overlays/vllm
    oc logs -f job/vllm-guidellm-benchmark -n demo-llm

    In Grafana, monitor for stable TTFT, high throughput, and efficient tensor-parallel use across multiple GPUs.

    Key takeaways

    • vLLM excels at inference.
    • Prefix caching works extremely well, within a single replica.
    • Configuration is simple and predictable.

    But now we ask the critical question: How does vLLM behave when scaled horizontally, when real-world multi-turn conversations route requests randomly across replicas?

    To answer this question, we will use a different benchmarking tool that replicates multi-turn chat conversations.

    Step 2: Identifying naive load balancing scaling limitations in multi-turn chat

    With multiple replicas behind a naive round-robin strategy, each user conversation will land on different pods. This breaks cache locality and causes frequent KV cache misses.

    Run the multi-turn benchmark:

    oc apply -k benchmark-job/overlays/vllm
    oc logs -f job/vllm-multi-turn-benchmark -n demo-llm

    Once the benchmarking tool is complete, you will get a summary of results similar to:

    ================================================================================
    BENCHMARK SUMMARY
    ================================================================================
    
    Total time: 210.48s
    Total requests: 110
    Completed conversations: 11/11
    Requests per second: 0.52
    
    Time to First Token (TTFT):
      Min:         44.47 ms
      Max:        881.38 ms
      Mean:       208.96 ms
      P50:        119.00 ms
      P95:        727.26 ms
      P99:        814.92 ms
    
    Total Request Time:
      Min:       2347.15 ms
      Max:      10874.38 ms
      Mean:      6323.26 ms
      P50:       6138.65 ms
      P95:       9769.03 ms
    
    TTFT by Turn Number:
      Turn  1:     332.52 ms avg (11 requests)
      Turn  2:     305.08 ms avg (11 requests)
      Turn  3:     270.26 ms avg (11 requests)
      Turn  4:     203.46 ms avg (11 requests)
      Turn  5:     276.70 ms avg (11 requests)
      Turn  6:      95.63 ms avg (11 requests)
      Turn  7:     138.93 ms avg (11 requests)
      Turn  8:     152.36 ms avg (11 requests)
      Turn  9:     180.47 ms avg (11 requests)
      Turn 10:     134.21 ms avg (11 requests)
    
    TTFT by Document Type:
      CODE:       241.15 ms avg (50 requests)
      TEXT:       182.14 ms avg (60 requests)
    
    First Turn vs Subsequent Turns (Prefix Caching Indicator):
      First turn avg:      332.52 ms
      Later turns avg:     195.23 ms
      Speedup ratio:         1.70x

    Expected vLLM results (from our tests):

    MetricValueMeaning
    P50 TTFT123 msGood median performance
    P95 TTFT727 msHigh tail latency, frustrated users
    P99 TTFT814 msSpikes caused by KV cache misses
    Prefix cache speedup1.70×Inefficient cache re-use

    What these results tell us:

    • Tail latency increases dramatically. P95 TTFT is approximately 727 ms and P99 TTFT is approximately 814 ms.
    • Median latency remains acceptable (P50 is approximately 123 ms), but the variance makes the system feel unpredictable.
    • Prefix caching provides only a modest benefit across replicas, with a cache speedup of just 1.70×. This indicates poor reuse of previously processed prompts.

    Grafana observations (see Figure 1):

    • KV cache hit rate is approximately 62% (2/3 of requests).
    • TTFT shows jagged spikes, correlating with KV cache misses.
    • GPUs show imbalanced utilization across replicas.
    Metrics for vLLM with naive load balancing showing 62% KV cache hit rate, jagged TTFT spikes, and imbalanced GPU utilization across four pods.
    Figure 1: Grafana dashboard showing metrics from benchmark against vLLM using naive load balancing.

    This is not a vLLM flaw; it’s an architecture problem in multi-replica deployments without routing intelligence.

    Clean up vLLM:

    oc delete job vllm-guidellm-benchmark vllm-multi-turn-benchmark -n demo-llm
    oc delete -k vllm

    Reset monitoring:

    oc delete pod -l app=prometheus -n llm-d-monitoring
    oc wait --for=condition=ready pod -l app=prometheus -n llm-d-monitoring --timeout=120s

    Step 3: Deploy llm-d and enable intelligent routing

    Now we redeploy using llm-d, which introduces prefix-aware scheduling: every request is routed to the replica most likely to contain relevant KV cache entries.

    This improves cache locality, reduces recompute, and stabilizes tail latency.

    Deploy llm-d

    oc apply -k llm-d
    oc wait --for=condition=ready pod \
      -l app.kubernetes.io/name=qwen \
      -n demo-llm --timeout=300s

    Run the same benchmark:

    oc apply -k benchmark-job/overlays/llm-d
    oc logs -f job/llm-d-multi-turn-benchmark -n demo-llm

    Once the benchmarking tool is complete, you will get a summary of results similar to:

    ================================================================================
    BENCHMARK SUMMARY
    ================================================================================
    
    Total time: 224.37s
    Total requests: 110
    Completed conversations: 11/11
    Requests per second: 0.49
    
    Time to First Token (TTFT):
      Min:         51.86 ms
      Max:        767.41 ms
      Mean:       112.54 ms
      P50:         89.17 ms
      P95:        237.96 ms
      P99:        700.76 ms
    
    Total Request Time:
      Min:       2873.23 ms
      Max:       9567.29 ms
      Mean:      6768.38 ms
      P50:       6727.68 ms
      P95:       9126.74 ms
    
    TTFT by Turn Number:
      Turn  1:     343.35 ms avg (11 requests)
      Turn  2:      80.83 ms avg (11 requests)
      Turn  3:      77.98 ms avg (11 requests)
      Turn  4:      86.33 ms avg (11 requests)
      Turn  5:      77.61 ms avg (11 requests)
      Turn  6:      83.59 ms avg (11 requests)
      Turn  7:      86.99 ms avg (11 requests)
      Turn  8:      93.45 ms avg (11 requests)
      Turn  9:      91.43 ms avg (11 requests)
      Turn 10:     103.83 ms avg (11 requests)
    
    TTFT by Document Type:
      CODE:       116.38 ms avg (50 requests)
      TEXT:       109.34 ms avg (60 requests)
    
    First Turn vs Subsequent Turns (Prefix Caching Indicator):
      First turn avg:      343.35 ms
      Later turns avg:      86.89 ms
      Speedup ratio:         3.95x

    Expected llm-d results:

    MetricValueInterpretation
    P50 TTFT92 ms25% faster than vLLM
    P95 TTFT237 ms63% faster, huge tail latency win
    P99 TTFT700 msMuch more stable high-percentile behavior
    Cache speedup3.95×Extremely efficient cache reuse

    What these results tell us:

    • P50 TTFT of 92 ms (25% faster than vLLM) shows that llm-d improves everyday responsiveness for most users.
    • P95 TTFT of 237 ms (63% faster) is the most important result: This means the slowest reasonable responses—the ones users actually notice—are more than 2× faster.
    • P99 TTFT of approximately 700 ms shows llm-d keeps even extreme edge cases stable, preventing rare unpredictable stalls.

    Grafana observations (see Figure 2):

    • KV cache hit rate jumps to approximately 90% (45% improvement over vLLM).
    • TTFT curve is smooth with no latency spikes.
    • GPU utilization across replicas is balanced and predictable.
    Grafana dashboard for llm-d benchmark showing a 90% KV cache hit rate, smooth TTFT curves, and balanced GPU utilization.
    Figure 2: Grafana dashboard showing metrics from llm-d benchmark.

    This demonstrates the core advantage of llm-d: routing is based on cached prefixes, not randomness.

    The following table provides a results comparison.

    MetricvLLMllm-d

    Improvement

    P50 TTFT123 ms92 ms25% faster
    P95 TTFT745 ms272 ms63% faster
    P99 TTFT841 ms674 ms20% faster
    Cache speedup1.79×3.84×2.1× better

    Figure 3 illustrates the tail latency improvements.

    Comparison of TTFT latency between vLLM and llm-d, illustrating significant tail latency improvements for llm-d at P50, P95, and P99 metrics.
    Figure 3: Graph showing the time to first token (TTFT) improvement of llm-d over vLLM with naive load balancing.

    For users, these results lead to fewer slow responses and more consistent performance during interactive sessions. You get better throughput for every dollar spent on GPUs, and the system stays responsive throughout multi-turn conversations. This is not marginal optimization, it is the difference between a smooth product experience and a frustrating one.

    Why llm-d performs better

    The following comparison highlights the architectural factors that help llm-d deliver more consistent performance.

    FeaturevLLM (round robin)

    llm-d (intelligent routing)

    RoutingRandomPrefix-aware scoring
    Cache hits~25% per replica~90%+
    P95 latencyHigh varianceConsistent, predictable
    GPU utilizationImbalancedBalanced via cache locality
    Multi-turn performanceCache frequently missedCache reused efficiently

    The summary is simple: llm-d ensures the right request lands on the right GPU at the right time.

    Key takeaways

    • Tail latency matters. P95/P99 represents your most frustrated users.
    • Cache locality is critical. Single-instance caching means little when routing is random.
    • Prefix-aware routing is required for large-context, multi-turn assistants.
    • No changes to your application. llm-d uses the same API, same model, and same OpenShift AI foundations.

    Conclusion

    vLLM remains an exceptional inference engine, and this benchmark validates that. But as soon as you scale horizontally and run multi-turn workloads, naive routing becomes the bottleneck.

    llm-d addresses these challenges through intelligent, KV-cache-aware routing. This approach increases cache hit rates, lowers tail latency, and ensures smoother performance under load. By using GPU resources more efficiently, the system provides a better experience for every user.

    If you're deploying multi-turn LLM workloads, code assistants, enterprise chatbots, document analysis applications, llm-d provides the performance foundation required for production at scale. Learn more on how to scale inference with llm-d.

    Related Posts

    • Introduction to distributed inference with llm-d

    • Master KV cache aware routing with llm-d for efficient AI inference

    • Scaling DeepSeek-style MoEs with vLLM and llm-d using Wide EP

    • Getting started with llm-d for distributed AI inference

    • llm-d: Kubernetes-native distributed inferencing

    • How to deploy and benchmark vLLM with GuideLLM on Kubernetes

    Recent Posts

    • Accelerate multi-turn LLM workloads on OpenShift AI with llm-d intelligent routing

    • The case for building enterprise agentic apps with Java instead of Python

    • What's New in OpenShift GitOps 1.19

    • Red Hat Developer Hub background and concepts

    • How to observe your multi-cluster service mesh with Kiali

    Red Hat Developers logo LinkedIn YouTube Twitter Facebook

    Platforms

    • Red Hat AI
    • Red Hat Enterprise Linux
    • Red Hat OpenShift
    • Red Hat Ansible Automation Platform
    • See all products

    Build

    • Developer Sandbox
    • Developer tools
    • Interactive tutorials
    • API catalog

    Quicklinks

    • Learning resources
    • E-books
    • Cheat sheets
    • Blog
    • Events
    • Newsletter

    Communicate

    • About us
    • Contact sales
    • Find a partner
    • Report a website issue
    • Site status dashboard
    • Report a security problem

    RED HAT DEVELOPER

    Build here. Go anywhere.

    We serve the builders. The problem solvers who create careers with code.

    Join us if you’re a developer, software engineer, web designer, front-end designer, UX designer, computer scientist, architect, tester, product manager, project manager or team lead.

    Sign me up

    Red Hat legal and privacy links

    • About Red Hat
    • Jobs
    • Events
    • Locations
    • Contact Red Hat
    • Red Hat Blog
    • Inclusion at Red Hat
    • Cool Stuff Store
    • Red Hat Summit
    © 2025 Red Hat

    Red Hat legal and privacy links

    • Privacy statement
    • Terms of use
    • All policies and guidelines
    • Digital accessibility

    Report a website issue