Skip to main content
Redhat Developers  Logo
  • Products

    Platforms

    • Red Hat Enterprise Linux
      Red Hat Enterprise Linux Icon
    • Red Hat AI
      Red Hat AI
    • Red Hat OpenShift
      Openshift icon
    • Red Hat Ansible Automation Platform
      Ansible icon
    • View All Red Hat Products

    Featured

    • Red Hat build of OpenJDK
    • Red Hat Developer Hub
    • Red Hat JBoss Enterprise Application Platform
    • Red Hat OpenShift Dev Spaces
    • Red Hat OpenShift Local
    • Red Hat Developer Sandbox

      Try Red Hat products and technologies without setup or configuration fees for 30 days with this shared Openshift and Kubernetes cluster.
    • Try at no cost
  • Technologies

    Featured

    • AI/ML
      AI/ML Icon
    • Linux
      Linux Icon
    • Kubernetes
      Cloud icon
    • Automation
      Automation Icon showing arrows moving in a circle around a gear
    • View All Technologies
    • Programming Languages & Frameworks

      • Java
      • Python
      • JavaScript
    • System Design & Architecture

      • Red Hat architecture and design patterns
      • Microservices
      • Event-Driven Architecture
      • Databases
    • Developer Productivity

      • Developer productivity
      • Developer Tools
      • GitOps
    • Automated Data Processing

      • AI/ML
      • Data Science
      • Apache Kafka on Kubernetes
    • Platform Engineering

      • DevOps
      • DevSecOps
      • Ansible automation for applications and services
    • Secure Development & Architectures

      • Security
      • Secure coding
  • Learn

    Featured

    • Kubernetes & Cloud Native
      Openshift icon
    • Linux
      Rhel icon
    • Automation
      Ansible cloud icon
    • AI/ML
      AI/ML Icon
    • View All Learning Resources

    E-Books

    • GitOps Cookbook
    • Podman in Action
    • Kubernetes Operators
    • The Path to GitOps
    • View All E-books

    Cheat Sheets

    • Linux Commands
    • Bash Commands
    • Git
    • systemd Commands
    • View All Cheat Sheets

    Documentation

    • Product Documentation
    • API Catalog
    • Legacy Documentation
  • Developer Sandbox

    Developer Sandbox

    • Access Red Hat’s products and technologies without setup or configuration, and start developing quicker than ever before with our new, no-cost sandbox environments.
    • Explore Developer Sandbox

    Featured Developer Sandbox activities

    • Get started with your Developer Sandbox
    • OpenShift virtualization and application modernization using the Developer Sandbox
    • Explore all Developer Sandbox activities

    Ready to start developing apps?

    • Try at no cost
  • Blog
  • Events
  • Videos

Master KV cache aware routing with llm-d for efficient AI inference

October 7, 2025
Christopher Nuland Maroon Ayoub (IBM)
Related topics:
Artificial intelligence
Related products:
Red Hat AI

Share:

    In the era of large-scale AI inference, ensuring efficiency across distributed environments is essential. As workloads grow, so does the need for more intelligent scheduling and memory reuse strategies. Enter llm-d, a Kubernetes-native framework for scalable, intelligent LLM inference. One of its most powerful capabilities is KV cache aware routing, which reduces latency and improves throughput by directing requests to pods that already hold relevant context in GPU memory.

    In this blog post, we'll cover:

    • What KV cache aware routing is and why it matters
    • How llm-d implements this feature with External Processing Pod (EPPs), Gateway API Inference Extension, and intelligent routing
    • The key Kubernetes YAML assets that make it work
    • A test case that shows our latest 87.4% cache hit rate
    • Where to go to learn more and get started

    What is llm-d?

    llm-d is an open source project that uses cloud-native patterns to manage large-scale LLM inference. It is a collaborative effort by IBM, Google, Red Hat, and the broader AI infrastructure community. The project introduces:

    • Disaggregated prefill and decode workloads
    • Multi-model and multi-tenant isolation
    • Intelligent routing via an External Processing Pod
    • And, crucially, KV cache aware routing for memory-efficient, low-latency inference

    Stateless inference fails to reuse cache

    In traditional deployments, even if KV caches are enabled inside the model server (like vLLM), the gateway is unaware of the cache state. This leads to:

    • Round-robin routing or explicit sticky sessions
    • Frequent cache misses
    • Repeated computation for common prefixes
    • Unnecessary GPU memory use

    This breaks down under high concurrency or in workloads with large shared context (like retrieval-augmented generation, agentic, and templated inputs).

    KV cache aware routing

    llm-d enables state-aware request scheduling by introducing the Gateway API Inference Extension (GAIE) with an EPP. This high-performance system, shown in Figure 1, makes intelligent routing decisions based on KV cache awareness. The key components include:

    • An External Processing Pod (EPP) for the GAIE that orchestrates intelligent pod scoring for optimal cache utilization
    • An in-memory caching system that tracks cache state across vLLM pods without external dependencies
    • A pod discovery and labeling system that automatically identifies and monitors decode service endpoints
    • A session-aware routing algorithm that maintains request consistency for optimal cache reuse
    • A prefix-aware scoring system that intelligently routes requests based on prompt similarity and cache warmth

    The result is an advanced scheduler that routes requests to pods most likely to have relevant cached content. This dramatically reduces inference times and GPU load.

    KV cache aware routing architecture diagram.
    Figure 1: Complete KV cache aware routing architecture, showing the flow from client requests through EPP intelligent routing to decode pods with Gateway API Inference Extension coordination.

    Example deployment

    This section provides a practical guide to deploying and configuring KV cache aware routing with llm-d.

    Prerequisites

    To follow this guide, you should have:

    • Red Hat OpenShift or Kubernetes with GPU-enabled nodes and NVIDIA GPU Operator
    • Istio 1.27.0+ or KGateway installed (required for Envoy features)
    • Gateway API CRDs installed (standard + inference extension)
    • llm-d infrastructure installed using the official community Helm chart
    • A Hugging Face token (for downloading models)

    Official community Helm chart approach

    The implementation in this example uses the official llm-d community Helm chart, which automatically provisions:

    • Infrastructure Gateway: Gateway with proper configuration
    • EPP: External Processing Pod
    • Service discovery: Automatic discovery of decode services via label selectors

    InferencePool and InferenceModel (auto-created by Helm chart)

    The EPP automatically discovers and manages inference pools based on service labels. For automatic discovery, the decode service must have the following label:

    llm-d.ai/inferenceServing: "true"

    KV cache indexer : Deep architecture dive

    The KV cache indexer is the core of llm-d's intelligent routing system. It maintains a global, near-real-time view of KV cache block locality across your vLLM-based decode and prefill pods.

    Core architecture components

    ComponentPurposeImplementation
    kvcache.IndexerMain orchestrator handling scoring requestsCoordinates all internal modules
    kvevents.PoolIngests KV cache events from vLLM podsSharded ZMQ worker pool for real-time event processing
    kvblock.IndexCore data store mapping block hashes to podsIn-memory two-level LRU cache for sub-millisecond lookups
    tokenization.PrefixStoreCaches tokenized prompt prefixesLRU cache avoiding expensive re-tokenization
    kvblock.TokenProcessorConverts tokens into KV block keysChunking and hashing algorithm matching vLLM exactly
    kvblock.ScorerScores pods based on cache hit sequencesLongest consecutive prefix matching strategy

    The read path: Intelligent pod scoring

    When a router needs to select the best pod for a new prompt, the read path finds the pod with the most extended sequence of relevant cached KV blocks:

    1. Token retrieval: Check the PrefixStore for the most extended cached token sequence for the prompt prefix.
    2. Key generation: Convert tokens into deterministic KV block keys that match vLLM's internal logic.
    3. Index lookup: Query the kvblock.Index to find which pods have the consecutive blocks.
    4. Scoring: Rank each pod based on consecutive matching blocks from the start of the prompt.
    5. Response: Return scored pod rankings to the router.

    Key insight: First-time prompts might return empty results while background tokenization occurs, but common prompts achieve sub-millisecond scoring.

    The write path: Real-time cache tracking

    The write path keeps the index synchronized with actual vLLM pod cache states:

    1. Event publication: vLLM pods publish cache events (BlockStored, BlockRemoved) via ZMQ.
    2. Message reception: Events parsed by topic format: kv@pod-id@model
    3. Sharded processing: Pod ID is hashed to ensure ordered processing per pod.
    4. Event decoding: The worker decodes msgpack payloads containing event batches.
    5. Index updates: Apply cache changes to the in-memory kvblock.Index.

    Component configuration details

    EPP configuration (deployed via official Helm chart):

    # plugins.yaml configuration
    apiVersion: v1
    kind: ConfigMap
    metadata:
      name: llm-d-gaie-epp-config
      namespace: llm-d
    data:
      plugins-v2.yaml: |
        plugins:
          - name: "cache-aware-router"
            type: "external_processor"
            config:
              discovery:
                label_selector: "llm-d.ai/inferenceServing=true"
              cache:
                type: "in-memory-lru"
                max_size: 10000
              routing:
                algorithm: "prefix-aware"
                session_affinity: true
    

    vLLM prefix caching configuration

    Each vLLM pod is configured for optimal prefix caching performance:

    args:
      - "--enable-prefix-caching"             # Enable KV-cache prefix reuse
      - "--block-size=16"                     # Optimal block size for cache efficiency
      - "--gpu-memory-utilization=0.7"        # Reserve memory for cache storage
      - "--max-model-len=4096"                # Match expected prompt lengths
      - "--kv-cache-dtype=auto"               # Automatic cache data type optimization
    env:
      - name: CUDA_VISIBLE_DEVICES            # GPU assignment for cache isolation
        value: "0"
    

    EnvoyFilter for EPP integration

    Enables the EPP to intercept and route requests based on intelligent pod scoring:

    name: envoy.filters.http.ext_proc
    typed_config:
      grpc_service:
        envoy_grpc:
          cluster_name: epp-ext-proc-cluster  # Cluster pointing to EPP service
      processing_mode:
        request_header_mode: SEND     # Send request headers for routing analysis
        response_header_mode: SEND    # Send response headers for session tracking
        request_body_mode: STREAMED   # Stream request bodies for prompt analysis
      failure_mode_allow: true        # Continue routing if EPP unavailable
      message_timeout: 30s            # Allow time for intelligent scoring

    Test case and results

    To validate that KV cache aware routing was functioning correctly, we designed a Tekton pipeline that simulated a typical usage pattern: multiple requests with shared prefixes, such as repeated user prompts or template-based documents.

    Monitored signals:

    • EPP logs for intelligent routing decisions
      EPP and vLLM Metrics for prefix cache hits
    • Grafana dashboard for comprehensive visibility

    Outstanding performance metrics:

    • Total queries: 4,776
    • Total cache hits: 4,176
    • Cache hit rate: 87.4% (improved from previous 86%)

    Traffic distribution analysis:

    • Primary pod: 4,772 queries (99.92% of traffic), 87.5% cache hit rate
    • Secondary pods: Only 4 queries total (0.08% failover)

    These results demonstrate a KV cache aware routing system with Gateway API Inference Extension, in-memory caching, and intelligent EPP routing for maximum cache utilization.

    Grafana dashboard monitoring

    To provide comprehensive observability into the KV cache aware routing performance, we used Grafana dashboards that visualize key metrics in real time (Figure 2).

    Grafana dashboard showing cache hit rates, request distribution, and system performance metrics during our latest 87.4% cache hit rate test.
    Figure 2: Grafana dashboard showing cache hit rates, request distribution, and system performance metrics during our latest 87.4% cache hit rate test.

    Key dashboard metrics displayed:

    • Cache hit rate timeline: Real-time visualization of cache effectiveness across all decode pods
    • Request distribution: Traffic routing patterns showing session affinity in action
    • Pod-level performance: Individual decode pod cache statistics and GPU utilization
    • Latency metrics: Response time improvements from cache hits versus cache misses
    • System health: Overall cluster performance and resource utilization

    The dashboard confirms our latest production results:

    • Session affinity concentrated 99.92% of requests to the primary warm pod (exceptional stickiness).
    • Cache hit rates achieved 87.4% overall, with 87.5% on the primary pod.
    • GPU memory utilization stayed optimal at 90%.
    • Response latencies showed significant improvement for cache-hit requests with sub-400 ms times.

    This visual monitoring validates that the KV cache aware routing system performs as designed and delivers measurable benefits in both efficiency and performance.

    TTFT impact: Measurable performance gains

    One of the most immediate and noticeable benefits of KV cache aware routing is the dramatic improvement in Time To First Token (TTFT). Here's how cache hits directly translate to faster inference.

    The following table compares baseline versus cache-aware performance.

    ScenarioWithout cache routingWith KV cache routingImprovement
    Cold inference2,850 ms TTFT2,850 ms TTFTBaseline
    Warm cache hit2,850 ms TTFT (worst case)340 ms TTFT88% faster

    Why this matters

    The 87.4% cache hit rate translates into tangible business value, impacting several key areas.

    Cost savings

    A 70% reduction in compute time for repeated prompts means 70% fewer GPU-hours billed. For a cluster running 10 GPUs at $2 per hour, that's $336 saved per day on redundant computation.

    Additionally, cache hits use 90% less energy than full inference, significantly reducing cloud costs.

    User experience

    Users see sub-second response times for cached prompts versus 3–5 seconds for cold inference.

    This higher throughput also means you can support 3 times more concurrent users with the same hardware.

    Key use cases

    Enterprise use cases where this shines:

    • RAG pipelines: Document chunks get cached, making follow-up questions instant.
    • Customer support: Common queries hit the cache, and agents get faster responses.
    • Code generation: Template-based prompts reuse cached context.
    • Multi-tenant SaaS: Shared prompt patterns benefit all users.

    Conclusion

    KV cache aware routing with llm-d represents a significant leap forward in optimizing large language model inference. By intelligently directing requests to pods with existing KV cache content, llm-d reduces latency, improves throughput, and lowers operational costs. The demonstrated 87% cache hit rate and 88% faster TTFT for warm cache hits underscore the real-world impact of this technology.

    For enterprises using LLMs in demanding scenarios like RAG, customer support, and code generation, llm-d's KV cache aware routing provides a robust, scalable, and highly efficient solution for maximizing the value of their AI infrastructure.

    Learn more:

    • Project code and performance test on GitHub
    • llm-d KV Cache Manager Architecture
    • llm-d on GitHub
    • llm-d operator quick start
    • vLLM documentation

    Related Posts

    • llm-d: Kubernetes-native distributed inferencing

    • Getting started with llm-d for distributed AI inference

    • Scaling DeepSeek-style MoEs with vLLM and llm-d using Wide EP

    • GPU benchmarking and how to choose a GPU framework

    • Ollama vs. vLLM: A deep dive into performance benchmarking

    • vLLM with torch.compile: Efficient LLM inference on PyTorch

    Recent Posts

    • Profiling vLLM Inference Server with GPU acceleration on RHEL

    • Network performance in distributed training: Maximizing GPU utilization on OpenShift

    • Clang bytecode interpreter update

    • How Red Hat has redefined continuous performance testing

    • Simplify OpenShift installation in air-gapped environments

    Red Hat Developers logo LinkedIn YouTube Twitter Facebook

    Platforms

    • Red Hat AI
    • Red Hat Enterprise Linux
    • Red Hat OpenShift
    • Red Hat Ansible Automation Platform
    • See all products

    Build

    • Developer Sandbox
    • Developer Tools
    • Interactive Tutorials
    • API Catalog

    Quicklinks

    • Learning Resources
    • E-books
    • Cheat Sheets
    • Blog
    • Events
    • Newsletter

    Communicate

    • About us
    • Contact sales
    • Find a partner
    • Report a website issue
    • Site Status Dashboard
    • Report a security problem

    RED HAT DEVELOPER

    Build here. Go anywhere.

    We serve the builders. The problem solvers who create careers with code.

    Join us if you’re a developer, software engineer, web designer, front-end designer, UX designer, computer scientist, architect, tester, product manager, project manager or team lead.

    Sign me up

    Red Hat legal and privacy links

    • About Red Hat
    • Jobs
    • Events
    • Locations
    • Contact Red Hat
    • Red Hat Blog
    • Inclusion at Red Hat
    • Cool Stuff Store
    • Red Hat Summit
    © 2025 Red Hat

    Red Hat legal and privacy links

    • Privacy statement
    • Terms of use
    • All policies and guidelines
    • Digital accessibility

    Report a website issue