Skip to main content
Redhat Developers  Logo
  • Products

    Platforms

    • Red Hat Enterprise Linux
      Red Hat Enterprise Linux Icon
    • Red Hat AI
      Red Hat AI
    • Red Hat OpenShift
      Openshift icon
    • Red Hat Ansible Automation Platform
      Ansible icon
    • See all Red Hat products

    Featured

    • Red Hat build of OpenJDK
    • Red Hat Developer Hub
    • Red Hat JBoss Enterprise Application Platform
    • Red Hat OpenShift Dev Spaces
    • Red Hat OpenShift Local
    • Red Hat Developer Sandbox

      Try Red Hat products and technologies without setup or configuration fees for 30 days with this shared Red Hat OpenShift and Kubernetes cluster.
    • Try at no cost
  • Technologies

    Featured

    • AI/ML
      AI/ML Icon
    • Linux
      Linux Icon
    • Kubernetes
      Cloud icon
    • Automation
      Automation Icon showing arrows moving in a circle around a gear
    • See all technologies
    • Programming languages & frameworks

      • Java
      • Python
      • JavaScript
    • System design & architecture

      • Red Hat architecture and design patterns
      • Microservices
      • Event-Driven Architecture
      • Databases
    • Developer experience

      • Productivity
      • Tools
      • GitOps
    • Automated data processing

      • AI/ML
      • Data science
      • Apache Kafka on Kubernetes
    • Platform engineering

      • DevOps
      • DevSecOps
      • Red Hat Ansible Automation Platform for applications and services
    • Secure development & architectures

      • Security
      • Secure coding
  • Learn

    Featured

    • Kubernetes & cloud native
      Openshift icon
    • Linux
      Rhel icon
    • Automation
      Ansible cloud icon
    • AI/ML
      AI/ML Icon
    • See all learning resources

    E-books

    • GitOps cookbook
    • Podman in action
    • Kubernetes operators
    • The path to GitOps
    • See all e-books

    Cheat sheets

    • Linux commands
    • Bash commands
    • Git
    • systemd commands
    • See all cheat sheets

    Documentation

    • Product documentation
    • API catalog
    • Legacy documentation
  • Developer Sandbox

    Developer Sandbox

    • Access Red Hat’s products and technologies without setup or configuration, and start developing quicker than ever before with our new, no-cost sandbox environments.
    • Explore the Developer Sandbox

    Featured Developer Sandbox activities

    • Get started with your Developer Sandbox
    • OpenShift virtualization and application modernization using the Developer Sandbox
    • Explore all Developer Sandbox activities

    Ready to start developing apps?

    • Try at no cost
  • Blog
  • Events
  • Videos

Combining KServe and llm-d for optimized generative AI inference

The best of both worlds: Cloud-native AI inference at scale using KServe and llm-d

April 21, 2026
Ran Pollak Yuan Tang
Related topics:
AI inferenceArtificial intelligenceKubernetesPlatform engineering
Related products:
Red Hat AIRed Hat AI Inference Server

    Enterprises today seek to integrate generative AI capabilities into their applications. However, scaling large AI models introduces complexity, such as managing high-volume traffic from large language models (LLMs), optimizing inference performance, maintaining predictable latency, and controlling infrastructure costs.

    Platform engineering leaders require more than model deployment capabilities. They need a Kubernetes-native infrastructure that supports efficient GPU utilization and intelligent request routing. This foundation also enables distributed inference patterns, cost-aware autoscaling, and production-grade governance.

    This article demonstrates how two open source solutions, KServe and llm-d, can be combined to address these challenges. We explore the role of each solution, illustrate their integration architecture, and provide practical guidance for AI platform teams, with a focus on KServe's LLMInferenceService, introduced in KServe v0.16.

    KServe: Simplifying AI model deployment on Kubernetes

    KServe is a Kubernetes-based model serving platform that simplifies deploying and managing ML models, including LLMs, at scale. For platform engineers, KServe acts as the model serving control plane—the layer responsible for lifecycle, scaling, and operational governance.

    The high-level generative inference architecture, including the interaction between Envoy AI Gateway and the KServe control plane, is illustrated in Figure 1.

    Diagram of Envoy AI Gateway routing requests to KServe inference services with distributed inference workers and a shared KV cache.
    Figure 1: Generative inference architecture. Source: KServe documentation (Apache 2.0 License).

    Inference as a service: What happens when a request comes in

    Instead of thinking about InferenceService as a list of features, it's more useful to follow the life of a single request: A request enters the system—for example, a /v1/chat/completions call from an application using an OpenAI-compatible client. KServe immediately takes responsibility.

    First, KServe determines where the request should go. If no pods are running, it triggers scale-from-zero. If traffic increases, it scales horizontally based on real-time demand. Then, it routes the request through the appropriate revision of the service, whether that's a stable deployment or a canary rollout receiving partial traffic.

    Before reaching the model, optional preprocessing steps might enrich or transform the request. Then, the system hands it off to the serving runtime, such as vLLM or TGI, where the model inference occurs. As tokens begin streaming back, KServe maintains the connection to ensure low-latency responses for the client. KServe continuously observes traffic patterns, adjusts scaling decisions, manages revisions, and maintains endpoint stability.

    From the developer's perspective, it feels simple: send a request, get a response. From the platform engineer's perspective, it's a fully orchestrated, adaptive system managing compute, traffic, and lifecycle in real time.

    LLMInferenceService in KServe

    KServe v0.16 introduces more generative AI capabilities, including LLMInferenceService, which is designed specifically for large language model workloads. The service provides OpenAI-compatible APIs, streaming token responses, and native integration with LLM runtimes. It is also built to handle high-concurrency workloads.

    The service connects to optimized runtimes such as vLLM and Hugging Face TGI to enable continuous batching, paged attention, and KV-cache reuse. This makes each individual pod highly efficient, but that efficiency has limits.

    KServe model inference platform architectural layers showing supported runtimes, gen AI and cloud-native integrations, and orchestration.
    Figure 2: KServe acts as a unified, Kubernetes-native inference platform.

    When KServe alone is not enough: The engineer's reality

    Initial KServe deployments often look successful: the model is live, autoscaling responds to requests, and GPUs are active. This stability is often tested, however, once the system encounters production-level traffic.

    You might notice inconsistent performance, where some requests are fast while others are unexpectedly slow. GPU utilization can appear high without being effective, and identical prompts might fail to benefit from cache reuse. These issues cause tail latency to become unpredictable under load.

    You realize that requests are being routed without awareness of where their data already exists. KV cache—a key optimization in LLM inference—is effectively random in a multi-replica setup. Then comes the next challenge: Prefill and decode phases—two fundamentally different workloads—are competing for the same GPUs.

    Finally, scaling decisions are reactive, not intelligent. The system scales based on load, but not on how efficiently that load is being processed.

    At this point, the problem shifts. The challenge is no longer about deploying models; it's about orchestrating intelligence across the cluster. That's where llm-d comes in.

    Integrating KServe and llm-d: Why separation wins

    It might be tempting to ask: Why not just build all of this into KServe?

    The answer lies in the architecture. Keeping KServe and llm-d as separate layers is a deliberate design choice that enables composability. KServe focuses on what it does best: managing the model lifecycle, AI exposure, and operational governance, such as autoscaling.

    In contrast, llm-d handles different operational concerns, including runtime-aware scheduling, cache locality optimization, and intelligence across pods and nodes. Merging these into a single monolithic system would couple features that evolve at different speeds.

    Instead, this layered approach gives platform teams:

    • Flexibility: Swap runtimes or schedulers independently.
    • Extensibility: Integrate future innovations without redesigning the stack.
    • Clarity: Each layer has a clear responsibility.

    From a platform engineering perspective, this is a win. Instead of a single tool trying to solve every problem, an effective system uses components that manage specific roles and work together through stable interfaces.

    The architectural flow for intelligent request routing and distributed prefix caching in Kubernetes is shown in Figure 3.

    Diagram showing a Client request routed through an Envoy-based Inference Gateway in Kubernetes to an Inference Pool with shared and independent prefix caching.
    Figure 3: llm-d augments Kubernetes-based inference by intelligently routing requests. Source: llm-d documentation (Apache 2.0 License). 

    KServe LLMInferenceService and llm-d: Responsibility separation

    To build an evolvable AI inference platform, you must separate these operational concerns. 

    KServe manages the model lifecycle and governance, while LLMInferenceService provides the generative API abstraction. Within the runtime, vLLM ensures execution, and llm-d provides cross-runtime routing and KV-cache awareness. Finally, Kubernetes orchestrates the underlying resources. 

    This separation is what enables a production-ready, scalable, and evolvable AI inference platform.

    Cost efficiency comparison: Naive versus optimized

    Serving LLMs at scale is more than a model problem; it is a distributed systems problem.

    Naive architectures can introduce cache locality loss, GPU imbalance, and duplicate computation. These issues lead to high tail latency and overprovisioned infrastructure.

    With KServe and llm-d, these inefficiencies are systematically removed through intelligent routing and phase-aware execution.

    Benchmark results: The before and after story

    Before introducing llm-d, the system behaved like many real-world deployments. Requests were distributed evenly—but blindly. Cache reuse was inconsistent. GPU utilization looked high, but effective throughput told a different story. In practice, this meant we were leaving a significant portion of performance unused.

    Once we introduced cache-aware routing and phase separation, system behavior improved. Requests began landing where their context already existed. Prefill and decode workloads stopped competing for the same resources. Schedulers began making decisions based on actual system state, not just traffic volume. These changes resulted in a measurable increase in efficiency.

    Key outcomes:

    • Up to 57 times improvement in Time to First Token (P90)
    • Double the token throughput
    • Approximately 50% reduction in tail latency
    • More consistent and predictable performance under load
    • Improved GPU utilization
    Table 1: Results are based on benchmarks published by the llm-d project.
    Optimization area

    Naive architecture (round-robin LB)

    Optimized (KServe + llm-d)

    Source

    Cache localityRequests routed randomly → KV cache frequently missedCache-aware routing preserves prefix localityKV-Cache Wins You Can See: From Prefix Caching in vLLM to Distributed Scheduling with llm-d
    Time to First Token (P90)Baseline latency under cache-blind schedulingUp to ~57× faster P90 TTFT in benchmarkKV-Cache Wins You Can See: From Prefix Caching in vLLM to Distributed Scheduling with llm-d
    Token throughput~4,400 tokens/sec (baseline test cluster)~8,730 tokens/sec (~2× improvement)KV-Cache Wins You Can See: From Prefix Caching in vLLM to Distributed Scheduling with llm-d
    Throughput at scaleDegrades under multi-tenant loadSustained 4.5k–11k tokens/secllm-d 0.5: Sustaining Performance at Scale
    Tail latency (P95/P99)Higher tail latency due to stragglers and imbalance~50% tail latency reduction (reported tests)llm-d: Kubernetes-native distributed inferencing
    GPU utilizationUneven utilization, idle GPUs possibleImproved effective utilization via routing intelligenceWell-lit Path: Intelligent Inference Scheduling
    Autoscaling controlScale reacts to load onlyWorks with KServe autoscaling and routing intelligenceAutoscaling with Knative Pod Autoscaler

    The model remained the same; the performance gains came from the orchestration and routing logic of the system.

    KV-cache-aware scheduling and disaggregated inference with llm-d

    As LLM deployments mature, scaling is no longer just about adding GPUs. It's about using them intelligently. Modern runtimes such as vLLM introduced prefix (KV) caching to reduce redundant computation, but without smart scheduling, much of that benefit is lost.

    This is where llm-d provides a different approach.

    Disaggregated inference (prefill and decode separation)

    LLM inference consists of two phases: prefill and decode. The prefill phase is compute-heavy; it processes the full prompt and builds the model's attention context. The decode phase is latency-sensitive and generates tokens step by step, where responsiveness impacts user experience.

    llm-d separates these phases across different GPU groups, assigning compute-optimized resources to prefill and latency-optimized resources to decode. With intelligent scheduling between them, workloads are aligned to the right hardware profile.

    This phase-aware architecture increases GPU utilization, reduces tail latency, and lowers cost per token by eliminating resource contention between different workloads.

    Intelligent inference scheduler

    llm-d's inference scheduler evaluates the following metrics:

    • GPU utilization
    • Queue depth
    • Cache residency
    • SLA constraints
    • Load distribution

    The system uses an intelligent scheduler to decrease serving latency and increase throughput. It achieves this through prefix-cache aware routing, utilization-based load balancing, fairness and prioritization for multi-tenant serving, and predicted latency balancing.

    Conclusion

    Modern gen AI platforms require more than fast runtimes. They require cache locality awareness, phase-aware scheduling, and distributed intelligence within a composable, Kubernetes-native architecture. By combining KServe and llm-d, platform teams can move from serving models to operating efficient inference systems at scale.

    Explore the project documentation:

    • KServe
    • llm-d

    Engage with community resources and Slack channels to stay updated and contribute to ongoing developments.

    • KServe community
    • llm-d community

    Related Posts

    • Introduction to distributed inference with llm-d

    • How to set up KServe autoscaling for vLLM with KEDA

    • Master KV cache aware routing with llm-d for efficient AI inference

    • How to install KServe using Open Data Hub

    • Empower conversational AI at scale with KServe

    • Getting started with llm-d for distributed AI inference

    Recent Posts

    • Combining KServe and llm-d for optimized generative AI inference

    • AI-powered documentation updates: From code diff to docs PR in one comment

    • 3 lessons for building reliable ServiceNow AI integrations

    • Deploy hosted control planes with OpenShift Virtualization

    • Camel integration quarterly digest: Q1 2026

    What’s up next?

    Learning Path Red Hat AI

    How to run AI models in cloud development environments

    This learning path explores running AI models, specifically large language...
    Red Hat Developers logo LinkedIn YouTube Twitter Facebook

    Platforms

    • Red Hat AI
    • Red Hat Enterprise Linux
    • Red Hat OpenShift
    • Red Hat Ansible Automation Platform
    • See all products

    Build

    • Developer Sandbox
    • Developer tools
    • Interactive tutorials
    • API catalog

    Quicklinks

    • Learning resources
    • E-books
    • Cheat sheets
    • Blog
    • Events
    • Newsletter

    Communicate

    • About us
    • Contact sales
    • Find a partner
    • Report a website issue
    • Site status dashboard
    • Report a security problem

    RED HAT DEVELOPER

    Build here. Go anywhere.

    We serve the builders. The problem solvers who create careers with code.

    Join us if you’re a developer, software engineer, web designer, front-end designer, UX designer, computer scientist, architect, tester, product manager, project manager or team lead.

    Sign me up

    Red Hat legal and privacy links

    • About Red Hat
    • Jobs
    • Events
    • Locations
    • Contact Red Hat
    • Red Hat Blog
    • Inclusion at Red Hat
    • Cool Stuff Store
    • Red Hat Summit
    © 2026 Red Hat

    Red Hat legal and privacy links

    • Privacy statement
    • Terms of use
    • All policies and guidelines
    • Digital accessibility

    Report a website issue