Skip to main content
Redhat Developers  Logo
  • Products

    Platforms

    • Red Hat Enterprise Linux
      Red Hat Enterprise Linux Icon
    • Red Hat AI
      Red Hat AI
    • Red Hat OpenShift
      Openshift icon
    • Red Hat Ansible Automation Platform
      Ansible icon
    • See all Red Hat products

    Featured

    • Red Hat build of OpenJDK
    • Red Hat Developer Hub
    • Red Hat JBoss Enterprise Application Platform
    • Red Hat OpenShift Dev Spaces
    • Red Hat OpenShift Local
    • Red Hat Developer Sandbox

      Try Red Hat products and technologies without setup or configuration fees for 30 days with this shared Red Hat OpenShift and Kubernetes cluster.
    • Try at no cost
  • Technologies

    Featured

    • AI/ML
      AI/ML Icon
    • Linux
      Linux Icon
    • Kubernetes
      Cloud icon
    • Automation
      Automation Icon showing arrows moving in a circle around a gear
    • See all technologies
    • Programming languages & frameworks

      • Java
      • Python
      • JavaScript
    • System design & architecture

      • Red Hat architecture and design patterns
      • Microservices
      • Event-Driven Architecture
      • Databases
    • Developer experience

      • Productivity
      • Tools
      • GitOps
    • Automated data processing

      • AI/ML
      • Data science
      • Apache Kafka on Kubernetes
    • Platform engineering

      • DevOps
      • DevSecOps
      • Red Hat Ansible Automation Platform for applications and services
    • Secure development & architectures

      • Security
      • Secure coding
  • Learn

    Featured

    • Kubernetes & cloud native
      Openshift icon
    • Linux
      Rhel icon
    • Automation
      Ansible cloud icon
    • AI/ML
      AI/ML Icon
    • See all learning resources

    E-books

    • GitOps cookbook
    • Podman in action
    • Kubernetes operators
    • The path to GitOps
    • See all e-books

    Cheat sheets

    • Linux commands
    • Bash commands
    • Git
    • systemd commands
    • See all cheat sheets

    Documentation

    • Product documentation
    • API catalog
    • Legacy documentation
  • Developer Sandbox

    Developer Sandbox

    • Access Red Hat’s products and technologies without setup or configuration, and start developing quicker than ever before with our new, no-cost sandbox environments.
    • Explore the Developer Sandbox

    Featured Developer Sandbox activities

    • Get started with your Developer Sandbox
    • OpenShift virtualization and application modernization using the Developer Sandbox
    • Explore all Developer Sandbox activities

    Ready to start developing apps?

    • Try at no cost
  • Blog
  • Events
  • Videos

5 steps to triage vLLM performance

March 9, 2026
David Whyte-Gray Thameem Abbas Ibrahim Bathusha Michael Goin Ashish Kamra
Related topics:
Artificial intelligence
Related products:
Red Hat AI Inference ServerRed Hat AI

    As enterprises move large language models (LLMs) from pilot to production, maintaining consistent inference performance becomes a primary operational challenge. While vLLM is efficient out of the box, users often encounter performance that differs from their expectations once they move beyond simple benchmarks.

    vLLM provides a comprehensive set of metrics that offer a window into the server's internal state, which can help determine the root cause of performance issues. This article provides a diagnostic workflow to help you improve the performance of your vLLM deployments.

    Before you start: Define what success looks like

    Before reviewing the diagnostic metrics, ask yourself: What are the performance objectives? The answer determines which metrics matter most and which remediation steps to prioritize. Your optimization strategy typically depends on one of the following workload profiles:

    • Throughput-sensitive (offline/batch): You're processing a large volume of requests for data extraction, summarization, or document classification. Individual request latency matters less than total job completion time. You are optimizing for requests per second.
    • Latency-sensitive (online/interactive): A human user is waiting for a streamed response. Think chatbot, code completion, real-time voice. Time to first token (TTFT) and inter-token latency (ITL) directly affect the user experience. Optimize for low latency at your expected concurrency.
    • Bursty (semi-online): Load is unpredictable, switching constantly from idle to spiking. You need flexible scaling and fast cold starts more than raw throughput or minimum latency.

    If you're running a batch pipeline and your response time is high, that may not be a problem at all. Conversely, if you're serving a chatbot, request latency might look fine at low concurrency but become unacceptable when your server is saturated. Framing your workload upfront will keep you focused on the right metrics in the steps that follow.

    Once you identify your performance goals, begin the triage process by isolating where the latency occurs.

    1. Isolate the symptom: TTFT vs. ITL

    Total end-to-end (E2E) latency is usually our first window into performance. It's a key metric for defining SLOs and understanding the overall user experience. While E2E latency can tell us if requests are slow, to solve the mystery of why they are slow, we need to look under the hood at the measurements of the two component parts of processing a request:

    • Time to First Token (TTFT): This is the time from when the server receives the request until the first token is generated. This includes any queuing delay (the time a request sits waiting due to server saturation) plus the prefill phase (where the engine processes the input prompt). On a saturated server, TTFT is often dominated by queuing time.
    • Inter-Token Latency (ITL): This measures the time between each subsequent token during the decode phase. ITL determines how "smooth" the text appears as it streams to the user. High ITL typically suggests hardware bandwidth limits or high concurrency within a batch.

    It's worth noting that these two phases interact. When a long prefill runs on a saturated server, it can temporarily starve the decode steps of other in-flight requests, causing ITL spikes for everyone in the batch. vLLM mitigates this with chunked prefill (enabled by default), which breaks long prefills into smaller chunks that interleave with decode steps. If you see periodic ITL spikes correlated with new long-prompt requests arriving, chunked prefill is working as intended. However, the spikes can indicate that your --max-num-batched-tokens need tuning.

    Latency metrics

    By looking at these three metrics, you can narrow down the bottleneck. For example, if E2E latency and TTFT are high but ITL is low, your server might be overloaded, causing requests to queue. It is also possible that the input prompts are very long, leading to long prefill times.

    These metrics are server-side and do not include network or ingress latency. If you're measuring latency from the client and it's significantly higher than what vLLM reports, the gap is likely in your network path, load balancer, or proxy layer.

    • E2E request latency: Total time for the full request to be processed by vLLM. Use the following Prometheus query:

      histogram_quantile(0.5, sum by(model_name, pod, le)(
          rate(vllm:e2e_request_latency_seconds_bucket{}[1m]))
      )
    • TTFT: Queuing time + prefill phase. Use the following Prometheus query:

      histogram_quantile(0.5, sum by(model_name, pod, le)(
        rate(vllm:time_to_first_token_seconds_bucket{}[1m]))
      )
    • ITL: Time between generated tokens (decode phase). Use the following Prometheus query:

      histogram_quantile(0.5, sum by(model_name, pod, le)(
        rate(vllm:inter_token_latency_seconds_bucket{}[1m]))
      )

    2. Detect server saturation

    To understand the latency metrics from the previous step, we can look at the server queue depth and running batch sizes. High TTFT could be caused by requests waiting in a long queue, and ITL generally scales with batch sizes. vLLM uses continuous batching to maximize throughput, but there is an inherent tradeoff: as the number of running requests increases, individual latency for each user typically rises as well.

    By monitoring the relationship between running and waiting requests, you can determine if your server is saturated or if a specific phase of inference is simply compute-bound.

    Server health metrics

    Monitor the balance between running and waiting requests to determine when your server reaches its capacity.

    • Running requests: Requests actively being processed on the GPU (batch size). Use the Prometheus metric vllm:num_requests_running.
    • Waiting requests: Requests sitting in the queue due to resource limits (queue depth). Use the Prometheus metric vllm:num_requests_waiting.

    If num_requests_waiting is consistently above zero, the time requests spend in the queue leads to higher TTFT. If it is zero but TTFT remains high, the delay is likely due to the time required for the prefill phase itself (processing long prompts). This suggests your vLLM server is compute-bound.

    Server health logs

    vLLM periodically logs a snapshot of the engine state, including the number of requests running and waiting, as well as key-value (KV) cache metrics. You can use these to check the server health.

    • Healthy server (well-utilized, but not overloaded): Zero requests are waiting, and the KV cache utilization is below 90%.

      Engine 000: Running: 39 reqs, Waiting: 0 reqs, GPU KV cache usage: 68.9%, Prefix cache hit rate: 29.7%
    • Saturated server: The KV cache is at nearly 100% capacity, forcing requests into the "Waiting" queue. These users will experience high TTFT regardless of their prompt size.

      Engine 000: Running: 60 reqs, Waiting: 21 reqs, GPU KV cache usage: 99.8%, Prefix cache hit rate: 32.2%

    A"healthy" batch size depends on your hardware and model choice. For a small model running on a large GPU, vLLM can process hundreds of requests in a batch with acceptable latency.

    3. Evaluate VRAM and KV cache health

    GPU memory is the primary constraint for how many requests a system can handle simultaneously. It is primarily used to store static model weights and the dynamic KV cache. Additionally, CUDA graphs and activations take up a smaller amount of GPU memory.

    If the model weights are too large for the hardware, there isn't enough KV cache space left to process requests concurrently. This "memory pressure" is the most common cause of the queued waiting requests we identified in step 2. Quantization can significantly alleviate this pressure; vLLM supports running models with quantized weights (for example, FP8, AWQ, or GPTQ), and quantizing the KV cache using --kv-cache-dtype fp8. These techniques shrink the memory footprint of weights and tokens, leaving more room for the engine to manage larger batches.

    To diagnose the severity of this memory pressure, monitor these key health indicators:

    • KV cache occupancy: The percentage of available KV cache currently in use. If this value is consistently near 100%, new requests are forced to wait. Use the Prometheus metric vllm:kv_cache_usage_perc.
    • Total preemptions: A cumulative count of requests stopped mid-generation to free up memory. This indicates severe memory thrashing and will result in significant latency spikes for users. If this counter is climbing, reduce --max-num-seqs to prevent the scheduler from admitting more requests than the KV cache can sustain. Use the Prometheus metric vllm:num_preemptions_total.

    Startup capacity checks

    At startup, vLLM calculates exactly how much memory remains for the KV cache after loading the model and overhead. This helps you determine if your model is right-sized before traffic arrives.

    Example: Memory-constrained model (32B FP16 model on one H100 GPU): The weights and overhead consume almost the entire 80GB GPU. With only 4.4 GiB left for the cache, the server will struggle with more than two users at a time.

    Available KV cache memory: 4.46 GiB
    Maximum concurrency for 10,000 tokens per request: 1.83x

    If your logs show very low available KV cache memory, consider using a quantized version of your model, a smaller model, or using more (or larger) GPUs to free up space for active requests.

    Prefix caching

    In many use cases, prompts start with the same prefix. For example, in a multi-turn chat where users requests build on one another, as the conversation history is included in the prompt for subsequent messages. vLLM prefix caching stores these segments in the KV cache so they don't have to be re-processed. Prefix caching is enabled by default in vLLM. A high hit rate is an excellent sign—it means you are saving compute power and significantly reducing TTFT.

    A high prefix cache hit rate—the ratio of requests that benefit from cached prompt segments—indicates that you are saving compute power and reducing the time to first token (TTFT). Calculate the hit rate using the following formula:

    rate(vllm:prefix_cache_hits_total[1m]) /
    rate(vllm:prefix_cache_queries_total[1m])

    4. Analyze request sequence lengths

    Latency metrics require the context of token counts to be meaningful. Input sequence length (ISL) and output sequence length (OSL) determine the memory footprint in the KV cache and the compute time required for the prefill and decode steps.

    Note that the prefill and decode steps are highly asymmetrical. On modern hardware, processing a 1,000-token prompt (prefill) is parallel and typically takes less than a second. However, generating 1,000 tokens (decode) happens sequentially and could take anywhere from 10 to 100+ seconds.

    • ISL and TTFT: Prefill time scales with the length of the input. If your prompts are very long, your TTFT will naturally be higher because the engine has more work to do before it can generate the first token.
    • OSL and ITL: Generation time scales linearly with the number of tokens produced. For example, at an ITL of 40 ms, generating 1,000 tokens takes 40 seconds, but 100 tokens takes only 4 seconds.

    You can investigate the sequence lengths using PromQL queries like the following:

    # 95th percentile prompt length
    histogram_quantile(0.95, rate(vllm:request_prompt_tokens_bucket[5m]))
    # 95th percentile output length
    histogram_quantile(0.95, rate(vllm:request_generation_tokens_bucket[5m]))
    # Median output length
    histogram_quantile(0.5, rate(vllm:request_generation_tokens_bucket[5m]))

    Note about retrieval-augmented generation (RAG)

    In vLLM deployments which are a part of a RAG pipeline, a short user query is often expanded into a massive prompt including the context retrieved from the vector database. The compute required to process these large prompts can lead to high TTFT. 

    Also note that the vector database lookup could add additional latency to requests which is not accounted for in the vLLM TTFT metrics. To mitigate prefill cost in RAG workloads, structure your prompts so that the shared system prompt and static instructions come first so vLLM's prefix caching will cache and reuse these common prefixes across requests, significantly reducing redundant compute.

    5. Review distributed inference strategy

    When a model is too large for a single GPU, it must be split across multiple units. The most common method is tensor parallelism (TP), which shards the tensor computations within each layer (for example, GEMMs in attention or MLP) across GPUs to improve ITL. Splitting the model across multiple GPUs reduces the memory required on each GPU to store model weights. This frees up KV cache space, allowing for larger batches and higher throughput.

    Because these strategies require constant synchronization, they can introduce significant communication overhead. If your ITL is unexpectedly high, check your hardware topology with nvidia-smi topo -m. If your GPUs communicate over PCIe instead of a high-speed interconnect like NVLink, these synchronization penalties can severely degrade performance.

    Oftentimes, the most efficient way to scale is by running multiple independent replicas of the model (data parallelism) or with a hybrid approach of TP and replicas. This avoids the inter-GPU communication bottleneck during token generation and is typically the better path for increasing total system throughput. As a general rule, use the minimum TP degree that fits your model, then scale out with independent replicas for additional capacity. Replicas have zero inter-replica communication overhead and scale linearly.

    Remediation steps

    While vLLM offers many advanced settings, performance issues are rarely solved by tuning CLI parameters alone. Significant improvements usually require addressing the underlying hardware or memory constraints identified during your triage.

    Once you have identified the bottleneck, consider these high-impact optimization paths.

    Right-size your model

    Before tuning anything else, ask whether you need the model you're running. A fine-tuned 8B model will often match or exceed a general-purpose 70B model on a narrow task, at a fraction of the latency and cost. If your workload is well-defined (classification, extraction, summarization), a smaller task-specific model may eliminate your performance problem entirely.

    Scale your hardware

    If the system is consistently saturated (high num_requests_waiting), add more replicas to spread the load. If ITL is the bottleneck, moving to GPUs with higher memory bandwidth (e.g., L40S to H100) will directly speed up the decode phase.

    Use quantization

    Moving from FP16 to FP8 halves the model's VRAM footprint and improves ITL by reducing the data read during each forward pass. You can further double effective cache capacity with minimal quality loss by using --kv-cache-dtype fp8. Many model providers publish pre-quantized checkpoints on Hugging Face, such as the Red Hat AI collection, so you might not need to quantize yourself. Note: Ensure your hardware natively supports your chosen quantization format—for example, avoid FP8 on A100s—to prevent performance degradation.

    Implement speculative decoding

    If ITL is your primary bottleneck and you're already on fast hardware with a quantized model, speculative decoding can generate multiple tokens per forward pass, effectively multiplying decode throughput without sacrificing output quality. vLLM supports several methods including native MTP, EAGLE3, draft models, and n-gram speculation. Red Hat provides several pre-trained speculators on Hugging Face for popular open models. You can even train your own aligned drafters with the Speculators library.

    Refine distribution

    If you are hitting interconnect overhead in multi-GPU deployments, consider reducing the tensor parallel (TP) degree. Two independent replicas at TP=2 will typically outperform one instance at TP=4, both in throughput and in scheduling flexibility. If your hardware lacks NVLink entirely, keeping TP=1 per GPU and running pure data or pipeline parallelism might be your best option.

    By analyzing metrics in the context of hardware limits and workload shape, you can move away from trial-and-error and ensure your LLM infrastructure remains performant as demand scales. To see these metrics in action, try deploying your model of choice on Red Hat OpenShift AI and monitoring the Prometheus metrics in the cluster observability dashboard.

    What's next

    Now that you have a handle on triaging a single vLLM instance, you are ready to look at the bigger picture. Performance tuning doesn't stop at a single GPU; in production environments, the next step is often scaling out. In upcoming posts, we will examine diagnostics and troubleshooting strategies for distributed inference using llm-d, which is designed for scale-out environments like Kubernetes and Red Hat OpenShift AI. Whether you are troubleshooting interconnect overhead across nodes or optimizing multi-replica queue depths for optimal load balancing, we will show you how to maintain a performant LLM infrastructure at scale.

    Related Posts

    • Practical strategies for vLLM performance tuning

    • How to deploy and benchmark vLLM with GuideLLM on Kubernetes

    • Why vLLM is the best choice for AI inference today

    • vLLM or llama.cpp: Choosing the right LLM inference engine for your use case

    • vLLM Semantic Router: Improving efficiency in AI reasoning

    Recent Posts

    • 5 steps to triage vLLM performance

    • Automate AI agents with the Responses API in Llama Stack

    • Smarter multi-cluster scheduling with dynamic scoring framework

    • What's new in network observability 1.11

    • From local prototype to enterprise production: Private speech transcription with Whisper and Red Hat AI

    What’s up next?

    Learning Path RHOS_Elasticsearch_RAG_featured_image

    Demystify RAG with OpenShift AI and Elasticsearch

    Understand how retrieval-augmented generation (RAG) works and how users can...
    Red Hat Developers logo LinkedIn YouTube Twitter Facebook

    Platforms

    • Red Hat AI
    • Red Hat Enterprise Linux
    • Red Hat OpenShift
    • Red Hat Ansible Automation Platform
    • See all products

    Build

    • Developer Sandbox
    • Developer tools
    • Interactive tutorials
    • API catalog

    Quicklinks

    • Learning resources
    • E-books
    • Cheat sheets
    • Blog
    • Events
    • Newsletter

    Communicate

    • About us
    • Contact sales
    • Find a partner
    • Report a website issue
    • Site status dashboard
    • Report a security problem

    RED HAT DEVELOPER

    Build here. Go anywhere.

    We serve the builders. The problem solvers who create careers with code.

    Join us if you’re a developer, software engineer, web designer, front-end designer, UX designer, computer scientist, architect, tester, product manager, project manager or team lead.

    Sign me up

    Red Hat legal and privacy links

    • About Red Hat
    • Jobs
    • Events
    • Locations
    • Contact Red Hat
    • Red Hat Blog
    • Inclusion at Red Hat
    • Cool Stuff Store
    • Red Hat Summit
    © 2026 Red Hat

    Red Hat legal and privacy links

    • Privacy statement
    • Terms of use
    • All policies and guidelines
    • Digital accessibility

    Report a website issue