5 steps to triage vLLM performance

As enterprises move large language models (LLMs) from pilot to production, maintaining consistent inference performance becomes a primary operational challenge. While vLLM is efficient out of the box, users often encounter performance that differs from their expectations once they move beyond simple benchmarks.

vLLM provides a comprehensive set of metrics that offer a window into the server's internal state, which can help determine the root cause of performance issues. This article provides a diagnostic workflow to help you improve the performance of your vLLM deployments.

Before you start: Define what success looks like

Before reviewing the diagnostic metrics, ask yourself: What are the performance objectives? The answer determines which metrics matter most and which remediation steps to prioritize. Your optimization strategy typically depends on one of the following workload profiles:

Throughput-sensitive (offline/batch): You're processing a large volume of requests for data extraction, summarization, or document classification. Individual request latency matters less than total job completion time. You are optimizing for requests per second.
Latency-sensitive (online/interactive): A human user is waiting for a streamed response. Think chatbot, code completion, real-time voice. Time to first token (TTFT) and inter-token latency (ITL) directly affect the user experience. Optimize for low latency at your expected concurrency.
Bursty (semi-online): Load is unpredictable, switching constantly from idle to spiking. You need flexible scaling and fast cold starts more than raw throughput or minimum latency.

If you're running a batch pipeline and your response time is high, that may not be a problem at all. Conversely, if you're serving a chatbot, request latency might look fine at low concurrency but become unacceptable when your server is saturated. Framing your workload upfront will keep you focused on the right metrics in the steps that follow.

Once you identify your performance goals, begin the triage process by isolating where the latency occurs.

1. Isolate the symptom: TTFT vs. ITL

Total end-to-end (E2E) latency is usually our first window into performance. It's a key metric for defining SLOs and understanding the overall user experience. While E2E latency can tell us if requests are slow, to solve the mystery of why they are slow, we need to look under the hood at the measurements of the two component parts of processing a request:

Time to First Token (TTFT): This is the time from when the server receives the request until the first token is generated. This includes any queuing delay (the time a request sits waiting due to server saturation) plus the prefill phase (where the engine processes the input prompt). On a saturated server, TTFT is often dominated by queuing time.
Inter-Token Latency (ITL): This measures the time between each subsequent token during the decode phase. ITL determines how "smooth" the text appears as it streams to the user. High ITL typically suggests hardware bandwidth limits or high concurrency within a batch.

It's worth noting that these two phases interact. When a long prefill runs on a saturated server, it can temporarily starve the decode steps of other in-flight requests, causing ITL spikes for everyone in the batch. vLLM mitigates this with chunked prefill (enabled by default), which breaks long prefills into smaller chunks that interleave with decode steps. If you see periodic ITL spikes correlated with new long-prompt requests arriving, chunked prefill is working as intended. However, the spikes can indicate that your --max-num-batched-tokens need tuning.

Latency metrics

By looking at these three metrics, you can narrow down the bottleneck. For example, if E2E latency and TTFT are high but ITL is low, your server might be overloaded, causing requests to queue. It is also possible that the input prompts are very long, leading to long prefill times.

These metrics are server-side and do not include network or ingress latency. If you're measuring latency from the client and it's significantly higher than what vLLM reports, the gap is likely in your network path, load balancer, or proxy layer.

E2E request latency: Total time for the full request to be processed by vLLM. Use the following Prometheus query:
```
histogram_quantile(0.5, sum by(model_name, pod, le)(
    rate(vllm:e2e_request_latency_seconds_bucket{}[1m]))
)
```

TTFT: Queuing time + prefill phase. Use the following Prometheus query:

histogram_quantile(0.5, sum by(model_name, pod, le)(
  rate(vllm:time_to_first_token_seconds_bucket{}[1m]))
)

ITL: Time between generated tokens (decode phase). Use the following Prometheus query:

histogram_quantile(0.5, sum by(model_name, pod, le)(
  rate(vllm:inter_token_latency_seconds_bucket{}[1m]))
)

2. Detect server saturation

To understand the latency metrics from the previous step, we can look at the server queue depth and running batch sizes. High TTFT could be caused by requests waiting in a long queue, and ITL generally scales with batch sizes. vLLM uses continuous batching to maximize throughput, but there is an inherent tradeoff: as the number of running requests increases, individual latency for each user typically rises as well.

By monitoring the relationship between running and waiting requests, you can determine if your server is saturated or if a specific phase of inference is simply compute-bound.

Server health metrics

Monitor the balance between running and waiting requests to determine when your server reaches its capacity.

Running requests: Requests actively being processed on the GPU (batch size). Use the Prometheus metric vllm:num_requests_running.
Waiting requests: Requests sitting in the queue due to resource limits (queue depth). Use the Prometheus metric vllm:num_requests_waiting.

If num_requests_waiting is consistently above zero, the time requests spend in the queue leads to higher TTFT. If it is zero but TTFT remains high, the delay is likely due to the time required for the prefill phase itself (processing long prompts). This suggests your vLLM server is compute-bound.

Server health logs

vLLM periodically logs a snapshot of the engine state, including the number of requests running and waiting, as well as key-value (KV) cache metrics. You can use these to check the server health.

Healthy server (well-utilized, but not overloaded): Zero requests are waiting, and the KV cache utilization is below 90%.
```
Engine 000: Running: 39 reqs, Waiting: 0 reqs, GPU KV cache usage: 68.9%, Prefix cache hit rate: 29.7%
```
Saturated server: The KV cache is at nearly 100% capacity, forcing requests into the "Waiting" queue. These users will experience high TTFT regardless of their prompt size.
```
Engine 000: Running: 60 reqs, Waiting: 21 reqs, GPU KV cache usage: 99.8%, Prefix cache hit rate: 32.2%
```

A"healthy" batch size depends on your hardware and model choice. For a small model running on a large GPU, vLLM can process hundreds of requests in a batch with acceptable latency.

3. Evaluate VRAM and KV cache health

GPU memory is the primary constraint for how many requests a system can handle simultaneously. It is primarily used to store static model weights and the dynamic KV cache. Additionally, CUDA graphs and activations take up a smaller amount of GPU memory.

If the model weights are too large for the hardware, there isn't enough KV cache space left to process requests concurrently. This "memory pressure" is the most common cause of the queued waiting requests we identified in step 2. Quantization can significantly alleviate this pressure; vLLM supports running models with quantized weights (for example, FP8, AWQ, or GPTQ), and quantizing the KV cache using --kv-cache-dtype fp8. These techniques shrink the memory footprint of weights and tokens, leaving more room for the engine to manage larger batches.

To diagnose the severity of this memory pressure, monitor these key health indicators:

KV cache occupancy: The percentage of available KV cache currently in use. If this value is consistently near 100%, new requests are forced to wait. Use the Prometheus metric vllm:kv_cache_usage_perc.
Total preemptions: A cumulative count of requests stopped mid-generation to free up memory. This indicates severe memory thrashing and will result in significant latency spikes for users. If this counter is climbing, reduce --max-num-seqs to prevent the scheduler from admitting more requests than the KV cache can sustain. Use the Prometheus metric vllm:num_preemptions_total.

Startup capacity checks

At startup, vLLM calculates exactly how much memory remains for the KV cache after loading the model and overhead. This helps you determine if your model is right-sized before traffic arrives.

Example: Memory-constrained model (32B FP16 model on one H100 GPU): The weights and overhead consume almost the entire 80GB GPU. With only 4.4 GiB left for the cache, the server will struggle with more than two users at a time.

Available KV cache memory: 4.46 GiB
Maximum concurrency for 10,000 tokens per request: 1.83x

If your logs show very low available KV cache memory, consider using a quantized version of your model, a smaller model, or using more (or larger) GPUs to free up space for active requests.

Prefix caching

In many use cases, prompts start with the same prefix. For example, in a multi-turn chat where users requests build on one another, as the conversation history is included in the prompt for subsequent messages. vLLM prefix caching stores these segments in the KV cache so they don't have to be re-processed. Prefix caching is enabled by default in vLLM. A high hit rate is an excellent sign—it means you are saving compute power and significantly reducing TTFT.

A high prefix cache hit rate—the ratio of requests that benefit from cached prompt segments—indicates that you are saving compute power and reducing the time to first token (TTFT). Calculate the hit rate using the following formula:

rate(vllm:prefix_cache_hits_total[1m]) /
rate(vllm:prefix_cache_queries_total[1m])

4. Analyze request sequence lengths

Latency metrics require the context of token counts to be meaningful. Input sequence length (ISL) and output sequence length (OSL) determine the memory footprint in the KV cache and the compute time required for the prefill and decode steps.

Note that the prefill and decode steps are highly asymmetrical. On modern hardware, processing a 1,000-token prompt (prefill) is parallel and typically takes less than a second. However, generating 1,000 tokens (decode) happens sequentially and could take anywhere from 10 to 100+ seconds.

ISL and TTFT: Prefill time scales with the length of the input. If your prompts are very long, your TTFT will naturally be higher because the engine has more work to do before it can generate the first token.
OSL and ITL: Generation time scales linearly with the number of tokens produced. For example, at an ITL of 40 ms, generating 1,000 tokens takes 40 seconds, but 100 tokens takes only 4 seconds.

You can investigate the sequence lengths using PromQL queries like the following:

# 95th percentile prompt length
histogram_quantile(0.95, rate(vllm:request_prompt_tokens_bucket[5m]))
# 95th percentile output length
histogram_quantile(0.95, rate(vllm:request_generation_tokens_bucket[5m]))
# Median output length
histogram_quantile(0.5, rate(vllm:request_generation_tokens_bucket[5m]))

Note about retrieval-augmented generation (RAG)

In vLLM deployments which are a part of a RAG pipeline, a short user query is often expanded into a massive prompt including the context retrieved from the vector database. The compute required to process these large prompts can lead to high TTFT.

Also note that the vector database lookup could add additional latency to requests which is not accounted for in the vLLM TTFT metrics. To mitigate prefill cost in RAG workloads, structure your prompts so that the shared system prompt and static instructions come first so vLLM's prefix caching will cache and reuse these common prefixes across requests, significantly reducing redundant compute.

5. Review distributed inference strategy

When a model is too large for a single GPU, it must be split across multiple units. The most common method is tensor parallelism (TP), which shards the tensor computations within each layer (for example, GEMMs in attention or MLP) across GPUs to improve ITL. Splitting the model across multiple GPUs reduces the memory required on each GPU to store model weights. This frees up KV cache space, allowing for larger batches and higher throughput.

Because these strategies require constant synchronization, they can introduce significant communication overhead. If your ITL is unexpectedly high, check your hardware topology with nvidia-smi topo -m. If your GPUs communicate over PCIe instead of a high-speed interconnect like NVLink, these synchronization penalties can severely degrade performance.

Oftentimes, the most efficient way to scale is by running multiple independent replicas of the model (data parallelism) or with a hybrid approach of TP and replicas. This avoids the inter-GPU communication bottleneck during token generation and is typically the better path for increasing total system throughput. As a general rule, use the minimum TP degree that fits your model, then scale out with independent replicas for additional capacity. Replicas have zero inter-replica communication overhead and scale linearly.

Remediation steps

While vLLM offers many advanced settings, performance issues are rarely solved by tuning CLI parameters alone. Significant improvements usually require addressing the underlying hardware or memory constraints identified during your triage.

Once you have identified the bottleneck, consider these high-impact optimization paths.

Right-size your model

Before tuning anything else, ask whether you need the model you're running. A fine-tuned 8B model will often match or exceed a general-purpose 70B model on a narrow task, at a fraction of the latency and cost. If your workload is well-defined (classification, extraction, summarization), a smaller task-specific model may eliminate your performance problem entirely.

Scale your hardware

If the system is consistently saturated (high num_requests_waiting), add more replicas to spread the load. If ITL is the bottleneck, moving to GPUs with higher memory bandwidth (e.g., L40S to H100) will directly speed up the decode phase.

Use quantization

Moving from FP16 to FP8 halves the model's VRAM footprint and improves ITL by reducing the data read during each forward pass. You can further double effective cache capacity with minimal quality loss by using --kv-cache-dtype fp8. Many model providers publish pre-quantized checkpoints on Hugging Face, such as the Red Hat AI collection, so you might not need to quantize yourself. Note: Ensure your hardware natively supports your chosen quantization format—for example, avoid FP8 on A100s—to prevent performance degradation.

Implement speculative decoding

If ITL is your primary bottleneck and you're already on fast hardware with a quantized model, speculative decoding can generate multiple tokens per forward pass, effectively multiplying decode throughput without sacrificing output quality. vLLM supports several methods including native MTP, EAGLE3, draft models, and n-gram speculation. Red Hat provides several pre-trained speculators on Hugging Face for popular open models. You can even train your own aligned drafters with the Speculators library.

Refine distribution

If you are hitting interconnect overhead in multi-GPU deployments, consider reducing the tensor parallel (TP) degree. Two independent replicas at TP=2 will typically outperform one instance at TP=4, both in throughput and in scheduling flexibility. If your hardware lacks NVLink entirely, keeping TP=1 per GPU and running pure data or pipeline parallelism might be your best option.

By analyzing metrics in the context of hardware limits and workload shape, you can move away from trial-and-error and ensure your LLM infrastructure remains performant as demand scales. To see these metrics in action, try deploying your model of choice on Red Hat OpenShift AI and monitoring the Prometheus metrics in the cluster observability dashboard.

What's next

Now that you have a handle on triaging a single vLLM instance, you are ready to look at the bigger picture. Performance tuning doesn't stop at a single GPU; in production environments, the next step is often scaling out. In upcoming posts, we will examine diagnostics and troubleshooting strategies for distributed inference using llm-d, which is designed for scale-out environments like Kubernetes and Red Hat OpenShift AI. Whether you are troubleshooting interconnect overhead across nodes or optimizing multi-replica queue depths for optimal load balancing, we will show you how to maintain a performant LLM infrastructure at scale.

5 steps to triage vLLM performance

Before you start: Define what success looks like

1. Isolate the symptom: TTFT vs. ITL

Latency metrics

2. Detect server saturation

Server health metrics

Server health logs

3. Evaluate VRAM and KV cache health

Startup capacity checks

Prefix caching

4. Analyze request sequence lengths

Note about retrieval-augmented generation (RAG)

5. Review distributed inference strategy

Remediation steps

Right-size your model

Scale your hardware

Use quantization

Implement speculative decoding

Refine distribution

What's next

Efficiently manage host content with Red Hat Satellite's multi-CV

New features in Python 3.14

Why killing pods is not enough: Testing operator reconciliation with operator-chaos

Troubleshoot Red Hat OpenShift Virtualization localnet with the netobserv command

EvalHub: Capability and safety benchmarking for AI models

Demystify RAG with OpenShift AI and Elasticsearch

Platforms

Build

Quicklinks

Communicate

RED HAT DEVELOPER

Red Hat legal and privacy links

Red Hat legal and privacy links