Performance tuning large language model (LLM) serving frameworks like vLLM is rarely about a single magic flag or configuration. Instead, it's an iterative process that balances hardware constraints, workload characteristics, and user experience goals such as latency and throughput.
This article walks through practical tuning recommendations with a focus on designing meaningful benchmarks and extracting the most performance from vLLM.
Start with a representative test dataset
The test dataset is often the most overlooked aspect of performance tuning. Performance for vLLM and llm-d depends on the shape and behavior of incoming requests. Synthetic or overly simplistic benchmarks can lead to misleading conclusions.
While you might be tempted to use artificial traffic, accurate performance optimization requires a dataset that mirrors real-world usage patterns.
Using GuideLLM for realistic benchmarking
Tools like GuideLLM help you transition from synthetic load testing to production reality. GuideLLM benchmarks LLM serving stacks using structured, repeatable workloads that reflect how applications interact with models in practice.
With GuideLLM, you can:
- Define realistic request shapes, including varying prompt lengths and sizes.
- Control concurrency patterns to observe how performance changes under increasing load.
- Capture key metrics such as throughput, time to first token (TTFT), and end-to-end latency.
- Create custom test datasets to model real-world use cases more accurately.
Use GuideLLM to standardize your benchmarking process. This ensures that performance comparisons between tuning configurations or deployment topologies are fair and repeatable.
Key dataset considerations
GuideLLM allows you to easily configure parameters and generate a test dataset. You can also build your own dataset manually.
When building or validating a test dataset, whether manually or with a tool like GuideLLM, consider these factors:
- Input and output shapes: Token counts, prompt variability, and response size directly impact KV cache utilization and scheduling behavior.
- Repeated text: Common text, such as reused prompts and tool context, significantly affects performance because vLLM optimizes these recurring items.
Future GuideLLM releases will support multi-turn requests. The tool will capture chat history from the LLM and submit follow-up questions to better simulate real-world chat use cases.
A representative dataset and a consistent benchmarking tool like GuideLLM ensure that tuning decisions translate into meaningful, real-world performance.
Recommended approach
Start with datasets generated by GuideLLM to create a repeatable test:
- Configure input and output lengths that represent your workloads.
- Test a variety of concurrencies and identify saturation points.
- Capture metrics like throughput, TTFT, and end-to-end latency.
- Determine your service-level objectives (SLOs) for P95 and P99 for those metrics.
Later, you can capture real-world prompts as your application matures to create a custom test dataset. Consider multi-turn requests or tool calling. These significantly impact performance when vLLM reuses existing KV cache values.
Identify the optimal GPU-to-replica ratio
When deploying vLLM on fixed hardware, choosing the optimal number of GPUs per replica is a critical decision.
For example, with two nodes that each have eight NVIDIA H100 GPUs, you could deploy:
- Two vLLM replicas using eight GPUs each
- Four vLLM replicas using four GPUs each
No single configuration works for every scenario. The optimal choice depends on model size, available KV cache memory, and request shapes and concurrency patterns.
Recommended approach
Begin by identifying the smallest number of GPUs required to load the model with a sufficient KV cache. Use that minimum count to deploy the maximum number of replicas on your hardware, then run performance tests at various concurrency levels. Finally, increase the GPUs per replica while reducing the number of replicas and repeat the tests to find the optimal balance.
For example, with eight GPUs you can test these configurations:
- Four replicas using two GPUs each
- Two replicas using four GPUs each
- One replica using eight GPUs each
Comparing these configurations helps you find the best balance between parallelism, memory availability, and scheduling efficiency.
Also consider other factors, such as the need for high availability or flexible hardware utilization. For example, two replicas with four GPUs each might be easier to schedule than one replica with eight GPUs. This setup also provides redundancy if an instance fails.
Maximize GPU memory for the KV cache
The vLLM framework limits the GPU memory available to the model and KV cache using the --gpu-memory-utilization parameter, which defaults to 0.9 (90%).
At startup, vLLM allocates approximately 90% of VRAM for the model and KV cache, while reserving the remaining 10% for CUDA graphs and runtime overhead. This reserved memory often remains unused, especially with smaller models or multi-GPU replicas. On an NVIDIA H100 GPU, this unused portion can reach 8 GB.
Reclaim underutilized memory
You can often increase this value to reclaim more memory for the KV cache:
--gpu-memory-utilization=0.95A larger KV cache allows vLLM to support more concurrent tokens and requests, which increases throughput. However, setting this value too high can crash the vLLM pod.
Recommended approach
Gradually increase the value until the system fails to start or becomes unstable under heavy load. Then, decrease the value slightly to establish a safe operating margin.
Reduce memory pressure with a quantized KV cache
vLLM supports KV cache quantization through the --kv-cache-dtype parameter. By default, the system uses the model data type. For example:
--kv-cache-dtype=fp8Using a lower-precision data type reduces the memory required per token. This can significantly increase the number of concurrent requests the system can handle.
Trade-offs to consider
Lower precision can impact response quality, though the level of impact varies by model and use case. Always pair KV cache quantization with automated evaluation to ensure your response quality remains acceptable.
Recommended approach
Start with the lowest KV cache precision your hardware and model support that maintains acceptable quality—for example, fp8.
Next, validate the response quality with automated evaluation tests. Roll back if you observe quality regressions.
Maintain throughput at high concurrency
As concurrency increases, vLLM eventually reaches a point where throughput plateaus and latency degrades. This is a natural result of GPU saturation and scheduling contention.
To manage this, use the --max-num-seqs parameter.
How --max-num-seqs works
The --max-num-seqs parameter limits the number of active requests processed simultaneously and queues any requests that exceed that limit. This keeps throughput near optimal levels for most requests, though it increases the time to first token (TTFT) and end-to-end latency for the queued requests.
This configuration protects system throughput even if it increases latency for requests that exceed the limit.
Recommended approach
Begin by establishing baseline throughput, time to first token (TTFT), and latency across various concurrency levels. Identify the point where throughput plateaus and latency begins to degrade, then use that level as a starting point for the --max-num-seqs parameter. You can then adjust the value based on whether you prefer throughput stability or balanced latency. Continue to monitor TTFT and P95 or P99 latency to ensure you meet your service-level objectives (SLOs). Finally, retune the system whenever your request shapes or model versions change.
Final thoughts
Tuning vLLM is an iterative process that relies on realistic workloads and careful measurement. Combine representative datasets with systematic experiments for GPU layouts, memory utilization, KV cache precision, and concurrency limits to improve performance on your hardware.
The key is not to optimize in isolation, but to continuously validate tuning decisions against user expectations and application requirements.