Skip to main content
Redhat Developers  Logo
  • Products

    Platforms

    • Red Hat Enterprise Linux
      Red Hat Enterprise Linux Icon
    • Red Hat AI
      Red Hat AI
    • Red Hat OpenShift
      Openshift icon
    • Red Hat Ansible Automation Platform
      Ansible icon
    • See all Red Hat products

    Featured

    • Red Hat build of OpenJDK
    • Red Hat Developer Hub
    • Red Hat JBoss Enterprise Application Platform
    • Red Hat OpenShift Dev Spaces
    • Red Hat OpenShift Local
    • Red Hat Developer Sandbox

      Try Red Hat products and technologies without setup or configuration fees for 30 days with this shared Red Hat OpenShift and Kubernetes cluster.
    • Try at no cost
  • Technologies

    Featured

    • AI/ML
      AI/ML Icon
    • Linux
      Linux Icon
    • Kubernetes
      Cloud icon
    • Automation
      Automation Icon showing arrows moving in a circle around a gear
    • See all technologies
    • Programming languages & frameworks

      • Java
      • Python
      • JavaScript
    • System design & architecture

      • Red Hat architecture and design patterns
      • Microservices
      • Event-Driven Architecture
      • Databases
    • Developer experience

      • Productivity
      • Tools
      • GitOps
    • Automated data processing

      • AI/ML
      • Data science
      • Apache Kafka on Kubernetes
    • Platform engineering

      • DevOps
      • DevSecOps
      • Red Hat Ansible Automation Platform for applications and services
    • Secure development & architectures

      • Security
      • Secure coding
  • Learn

    Featured

    • Kubernetes & cloud native
      Openshift icon
    • Linux
      Rhel icon
    • Automation
      Ansible cloud icon
    • AI/ML
      AI/ML Icon
    • See all learning resources

    E-books

    • GitOps cookbook
    • Podman in action
    • Kubernetes operators
    • The path to GitOps
    • See all e-books

    Cheat sheets

    • Linux commands
    • Bash commands
    • Git
    • systemd commands
    • See all cheat sheets

    Documentation

    • Product documentation
    • API catalog
    • Legacy documentation
  • Developer Sandbox

    Developer Sandbox

    • Access Red Hat’s products and technologies without setup or configuration, and start developing quicker than ever before with our new, no-cost sandbox environments.
    • Explore the Developer Sandbox

    Featured Developer Sandbox activities

    • Get started with your Developer Sandbox
    • OpenShift virtualization and application modernization using the Developer Sandbox
    • Explore all Developer Sandbox activities

    Ready to start developing apps?

    • Try at no cost
  • Blog
  • Events
  • Videos

Practical strategies for vLLM performance tuning

Balancing hardware, workload, and user experience

March 3, 2026
Trevor Royer
Related topics:
Artificial intelligenceData science
Related products:
Red Hat AI Inference ServerRed Hat AIRed Hat OpenShift AI

    Performance tuning large language model (LLM) serving frameworks like vLLM is rarely about a single magic flag or configuration. Instead, it's an iterative process that balances hardware constraints, workload characteristics, and user experience goals such as latency and throughput.

    This article walks through practical tuning recommendations with a focus on designing meaningful benchmarks and extracting the most performance from vLLM.

    Start with a representative test dataset

    The test dataset is often the most overlooked aspect of performance tuning. Performance for vLLM and llm-d depends on the shape and behavior of incoming requests. Synthetic or overly simplistic benchmarks can lead to misleading conclusions.

    While you might be tempted to use artificial traffic, accurate performance optimization requires a dataset that mirrors real-world usage patterns.

    Using GuideLLM for realistic benchmarking

    Tools like GuideLLM help you transition from synthetic load testing to production reality. GuideLLM benchmarks LLM serving stacks using structured, repeatable workloads that reflect how applications interact with models in practice.

    With GuideLLM, you can:

    • Define realistic request shapes, including varying prompt lengths and sizes.
    • Control concurrency patterns to observe how performance changes under increasing load.
    • Capture key metrics such as throughput, time to first token (TTFT), and end-to-end latency.
    • Create custom test datasets to model real-world use cases more accurately.

    Use GuideLLM to standardize your benchmarking process. This ensures that performance comparisons between tuning configurations or deployment topologies are fair and repeatable.

    Key dataset considerations

    GuideLLM allows you to easily configure parameters and generate a test dataset. You can also build your own dataset manually.

    When building or validating a test dataset, whether manually or with a tool like GuideLLM, consider these factors:

    • Input and output shapes: Token counts, prompt variability, and response size directly impact KV cache utilization and scheduling behavior.
    • Repeated text: Common text, such as reused prompts and tool context, significantly affects performance because vLLM optimizes these recurring items.

    Future GuideLLM releases will support multi-turn requests. The tool will capture chat history from the LLM and submit follow-up questions to better simulate real-world chat use cases.

    A representative dataset and a consistent benchmarking tool like GuideLLM ensure that tuning decisions translate into meaningful, real-world performance.

    Recommended approach

    Start with datasets generated by GuideLLM to create a repeatable test:

    • Configure input and output lengths that represent your workloads.
    • Test a variety of concurrencies and identify saturation points.
    • Capture metrics like throughput, TTFT, and end-to-end latency.
    • Determine your service-level objectives (SLOs) for P95 and P99 for those metrics.

    Later, you can capture real-world prompts as your application matures to create a custom test dataset. Consider multi-turn requests or tool calling. These significantly impact performance when vLLM reuses existing KV cache values.

    Identify the optimal GPU-to-replica ratio

    When deploying vLLM on fixed hardware, choosing the optimal number of GPUs per replica is a critical decision.

    For example, with two nodes that each have eight NVIDIA H100 GPUs, you could deploy:

    • Two vLLM replicas using eight GPUs each
    • Four vLLM replicas using four GPUs each

    No single configuration works for every scenario. The optimal choice depends on model size, available KV cache memory, and request shapes and concurrency patterns.

    Recommended approach

    Begin by identifying the smallest number of GPUs required to load the model with a sufficient KV cache. Use that minimum count to deploy the maximum number of replicas on your hardware, then run performance tests at various concurrency levels. Finally, increase the GPUs per replica while reducing the number of replicas and repeat the tests to find the optimal balance.

    For example, with eight GPUs you can test these configurations:

    • Four replicas using two GPUs each
    • Two replicas using four GPUs each
    • One replica using eight GPUs each

    Comparing these configurations helps you find the best balance between parallelism, memory availability, and scheduling efficiency.

    Also consider other factors, such as the need for high availability or flexible hardware utilization. For example, two replicas with four GPUs each might be easier to schedule than one replica with eight GPUs. This setup also provides redundancy if an instance fails.

    Maximize GPU memory for the KV cache

    The vLLM framework limits the GPU memory available to the model and KV cache using the --gpu-memory-utilization parameter, which defaults to 0.9 (90%).

    At startup, vLLM allocates approximately 90% of VRAM for the model and KV cache, while reserving the remaining 10% for CUDA graphs and runtime overhead. This reserved memory often remains unused, especially with smaller models or multi-GPU replicas. On an NVIDIA H100 GPU, this unused portion can reach 8 GB.

    Reclaim underutilized memory

    You can often increase this value to reclaim more memory for the KV cache:

    --gpu-memory-utilization=0.95

    A larger KV cache allows vLLM to support more concurrent tokens and requests, which increases throughput. However, setting this value too high can crash the vLLM pod.

    Recommended approach

    Gradually increase the value until the system fails to start or becomes unstable under heavy load. Then, decrease the value slightly to establish a safe operating margin.

    Reduce memory pressure with a quantized KV cache

    vLLM supports KV cache quantization through the --kv-cache-dtype parameter. By default, the system uses the model data type. For example:

    --kv-cache-dtype=fp8

    Using a lower-precision data type reduces the memory required per token. This can significantly increase the number of concurrent requests the system can handle.

    Trade-offs to consider

    Lower precision can impact response quality, though the level of impact varies by model and use case. Always pair KV cache quantization with automated evaluation to ensure your response quality remains acceptable.

    Recommended approach

    Start with the lowest KV cache precision your hardware and model support that maintains acceptable quality—for example, fp8.

    Next, validate the response quality with automated evaluation tests. Roll back if you observe quality regressions.

    Maintain throughput at high concurrency

    As concurrency increases, vLLM eventually reaches a point where throughput plateaus and latency degrades. This is a natural result of GPU saturation and scheduling contention.

    To manage this, use the --max-num-seqs parameter.

    How --max-num-seqs works

    The --max-num-seqs parameter limits the number of active requests processed simultaneously and queues any requests that exceed that limit. This keeps throughput near optimal levels for most requests, though it increases the time to first token (TTFT) and end-to-end latency for the queued requests.

    This configuration protects system throughput even if it increases latency for requests that exceed the limit.

    Recommended approach

    Begin by establishing baseline throughput, time to first token (TTFT), and latency across various concurrency levels. Identify the point where throughput plateaus and latency begins to degrade, then use that level as a starting point for the --max-num-seqs parameter. You can then adjust the value based on whether you prefer throughput stability or balanced latency. Continue to monitor TTFT and P95 or P99 latency to ensure you meet your service-level objectives (SLOs). Finally, retune the system whenever your request shapes or model versions change.

    Final thoughts

    Tuning vLLM is an iterative process that relies on realistic workloads and careful measurement. Combine representative datasets with systematic experiments for GPU layouts, memory utilization, KV cache precision, and concurrency limits to improve performance on your hardware.

    The key is not to optimize in isolation, but to continuously validate tuning decisions against user expectations and application requirements.

    Related Posts

    • How to deploy and benchmark vLLM with GuideLLM on Kubernetes

    • Autoscaling vLLM with OpenShift AI model serving: Performance validation

    • Why vLLM is the best choice for AI inference today

    • Benchmarking with GuideLLM in air-gapped OpenShift clusters

    • GuideLLM: Evaluate LLM deployments for real-world inference

    • Run Voxtral Mini 4B Realtime on vLLM with Red Hat AI on Day 1: A step-by-step guide

    Recent Posts

    • Kafka Monthly Digest: March 2026

    • Introduction to Linux interfaces for virtual networking

    • Run Gemma 4 with Red Hat AI on Day 0: A step-by-step guide

    • Red Hat build of Perses with the cluster observability operator

    • How to plan your RHEL lifecycle with AI

    What’s up next?

    share-graphic-applied-ai-enterprise-java-ebook.png

    Applied AI for Enterprise Java Development

    Alex Soto Bueno +2
    Red Hat Developers logo LinkedIn YouTube Twitter Facebook

    Platforms

    • Red Hat AI
    • Red Hat Enterprise Linux
    • Red Hat OpenShift
    • Red Hat Ansible Automation Platform
    • See all products

    Build

    • Developer Sandbox
    • Developer tools
    • Interactive tutorials
    • API catalog

    Quicklinks

    • Learning resources
    • E-books
    • Cheat sheets
    • Blog
    • Events
    • Newsletter

    Communicate

    • About us
    • Contact sales
    • Find a partner
    • Report a website issue
    • Site status dashboard
    • Report a security problem

    RED HAT DEVELOPER

    Build here. Go anywhere.

    We serve the builders. The problem solvers who create careers with code.

    Join us if you’re a developer, software engineer, web designer, front-end designer, UX designer, computer scientist, architect, tester, product manager, project manager or team lead.

    Sign me up

    Red Hat legal and privacy links

    • About Red Hat
    • Jobs
    • Events
    • Locations
    • Contact Red Hat
    • Red Hat Blog
    • Inclusion at Red Hat
    • Cool Stuff Store
    • Red Hat Summit
    © 2026 Red Hat

    Red Hat legal and privacy links

    • Privacy statement
    • Terms of use
    • All policies and guidelines
    • Digital accessibility

    Report a website issue