Skip to main content
Redhat Developers  Logo
  • AI

    Get started with AI

    • Red Hat AI
      Accelerate the development and deployment of enterprise AI solutions.
    • AI learning hub
      Explore learning materials and tools, organized by task.
    • AI interactive demos
      Click through scenarios with Red Hat AI, including training LLMs and more.
    • AI/ML learning paths
      Expand your OpenShift AI knowledge using these learning resources.
    • AI quickstarts
      Focused AI use cases designed for fast deployment on Red Hat AI platforms.
    • No-cost AI training
      Foundational Red Hat AI training.

    Featured resources

    • OpenShift AI learning
    • Open source AI for developers
    • AI product application development
    • Open source-powered AI/ML for hybrid cloud
    • AI and Node.js cheat sheet

    Red Hat AI Factory with NVIDIA

    • Red Hat AI Factory with NVIDIA is a co-engineered, enterprise-grade AI solution for building, deploying, and managing AI at scale across hybrid cloud environments.
    • Explore the solution
  • Learn

    Self-guided

    • Documentation
      Find answers, get step-by-step guidance, and learn how to use Red Hat products.
    • Learning paths
      Explore curated walkthroughs for common development tasks.
    • Guided learning
      Receive custom learning paths powered by our AI assistant.
    • See all learning

    Hands-on

    • Developer Sandbox
      Spin up Red Hat's products and technologies without setup or configuration.
    • Interactive labs
      Learn by doing in these hands-on, browser-based experiences.
    • Interactive demos
      Click through product features in these guided tours.

    Browse by topic

    • AI/ML
    • Automation
    • Java
    • Kubernetes
    • Linux
    • See all topics

    Training & certifications

    • Courses and exams
    • Certifications
    • Skills assessments
    • Red Hat Academy
    • Learning subscription
    • Explore training
  • Build

    Get started

    • Red Hat build of Podman Desktop
      A downloadable, local development hub to experiment with our products and builds.
    • Developer Sandbox
      Spin up Red Hat's products and technologies without setup or configuration.

    Download products

    • Access product downloads to start building and testing right away.
    • Red Hat Enterprise Linux
    • Red Hat AI
    • Red Hat OpenShift
    • Red Hat Ansible Automation Platform
    • See all products

    Featured

    • Red Hat build of OpenJDK
    • Red Hat JBoss Enterprise Application Platform
    • Red Hat OpenShift Dev Spaces
    • Red Hat Developer Toolset

    References

    • E-books
    • Documentation
    • Cheat sheets
    • Architecture center
  • Community

    Get involved

    • Events
    • Live AI events
    • Red Hat Summit
    • Red Hat Accelerators
    • Community discussions

    Follow along

    • Articles & blogs
    • Developer newsletter
    • Videos
    • Github

    Get help

    • Customer service
    • Customer support
    • Regional contacts
    • Find a partner

    Join the Red Hat Developer program

    • Download Red Hat products and project builds, access support documentation, learning content, and more.
    • Explore the benefits

Practical strategies for vLLM performance tuning

Balancing hardware, workload, and user experience

March 3, 2026
Trevor Royer
Related topics:
Artificial intelligenceData science
Related products:
Red Hat AI InferenceRed Hat AIRed Hat OpenShift AI

    Performance tuning large language model (LLM) serving frameworks like vLLM is rarely about a single magic flag or configuration. Instead, it's an iterative process that balances hardware constraints, workload characteristics, and user experience goals such as latency and throughput.

    This article walks through practical tuning recommendations with a focus on designing meaningful benchmarks and extracting the most performance from vLLM.

    Start with a representative test dataset

    The test dataset is often the most overlooked aspect of performance tuning. Performance for vLLM and llm-d depends on the shape and behavior of incoming requests. Synthetic or overly simplistic benchmarks can lead to misleading conclusions.

    While you might be tempted to use artificial traffic, accurate performance optimization requires a dataset that mirrors real-world usage patterns.

    Using GuideLLM for realistic benchmarking

    Tools like GuideLLM help you transition from synthetic load testing to production reality. GuideLLM benchmarks LLM serving stacks using structured, repeatable workloads that reflect how applications interact with models in practice.

    With GuideLLM, you can:

    • Define realistic request shapes, including varying prompt lengths and sizes.
    • Control concurrency patterns to observe how performance changes under increasing load.
    • Capture key metrics such as throughput, time to first token (TTFT), and end-to-end latency.
    • Create custom test datasets to model real-world use cases more accurately.

    Use GuideLLM to standardize your benchmarking process. This ensures that performance comparisons between tuning configurations or deployment topologies are fair and repeatable.

    Key dataset considerations

    GuideLLM allows you to easily configure parameters and generate a test dataset. You can also build your own dataset manually.

    When building or validating a test dataset, whether manually or with a tool like GuideLLM, consider these factors:

    • Input and output shapes: Token counts, prompt variability, and response size directly impact KV cache utilization and scheduling behavior.
    • Repeated text: Common text, such as reused prompts and tool context, significantly affects performance because vLLM optimizes these recurring items.

    Future GuideLLM releases will support multi-turn requests. The tool will capture chat history from the LLM and submit follow-up questions to better simulate real-world chat use cases.

    A representative dataset and a consistent benchmarking tool like GuideLLM ensure that tuning decisions translate into meaningful, real-world performance.

    Recommended approach

    Start with datasets generated by GuideLLM to create a repeatable test:

    • Configure input and output lengths that represent your workloads.
    • Test a variety of concurrencies and identify saturation points.
    • Capture metrics like throughput, TTFT, and end-to-end latency.
    • Determine your service-level objectives (SLOs) for P95 and P99 for those metrics.

    Later, you can capture real-world prompts as your application matures to create a custom test dataset. Consider multi-turn requests or tool calling. These significantly impact performance when vLLM reuses existing KV cache values.

    Identify the optimal GPU-to-replica ratio

    When deploying vLLM on fixed hardware, choosing the optimal number of GPUs per replica is a critical decision.

    For example, with two nodes that each have eight NVIDIA H100 GPUs, you could deploy:

    • Two vLLM replicas using eight GPUs each
    • Four vLLM replicas using four GPUs each

    No single configuration works for every scenario. The optimal choice depends on model size, available KV cache memory, and request shapes and concurrency patterns.

    Recommended approach

    Begin by identifying the smallest number of GPUs required to load the model with a sufficient KV cache. Use that minimum count to deploy the maximum number of replicas on your hardware, then run performance tests at various concurrency levels. Finally, increase the GPUs per replica while reducing the number of replicas and repeat the tests to find the optimal balance.

    For example, with eight GPUs you can test these configurations:

    • Four replicas using two GPUs each
    • Two replicas using four GPUs each
    • One replica using eight GPUs each

    Comparing these configurations helps you find the best balance between parallelism, memory availability, and scheduling efficiency.

    Also consider other factors, such as the need for high availability or flexible hardware utilization. For example, two replicas with four GPUs each might be easier to schedule than one replica with eight GPUs. This setup also provides redundancy if an instance fails.

    Maximize GPU memory for the KV cache

    The vLLM framework limits the GPU memory available to the model and KV cache using the --gpu-memory-utilization parameter, which defaults to 0.9 (90%).

    At startup, vLLM allocates approximately 90% of VRAM for the model and KV cache, while reserving the remaining 10% for CUDA graphs and runtime overhead. This reserved memory often remains unused, especially with smaller models or multi-GPU replicas. On an NVIDIA H100 GPU, this unused portion can reach 8 GB.

    Reclaim underutilized memory

    You can often increase this value to reclaim more memory for the KV cache:

    --gpu-memory-utilization=0.95

    A larger KV cache allows vLLM to support more concurrent tokens and requests, which increases throughput. However, setting this value too high can crash the vLLM pod.

    Recommended approach

    Gradually increase the value until the system fails to start or becomes unstable under heavy load. Then, decrease the value slightly to establish a safe operating margin.

    Reduce memory pressure with a quantized KV cache

    vLLM supports KV cache quantization through the --kv-cache-dtype parameter. By default, the system uses the model data type. For example:

    --kv-cache-dtype=fp8

    Using a lower-precision data type reduces the memory required per token. This can significantly increase the number of concurrent requests the system can handle.

    Trade-offs to consider

    Lower precision can impact response quality, though the level of impact varies by model and use case. Always pair KV cache quantization with automated evaluation to ensure your response quality remains acceptable.

    Recommended approach

    Start with the lowest KV cache precision your hardware and model support that maintains acceptable quality—for example, fp8.

    Next, validate the response quality with automated evaluation tests. Roll back if you observe quality regressions.

    Maintain throughput at high concurrency

    As concurrency increases, vLLM eventually reaches a point where throughput plateaus and latency degrades. This is a natural result of GPU saturation and scheduling contention.

    To manage this, use the --max-num-seqs parameter.

    How --max-num-seqs works

    The --max-num-seqs parameter limits the number of active requests processed simultaneously and queues any requests that exceed that limit. This keeps throughput near optimal levels for most requests, though it increases the time to first token (TTFT) and end-to-end latency for the queued requests.

    This configuration protects system throughput even if it increases latency for requests that exceed the limit.

    Recommended approach

    Begin by establishing baseline throughput, time to first token (TTFT), and latency across various concurrency levels. Identify the point where throughput plateaus and latency begins to degrade, then use that level as a starting point for the --max-num-seqs parameter. You can then adjust the value based on whether you prefer throughput stability or balanced latency. Continue to monitor TTFT and P95 or P99 latency to ensure you meet your service-level objectives (SLOs). Finally, retune the system whenever your request shapes or model versions change.

    Final thoughts

    Tuning vLLM is an iterative process that relies on realistic workloads and careful measurement. Combine representative datasets with systematic experiments for GPU layouts, memory utilization, KV cache precision, and concurrency limits to improve performance on your hardware.

    The key is not to optimize in isolation, but to continuously validate tuning decisions against user expectations and application requirements.

    Related Posts

    • How to deploy and benchmark vLLM with GuideLLM on Kubernetes

    • Autoscaling vLLM with OpenShift AI model serving: Performance validation

    • Why vLLM is the best choice for AI inference today

    • Benchmarking with GuideLLM in air-gapped OpenShift clusters

    • GuideLLM: Evaluate LLM deployments for real-world inference

    • Run Voxtral Mini 4B Realtime on vLLM with Red Hat AI on Day 1: A step-by-step guide

    Recent Posts

    • Every layer counts: Defense in depth for AI agents with Red Hat AI

    • Fun in the RUN instruction: Why container builds with distroless images can surprise you

    • Trusted software factory: Building trust in the agentic AI era

    • Build a zero trust AI pipeline with OpenShift and RHEL CVMs

    • Red Hat Hardened Images: Top 5 benefits for software developers

    What’s up next?

    applied ai for devs tile card

    Applied AI for Enterprise Java Development

    Alex Soto Bueno +2
    Red Hat Developers logo LinkedIn YouTube Twitter Facebook

    Platforms

    • Red Hat AI
    • Red Hat Enterprise Linux
    • Red Hat OpenShift
    • Red Hat Ansible Automation Platform
    • See all products

    Build

    • Developer Sandbox
    • Developer tools
    • Interactive tutorials
    • API catalog

    Quicklinks

    • Learning resources
    • E-books
    • Cheat sheets
    • Blog
    • Events
    • Newsletter

    Communicate

    • About us
    • Contact sales
    • Find a partner
    • Report a website issue
    • Site status dashboard
    • Report a security problem

    RED HAT DEVELOPER

    Build here. Go anywhere.

    We serve the builders. The problem solvers who create careers with code.

    Join us if you’re a developer, software engineer, web designer, front-end designer, UX designer, computer scientist, architect, tester, product manager, project manager or team lead.

    Sign me up

    Red Hat legal and privacy links

    • About Red Hat
    • Jobs
    • Events
    • Locations
    • Contact Red Hat
    • Red Hat Blog
    • Inclusion at Red Hat
    • Cool Stuff Store
    • Red Hat Summit
    © 2026 Red Hat

    Red Hat legal and privacy links

    • Privacy statement
    • Terms of use
    • All policies and guidelines
    • Digital accessibility

    Chat Support

    Please log in with your Red Hat account to access chat support.