Skip to main content
Redhat Developers  Logo
  • Products

    Platforms

    • Red Hat Enterprise Linux
      Red Hat Enterprise Linux Icon
    • Red Hat AI
      Red Hat AI
    • Red Hat OpenShift
      Openshift icon
    • Red Hat Ansible Automation Platform
      Ansible icon
    • View All Red Hat Products

    Featured

    • Red Hat build of OpenJDK
    • Red Hat Developer Hub
    • Red Hat JBoss Enterprise Application Platform
    • Red Hat OpenShift Dev Spaces
    • Red Hat OpenShift Local
    • Red Hat Developer Sandbox

      Try Red Hat products and technologies without setup or configuration fees for 30 days with this shared Openshift and Kubernetes cluster.
    • Try at no cost
  • Technologies

    Featured

    • AI/ML
      AI/ML Icon
    • Linux
      Linux Icon
    • Kubernetes
      Cloud icon
    • Automation
      Automation Icon showing arrows moving in a circle around a gear
    • View All Technologies
    • Programming Languages & Frameworks

      • Java
      • Python
      • JavaScript
    • System Design & Architecture

      • Red Hat architecture and design patterns
      • Microservices
      • Event-Driven Architecture
      • Databases
    • Developer Productivity

      • Developer productivity
      • Developer Tools
      • GitOps
    • Automated Data Processing

      • AI/ML
      • Data Science
      • Apache Kafka on Kubernetes
    • Platform Engineering

      • DevOps
      • DevSecOps
      • Ansible automation for applications and services
    • Secure Development & Architectures

      • Security
      • Secure coding
  • Learn

    Featured

    • Kubernetes & Cloud Native
      Openshift icon
    • Linux
      Rhel icon
    • Automation
      Ansible cloud icon
    • AI/ML
      AI/ML Icon
    • View All Learning Resources

    E-Books

    • GitOps Cookbook
    • Podman in Action
    • Kubernetes Operators
    • The Path to GitOps
    • View All E-books

    Cheat Sheets

    • Linux Commands
    • Bash Commands
    • Git
    • systemd Commands
    • View All Cheat Sheets

    Documentation

    • Product Documentation
    • API Catalog
    • Legacy Documentation
  • Developer Sandbox

    Developer Sandbox

    • Access Red Hat’s products and technologies without setup or configuration, and start developing quicker than ever before with our new, no-cost sandbox environments.
    • Explore Developer Sandbox

    Featured Developer Sandbox activities

    • Get started with your Developer Sandbox
    • OpenShift virtualization and application modernization using the Developer Sandbox
    • Explore all Developer Sandbox activities

    Ready to start developing apps?

    • Try at no cost
  • Blog
  • Events
  • Videos

Ollama vs. vLLM: A deep dive into performance benchmarking

August 8, 2025
Harshith Umesh
Related topics:
Artificial intelligenceOpen source
Related products:
Red Hat AI

Share:

    Key takeaways

    • Ollama and vLLM serve different purposes, and that's a good thing for the AI community: Ollama is ideal for local development and prototyping, while vLLM is built for high-performance production deployments.
    • vLLM outperforms Ollama at scale: vLLM delivers significantly higher throughput (achieving a peak of 793 TPS compared to Ollama's 41 TPS) and lower P99 latency (80 ms vs. 673 ms at peak throughput). vLLM delivers higher throughput and lower latency across all concurrency levels(1-256 concurrent users), even when Ollama is tuned for parallelism.
    • Ollama prioritizes simplicity, vLLM prioritizes scalability: Ollama keeps things lightweight for single users, while vLLM dynamically scales to handle large, concurrent workloads efficiently.

    In Ollama or vLLM? How to choose the right LLM serving tool for your use case, we introduced Ollama as a good tool for local development and vLLM as the go-to solution for high-performance production serving. We argued that while Ollama prioritizes ease of use for individual developers, vLLM is engineered for the scalability and throughput that enterprise applications demand.

    Now, it's time to move from theory to practice. In this follow-up article, we'll put these two inference engines to the test in a head-to-head performance benchmark. Using concrete data, we’ll demonstrate how each inference server behaves under pressure and provide clear evidence to help you choose the right tool for your deployment needs.

    The benchmarking setup

    To ensure a true "apples-to-apples" comparison, we created a controlled testing environment on OpenShift using default arguments and original model weights without compression. The goal was to measure how each server performed while handling an increasing number of simultaneous users.

    • Hardware and software:
      • GPU: Single NVIDIA A100-PCIE-40GB GPU
      • NVIDIA driver version: 550.144.03
      • CUDA version: 12.4
      • Platform: OpenShift version 4.17.15
      • Models: meta-llama/Llama-3.1-8B-instruct for vLLM and llama3.1:8b-instruct-fp16 for Ollama
      • vLLM version: 0.9.1
      • Ollama version: 0.9.2
      • Python version: 3.13.5
    • Benchmarking tool:

      We used GuideLLM (version 0.2.1) to conduct our performance tests. GuideLLM is a benchmarking tool specifically designed to measure the performance of LLM inference servers. Refer to the article GuideLLM: Evaluate LLM deployments for real-world inference or this video to learn more about GuideLLM. We ran it as a container from within the same OpenShift cluster to ensure tests were conducted on the same network as our vLLM and Ollama services. We use its concurrency feature to simulate multiple simultaneous users by sending requests concurrently at various rates.

    • Methodology:
      • We used a fixed dataset of prompt-response pairs to ensure that every request was identical for both servers, eliminating variables from synthetic data generation. Each entry in this dataset specified the exact prompt to be sent to the model and the pre-calculated prompt and expected output token counts.
      • We simulated multiple simultaneous users by running concurrent requests, with concurrency levels tested from 1 up to 256. The concurrency level represents a fixed number of "virtual users" that continuously send requests. For example, the "64 concurrency" rate maintains a constant 64 active requests on the server. Each test runs for 300 seconds.
    • Key performance metrics:
      • Requests Per Second (RPS): The average number of requests the system can successfully complete each second. Higher is better.
      • Output Tokens Per Second (TPS): The total number of tokens generated per second, measuring the server's total generative capacity. Higher is better.
      • Time to First Token (TTFT): How long it takes from sending a request to receiving the first piece of the response (token). This measures initial responsiveness. Lower is better.
      • Inter-token Latency (ITL): The average time between each subsequent token in a response, measuring the text generation speed. Lower is better.

    For TTFT and ITL, we used P99 (99th percentile) as the measure. P99 means that 99% of requests had a TTFT/ITL at or below this value, making it a good measure of "worst-case" responsiveness.

    Comparison 1: Default settings showdown

    First, we compared vLLM and Ollama using their standard, out-of-the-box configurations. By default, Ollama is configured to handle a maximum of four requests in parallel, as it's primarily designed for single-user scenarios.

    Throughput (RPS and TPS)

    The difference in throughput was immediate and stark.

    • vLLM's throughput (both RPS and TPS) scaled impressively as concurrency increased, handling a much heavier user load.
    • Ollama's performance remained flat, quickly hitting its maximum capacity due to the default cap on parallel requests.

    As seen in the graphs (Figures 1 and 2), vLLM's peak performance is several times higher than Ollama's default configuration, demonstrating its superior ability to manage many concurrent user requests.

    Figure 1
    Figure 1: vLLM’s throughput(TPS) scales with concurrency, while Ollama's throughput remains flat.
    Figure 2
    Figure 2: vLLM’s throughput(RPS) scales with concurrency, while Ollama's throughput remains flat.

    vLLM's throughput scales with user load, while Ollama's performance remains flat.

    Responsiveness (TTFT and ITL)

    Here, we see an interesting trade-off between how the two engines handle load.

    • Time to First Token: vLLM consistently delivered a much lower TTFT, meaning users get a faster initial response, even under heavy load. Ollama's TTFT rose dramatically with more users because incoming requests had to wait in a queue before being processed (Figure 3).

      "Worst-case" (P99) Time To First Token shows vLLM is significantly more responsive under load.

    Figure 3
    Figure 3: vLLM has consistently lower TTFT, while Ollama's TTFT rises sharply.
    • Inter-token Latency: At very high concurrency (above 16), vLLM's ITL began to rise, while Ollama's remained stable and low. This is because Ollama throttles requests, keeping its active workload small and predictable at the expense of making many users wait (high TTFT). In contrast, vLLM processes a much larger batch of requests at once to maximize overall throughput, which can slightly increase the time to generate each individual token within that large batch (Figure 4).
    Figure 4
    Figure 4: At higher concurrencies, vLLM's ITL begins to rise, while Ollama's remains stable and low.

    Even when tuned for parallelism, Ollama's throughput can't keep up with vLLM. 

    Comparison 2: Tuned Ollama versus vLLM

    Recognizing that Ollama's default settings aren't meant for high-concurrency, we tuned it for maximum performance. We set its parallel request limit to 32 (OLLAMA_NUM_PARALLEL=32), which was the highest stable value for our NVIDIA A100 GPU. vLLM isn't limited by a fixed number of parallel requests like OLLAMA_NUM_PARALLEL. Instead, its scaling is dynamic.

    We show only results up to load test concurrency of 64, as anything above that is oversaturated.

    Throughput (RPS and TPS)

    Even after tuning, vLLM remained the clear leader, as shown in Figures 5 and 6.

    • vLLM's throughput continued to scale almost linearly, showcasing its dynamic and efficient scheduling.
    • Ollama, despite the tuning, saw its performance plateau and was unable to match vLLM's capacity at any concurrency level.
    Figure 5
    Figure 5: vLLM's throughput(TPS) scales with concurrency, while Ollama's throughput plateaus at a much lower level.
    Figure 6
    Figure 6: vLLM's throughput(RPS) scales with concurrency, while Ollama's throughput plateaus at a much lower level.

    Responsiveness (TTFT and ITL)

    Tuning Ollama for higher parallelism revealed significant stability challenges under load.

    • Time to First Token: vLLM's TTFT remained extremely low and stable. In contrast, Ollama's TTFT still increased sharply, as juggling more requests meant each new one had to wait longer to begin processing (Figure 7).
    • Inter-token Latency: This is where the difference becomes most apparent. vLLM's token generation speed stayed fast and fluid across all loads. Ollama's ITL, however, became extremely erratic, with massive spikes at higher concurrency. This indicates significant performance degradation and potential "head-of-line blocking," where a single stalled request can slow down an entire batch (Figure 8).
    Figure 7
    Figure 7: vLLM consistently maintains a lower TTFT across all concurrency levels, while Ollama's TTFT increases significantly with higher concurrency.
    A graph showing P99 inter-token latency vs. concurrency.
    Figure 8: vLLM's ITL remains stable at all concurrencies, while Ollama's ITL rises sharply and becomes erratic at higher concurrency.

    vLLM maintains stable generation speed, while tuned Ollama becomes erratic under load.

    The right tool for the job

    This benchmark comparison data provides definitive proof for our initial guidance:

    • Ollama excels in its intended role: a simple, accessible tool for local development, prototyping, and single-user applications. Its strength lies in its ease of use, not its ability to handle high-concurrency production traffic, where it struggles even when tuned.
    • vLLM is unequivocally the superior choice for production deployment. It is built for performance, delivering significantly higher throughput and lower latency under heavy load. Its dynamic batching and efficient resource management make it the ideal engine for scalable, enterprise-grade AI applications.

    Ultimately, the choice depends on where you are in your development journey. For developers experimenting locally, Ollama is a fantastic starting point. But for teams moving toward production, this performance data confirms that vLLM is the powerful, scalable, and efficient foundation needed to serve LLMs reliably at scale.

    Discover how Red Hat AI Inference Server, powered by vLLM, enables fast, cost-effective AI inference.

    Related Posts

    • Llama 4 herd is here with Day 0 inference support in vLLM

    • How we optimized vLLM for DeepSeek-R1

    • LLM Compressor is here: Faster inference with vLLM

    • LLM Compressor: Optimize LLMs for low-latency deployments

    • vLLM V1: Accelerating multimodal inference for large language models

    • llm-d: Kubernetes-native distributed inferencing

    Recent Posts

    • Skopeo: The unsung hero of Linux container-tools

    • Automate certificate management in OpenShift

    • Customize RHEL CoreOS at scale: On-cluster image mode in OpenShift

    • How to set up KServe autoscaling for vLLM with KEDA

    • How I used Cursor AI to migrate a Bash test suite to Python

    What’s up next?

    Open source AI for developers introduces and covers key features of Red Hat OpenShift AI, including Jupyter Notebooks, PyTorch, and enhanced monitoring and observability tools, along with MLOps and continuous integration/continuous deployment (CI/CD) workflows.

    Get the e-book
    Red Hat Developers logo LinkedIn YouTube Twitter Facebook

    Platforms

    • Red Hat AI
    • Red Hat Enterprise Linux
    • Red Hat OpenShift
    • Red Hat Ansible Automation Platform
    • See all products

    Build

    • Developer Sandbox
    • Developer Tools
    • Interactive Tutorials
    • API Catalog

    Quicklinks

    • Learning Resources
    • E-books
    • Cheat Sheets
    • Blog
    • Events
    • Newsletter

    Communicate

    • About us
    • Contact sales
    • Find a partner
    • Report a website issue
    • Site Status Dashboard
    • Report a security problem

    RED HAT DEVELOPER

    Build here. Go anywhere.

    We serve the builders. The problem solvers who create careers with code.

    Join us if you’re a developer, software engineer, web designer, front-end designer, UX designer, computer scientist, architect, tester, product manager, project manager or team lead.

    Sign me up

    Red Hat legal and privacy links

    • About Red Hat
    • Jobs
    • Events
    • Locations
    • Contact Red Hat
    • Red Hat Blog
    • Inclusion at Red Hat
    • Cool Stuff Store
    • Red Hat Summit
    © 2025 Red Hat

    Red Hat legal and privacy links

    • Privacy statement
    • Terms of use
    • All policies and guidelines
    • Digital accessibility

    Report a website issue