Skip to main content
Redhat Developers  Logo
  • AI

    Get started with AI

    • Red Hat AI
      Accelerate the development and deployment of enterprise AI solutions.
    • AI learning hub
      Explore learning materials and tools, organized by task.
    • AI interactive demos
      Click through scenarios with Red Hat AI, including training LLMs and more.
    • AI/ML learning paths
      Expand your OpenShift AI knowledge using these learning resources.
    • AI quickstarts
      Focused AI use cases designed for fast deployment on Red Hat AI platforms.
    • No-cost AI training
      Foundational Red Hat AI training.

    Featured resources

    • OpenShift AI learning
    • Open source AI for developers
    • AI product application development
    • Open source-powered AI/ML for hybrid cloud
    • AI and Node.js cheat sheet

    Red Hat AI Factory with NVIDIA

    • Red Hat AI Factory with NVIDIA is a co-engineered, enterprise-grade AI solution for building, deploying, and managing AI at scale across hybrid cloud environments.
    • Explore the solution
  • Learn

    Self-guided

    • Documentation
      Find answers, get step-by-step guidance, and learn how to use Red Hat products.
    • Learning paths
      Explore curated walkthroughs for common development tasks.
    • Guided learning
      Receive custom learning paths powered by our AI assistant.
    • See all learning

    Hands-on

    • Developer Sandbox
      Spin up Red Hat's products and technologies without setup or configuration.
    • Interactive labs
      Learn by doing in these hands-on, browser-based experiences.
    • Interactive demos
      Click through product features in these guided tours.

    Browse by topic

    • AI/ML
    • Automation
    • Java
    • Kubernetes
    • Linux
    • See all topics

    Training & certifications

    • Courses and exams
    • Certifications
    • Skills assessments
    • Red Hat Academy
    • Learning subscription
    • Explore training
  • Build

    Get started

    • Red Hat build of Podman Desktop
      A downloadable, local development hub to experiment with our products and builds.
    • Developer Sandbox
      Spin up Red Hat's products and technologies without setup or configuration.

    Download products

    • Access product downloads to start building and testing right away.
    • Red Hat Enterprise Linux
    • Red Hat AI
    • Red Hat OpenShift
    • Red Hat Ansible Automation Platform
    • See all products

    Featured

    • Red Hat build of OpenJDK
    • Red Hat JBoss Enterprise Application Platform
    • Red Hat OpenShift Dev Spaces
    • Red Hat Developer Toolset

    References

    • E-books
    • Documentation
    • Cheat sheets
    • Architecture center
  • Community

    Get involved

    • Events
    • Live AI events
    • Red Hat Summit
    • Red Hat Accelerators
    • Community discussions

    Follow along

    • Articles & blogs
    • Developer newsletter
    • Videos
    • Github

    Get help

    • Customer service
    • Customer support
    • Regional contacts
    • Find a partner

    Join the Red Hat Developer program

    • Download Red Hat products and project builds, access support documentation, learning content, and more.
    • Explore the benefits

llama.cpp vs. vLLM: Choosing the right local LLM inference engine

June 15, 2026
Cedric Clyburn
Related topics:
AI inferenceArtificial intelligence
Related products:
Red Hat AI Inference

    Everyone wants to run local large language models (LLMs) right now, and for good reason. Running your own GPT-style models means no API bills creeping up month over month, no rate limits from a model vendor, and full data privacy by default. Whether you're building retrieval-augmented generation (RAG) pipelines, spinning up AI agents, or using an AI code assistant, two inference engines keep coming up in most conversations: llama.cpp and vLLM.

    They solve the same core problem—running open-weight models yourself—but they approach it from different angles. By the end of this article, you'll know exactly when to use each one, and we'll back it up with benchmarks.

    The evolution of open-weight models: Why local inference exists

    The need for local inference engines traces back to 2023. When Meta released Llama 2 as one of the first commercially viable open-weight model families, it shifted the landscape of open-access AI. Sure, ChatGPT and other hosted services already existed, but Llama 2 was different. You could download it. You could own it.

    One small problem, though: you probably couldn't actually run it.

    Llama 2 shipped in 7B, 13B, and 70B parameter sizes. Even the 7B model required significant GPU memory in its native precision. If you didn't have a dedicated NVIDIA GPU (and most developers don't), running these sizes was difficult. That gap between you can download this and you can run this is where llama.cpp and vLLM emerged, each solving a different side of the problem.

    llama.cpp: Efficient AI inference on consumer hardware

    The llama.cpp project started with a simple question: Instead of asking How do we get bigger GPUs?, what if we made it possible to run optimized models on the hardware we already have?

    The project started as a lightweight, dependency-free way to run Llama models (in C++, hence the name), but has since become the primary way to run LLMs on consumer hardware thanks to a few key optimizations.

    Reducing compute requirements with quantization

    Quantization compresses model weights from their original 16-bit or 32-bit floating-point representations down to 4-bit or even 2-bit integers. In practice, this means a model that originally required about 30 GB of memory can shrink to around 4 GB, which is small enough to fit in your laptop's RAM.

    Hugging Face model page for Gemma 4 listing GGUF file sizes for various quantization levels ranging from 2-bit to 16-bit.
    Figure 1: Released in 16-bit format, the model takes up approximately 60 GB. Compression to 4-bit gives us a model that only takes up around 18 GB of space.

    The tradeoff is some loss in output quality, but modern quantization techniques preserve surprisingly good performance for most use cases. For example, our FP8 and INT8 benchmarks on DeepSeek retained near-perfect accuracy. Both llama.cpp and vLLM benefit from static quantization, whereas vLLM supports activation quantization to handle the dynamic computations generated for each prompt.

    The GGUF format for standardized model packaging

    The llama.cpp project also introduced GGUF (GPT-Generated Unified Format), a single-file format that bundles model weights and all associated metadata: tokenizer configuration, architecture details, and quantization parameters into one portable unit. This makes loading and swapping models fast, and it's become the de facto standard for local model distribution. If you've ever browsed Hugging Face for quantized models, you've seen GGUF files everywhere.

    GGUF file structure mapping byte allocations for magic number, version, tensor counts, metadata key-value pairs, and tensor info.
    Figure 2: Instead of just safetensors, GGUF is a binary format that's optimized for quick loading and saving of models.

    CPU-first inference and its downstream effects

    The vast majority of personal computers don't have a dedicated GPU. llama.cpp was designed to run efficiently on CPUs, with optional GPU acceleration when available. This single decision made local LLMs accessible to millions of developers, and its impact extends far beyond the project itself. For example, if you've used Ollama or LM Studio, llama.cpp's underlying engine enables these tools.

    vLLM: High-throughput AI inference at scale

    Running one model on a laptop is great. But what happens when 10 users hit your AI model at the same time? Or when you deploy to Kubernetes, where workloads are distributed across nodes and regions?

    vLLM takes the local LLM idea and scales it up, focusing on production inference with hardware accelerators like NVIDIA GPUs, Google tensor processing units (TPUs), AMD GPUs, Intel accelerators, and more. It's an LLM serving engine, just like llama.cpp, but it's built around solving problems with LLM inference: specifically, managing the key-value (KV) cache (an LLM's short-term memory) from requests and resolving issues with GPU underutilization. To support this, vLLM includes several features designed for deploying LLMs at scale.

    Maximizing throughput with continuous batching

    With traditional predictive AI models, such as BERT) or YOLO, we had fixed inputs, like a single image, and fixed outputs, like It's a cat. Instead of processing each request individually or batching 10 and waiting, vLLM uses continuous batching to process incoming requests per token across the batch. If 10 requests arrive at roughly the same time, vLLM interleaves their token generation rather than making nine of them wait.

    Token execution grids contrasting static batching, which leaves GPU slots idle after sequence completion, with continuous batching, which inserts new requests immediately.
    Figure 3: Static batching wastes GPU slots waiting, whereas continuous batching fills them dynamically as requests finish.

    Efficient KV cache management with PagedAttention

    Every time a model processes a prompt, it computes a KV cache. This cache is a set of intermediate calculations that the model references when generating each subsequent token, which saves compute later on. However, these caches grow quickly. For longer conversations or complex prompts, a single request's KV cache can consume dozens of gigabytes of GPU memory.

    Token execution grids contrasting static batching, which leaves GPU slots idle after sequence completion, with continuous batching, which inserts new requests immediately.
    Figure 4: The KV cache grows with each request's token embeddings, and can consume a large part of GPU memory.

    With many hardware accelerators offering only 10, 40, or 80 GB of VRAM, loading the model weights and maintaining KV caches for concurrent users is a serious memory challenge. vLLM's PagedAttention mechanism manages KV cache memory the way an operating system manages virtual memory: allocating and freeing blocks dynamically to maximize GPU utilization.

    Scaling performance with speculative decoding and disaggregation

    Both llama.cpp and vLLM benefit from quantization, as well as speculative decoding, where you use a small, fast “draft” model to generate candidate tokens, then verify them with the large model in a single forward pass. When the draft model guesses correctly (which it often does for predictable tokens), you get multiple tokens for the cost of one verification step.

    Speculative decoding workflow mapping sequential token generation in a small model to parallel token verification in a large model.
    Figure 6: A small draft model proposes candidate tokens, then the large model verifies them in one parallel pass.

    However, so far we've been talking about single-node deployments, which will eventually encounter challenges with memory limits, throughput bottlenecks, and latency under load. In some cases, just adding n + 1 inference engine servers won't be enough.

    That's why the llm-d project exists: to separate the prefill (processing the prompt) and decode (generating tokens) stages of inference across different hardware. This split allows each stage to be optimized and scaled independently on Kubernetes.

    Benchmarking llama.cpp vs. vLLM

    To put concrete numbers behind these architectural differences, we benchmarked both engines using GuideLLM, an open source toolkit for evaluating LLM inference performance. The test ran Llama 3.1 8B at full precision (16-bit) on a single NVIDIA H200 GPU with concurrency levels from 1 to 64 simultaneous users. For full methodology and additional metrics, see the detailed benchmark comparison.

    Throughput: How many users can we serve?

    At a single concurrent request, both engines produce tokens at a comparable rate. The difference emerges as concurrency increases.

    Token throughput comparison between vLLM and Llama.cpp from 1 to 64 concurrent requests.
    Figure 7: vLLM's output throughput scales with concurrency. At 64 simultaneous users, it generated roughly 44 times more tokens per second than llama.cpp.

    Responsiveness: Time to first token under load

    The time to first token (TTFT) metric shows how long a user waits before the first token of a response arrives, which is especially relevant for interactive applications.

    P99 time to first token comparison between vLLM and Llama.cpp across 1 to 64 concurrent requests.
    Figure 8: vLLM's P99 TTFT remains low and stable across all concurrency levels, while llama.cpp's TTFT grows exponentially. At 64 concurrent users, it takes more than three minutes before receiving the first token.

    This is a consequence of llama.cpp's sequential queuing model: requests are processed one at a time, so later arrivals wait in line. For a single user, both engines respond immediately.

    Should you choose llama.cpp or vLLM?

    Both engines serve models through an OpenAI-compatible API endpoint. That means whether you're building RAG, AI agents, or any application that talks to an LLM, swapping between llama.cpp and vLLM (or a hosted API like OpenAI) is essentially a URL change. No code rewrite required.

    Python script code highlighting the MODEL_ENDPOINT environment variable configured to a localhost URL.
    Figure 9: Developers can easily switch between a proprietary and local model just by replacing one endpoint with localhost.

    Choose llama.cpp (or Ollama and LM Studio) when you are prototyping on your laptop or workstation, have a consumer-grade or no GPU, want to quickly test different models by swapping GGUF files, and need offline inference in situations like factory floors and Internet of Things (IoT).

    Note

    Explore our side-by-side comparison of Ollama and vLLM to see exactly how they match up under pressure.

    On the other hand, choose vLLM when you need to serve multiple concurrent users, have access to data center GPUs (such as the A100 or H100), need to meet specific latency service-level agreements (SLAs), and want features like disaggregated serving with llm-d.

    Choosing the right engine for your AI journey

    The typical path many teams follow is starting with a paid API (such as OpenAI or Anthropic) for quick prototyping, watching the bill grow, and then moving to llama.cpp on a local machine for development. When it's time to deploy to users, they switch to vLLM on GPU infrastructure.

    To get started, the llama-cli tool (for experimentation) and llama-server tool (for serving) support llama.cpp workflows, while the vLLM Recipes provide guides and commands for running vLLM models. You can also explore the Red Hat AI collection of optimized models, which are ready for deployment. Balancing these architectural tradeoffs allows you to select the precise infrastructure required to sustain your local LLM deployment.

    Related Posts

    • vLLM or llama.cpp: Choosing the right LLM inference engine for your use case

    • Reach native speed with MacOS llama.cpp container inference

    • Introduction to distributed inference with llm-d

    • Master KV cache aware routing with llm-d for efficient AI inference

    • Getting started with llm-d for distributed AI inference

    • Practical strategies for vLLM performance tuning

    Recent Posts

    • MPI-powered gradient synchronization in PyTorch distributed training

    • llama.cpp vs. vLLM: Choosing the right local LLM inference engine

    • How speculative decoding delivers faster LLM inference

    • What's New in Red Hat Developer Hub 1.10?

    • Model-as-a-Service: How to run your own private AI API

    What’s up next?

    Try Red Hat AI Inference to deploy high-throughput, distributed LLMs on your own infrastructure using an enterprise runtime powered by vLLM and llm-d.

    Try Red Hat AI Inference
    Red Hat Developers logo LinkedIn YouTube Twitter Facebook

    Platforms

    • Red Hat AI
    • Red Hat Enterprise Linux
    • Red Hat OpenShift
    • Red Hat Ansible Automation Platform
    • See all products

    Build

    • Developer Sandbox
    • Developer tools
    • Interactive tutorials
    • API catalog

    Quicklinks

    • Learning resources
    • E-books
    • Cheat sheets
    • Blog
    • Events
    • Newsletter

    Communicate

    • About us
    • Contact sales
    • Find a partner
    • Report a website issue
    • Site status dashboard
    • Report a security problem

    RED HAT DEVELOPER

    Build here. Go anywhere.

    We serve the builders. The problem solvers who create careers with code.

    Join us if you’re a developer, software engineer, web designer, front-end designer, UX designer, computer scientist, architect, tester, product manager, project manager or team lead.

    Sign me up

    Red Hat legal and privacy links

    • About Red Hat
    • Jobs
    • Events
    • Locations
    • Contact Red Hat
    • Red Hat Blog
    • Inclusion at Red Hat
    • Cool Stuff Store
    • Red Hat Summit
    © 2026 Red Hat

    Red Hat legal and privacy links

    • Privacy statement
    • Terms of use
    • All policies and guidelines
    • Digital accessibility

    Chat Support

    Please log in with your Red Hat account to access chat support.