Skip to main content
Redhat Developers  Logo
  • AI

    Get started with AI

    • Red Hat AI
      Accelerate the development and deployment of enterprise AI solutions.
    • AI learning hub
      Explore learning materials and tools, organized by task.
    • AI interactive demos
      Click through scenarios with Red Hat AI, including training LLMs and more.
    • AI/ML learning paths
      Expand your OpenShift AI knowledge using these learning resources.
    • AI quickstarts
      Focused AI use cases designed for fast deployment on Red Hat AI platforms.
    • No-cost AI training
      Foundational Red Hat AI training.

    Featured resources

    • OpenShift AI learning
    • Open source AI for developers
    • AI product application development
    • Open source-powered AI/ML for hybrid cloud
    • AI and Node.js cheat sheet

    Red Hat AI Factory with NVIDIA

    • Red Hat AI Factory with NVIDIA is a co-engineered, enterprise-grade AI solution for building, deploying, and managing AI at scale across hybrid cloud environments.
    • Explore the solution
  • Learn

    Self-guided

    • Documentation
      Find answers, get step-by-step guidance, and learn how to use Red Hat products.
    • Learning paths
      Explore curated walkthroughs for common development tasks.
    • Guided learning
      Receive custom learning paths powered by our AI assistant.
    • See all learning

    Hands-on

    • Developer Sandbox
      Spin up Red Hat's products and technologies without setup or configuration.
    • Interactive labs
      Learn by doing in these hands-on, browser-based experiences.
    • Interactive demos
      Click through product features in these guided tours.

    Browse by topic

    • AI/ML
    • Automation
    • Java
    • Kubernetes
    • Linux
    • See all topics

    Training & certifications

    • Courses and exams
    • Certifications
    • Skills assessments
    • Red Hat Academy
    • Learning subscription
    • Explore training
  • Build

    Get started

    • Red Hat build of Podman Desktop
      A downloadable, local development hub to experiment with our products and builds.
    • Developer Sandbox
      Spin up Red Hat's products and technologies without setup or configuration.

    Download products

    • Access product downloads to start building and testing right away.
    • Red Hat Enterprise Linux
    • Red Hat AI
    • Red Hat OpenShift
    • Red Hat Ansible Automation Platform
    • See all products

    Featured

    • Red Hat build of OpenJDK
    • Red Hat JBoss Enterprise Application Platform
    • Red Hat OpenShift Dev Spaces
    • Red Hat Developer Toolset

    References

    • E-books
    • Documentation
    • Cheat sheets
    • Architecture center
  • Community

    Get involved

    • Events
    • Live AI events
    • Red Hat Summit
    • Red Hat Accelerators
    • Community discussions

    Follow along

    • Articles & blogs
    • Developer newsletter
    • Videos
    • Github

    Get help

    • Customer service
    • Customer support
    • Regional contacts
    • Find a partner

    Join the Red Hat Developer program

    • Download Red Hat products and project builds, access support documentation, learning content, and more.
    • Explore the benefits

How speculative decoding delivers faster LLM inference

Using the tortoise and the hare fable to accelerate production inference

June 12, 2026
Sawyer Bowerman
Related topics:
AI inferenceArtificial intelligence
Related products:
Red Hat AI

    Remember the classic fable where the hare races ahead while the tortoise plods along steadily? In the end, slow and steady wins the race.

    I'm about to blow your mind here—but what if they worked together?

    What if the hare could sprint ahead and make educated guesses about the terrain, while the tortoise validated the entire path in a single glance based on what the hare told it? Competition? Out the window. Turtle soup? Forget it. The collaboration is what makes them so efficient. That's speculative decoding in a nutshell, and it's one of the most underutilized optimizations in production large language model (LLM) deployments.

    If you're serving LLMs in production and not using speculative decoding, you could miss out on more than three times the performance for code generation, structured outputs, and other predictable, non-creative workloads.

    The problem: One token at a time

    Typically, LLMs generate text autoregressively. That's a fancy way of saying they produce one token at a time. Each token requires a full forward pass through billions—potentially trillions—of parameters.

    Each step in this painstakingly slow process is sequential. You can't generate token 50 until you've generated token 49. Each forward pass through a 70B parameter model takes time, even on high-end GPUs. You wouldn't go to the grocery store, get one item, take it home, and then go back for another. This is the fundamental bottleneck in LLM inference.

    The solution: Let the hare run ahead

    Just because traditional approaches for inference might not be the most efficient, not all is lost in the world of AI. Speculative decoding breaks the one-token-at-a-time constraint by using two models:

    • The hare (speculator model): This is a small, fast model (0.5 to 2B parameters) that generates three to five tokens speculatively. It is occasionally incorrect, but it is fast.
    • The tortoise (target model): This is the production model (7B, 70B, or larger) that verifies the tokens from the speculator model in a single parallel forward pass.

    Here's the upside: When the hare guesses correctly, you receive three to five tokens for the price of one forward pass. When it's wrong, you only lose a few microseconds rejecting the incorrect tokens. The final output is identical to the original process. Speculative decoding simply gets you there faster.

    Additionally, speculative decoding is lossless in nature. Unlike quantization, which results in a slight loss in model accuracy as the model becomes more compressed, speculative decoding preserves full accuracy. In the worst case, it slows time to first token (TTFT) by a negligible margin.

    Here's the downside: Creating and running your own speculators requires training and fine-tuning a smaller model on the same dataset as the main model. This is easier said than done. To get started, you can visit the vLLM Speculators GitHub repository to create your own, or the Red Hat AI Hugging Face repository to download a pretrained model.

    How it actually works

    Let's walk through a concrete example. You're generating code, and the next logical tokens are for i in range(10):.

    Step 1: The hare sprints ahead

    The speculator model (for example, a 1B parameter model) generates four tokens speculatively:

    for → i → in → range

    Step 2: The tortoise validates in parallel

    The verifier model (for example, a 70B parameter model) performs a single forward pass to evaluate all four draft tokens simultaneously:

    • for: Correct
    • i: Correct
    • in: Correct
    • range: Correct

    The system accepts all four tokens. You have generated your tokens with one forward pass through the large model.

    Step 3: What happens when the hare is incorrect?

    Consider a scenario where the speculator model predicts the following:

    for → loop → in → range

    The verifier model validates the tokens:

    • for: Correct
    • loop: Incorrect (should be i)

    The system accepts the first correct token and rejects all tokens following the first mistake. The verifier model generates the correct token (i), and the process continues.

    The key insight: Incorrect guesses have a negligible cost. You perform the same forward pass either way. When the hare is correct, which occurs 50 to 80% of the time for predictable tasks, the system achieves significant speed improvements.

    When the hare wins

    Speculative decoding is most effective in low-concurrency, interactive serving scenarios where you process a smaller number of requests at a time. Notable environments include:

    • Single-user interactive sessions (chatbots, coding assistants)
    • Low-latency API endpoints (serving individual requests)
    • Real-time applications (where response speed is crucial)

    In smaller batch sizes, your GPU has an amount of idle compute capacity between sequential token generations. The speculator model uses that idle capacity to generate draft tokens. The following examples demonstrate where speculative decoding generally performs well:

    • Code generation: Programming languages have syntax rules. After def function_name(, the subsequent tokens are highly constrained. Small speculator models learn these patterns well.
    • Structured outputs: When you generate JSON, XML, or API responses, the format is predictable. Keys, brackets, and common patterns repeat constantly.
    • Repetitive tasks: Summarization with standard formats, Q&A with consistent structure, or template-based generation all benefit from speculative decoding.

    When the hare loses

    Speculative decoding is not a universal solution. It loses effectiveness in environments with high-concurrency offline batch scenarios. Examples include large batch processing and offline bulk inference (for example, processing datasets overnight)

    Why is this the case? At larger batch sizes, the GPU is fully saturated because it is processing multiple requests in parallel. This means there isn't any headroom for the speculator model to sit comfortably in. In these scenarios, the speculator model becomes counterproductive, as it adds computational overhead without providing a performance benefit.

    The technique is also less effective for highly creative outputs. This includes poetry, fiction, or marketing copy where token choices are often novel. The speculator model's acceptance rate decreases because it is less likely to predict creative tokens accurately. Running the speculator model in these cases can waste compute resources and increase your time to first token (TTFT) for minimal gain.

    Poorly aligned speculator models

    If your speculator model was trained on different data from your verifier model, acceptance rates collapse as well. You need domain alignment between the hare and the tortoise.

    You can mitigate this by using the vLLM speculators project. If a model doesn't have an associated speculator model for speculative decoding, you can train one.

    How to implement speculative decoding

    You can follow these steps to implement speculative decoding. This guide focuses on vLLM, as it is a production-ready implementation.

    Step 1: Choose your speculator model (the hare)

    The speculator model must meet the following criteria:

    • Size: 10 to 50 times smaller than the verifier model (for example, 1B for a 70B target).
    • Domain alignment: Trained on similar data to the verifier model.
    • Speed: Optimized for speed rather than accuracy.

    A recommended starting point is the Red Hat AI speculator models. These models serve as speculators for popular verifiers, such as Gemma, Qwen, Llama, and Mistral. These models are trained to predict the output of their corresponding flagship models.

    You can find available speculator models on the Red Hat AI Hugging Face repository (Figure 1).

    Hugging Face repository page for Red Hat AI with a list of speculator models for Gemma, Qwen, and Llama.
    Figure 1: List of speculator models in the Red Hat AI Hugging Face repository.

    If a verifier model does not have an associated speculator model, you can train one using the vLLM Speculators project.

    Step 2: Deploy with vLLM

    The vLLM engine supports speculative decoding through a single configuration flag. The following examples show the full implementation:

    Python API:

    from vllm import LLM, SamplingParams
    # Initialize with speculative decoding
    llm = LLM(
        model="RedHatAI/gemma-4-31B-it-FP8-Dynamic",
        speculative_model="RedHatAI/gemma-4-31B-it-speculator.eagle3",
        num_speculative_tokens=5,
        use_v2_block_manager=True,  # Required for spec decode
        gpu_memory_utilization=0.9,
        dtype="auto"
    )
    # Use it like normal vLLM
    sampling_params = SamplingParams(temperature=0.7, max_tokens=512)
    outputs = llm.generate(["Write a Python function to parse JSON"], sampling_params)
    ```

    OpenAI-compatible Server:

    vllm serve RedHatAI/gemma-4-31B-it-FP8-Dynamic \
      --speculative-model RedHatAI/gemma-4-31B-it-speculator.eagle3 \
      --num-speculative-tokens 5 \
      --use-v2-block-manager \
      --gpu-memory-utilization 0.9 \
      --dtype auto

    You can then call the server using the OpenAI Python library:

    from openai import OpenAI
    client = OpenAI(
        base_url="http://localhost:8000/v1",
        api_key="EMPTY"
    )
    response = client.chat.completions.create(
        model="RedHatAI/gemma-4-31B-it-FP8-Dynamic",
        messages=[{"role": "user", "content": "Generate SQL for user analytics"}]
    )

    Docker deployment:

    docker run --gpus all \
      -p 8000:8000 \
      -v ~/.cache/huggingface:/root/.cache/huggingface \
      vllm/vllm-openai:latest \
      --model RedHatAI/gemma-4-31B-it-FP8-Dynamic \
      --speculative-model RedHatAI/gemma-4-31B-it-speculator.eagle3 \
      --num-speculative-tokens 5 \
      --use-v2-block-manager

    Step 3: Tune the configuration

    The performance of speculative decoding depends on your specific workload and hardware. To optimize your deployment, focus on these three areas:

    Number of speculative tokens

    Begin with a value of four to five tokens. Selecting too many tokens can result in wasted processing time rejecting incorrect guesses. Conversely, selecting too few tokens might not result in a significant performance improvement. If you notice hit rates getting really high, you can increase the number of speculative tokens to improve throughput.

    # Conservative (safer for unpredictable tasks)
    num_speculative_tokens=3
    # Aggressive (best for code/structured output)
    num_speculative_tokens=10

    Monitor the acceptance rate

    The acceptance rate is your golden metric for performance. Track the percentage of speculator tokens that the verifier model successfully validates.

    # Enable metrics in vLLM
    llm = LLM(
        model="...",
        speculative_model="...",
        num_speculative_tokens=5,
        enable_metrics=True  # Exposes Prometheus metrics
    )

    Target acceptance rates:

    • 60 to 80%: You're in the sweet spot, a two- to three-times speed improvement.
    • Below 50%: Your speculator model might be poorly aligned with the target model, or the workload might be too creative for effective prediction.
    • Above 85%: Consider increasing the num_speculative_tokens value to improve performance further.

    Speculative decoding works best with smaller batch sizes of one to eight. At large batch sizes (32 or more), your GPU is already saturated, and the performance benefit diminishes.

    Step 4: Measure the impact

    Track these metrics before and after:

    • Tokens per second (TPS): This should increase substantially for most workloads.
    • Time to first token (TTFT): This might increase slightly due to speculator model overhead.
    • Time per output token (TPOT): This should decrease significantly.
    • Cost per 1,000 tokens: This should also decrease substantially.

    If you do not see at least an approximately 1.5 times performance increase, your workload might not be predictable or your speculator model might not be well-aligned.

    The following comparison demonstrates the performance difference between standard inference and inference with an aggressive speculator:

    vllm serve Qwen/Qwen3.5-9B --max-num-batched-tokens 32768 - Left (Standard Inference)

    Versus:

    vllm serve Qwen/Qwen3.5-9B --speculative-config '{"method": "dflash", "model": "z-lab/Qwen3.5-9B-DFlash", "num_speculative_tokens": 15}' --max-num-batched-tokens 32768 - Right

    Both configurations are running on a singular H100 GPU.

    In standard inference (left), the Qwen3.5-9B model achieves a throughput of approximately 145 tokens per second. With speculative decoding, the same model (right) reaches approximately 424 tokens per second. That is nearly a three times increase in performance. Although the TTFT is slightly higher, speculative decoding absolutely thrashes standard inference in the long run.

    The hidden benefit: Cost reduction

    Speculative decoding does more than speed up processing; it also cuts costs.

    When running on cloud-based GPUs, the cost per token decreases significantly:

    • Standard inference: 100 tokens per second at $5 per hour costs $0.05 per 1,000 tokens.
    • Speculative decoding (2.5 times faster): 250 tokens per second at $5 per hour costs 2 cents per 1,000 tokens.

    That's a 60% cost reduction using the same hardware. Alternatively, you can serve 2.5 times more users on the same GPU cluster.

    For a production deployment serving 10 million tokens per day:

    • Standard cost: $500 per day
    • Optimized cost: $200 per day
    • Annual savings: $109,500

    Boom!

    Both the tortoise and the hare win

    In Aesop's fable, the tortoise wins by being steady and reliable, while the hare loses because of overconfidence.

    In speculative decoding, these two models collaborate. The hare sprints ahead with educated guesses, while the tortoise validates the results in parallel to make sure they are accurate. Together, they deliver faster inference with no loss in quality.

    This optimization is available without additional licensing or infrastructure costs. Implementing this technique requires a configuration change rather than a model replacement. You can continue to use the verifier model you trust while generating tokens more quickly.

    The action plan

    If you serve LLMs in production and your workload involves:

    • Code generation
    • Structured outputs (JSON, SQL, API responses)
    • Template-based generation
    • Predictable patterns

    Then you should use speculative decoding. Here are the next steps:

    1. Identify your workload type: Is it predictable or creative?
    2. Choose a speculator model: Check the Red Hat AI Hugging Face repository for speculator models or train your own.
    3. Enable speculative decoding: Implement a configuration change in vLLM, TensorRT-LLM, or another supported engine.
    4. Measure the acceptance rate: Aim for a target of 60 to 80% for predictable workloads.
    5. Monitor cost savings: Expect a 50 to 60% reduction in cost per token.

    Default configurations are built for demos, not production. Your GPU can do a whole lot more than you're currently asking of it. Give it a try.

    Want to learn more about LLM optimization? Visit the Red Hat AI Hugging Face repository for more than 600 pre-optimized models and speculator models ready for production.

    Related Posts

    • Speculators: Standardized, production-ready speculative decoding

    • Speculators v0.5.0: DFlash support and online training

    • Deploy Hermes Agent on OpenShift AI with vLLM model serving

    • Improve vLLM Semantic Router accuracy with fine-tuning

    • pip install vllm: The iceberg under a single command

    • Why vLLM is the best choice for AI inference today

    Recent Posts

    • How speculative decoding delivers faster LLM inference

    • What's New in Red Hat Developer Hub 1.10?

    • Model-as-a-Service: How to run your own private AI API

    • How to use Red Hat Satellite to deploy virtual machines in Microsoft Azure

    • Add automated AI evaluations to your CI/CD pipeline

    What’s up next?

    Learning Path Red Hat AI

    How to run AI models in cloud development environments

    This learning path explores running AI models, specifically large language...
    Red Hat Developers logo LinkedIn YouTube Twitter Facebook

    Platforms

    • Red Hat AI
    • Red Hat Enterprise Linux
    • Red Hat OpenShift
    • Red Hat Ansible Automation Platform
    • See all products

    Build

    • Developer Sandbox
    • Developer tools
    • Interactive tutorials
    • API catalog

    Quicklinks

    • Learning resources
    • E-books
    • Cheat sheets
    • Blog
    • Events
    • Newsletter

    Communicate

    • About us
    • Contact sales
    • Find a partner
    • Report a website issue
    • Site status dashboard
    • Report a security problem

    RED HAT DEVELOPER

    Build here. Go anywhere.

    We serve the builders. The problem solvers who create careers with code.

    Join us if you’re a developer, software engineer, web designer, front-end designer, UX designer, computer scientist, architect, tester, product manager, project manager or team lead.

    Sign me up

    Red Hat legal and privacy links

    • About Red Hat
    • Jobs
    • Events
    • Locations
    • Contact Red Hat
    • Red Hat Blog
    • Inclusion at Red Hat
    • Cool Stuff Store
    • Red Hat Summit
    © 2026 Red Hat

    Red Hat legal and privacy links

    • Privacy statement
    • Terms of use
    • All policies and guidelines
    • Digital accessibility

    Chat Support

    Please log in with your Red Hat account to access chat support.