Skip to main content
Redhat Developers  Logo
  • Products

    Platforms

    • Red Hat Enterprise Linux
      Red Hat Enterprise Linux Icon
    • Red Hat AI
      Red Hat AI
    • Red Hat OpenShift
      Openshift icon
    • Red Hat Ansible Automation Platform
      Ansible icon
    • See all Red Hat products

    Featured

    • Red Hat build of OpenJDK
    • Red Hat Developer Hub
    • Red Hat JBoss Enterprise Application Platform
    • Red Hat OpenShift Dev Spaces
    • Red Hat OpenShift Local
    • Red Hat Developer Sandbox

      Try Red Hat products and technologies without setup or configuration fees for 30 days with this shared Red Hat OpenShift and Kubernetes cluster.
    • Try at no cost
  • Technologies

    Featured

    • AI/ML
      AI/ML Icon
    • Linux
      Linux Icon
    • Kubernetes
      Cloud icon
    • Automation
      Automation Icon showing arrows moving in a circle around a gear
    • See all technologies
    • Programming languages & frameworks

      • Java
      • Python
      • JavaScript
    • System design & architecture

      • Red Hat architecture and design patterns
      • Microservices
      • Event-Driven Architecture
      • Databases
    • Developer experience

      • Productivity
      • Tools
      • GitOps
    • Automated data processing

      • AI/ML
      • Data science
      • Apache Kafka on Kubernetes
    • Platform engineering

      • DevOps
      • DevSecOps
      • Red Hat Ansible Automation Platform for applications and services
    • Secure development & architectures

      • Security
      • Secure coding
  • Learn

    Featured

    • Kubernetes & cloud native
      Openshift icon
    • Linux
      Rhel icon
    • Automation
      Ansible cloud icon
    • AI/ML
      AI/ML Icon
    • See all learning resources

    E-books

    • GitOps cookbook
    • Podman in action
    • Kubernetes operators
    • The path to GitOps
    • See all e-books

    Cheat sheets

    • Linux commands
    • Bash commands
    • Git
    • systemd commands
    • See all cheat sheets

    Documentation

    • Product documentation
    • API catalog
    • Legacy documentation
  • Developer Sandbox

    Developer Sandbox

    • Access Red Hat’s products and technologies without setup or configuration, and start developing quicker than ever before with our new, no-cost sandbox environments.
    • Explore the Developer Sandbox

    Featured Developer Sandbox activities

    • Get started with your Developer Sandbox
    • OpenShift virtualization and application modernization using the Developer Sandbox
    • Explore all Developer Sandbox activities

    Ready to start developing apps?

    • Try at no cost
  • Blog
  • Events
  • Videos

Beyond the next token: Why diffusion LLMs are changing the game

Diffusion LLMs: A revolutionary approach to language models

April 27, 2026
Alon Kellner Tomer Garber
Related topics:
Artificial intelligence
Related products:
Red Hat AI

    If you've spent time deploying traditional large language models (LLMs), you've likely wrestled with the classic tradeoff between accuracy and performance. Typically, we're forced to make a rigid architectural choice: Do you deploy a massive, slow model for deep reasoning, or a small, lightning-fast one for everyday chat? Often, we end up gluing these models together with complex semantic routers. What if we didn't have to choose? Diffusion LLMs offer a way out of this trap, alongside a host of other potential benefits.

    Why the shift and how it works

    Traditional auto-regressive (AR) models use causal next token prediction. They predict the next token one by one, moving strictly from left to right. It works, but it's rigid.

    A diffusion LLM (dLLM) flips the script, offering a refreshing, dynamic alternative. Instead of guessing one word at a time, a dLLM drafts a sequence of text and then refines it, using two techniques:

    • Bidirectional attention: Unlike AR models that can only look at the past, a dLLM looks at both the past and the future tokens it is drafting. This creates incredibly coherent, context-aware output.
    • Iterative refinement: Think of it like a game of fill-in-the-blank. Instead of writing a sentence word-by-word, the model drafts a full sentence with blanks—like "The quick [MASK] fox jumps over the [MASK] dog." In the next step, it looks at the whole sentence to perfectly fill in "brown" and "lazy." It loops back to enhance the quality and logic of the draft with each pass.

    Figure 1 shows the progression from a standard causal mask of auto-regressive models to the full bidirectional mask of early diffusion LLMs.

    A technical diagram comparing attention mechanisms, showing the progression from the standard causal mask of auto-regressive models to the full bidirectional mask of early diffusion LLMs, and finally the intra-block bidirectional blockwise causal mask used in modern block diffusion LLMs. (Source: Tian, Y. et al., "From Next-Token to Next-Block: A Principled Adaptation Path for Diffusion LLMs")
    A technical diagram comparing attention mechanisms, showing the progression from the standard causal mask of auto-regressive models to the full bidirectional mask of early diffusion LLMs, and finally the intra-block bidirectional blockwise causal mask used in modern block diffusion LLMs. (Source: Tian, Y. et al., "From Next-Token to Next-Block: A Principled Adaptation Path for Diffusion LLMs")

    The runtime tradeoff

    So why is this architecture so different? The magic is a beautifully simple, dynamic runtime tradeoff. With standard AR models, the compute spent on a token is baked into the architecture. If you want more intelligence, you need a bigger model or complex, latency-heavy test-time compute loops.

    A dLLM fundamentally rewrites these rules. Because a dLLM drafts multiple tokens simultaneously and refines them over multiple "steps" (similar to how an AI image slowly comes into focus), you can tune quality against latency on the fly.

    Need instant, real-time speed for a voice assistant? Run the exact same model at fewer steps.

    Need deep, complex reasoning for a coding task? Turn the steps up to let the model refine its draft.

    There's no need to swap models, maintain multiple endpoints, or build complex routing logic. You deploy one single model and flex its performance to meet the exact needs of the moment. It's elegant, efficient, and incredibly powerful.

    Figure 2 shows some example real-world statistics for a dLLM.

    A scatter plot and radar chart showing WeDLM-8B achieving higher accuracy and faster inference speeds than both auto-regressive baselines and previous diffusion LLMs across multiple benchmarks. (Source: Liu et al., "WeDLM: Reconciling Diffusion Language Models with Standard Causal Attention for Fast Inference")
    Figure 2: A scatter plot and radar chart showing WeDLM-8B achieving higher accuracy and faster inference speeds than both auto-regressive baselines and previous diffusion LLMs across multiple benchmarks. (Source: Liu et al., "WeDLM: Reconciling Diffusion Language Models with Standard Causal Attention for Fast Inference")

    The evolution of diffusion LLMs

    This architecture is rapidly evolving across three distinct generations:

    • First generation (the pioneers): Early models like LLaDA 1.0, and Dream used a full context approach. They were groundbreaking, but because they tried to refine the entire sequence at once and could not use standard KV caching, they were computationally heavy and slow.
    • Second generation (the pragmatists): Models like SDAR, LLaDA 2.0, and Fast-dLLM2 got smarter by shifting to a blocked causal context. By operating on smaller chunks (typically 8 to 64 tokens at a time) and enabling block-wise KV caching, they drastically improved inference speed and made dLLMs practical for real-world use.
    • Third generation (the frontier): New models and innovations are popping up every day. LLaDA 2.1 with token editing, and WeDLM with stream decoding, just to name two. That's just the beginning for this self-reinventing third generation.

    The need for speed and scale

    The demand is undeniable, and the open source community is moving fast. Popular models like LLaDA2.0-mini, LLaDA2.1-flash, Fast_dLLM_v2_1-1.5B, and SDAR-1.7B are experiencing a surge in downloads on Hugging Face, with some reaching 150,000 monthly downloads.

    We're seeing trending dLLMs like the open source LLaDA 2.X hitting over 800 output tokens per second (TPS), while the Mercury 2 is crossing the 1000+ TPS threshold. When looking at the data, models like WeDLM and LLaDA 2.1 are outperforming baseline vLLM auto-regressive models, as shown in Figure 3.

    Bar chart showing Mercury 2 at the top with 877 output tokens per second, significantly higher than models like Nemotron 3, GPT-4o, and Gemini 3.1 Flash. (Source: Artificial Analysis)
    Figure 3: Bar chart showing Mercury 2 at the top with 877 output tokens per second, significantly higher than models like Nemotron 3, GPT-4o, and Gemini 3.1 Flash. (Source: Artificial Analysis)

    Building the future at Red Hat

    At Red Hat, we believe in taking brave leaps forward along with the rest of the open source community. We aren't just watching this dLLM trend happen, we're actively exploring this promising architecture to build the infrastructure that supports it. The journey into diffusion LLMs is just beginning, and it's full of brilliant possibilities. Stay tuned for more updates, and as always, keep building with curiosity!

    You can also leverage the no-cost 60-day Red Hat AI trial to test leading and emerging models and experience the benefits they can unlock for your unique business use cases.

    Related Posts

    • pip install vllm: The iceberg under a single command

    • Run Model-as-a-Service for multiple LLMs on OpenShift

    • Estimate GPU memory for LLM fine-tuning with Red Hat AI

    • Practical strategies for vLLM performance tuning

    • Making LLMs boring: From chatbots to semantic processors

    Recent Posts

    • Deploy hosted control planes with OpenShift Virtualization: Split hub

    • Automate Infoblox DDI with Red Hat Ansible Automation Platform

    • Beyond the next token: Why diffusion LLMs are changing the game

    • From 200 lines to 15: How Helion is rewriting the rules of GPU programming

    • Sky computing with OpenShift Service Mesh and SPIRE, part 2: External and multicloud integration

    What’s up next?

    Learning Path Red Hat AI

    Get started with consuming GPU-hosted large language models on Developer Sandbox

    Learn the many ways you can interact with GPU-hosted large language models...
    Red Hat Developers logo LinkedIn YouTube Twitter Facebook

    Platforms

    • Red Hat AI
    • Red Hat Enterprise Linux
    • Red Hat OpenShift
    • Red Hat Ansible Automation Platform
    • See all products

    Build

    • Developer Sandbox
    • Developer tools
    • Interactive tutorials
    • API catalog

    Quicklinks

    • Learning resources
    • E-books
    • Cheat sheets
    • Blog
    • Events
    • Newsletter

    Communicate

    • About us
    • Contact sales
    • Find a partner
    • Report a website issue
    • Site status dashboard
    • Report a security problem

    RED HAT DEVELOPER

    Build here. Go anywhere.

    We serve the builders. The problem solvers who create careers with code.

    Join us if you’re a developer, software engineer, web designer, front-end designer, UX designer, computer scientist, architect, tester, product manager, project manager or team lead.

    Sign me up

    Red Hat legal and privacy links

    • About Red Hat
    • Jobs
    • Events
    • Locations
    • Contact Red Hat
    • Red Hat Blog
    • Inclusion at Red Hat
    • Cool Stuff Store
    • Red Hat Summit
    © 2026 Red Hat

    Red Hat legal and privacy links

    • Privacy statement
    • Terms of use
    • All policies and guidelines
    • Digital accessibility

    Report a website issue