Skip to main content
Redhat Developers  Logo
  • AI

    Get started with AI

    • Red Hat AI
      Accelerate the development and deployment of enterprise AI solutions.
    • AI learning hub
      Explore learning materials and tools, organized by task.
    • AI interactive demos
      Click through scenarios with Red Hat AI, including training LLMs and more.
    • AI/ML learning paths
      Expand your OpenShift AI knowledge using these learning resources.
    • AI quickstarts
      Focused AI use cases designed for fast deployment on Red Hat AI platforms.
    • No-cost AI training
      Foundational Red Hat AI training.

    Featured resources

    • OpenShift AI learning
    • Open source AI for developers
    • AI product application development
    • Open source-powered AI/ML for hybrid cloud
    • AI and Node.js cheat sheet

    Red Hat AI Factory with NVIDIA

    • Red Hat AI Factory with NVIDIA is a co-engineered, enterprise-grade AI solution for building, deploying, and managing AI at scale across hybrid cloud environments.
    • Explore the solution
  • Learn

    Self-guided

    • Documentation
      Find answers, get step-by-step guidance, and learn how to use Red Hat products.
    • Learning paths
      Explore curated walkthroughs for common development tasks.
    • Guided learning
      Receive custom learning paths powered by our AI assistant.
    • See all learning

    Hands-on

    • Developer Sandbox
      Spin up Red Hat's products and technologies without setup or configuration.
    • Interactive labs
      Learn by doing in these hands-on, browser-based experiences.
    • Interactive demos
      Click through product features in these guided tours.

    Browse by topic

    • AI/ML
    • Automation
    • Java
    • Kubernetes
    • Linux
    • See all topics

    Training & certifications

    • Courses and exams
    • Certifications
    • Skills assessments
    • Red Hat Academy
    • Learning subscription
    • Explore training
  • Build

    Get started

    • Red Hat build of Podman Desktop
      A downloadable, local development hub to experiment with our products and builds.
    • Developer Sandbox
      Spin up Red Hat's products and technologies without setup or configuration.

    Download products

    • Access product downloads to start building and testing right away.
    • Red Hat Enterprise Linux
    • Red Hat AI
    • Red Hat OpenShift
    • Red Hat Ansible Automation Platform
    • See all products

    Featured

    • Red Hat build of OpenJDK
    • Red Hat JBoss Enterprise Application Platform
    • Red Hat OpenShift Dev Spaces
    • Red Hat Developer Toolset

    References

    • E-books
    • Documentation
    • Cheat sheets
    • Architecture center
  • Community

    Get involved

    • Events
    • Live AI events
    • Red Hat Summit
    • Red Hat Accelerators
    • Community discussions

    Follow along

    • Articles & blogs
    • Developer newsletter
    • Videos
    • Github

    Get help

    • Customer service
    • Customer support
    • Regional contacts
    • Find a partner

    Join the Red Hat Developer program

    • Download Red Hat products and project builds, access support documentation, learning content, and more.
    • Explore the benefits

Beyond the next token: Why diffusion LLMs are changing the game

Diffusion LLMs: A revolutionary approach to language models

April 27, 2026
Alon Kellner Tomer Garber
Related topics:
Artificial intelligence
Related products:
Red Hat AI

    If you've spent time deploying traditional large language models (LLMs), you've likely wrestled with the classic tradeoff between accuracy and performance. Typically, we're forced to make a rigid architectural choice: Do you deploy a massive, slow model for deep reasoning, or a small, lightning-fast one for everyday chat? Often, we end up gluing these models together with complex semantic routers. What if we didn't have to choose? Diffusion LLMs offer a way out of this trap, alongside a host of other potential benefits.

    Why the shift and how it works

    Traditional auto-regressive (AR) models use causal next token prediction. They predict the next token one by one, moving strictly from left to right. It works, but it's rigid.

    A diffusion LLM (dLLM) flips the script, offering a refreshing, dynamic alternative. Instead of guessing one word at a time, a dLLM drafts a sequence of text and then refines it, using two techniques:

    • Bidirectional attention: Unlike AR models that can only look at the past, a dLLM looks at both the past and the future tokens it is drafting. This creates incredibly coherent, context-aware output.
    • Iterative refinement: Think of it like a game of fill-in-the-blank. Instead of writing a sentence word-by-word, the model drafts a full sentence with blanks—like "The quick [MASK] fox jumps over the [MASK] dog." In the next step, it looks at the whole sentence to perfectly fill in "brown" and "lazy." It loops back to enhance the quality and logic of the draft with each pass.

    Figure 1 shows the progression from a standard causal mask of auto-regressive models to the full bidirectional mask of early diffusion LLMs.

    A technical diagram comparing attention mechanisms, showing the progression from the standard causal mask of auto-regressive models to the full bidirectional mask of early diffusion LLMs, and finally the intra-block bidirectional blockwise causal mask used in modern block diffusion LLMs. (Source: Tian, Y. et al., "From Next-Token to Next-Block: A Principled Adaptation Path for Diffusion LLMs")
    A technical diagram comparing attention mechanisms, showing the progression from the standard causal mask of auto-regressive models to the full bidirectional mask of early diffusion LLMs, and finally the intra-block bidirectional blockwise causal mask used in modern block diffusion LLMs. (Source: Tian, Y. et al., "From Next-Token to Next-Block: A Principled Adaptation Path for Diffusion LLMs")

    The runtime tradeoff

    So why is this architecture so different? The magic is a beautifully simple, dynamic runtime tradeoff. With standard AR models, the compute spent on a token is baked into the architecture. If you want more intelligence, you need a bigger model or complex, latency-heavy test-time compute loops.

    A dLLM fundamentally rewrites these rules. Because a dLLM drafts multiple tokens simultaneously and refines them over multiple "steps" (similar to how an AI image slowly comes into focus), you can tune quality against latency on the fly.

    Need instant, real-time speed for a voice assistant? Run the exact same model at fewer steps.

    Need deep, complex reasoning for a coding task? Turn the steps up to let the model refine its draft.

    There's no need to swap models, maintain multiple endpoints, or build complex routing logic. You deploy one single model and flex its performance to meet the exact needs of the moment. It's elegant, efficient, and incredibly powerful.

    Figure 2 shows some example real-world statistics for a dLLM.

    A scatter plot and radar chart showing WeDLM-8B achieving higher accuracy and faster inference speeds than both auto-regressive baselines and previous diffusion LLMs across multiple benchmarks. (Source: Liu et al., "WeDLM: Reconciling Diffusion Language Models with Standard Causal Attention for Fast Inference")
    Figure 2: A scatter plot and radar chart showing WeDLM-8B achieving higher accuracy and faster inference speeds than both auto-regressive baselines and previous diffusion LLMs across multiple benchmarks. (Source: Liu et al., "WeDLM: Reconciling Diffusion Language Models with Standard Causal Attention for Fast Inference")

    The evolution of diffusion LLMs

    This architecture is rapidly evolving across three distinct generations:

    • First generation (the pioneers): Early models like LLaDA 1.0, and Dream used a full context approach. They were groundbreaking, but because they tried to refine the entire sequence at once and could not use standard KV caching, they were computationally heavy and slow.
    • Second generation (the pragmatists): Models like SDAR, LLaDA 2.0, and Fast-dLLM2 got smarter by shifting to a blocked causal context. By operating on smaller chunks (typically 8 to 64 tokens at a time) and enabling block-wise KV caching, they drastically improved inference speed and made dLLMs practical for real-world use.
    • Third generation (the frontier): New models and innovations are popping up every day. LLaDA 2.1 with token editing, and WeDLM with stream decoding, just to name two. That's just the beginning for this self-reinventing third generation.

    The need for speed and scale

    The demand is undeniable, and the open source community is moving fast. Popular models like LLaDA2.0-mini, LLaDA2.1-flash, Fast_dLLM_v2_1-1.5B, and SDAR-1.7B are experiencing a surge in downloads on Hugging Face, with some reaching 150,000 monthly downloads.

    We're seeing trending dLLMs like the open source LLaDA 2.X hitting over 800 output tokens per second (TPS), while the Mercury 2 is crossing the 1000+ TPS threshold. When looking at the data, models like WeDLM and LLaDA 2.1 are outperforming baseline vLLM auto-regressive models, as shown in Figure 3.

    Bar chart showing Mercury 2 at the top with 877 output tokens per second, significantly higher than models like Nemotron 3, GPT-4o, and Gemini 3.1 Flash. (Source: Artificial Analysis)
    Figure 3: Bar chart showing Mercury 2 at the top with 877 output tokens per second, significantly higher than models like Nemotron 3, GPT-4o, and Gemini 3.1 Flash. (Source: Artificial Analysis)

    Building the future at Red Hat

    At Red Hat, we believe in taking brave leaps forward along with the rest of the open source community. We aren't just watching this dLLM trend happen, we're actively exploring this promising architecture to build the infrastructure that supports it. The journey into diffusion LLMs is just beginning, and it's full of brilliant possibilities. Stay tuned for more updates, and as always, keep building with curiosity!

    You can also leverage the no-cost 60-day Red Hat AI trial to test leading and emerging models and experience the benefits they can unlock for your unique business use cases.

    Related Posts

    • pip install vllm: The iceberg under a single command

    • Run Model-as-a-Service for multiple LLMs on OpenShift

    • Estimate GPU memory for LLM fine-tuning with Red Hat AI

    • Practical strategies for vLLM performance tuning

    • Making LLMs boring: From chatbots to semantic processors

    Recent Posts

    • Every layer counts: Defense in depth for AI agents with Red Hat AI

    • Fun in the RUN instruction: Why container builds with distroless images can surprise you

    • Trusted software factory: Building trust in the agentic AI era

    • Build a zero trust AI pipeline with OpenShift and RHEL CVMs

    • Red Hat Hardened Images: Top 5 benefits for software developers

    What’s up next?

    Learning Path AI sparkles and a tiny red hat on a dark background

    Get started with consuming GPU-hosted large language models on Developer Sandbox

    Learn the many ways you can interact with GPU-hosted large language models...
    Red Hat Developers logo LinkedIn YouTube Twitter Facebook

    Platforms

    • Red Hat AI
    • Red Hat Enterprise Linux
    • Red Hat OpenShift
    • Red Hat Ansible Automation Platform
    • See all products

    Build

    • Developer Sandbox
    • Developer tools
    • Interactive tutorials
    • API catalog

    Quicklinks

    • Learning resources
    • E-books
    • Cheat sheets
    • Blog
    • Events
    • Newsletter

    Communicate

    • About us
    • Contact sales
    • Find a partner
    • Report a website issue
    • Site status dashboard
    • Report a security problem

    RED HAT DEVELOPER

    Build here. Go anywhere.

    We serve the builders. The problem solvers who create careers with code.

    Join us if you’re a developer, software engineer, web designer, front-end designer, UX designer, computer scientist, architect, tester, product manager, project manager or team lead.

    Sign me up

    Red Hat legal and privacy links

    • About Red Hat
    • Jobs
    • Events
    • Locations
    • Contact Red Hat
    • Red Hat Blog
    • Inclusion at Red Hat
    • Cool Stuff Store
    • Red Hat Summit
    © 2026 Red Hat

    Red Hat legal and privacy links

    • Privacy statement
    • Terms of use
    • All policies and guidelines
    • Digital accessibility

    Chat Support

    Please log in with your Red Hat account to access chat support.