Skip to main content
Redhat Developers  Logo
  • AI

    Get started with AI

    • Red Hat AI
      Accelerate the development and deployment of enterprise AI solutions.
    • AI learning hub
      Explore learning materials and tools, organized by task.
    • AI interactive demos
      Click through scenarios with Red Hat AI, including training LLMs and more.
    • AI/ML learning paths
      Expand your OpenShift AI knowledge using these learning resources.
    • AI quickstarts
      Focused AI use cases designed for fast deployment on Red Hat AI platforms.
    • No-cost AI training
      Foundational Red Hat AI training.

    Featured resources

    • OpenShift AI learning
    • Open source AI for developers
    • AI product application development
    • Open source-powered AI/ML for hybrid cloud
    • AI and Node.js cheat sheet

    Red Hat AI Factory with NVIDIA

    • Red Hat AI Factory with NVIDIA is a co-engineered, enterprise-grade AI solution for building, deploying, and managing AI at scale across hybrid cloud environments.
    • Explore the solution
  • Learn

    Self-guided

    • Documentation
      Find answers, get step-by-step guidance, and learn how to use Red Hat products.
    • Learning paths
      Explore curated walkthroughs for common development tasks.
    • Guided learning
      Receive custom learning paths powered by our AI assistant.
    • See all learning

    Hands-on

    • Developer Sandbox
      Spin up Red Hat's products and technologies without setup or configuration.
    • Interactive labs
      Learn by doing in these hands-on, browser-based experiences.
    • Interactive demos
      Click through product features in these guided tours.

    Browse by topic

    • AI/ML
    • Automation
    • Java
    • Kubernetes
    • Linux
    • See all topics

    Training & certifications

    • Courses and exams
    • Certifications
    • Skills assessments
    • Red Hat Academy
    • Learning subscription
    • Explore training
  • Build

    Get started

    • Red Hat build of Podman Desktop
      A downloadable, local development hub to experiment with our products and builds.
    • Developer Sandbox
      Spin up Red Hat's products and technologies without setup or configuration.

    Download products

    • Access product downloads to start building and testing right away.
    • Red Hat Enterprise Linux
    • Red Hat AI
    • Red Hat OpenShift
    • Red Hat Ansible Automation Platform
    • See all products

    Featured

    • Red Hat build of OpenJDK
    • Red Hat JBoss Enterprise Application Platform
    • Red Hat OpenShift Dev Spaces
    • Red Hat Developer Toolset

    References

    • E-books
    • Documentation
    • Cheat sheets
    • Architecture center
  • Community

    Get involved

    • Events
    • Live AI events
    • Red Hat Summit
    • Red Hat Accelerators
    • Community discussions

    Follow along

    • Articles & blogs
    • Developer newsletter
    • Videos
    • Github

    Get help

    • Customer service
    • Customer support
    • Regional contacts
    • Find a partner

    Join the Red Hat Developer program

    • Download Red Hat products and project builds, access support documentation, learning content, and more.
    • Explore the benefits

Inside the vLLM-Omni architecture: Serving Qwen3-Omni

July 1, 2026
Isaac Tigges
Related topics:
AI inferenceArtificial intelligence
Related products:
Red Hat AIRed Hat AI Inference

    Serving a large language model (LLM) is a well-worn path by now: tokens in, tokens out, and vLLM has spent years making that fast and efficient. Multimodal inputs already fit that path. For example, a vision-language model encodes an image into embeddings, merges them with the text, and relies on autoregressive text generation underneath.

    Multimodal outputs introduce a different structure. A model like Qwen3-Omni takes in text, images, audio, and video and can talk back. To do that, it isn't one model at all; it's a pipeline of stages: encoders, an autoregressive language model, and generation stages. Serving it well means serving that whole pipeline, which is a different problem than serving a single decoder, and it's the problem vLLM-Omni exists to solve.

    I wanted to see what that actually looks like under load, so I built an insurance claim triage demo on a single hardware accelerator and instrumented it to show the engine working stage by stage. This post walks through the demo, separates what the model does from what the engine does, and then digs into how vLLM-Omni serves the pipeline. We can map out this high-level architecture using Figure 1, layering in the technical components as we go.

    Comparison of traditional text serving to the vLLM-Omni multi-stage architecture, outlining key features and benchmarked performance efficiency.
    Figure 1: vLLM-Omni reframes serving as a diverse pipeline, with any-modality inputs flowing through encoders into the autoregressive core and back out through modality generators.

    The demo

    A claims analyst uploads a vehicle damage video and a customer voice statement. Each claim fires two concurrent requests: one for a text-only structured report for the adjuster (visible damage, statement summary, a consistency check, severity, and a recommended action), and one for a short spoken summary for the customer. Qwen3-Omni's speech is generated from its own text reasoning, so different content for the adjuster and the customer means sending two requests rather than one.

    The hook is the consistency check. When the audio matches the video, the claim auto-approves. When the same video is paired with a voice statement describing a different incident, the report flags it and routes the claim to review.

    That cross-modal reasoning is the Qwen3-Omni model, not vLLM-Omni. The Thinker's audio and vision encoders project into the same embedding space as the text tokens. Everything is concatenated into one sequence, and self-attention runs over all of it. Any token attends to any other token regardless of modality. That shared-sequence attention is why the model notices a spoken "front-end collision" and doesn't match visible rear-end damage. You'd get it from the model on plain Hugging Face Transformers. 

    vLLM-Omni doesn't change how the model reasons; it changes how efficiently you serve it. The rest of this post is about the serving, and that starts with the stages.

    The stage graph

    For Qwen3-Omni, the pipeline consists of three autoregressive stages plus encoders, which are mapped out sequentially in Figure 2:

    • Thinker (~30 billion MoE): Reasoning and text, with integrated audio and SigLIP2 vision encoders. For a text-only response, it's the only stage that runs.
    • Talker (~3 billion MoE): Generates audio codec codes from the Thinker's hidden states. This stage runs only for audio output.
    • Code2Wav (causal ConvNet vocoder): Turns codec codes into a waveform.
    Multi-stage request processing workflow mapping Qwen3-Omni architectural layers to corresponding vLLM-Omni execution components and output processors.
    Figure 2: A Qwen3-Omni request flows through three stages (Thinker, Talker, and Code2Wav), with the OmniConnector moving data between them before the result reaches the API.

    vLLM-Omni decomposes inference into a graph of these stages. The edges are functions that transform and route data between them, and a per-model YAML declares each stage's device, GPU memory fraction, and input source. Recent versions split this configuration into a  --deploy-config schema that also captures the communication topology between stages for disaggregated deployments, with the original --stage-configs-path kept for compatibility. A few things follow from that.

    One endpoint

    vllm serve <model> --omni brings up the whole graph behind a single OpenAI-compatible API. Requests use the same schema vanilla vLLM uses for images, extended to video and audio plus a modalities parameter. The cross-stage orchestration is invisible to the caller.

    It inherits vLLM

    The autoregressive stages reuse PagedAttention, continuous batching, CUDA graphs, and the scheduler. After tokenization, multimodal placeholder tokens are replaced by encoder embeddings during prefill. The KV cache stores keys and values for all tokens regardless of modality, keeping the scheduler modality-agnostic. 

    Prefix caching carries over as well: vLLM-Omni reuses vLLM's KV cache prefix caching and adds a separate cache for the hidden-state tensors passed between stages. So, a repeated prefix skips recompute on the autoregressive stages, just as it would for a plain LLM. It's supported on autoregressive stages with a single KV cache group, and you toggle it per stage with --enable-prefix-caching in the stage config (see the vLLM-Omni prefix caching documentation).

    Disaggregated resources

    Each stage gets its own GPU memory fraction and can be placed and scaled independently. In this demo, the adjuster request is text-only. While its Talker and Code2Wav stages never run, they still initialize and hold memory. The container passes --stage-overrides, reserving 15% of the card for the Talker and 10% for Code2Wav. That per-stage budgeting is the foundation for scaling the bottleneck stage instead of duplicating the whole model.

    Shared-memory transport

    Stages talk through the OmniConnector, which splits a lightweight control plane from a heavy data plane. The default same-node connector moves payloads through /dev/shm and passes only a small handle through the control plane; RDMA and TCP connectors handle cross-node deployments. The connector only moves data between stages; a separate router handles load balancing, so when you run multiple replicas of a bottleneck stage it spreads requests across them while the other stages are untouched.

    Pipeline execution

    With async chunking, the stages overlap instead of running strictly in sequence, and audio streams out as the Talker generates each chunk, as illustrated in Figure 3.

    Sequential end-to-end generation compared to overlapping streaming generation across 3 pipeline stages to lower first packet latency.
    Figure 3: With async chunking the stages overlap instead of running end to end, so the first audio packet streams out far sooner.

    The demo's customer-callback request runs the full Thinker to Talker to Code2Wav path, and a live timeline tracks each stage from real timestamps in the streaming response: the Thinker from request-sent to last text token, the Talker from the first text token (async chunking lets it start before the Thinker finishes) to the first audio byte, Code2Wav from first audio byte to completion. You can watch the Talker start before the Thinker is done and the vocoder start before the Talker is done. This overlap is the key serving advantage for audio output.

    Benchmarking performance against Hugging Face Transformers

    The right baseline for a serving engine is the same model served in a simpler way. The vLLM-Omni team's published Qwen3-TTS benchmarks do exactly that, the same Qwen3-TTS-1.7B model on both sides, which isolates what the dedicated omni engine adds rather than the model.

    The comparison uses the exact same model weights and core reasoning, which shows the difference is entirely in serving. As shown in Figure 4, a real-time factor below 1.0 means the audio output is faster than playback, which contrasts with the performance of Hugging Face Transformers (at 2.64).

    Four performance charts comparing latency, real-time factor, and throughput scaling metrics between vLLM-Omni and Hugging Face Transformers.
    Figure 4: Published by the vLLM-Omni team, these Qwen3-TTS benchmarks run the same model on both sides, with vLLM-Omni reaching a real-time factor of 0.17 against Hugging Face Transformers at 2.64.

    Not just this model

    vLLM was built around autoregressive generation: one token at a time, causal mask, KV cache. Diffusion models break all three. A diffusion transformer (Qwen-Image, Flux, Wan, Stable Diffusion 3) refines a full output over fixed denoising steps with full attention and fixed sequence length. vLLM-Omni implements a native diffusion path alongside the autoregressive one, on the same stage and model-runner abstractions, so one engine serves both. Figure 5 highlights how an omni model can mix these components, using an autoregressive stage for understanding and a diffusion stage for generation.

    Categorized layout detailing vLLM-Omni support across 4 model architecture domains: omni multimodal, TTS audio, image diffusion, and video.
    Figure 5: vLLM-Omni supports more than 30 omni and diffusion architectures, the clearest sign it's serving infrastructure rather than a single-model runner.

    The setup

    I was running this on a single B200, which is 192 GB of HBM3e at 8 TB/s on the Blackwell architecture. Qwen3-Omni's weights are only about 60 GB at BF16, so the model fit with plenty of room and capacity was never the constraint. What mattered was bandwidth: decode is memory-bandwidth-bound, and since only about 3 billion of the 30 billion parameters are active per token, the rate you can move weights and KV cache through memory is what sets generation speed. The 8 TB/s is really why I reached for Blackwell, not the spare memory.

    Review the following resources if you want to try it out yourself.

    • Docker Hub
    • vLLM-Omni docs
    • Demo: Inside the vLLM-Omni Architecture: Serving Qwen3-Omni

    Related Posts

    • Run Qwen3-Next on vLLM with Red Hat AI: A step-by-step guide

    • Optimizing distributed AI inference: Advanced deployment patterns

    • Designing distributed AI inference: Core concepts and scaling dimensions

    • llama.cpp vs. vLLM: Choosing the right local LLM inference engine

    • How speculative decoding delivers faster LLM inference

    • LLM Compressor 0.8.0: Extended support for Qwen3 and more

    Recent Posts

    • Build a multi-agent supervisor pattern on OpenShift

    • Inside the vLLM-Omni architecture: Serving Qwen3-Omni

    • Demystify the terminology of OpenShift hosted control planes

    • Scale document ingestion with Docling and Ray on OpenShift AI

    • Deploy secure agentic AI: Protocols and performance tuning

    What’s up next?

    applied ai for devs tile card

    Applied AI for Enterprise Java Development

    Alex Soto Bueno +2
    Red Hat Developers logo LinkedIn YouTube Twitter Facebook

    Platforms

    • Red Hat AI
    • Red Hat Enterprise Linux
    • Red Hat OpenShift
    • Red Hat Ansible Automation Platform
    • See all products

    Build

    • Developer Sandbox
    • Developer tools
    • Interactive tutorials
    • API catalog

    Quicklinks

    • Learning resources
    • E-books
    • Cheat sheets
    • Blog
    • Events
    • Newsletter

    Communicate

    • About us
    • Contact sales
    • Find a partner
    • Report a website issue
    • Site status dashboard
    • Report a security problem

    RED HAT DEVELOPER

    Build here. Go anywhere.

    We serve the builders. The problem solvers who create careers with code.

    Join us if you’re a developer, software engineer, web designer, front-end designer, UX designer, computer scientist, architect, tester, product manager, project manager or team lead.

    Sign me up

    Red Hat legal and privacy links

    • About Red Hat
    • Jobs
    • Events
    • Locations
    • Contact Red Hat
    • Red Hat Blog
    • Inclusion at Red Hat
    • Cool Stuff Store
    • Red Hat Summit
    © 2026 Red Hat

    Red Hat legal and privacy links

    • Privacy statement
    • Terms of use
    • All policies and guidelines
    • Digital accessibility

    Chat Support

    Please log in with your Red Hat account to access chat support.