Skip to main content
Redhat Developers  Logo
  • AI

    Get started with AI

    • Red Hat AI
      Accelerate the development and deployment of enterprise AI solutions.
    • AI learning hub
      Explore learning materials and tools, organized by task.
    • AI interactive demos
      Click through scenarios with Red Hat AI, including training LLMs and more.
    • AI/ML learning paths
      Expand your OpenShift AI knowledge using these learning resources.
    • AI quickstarts
      Focused AI use cases designed for fast deployment on Red Hat AI platforms.
    • No-cost AI training
      Foundational Red Hat AI training.

    Featured resources

    • OpenShift AI learning
    • Open source AI for developers
    • AI product application development
    • Open source-powered AI/ML for hybrid cloud
    • AI and Node.js cheat sheet

    Red Hat AI Factory with NVIDIA

    • Red Hat AI Factory with NVIDIA is a co-engineered, enterprise-grade AI solution for building, deploying, and managing AI at scale across hybrid cloud environments.
    • Explore the solution
  • Learn

    Self-guided

    • Documentation
      Find answers, get step-by-step guidance, and learn how to use Red Hat products.
    • Learning paths
      Explore curated walkthroughs for common development tasks.
    • Guided learning
      Receive custom learning paths powered by our AI assistant.
    • See all learning

    Hands-on

    • Developer Sandbox
      Spin up Red Hat's products and technologies without setup or configuration.
    • Interactive labs
      Learn by doing in these hands-on, browser-based experiences.
    • Interactive demos
      Click through product features in these guided tours.

    Browse by topic

    • AI/ML
    • Automation
    • Java
    • Kubernetes
    • Linux
    • See all topics

    Training & certifications

    • Courses and exams
    • Certifications
    • Skills assessments
    • Red Hat Academy
    • Learning subscription
    • Explore training
  • Build

    Get started

    • Red Hat build of Podman Desktop
      A downloadable, local development hub to experiment with our products and builds.
    • Developer Sandbox
      Spin up Red Hat's products and technologies without setup or configuration.

    Download products

    • Access product downloads to start building and testing right away.
    • Red Hat Enterprise Linux
    • Red Hat AI
    • Red Hat OpenShift
    • Red Hat Ansible Automation Platform
    • See all products

    Featured

    • Red Hat build of OpenJDK
    • Red Hat JBoss Enterprise Application Platform
    • Red Hat OpenShift Dev Spaces
    • Red Hat Developer Toolset

    References

    • E-books
    • Documentation
    • Cheat sheets
    • Architecture center
  • Community

    Get involved

    • Events
    • Live AI events
    • Red Hat Summit
    • Red Hat Accelerators
    • Community discussions

    Follow along

    • Articles & blogs
    • Developer newsletter
    • Videos
    • Github

    Get help

    • Customer service
    • Customer support
    • Regional contacts
    • Find a partner

    Join the Red Hat Developer program

    • Download Red Hat products and project builds, access support documentation, learning content, and more.
    • Explore the benefits

vLLM V1: Accelerating multimodal inference for large language models

How vLLM V1 drives enhanced support for multimodal LLMs

February 27, 2025
Michael Goin Addie Stevens Roger Wang, Senior Machine Learning Engineer - ML Platform at Roblox
Related topics:
Artificial intelligenceData scienceOpen source
Related products:
Red Hat AI

    This blog recaps the February 6th vLLM Office Hour, where host Michael Goin was joined by Roger Wang, a vLLM committer from Roblox, to discuss the new multimodal capabilities in vLLM V1.

    In the AI space, efficient inference isn’t just about speed; it’s about flexibility, scalability, and the ability to seamlessly handle diverse data modalities—beyond just text. vLLM has emerged as the open source standard for serving language model inference, supporting models from Hugging Face and more across a wide array of hardware. With robust support for GPUs, TPUs, and even CPUs, vLLM is paving the way for next-generation multimodal applications.

    In this article, we dive into the innovations behind vLLM V1 (V1 Alpha), which addresses the challenges of multimodal inference encountered in V0. We’ll explore the design decisions that enhance performance, from encoder caching to optimized data processing and share benchmark results that highlight the improvements. Finally, we’ll outline our vision for future work to further push the boundaries of efficient, scalable AI.

    About vLLM

    vLLM is the go-to open source model serving framework for LM inference. Its design emphasizes:

    • Speed and ease of use: vLLM works out-of-the-box with models from Hugging Face and supports dozens of key models.
    • Hardware versatility: Built on PyTorch, vLLM isn’t limited to NVIDIA GPUs. It extends support to AMD GPUs, Google TPUs, AWS Accelerators, Intel accelerators, and even CPUs.
    • Beyond text-only models: Today’s applications demand multimodal capabilities. vLLM now supports not only text but also images, audio, and video inputs—enabling tasks like document parsing, object recognition, video understanding, and computer use.
    • Advanced inference optimizations: With features like quantization, chunked prefill, and prefix caching, vLLM is continually optimized for both high-throughput and low-latency inference.

    Learn more: Meet vLLM: For faster, more efficient LLM inference and serving

    Overview of large multimodal models

    Modern large multimodal models typically leverage a decoder-only language model (LM) backbone paired with an encoder for non-text modalities. In practice, when you provide an image or audio clip, it’s first transformed into embeddings by a dedicated encoder. These embeddings are then merged with text embeddings and fed into the decoder LM.

    For example:

    • LLaVA: Uses CLIP to encode images into embeddings before merging them with text (see Figure 1).
    • Qwen2-audio: Uses a Whisper audio encoder to process audio inputs, which are then merged with text embeddings for decoding.
    Figure 1: LLaVA Architecture
    Figure 1: LLaVA architecture.
    Source: https://encord.com/blog/llava-large-language-vision-assistant/

    vLLM’s flexible architecture now supports this diverse range of inputs, setting the stage for richer, more capable multimodal applications.

    What went wrong in vLLM V0

    While vLLM V0 set the foundation, it wasn’t without limitations, especially when dealing with multimodal inputs.

    Chunked prefill challenges

    Chunked prefill allows prompts to be partially prefilled so that long requests don’t block the entire decoding process of existing requests. For example, with three incoming requests (R1, R2, R3), R1 and R2 might be fully prefilled, while only a portion of R3 is prefilled initially. This staggered approach, illustrated in Figure 2, keeps latency in check.

    Figure 2: A simplified diagram of Chunked Prefill running 3 prompts (R1, R2, R3) under a 10-token budget, illustrating staggered prefill and embedding challenges.
    Figure 2: A simplified diagram of chunked prefill running 3 prompts (R1, R2, R3) under a 10-token budget, illustrating staggered prefill and embedding challenges. 
    Source: https://docs.google.com/presentation/d/1SZOJ1lCOj6BpHcwqCMcRNfjCNvEPInv8/edit#slide=id.p1

    However, multimodal embeddings are continuous by nature and cannot be broken into discrete tokens to be incrementally produced. If an image produces 10 embeddings but only 2 tokens are reserved in a prefill chunk, a shape mismatch occurs. Early designs assumed a direct merge into text embeddings, which proved problematic.

    Prefix caching limitations

    In V0, prefix caching was based solely on token IDs. For multimodal inputs, where placeholder tokens (e.g., <image>) are identical across requests, this led to cache collisions. Different images sharing the same placeholder would mistakenly trigger cached results, compromising correctness.

    Innovations in vLLM V1

    vLLM V1 introduces several key improvements to overcome these challenges.

    1. Encoder cache and encoder-aware scheduler

    The challenge: Repeatedly regenerating multimodal embeddings for every prefill operation can be inefficient, especially when a single image may generate thousands of embeddings (e.g., Pixtral produces 4096 embeddings for a single 1024x1024 image).

    The V1 solution:

    • Encoder cache: Multimodal embeddings are computed once and stored directly on the GPU.
    • Encoder-aware scheduler: The scheduler tracks the positions of multimodal embeddings within each request. When merging with text embeddings, it retrieves cached data, eliminating redundant encoder execution. See Figure 3.
    Figure 3: Flowchart illustrating how the encoder cache and scheduler work together to streamline multimodal inference.
    Figure 3: Flowchart illustrating how the encoder cache and scheduler work together to streamline multimodal inference.
    Source: https://docs.google.com/presentation/d/1SZOJ1lCOj6BpHcwqCMcRNfjCNvEPInv8/edit#slide=id.p1

    Why a GPU cache? 

    Transferring tensors to and from CPU memory is often more expensive than re-executing the encoder. Keeping the cache on the GPU minimizes latency.

    2. Enhanced prefix caching with metadata

    To address the shortcomings of token-ID–based caching, V1 incorporates additional metadata, such as hashes of images or audio chunks, into the caching mechanism (Figure 4). This ensures that even if placeholder tokens are identical, the underlying multimodal content is correctly distinguished.

    Figure 4: Schematic showing how metadata enhances prefix caching for multimodal data.
    Figure 4: Schematic showing how metadata enhances prefix caching for multimodal data.
    Source: https://docs.google.com/presentation/d/1SZOJ1lCOj6BpHcwqCMcRNfjCNvEPInv8/edit#slide=id.p1

    3. Optimized multimodal data processing

    In V0, converting raw data (e.g., PIL images) to tensors was a blocking CPU operation, often stalling GPU kernels. V1 tackles this by decoupling the processes:

    • Process 0 (CPU): Handles input processing and raw data conversion.
    • Process 1 (GPU): Executes the forward pass independently.

    This asynchronous pipeline ensures that heavy CPU operations do not block GPU performance, leading to significant latency reductions.

    4. Multimodal feature caching

    Beyond prefix caching, V1 introduces feature caching for raw data conversion:

    • Dual mirror caches: Both CPU and GPU processes maintain mirrored caches on CPU memory, minimizing data transfers.
    • Efficient hashing: Using consistent hashes for raw data allows the system to skip redundant conversions, improving throughput in both online and offline scenarios.

    Benchmark results

    vLLM V1’s improvements have been validated across two key scenarios: online serving and offline inference.

    Online serving

    Using the Qwen2-VL 7B model on the VisionArena dataset—a real-world vision QA benchmark—vLLM V1 demonstrates:

    • Low latency at high QPS: While differences are subtle at low QPS, at higher throughput, V1 significantly outperforms V0, as shown in Figure 5.
    • Competitive edge: When compared with other open source alternatives, V1 maintains superior performance in high QPS regimes.
    Figure 5: Latency vs. QPS comparison for vLLM V0, V1, and a leading open source alternative.
    Figure 5: Latency vs. QPS comparison for vLLM V0, V1, and a leading open source alternative.
    Source: https://docs.google.com/presentation/d/1SZOJ1lCOj6BpHcwqCMcRNfjCNvEPInv8/edit#slide=id.p1

    Offline inference

    For offline inference, we benchmarked using the MMMU Pro Vision dataset with the Molmo-72B model using 4xH100s. Figure 6 shows the following results:

    • Throughput gains: vLLM V1, even without caching, shows around a 40% performance boost over V0. 
    • Caching benefits: With both prefix and feature caching enabled, scenarios with repeated requests (up to 100% repeat) experience dramatic throughput improvements. Even for unique prompts, the overhead is minimal compared to the benefits.
    Figure 6: Throughput improvements in offline inference under varying request repeat conditions.
    Figure 6: Throughput improvements in offline inference under varying request repeat conditions.
    Source: https://docs.google.com/presentation/d/1SZOJ1lCOj6BpHcwqCMcRNfjCNvEPInv8/edit#slide=id.p1

    Conclusion

    vLLM V1 marks a pivotal upgrade in serving large, multimodal language models. By addressing the challenges of chunked prefill, enhancing caching mechanisms, and optimizing data processing pipelines, V1 delivers lower latency, higher throughput, and robust performance across diverse hardware platforms.

    Neural Magic (now part of Red Hat) is proud to be a top commercial contributor to vLLM, driving these innovations forward and empowering the community with open, efficient, and scalable AI solutions. We invite you to explore vLLM V1, experiment with our open source tools, and join us in shaping the future of multimodal inference.

    For more information and to get started with vLLM, visit the GitHub repository. See more on the support vLLM provides for multi-modal models here.

    Feel free to reach out with questions or share your feedback on the vLLM Slack workspace as we continue to evolve vLLM!

    Last updated: March 31, 2025

    Related Posts

    • Generative AI large language model prompt patterns: Tips for developers

    • Red Hat publishes Docker Hub images for Granite 7B LLMs and InstructLab

    • Introducing Podman AI Lab: Developer tooling for working with LLMs

    • How to use LLMs in Java with LangChain4j and Quarkus

    • How to use AMD GPUs for model serving in OpenShift AI

    • Enhance LLMs and streamline MLOps using InstructLab and KitOps

    Recent Posts

    • Debugging image mode with Red Hat OpenShift 4.20: A practical guide

    • EvalHub: Because "looks good to me" isn't a benchmark

    • SQL Server HA on RHEL: Meet Pacemaker HA Agent v2 (tech preview)

    • Deploy with confidence: Continuous integration and continuous delivery for agentic AI

    • Every layer counts: Defense in depth for AI agents with Red Hat AI

    What’s up next?

    Learn how large language models (LLMs) are created and use Red Hat Enterprise Linux AI to experiment within an LLM in this hands-on learning path.

    Start the activity
    Red Hat Developers logo LinkedIn YouTube Twitter Facebook

    Platforms

    • Red Hat AI
    • Red Hat Enterprise Linux
    • Red Hat OpenShift
    • Red Hat Ansible Automation Platform
    • See all products

    Build

    • Developer Sandbox
    • Developer tools
    • Interactive tutorials
    • API catalog

    Quicklinks

    • Learning resources
    • E-books
    • Cheat sheets
    • Blog
    • Events
    • Newsletter

    Communicate

    • About us
    • Contact sales
    • Find a partner
    • Report a website issue
    • Site status dashboard
    • Report a security problem

    RED HAT DEVELOPER

    Build here. Go anywhere.

    We serve the builders. The problem solvers who create careers with code.

    Join us if you’re a developer, software engineer, web designer, front-end designer, UX designer, computer scientist, architect, tester, product manager, project manager or team lead.

    Sign me up

    Red Hat legal and privacy links

    • About Red Hat
    • Jobs
    • Events
    • Locations
    • Contact Red Hat
    • Red Hat Blog
    • Inclusion at Red Hat
    • Cool Stuff Store
    • Red Hat Summit
    © 2026 Red Hat

    Red Hat legal and privacy links

    • Privacy statement
    • Terms of use
    • All policies and guidelines
    • Digital accessibility

    Chat Support

    Please log in with your Red Hat account to access chat support.