Skip to main content
Redhat Developers  Logo
  • Products

    Platforms

    • Red Hat Enterprise Linux
      Red Hat Enterprise Linux Icon
    • Red Hat AI
      Red Hat AI
    • Red Hat OpenShift
      Openshift icon
    • Red Hat Ansible Automation Platform
      Ansible icon
    • View All Red Hat Products

    Featured

    • Red Hat build of OpenJDK
    • Red Hat Developer Hub
    • Red Hat JBoss Enterprise Application Platform
    • Red Hat OpenShift Dev Spaces
    • Red Hat OpenShift Local
    • Red Hat Developer Sandbox

      Try Red Hat products and technologies without setup or configuration fees for 30 days with this shared Openshift and Kubernetes cluster.
    • Try at no cost
  • Technologies

    Featured

    • AI/ML
      AI/ML Icon
    • Linux
      Linux Icon
    • Kubernetes
      Cloud icon
    • Automation
      Automation Icon showing arrows moving in a circle around a gear
    • View All Technologies
    • Programming Languages & Frameworks

      • Java
      • Python
      • JavaScript
    • System Design & Architecture

      • Red Hat architecture and design patterns
      • Microservices
      • Event-Driven Architecture
      • Databases
    • Developer Productivity

      • Developer productivity
      • Developer Tools
      • GitOps
    • Automated Data Processing

      • AI/ML
      • Data Science
      • Apache Kafka on Kubernetes
    • Platform Engineering

      • DevOps
      • DevSecOps
      • Ansible automation for applications and services
    • Secure Development & Architectures

      • Security
      • Secure coding
  • Learn

    Featured

    • Kubernetes & Cloud Native
      Openshift icon
    • Linux
      Rhel icon
    • Automation
      Ansible cloud icon
    • AI/ML
      AI/ML Icon
    • View All Learning Resources

    E-Books

    • GitOps Cookbook
    • Podman in Action
    • Kubernetes Operators
    • The Path to GitOps
    • View All E-books

    Cheat Sheets

    • Linux Commands
    • Bash Commands
    • Git
    • systemd Commands
    • View All Cheat Sheets

    Documentation

    • Product Documentation
    • API Catalog
    • Legacy Documentation
  • Developer Sandbox

    Developer Sandbox

    • Access Red Hat’s products and technologies without setup or configuration, and start developing quicker than ever before with our new, no-cost sandbox environments.
    • Explore Developer Sandbox

    Featured Developer Sandbox activities

    • Get started with your Developer Sandbox
    • OpenShift virtualization and application modernization using the Developer Sandbox
    • Explore all Developer Sandbox activities

    Ready to start developing apps?

    • Try at no cost
  • Blog
  • Events
  • Videos

Enable 3.5 times faster vision language models with quantization

April 1, 2025
Shubhra Pandit Megan Flynn Alexandre Marques Mark Kurtz Eldar Kurtić
Related topics:
Artificial intelligenceOpen source
Related products:
Red Hat AI

Share:

    A compressed summary

    • Open source models: Quantized versions of Pixtral (12B, Large), Qwen2-VL (72B), and Qwen2.5-VL (3B, 7B, 72B) are now available.
    • Competitive accuracy recovery: FP8 and INT8 quantized versions recover >99% across five vision benchmarks, while INT4 has a modest drop on smaller models.
    • Performant deployments: With vLLM, up to 3.5 times faster throughput scenarios and 3.2 times more requests per second for server scenarios.

    Vision-language models (VLMs), such as the Pixtral and Qwen-VL series, are trained to generate text from image and text inputs. With the expanded input types and the performance of large language models, they enable accurate and promising new use cases such as content moderation, image captioning and tagging, visual question answering, and document extraction/analysis, among others. The extra modality, though, means that VLMs are even more computationally demanding, requiring more processing power and memory than the already demanding language-only architectures.

    Quantized vision-language models 

    Utilizing the latest vision-language support within LLM Compressor, we created quantized, deployment-ready versions of Pixtral (12B, Large), Qwen2-VL (72B), and Qwen2.5-VL (3B, 7B, 72B) for off-the-shelf deployments with vLLM (Figure 1). On average, the models recovered >99% accuracy at 8-bit precision and ~98% at 4-bit across vision tasks while inferencing up to 4.3 times faster for throughput or 2.2 times more requests per second for low latency server scenarios. 

    Specifically, 3 versions of each model were created, thoroughly evaluated, and run through numerous inference scenarios:

    • FP W8A8: 8-bit floating-point weights and activations supporting server and throughput scenarios for the latest Ada Lovelace and Hopper GPUs.
    • INT W8A8: 8-bit integer weights and activations supporting server and throughput scenarios for Ampere and older GPUs.
    • INT W4A16: 4-bit integer weights with activations kept at the baseline 16 bits for single-stream and low queries per second scenarios.
    Figure 1
    Figure 1: Accuracy across various vision and text based datasets for all quantized variants of the Pixtral-Large model.

    The models, recipes, evaluations, and benchmarks are open-sourced and available in our Hugging Face model collection. These include the complete commands for deployments and replication of our data outlined in the rest of the post to get started. Additionally, check out our previous article in this series, which walks you through quantizing your multimodal models.

    Accuracy recovery across evaluations

    We evaluated the quantized versions compared to their baseline performance across many current top available benchmarks. Specifically, we used mistral-evals for visual question answering, visual reasoning, chart and graph interpretation, and a few standard language tasks. The complete list is provided below and ensures a comprehensive overview and comparison for the performance of these models:

    Vision benchmarks

    • MMMU: Measures the model's ability to effectively handle visual and linguistic inputs.
    • ChartQA: Focuses on interpreting visual charts and graphs to generate correct textual answers.
    • DocVQA: Assesses the model's performance on document-based visual question answering.
    • VQAv2: Evaluates question answering accuracy over diverse visual inputs.
    • Mathvista: Evaluates mathematical reasoning in visual contexts, incorporating 31 diverse datasets.

    Text benchmarks

    • MMLU: Measures the model's capability to answer various questions spanning reasoning, math, and general knowledge.
    • MGSM: Focuses on grade-school level math in a conversational setup across multiple languages.

    Figure 2 shows the average performance of various quantized multimodal models across the aforementioned vision and text benchmarks:

    • FP W8A8 delivers near-lossless performance, on par with BF16 across all models.
    • INT W8A8 recovers ~99% of the original accuracy, being very competitive to the FP W8A8.
    • INT W4A16 performs reasonably well for most models with >96% recovery. The smaller 3B and 7B models performed worse, with >93% and ~92% recovery, respectively. Note, most of the drop came from text evals; for vision alone, the recovery jumps to ~98%.
    Figure 2
    Figure 2: Average accuracy for quantized models on some well-established vision (MMMU, MathVista, VQAv2, DocVQA and ChartQA) and text (MMLU, MGSM) benchmarks.

    Inference performance across use cases

    To ensure realistic benchmarks and measurement of the inference performance, we designed a set of vision-language workloads capturing a broad range of use cases and scenarios. For scenarios, we benchmarked across server scenarios for both low-latency and high-throughput. The workloads are defined by the number of prompt tokens and the input image size. This latter point is necessary due to the different processors within the models, which generate differing token input lengths for the exact image size. The workloads include:

    • Document Visual Question Answering (DocVQA) (1680W × 2240H pixels, 64 prompt tokens, 128 generated tokens)
    • Visual Reasoning (640W × 480H pixels, 128 prompt tokens, 128 generated tokens)
    • Image Captioning (480W × 360H pixels, 0 prompt tokens, 128 generated tokens)

    How quantization impacts different models

    Figure 3 presents the average inference speedups across all workloads (Visual Reasoning, DocVQA, Image Captioning) and GPU platforms (A6000, A100, H100). Key insights include:

    • Larger models benefit most from quantization, with Qwen2/2.5-VL-72B and Pixtral-Large achieving up to 3.5 times speedups.
    • Smaller models, like Qwen2.5-VL-3B, show more modest gains (1.1–1.5 times) due to lower memory and compute demands.
    • Qwen2/2.5-VL models gain the most from INT W4A16, suggesting they are more memory-bound than compute-bound in most deployment scenarios.
    Figure 4
    Figure 3: Inference performance (in relative speedup) for baseline and quantized multi-modal models deployed with vLLM across various scenario workloads (Visual Reasoning, Document Visual Question Answering and Image Captioning) and various GPU hardware (A6000, A100, H100). Left: Low-latency. Right: High-throughput deployment.

    How quantization impacts different workloads

    Figure 4 provides a detailed breakdown of Pixtral-Large speedups across different workloads, reinforcing that:

    • Pixtral-Large benefits most from W8A8 overall.
    • For lighter workloads (Visual Reasoning, Image Captioning), INT W4A16 performs comparably to W8A8.
    • For more compute-intensive tasks (DocVQA), W8A8 delivers the highest speedups.
    Figure 5
    Figure 4: Inference performance (in relative speedup) for baseline and quantized Pixtral-Large model deployed with vLLM across various scenario workloads (Visual Reasoning, Document Visual Question Answering and Image Captioning) and various GPU hardware (A100, H100). Left: Low-latency deployment. Right: High-throughput deployment.

    How GPU size impacts performance

    Figure 5 highlights Pixtral-12B, analyzing how quantization improves inference speed across different GPU architectures in DocVQA:

    • Low-latency deployments benefit more from INT W4A16 (1.3–1.7 times speedup).
    • High-throughput deployments favor W8A8 formats (1.3–1.5 times speedup).
    • Lower-tier GPUs (e.g., A6000) experience greater gains from quantization, as workloads can overload memory and reduce BF16 request throughput. INT W4A16 and W8A8 alleviate these bottlenecks, enabling higher efficiency and serving more requests in parallel.
    Figure 6
    Figure 5:  Inference performance (in requests per second) of the Pixtral-12B model on vLLM for high-resolution workload (Document Visual Question Answering: 1680×2240) across A6000, A100, and H100 GPUs. Left: Low-latency performance. Right: Multi-stream (high-throughput) performance. W8A8 refers to INT W8A8 for A6000x1 and A100x1 and FP W8A8 for H100x1.

    Figure 6 summarizes the performance gains for Pixtral-Large on the most intensive DocVQA workload across different GPUs:

    • A100: INT W4A16 enables >2.2x faster requests for low-latency scenarios, while INT W8A8 provides 1.9x higher throughput.
    • H100: FP W8A8 enables 1.9x faster requests for low-latency workloads and provides 3.4x higher throughput.
    Figure 2
    Figure 6: Inference performance (in requests per second) for baseline and quantized Pixtral-Large model for Document Visual Question Answering deployment scenario (1680Wx2240H, 64 prompt tokens, 128 generated tokens) with vLLM across various hardware platforms. Left: Low-latency deployment. Right: High-throughput deployment. W8A8 corresponds to INT W8A8 for A100x4 and FP W8A8 for H100x4.

    Optimizing for real-world deployment

    Most practical deployments fall between low-latency and high-throughput scenarios, requiring efficiency while maintaining response time constraints. Figure 7 examines how quantization impacts token generation speed under increasing load:

    • INT W4A16 provides the lowest response times, maintaining minimal inter-token latency at low to medium request rates.
    • INT W8A8 balances speed and efficiency, reducing latency over BF16 while supporting higher request rates, making it a strong choice for high-speed, multi-stream inference.
    Figure 7
    Figure 7: Asynchronous inference performance (request rate vs. inter-token latency) for baseline (BF16) and quantized Pixtral-12B model (INT W8A8, INT W4A16) deployed with vLLM on high-resolution Document Visual Question Answering (1680W × 2240H pixels) workload.

    Get started with quantized VLMs

    The future of vision-language models (VLMs) lies in balancing performance, efficiency, and accessibility. By leveraging quantization, we enable faster inference, lower costs, and scalable AI deployment without compromising capability.

    Ready to explore? Check out our fully open source quantized models, including Qwen2, Qwen2.5, Pixtral, and more on our Hugging Face VLM collection.

    Driving AI efficiency with quantization

    We’re excited to see how the community applies these quantized VLMs—whether for efficient deployment, advancing quantization techniques, or scaling AI for broader applications. These models provide a strong foundation for real-world AI innovation.

    As part of Red Hat, Neural Magic remains committed to open, efficient AI, providing cutting-edge quantization tools like LLM Compressor and optimized inference solutions via vLLM. Explore, deploy, or customize our models to suit your needs.

    Get in touch to learn more about enterprise-grade AI solutions or contribute to the open source ecosystem today!

    Last updated: April 2, 2025

    Related Posts

    • LLM Compressor is here: Faster inference with vLLM

    • Multimodal model quantization support through LLM Compressor

    • vLLM V1: Accelerating multimodal inference for large language models

    • How we optimized vLLM for DeepSeek-R1

    • We ran over half a million evaluations on quantized LLMs—here's what we found

    • How well do quantized models handle long-context tasks?

    Recent Posts

    • A deep dive into Apache Kafka's KRaft protocol

    • Staying ahead of artificial intelligence threats

    • Strengthen privacy and security with encrypted DNS in RHEL

    • How to enable Ansible Lightspeed intelligent assistant

    • Why some agentic AI developers are moving code from Python to Rust

    Red Hat Developers logo LinkedIn YouTube Twitter Facebook

    Products

    • Red Hat Enterprise Linux
    • Red Hat OpenShift
    • Red Hat Ansible Automation Platform

    Build

    • Developer Sandbox
    • Developer Tools
    • Interactive Tutorials
    • API Catalog

    Quicklinks

    • Learning Resources
    • E-books
    • Cheat Sheets
    • Blog
    • Events
    • Newsletter

    Communicate

    • About us
    • Contact sales
    • Find a partner
    • Report a website issue
    • Site Status Dashboard
    • Report a security problem

    RED HAT DEVELOPER

    Build here. Go anywhere.

    We serve the builders. The problem solvers who create careers with code.

    Join us if you’re a developer, software engineer, web designer, front-end designer, UX designer, computer scientist, architect, tester, product manager, project manager or team lead.

    Sign me up

    Red Hat legal and privacy links

    • About Red Hat
    • Jobs
    • Events
    • Locations
    • Contact Red Hat
    • Red Hat Blog
    • Inclusion at Red Hat
    • Cool Stuff Store
    • Red Hat Summit
    © 2025 Red Hat

    Red Hat legal and privacy links

    • Privacy statement
    • Terms of use
    • All policies and guidelines
    • Digital accessibility

    Report a website issue