Enable 3.5 times faster vision language models with quantization

A compressed summary

Open source models: Quantized versions of Pixtral (12B, Large), Qwen2-VL (72B), and Qwen2.5-VL (3B, 7B, 72B) are now available.
Competitive accuracy recovery: FP8 and INT8 quantized versions recover >99% across five vision benchmarks, while INT4 has a modest drop on smaller models.
Performant deployments: With vLLM, up to 3.5 times faster throughput scenarios and 3.2 times more requests per second for server scenarios.

Vision-language models (VLMs), such as the Pixtral and Qwen-VL series, are trained to generate text from image and text inputs. With the expanded input types and the performance of large language models, they enable accurate and promising new use cases such as content moderation, image captioning and tagging, visual question answering, and document extraction/analysis, among others. The extra modality, though, means that VLMs are even more computationally demanding, requiring more processing power and memory than the already demanding language-only architectures.

Quantized vision-language models

Utilizing the latest vision-language support within LLM Compressor, we created quantized, deployment-ready versions of Pixtral (12B, Large), Qwen2-VL (72B), and Qwen2.5-VL (3B, 7B, 72B) for off-the-shelf deployments with vLLM (Figure 1). On average, the models recovered >99% accuracy at 8-bit precision and ~98% at 4-bit across vision tasks while inferencing up to 4.3 times faster for throughput or 2.2 times more requests per second for low latency server scenarios.

Specifically, 3 versions of each model were created, thoroughly evaluated, and run through numerous inference scenarios:

FP W8A8: 8-bit floating-point weights and activations supporting server and throughput scenarios for the latest Ada Lovelace and Hopper GPUs.
INT W8A8: 8-bit integer weights and activations supporting server and throughput scenarios for Ampere and older GPUs.
INT W4A16: 4-bit integer weights with activations kept at the baseline 16 bits for single-stream and low queries per second scenarios.

Figure 1: Accuracy across various vision and text based datasets for all quantized variants of the Pixtral-Large model.

The models, recipes, evaluations, and benchmarks are open-sourced and available in our Hugging Face model collection. These include the complete commands for deployments and replication of our data outlined in the rest of the post to get started. Additionally, check out our previous article in this series, which walks you through quantizing your multimodal models.

Accuracy recovery across evaluations

We evaluated the quantized versions compared to their baseline performance across many current top available benchmarks. Specifically, we used mistral-evals for visual question answering, visual reasoning, chart and graph interpretation, and a few standard language tasks. The complete list is provided below and ensures a comprehensive overview and comparison for the performance of these models:

Vision benchmarks

MMMU: Measures the model's ability to effectively handle visual and linguistic inputs.
ChartQA: Focuses on interpreting visual charts and graphs to generate correct textual answers.
DocVQA: Assesses the model's performance on document-based visual question answering.
VQAv2: Evaluates question answering accuracy over diverse visual inputs.
Mathvista: Evaluates mathematical reasoning in visual contexts, incorporating 31 diverse datasets.

Text benchmarks

MMLU: Measures the model's capability to answer various questions spanning reasoning, math, and general knowledge.
MGSM: Focuses on grade-school level math in a conversational setup across multiple languages.

Figure 2 shows the average performance of various quantized multimodal models across the aforementioned vision and text benchmarks:

FP W8A8 delivers near-lossless performance, on par with BF16 across all models.
INT W8A8 recovers ~99% of the original accuracy, being very competitive to the FP W8A8.
INT W4A16 performs reasonably well for most models with >96% recovery. The smaller 3B and 7B models performed worse, with >93% and ~92% recovery, respectively. Note, most of the drop came from text evals; for vision alone, the recovery jumps to ~98%.

Figure 2: Average accuracy for quantized models on some well-established vision (MMMU, MathVista, VQAv2, DocVQA and ChartQA) and text (MMLU, MGSM) benchmarks.

Inference performance across use cases

To ensure realistic benchmarks and measurement of the inference performance, we designed a set of vision-language workloads capturing a broad range of use cases and scenarios. For scenarios, we benchmarked across server scenarios for both low-latency and high-throughput. The workloads are defined by the number of prompt tokens and the input image size. This latter point is necessary due to the different processors within the models, which generate differing token input lengths for the exact image size. The workloads include:

Document Visual Question Answering (DocVQA) (1680W × 2240H pixels, 64 prompt tokens, 128 generated tokens)
Visual Reasoning (640W × 480H pixels, 128 prompt tokens, 128 generated tokens)
Image Captioning (480W × 360H pixels, 0 prompt tokens, 128 generated tokens)

How quantization impacts different models

Figure 3 presents the average inference speedups across all workloads (Visual Reasoning, DocVQA, Image Captioning) and GPU platforms (A6000, A100, H100). Key insights include:

Larger models benefit most from quantization, with Qwen2/2.5-VL-72B and Pixtral-Large achieving up to 3.5 times speedups.
Smaller models, like Qwen2.5-VL-3B, show more modest gains (1.1–1.5 times) due to lower memory and compute demands.
Qwen2/2.5-VL models gain the most from INT W4A16, suggesting they are more memory-bound than compute-bound in most deployment scenarios.

Figure 3: Inference performance (in relative speedup) for baseline and quantized multi-modal models deployed with vLLM across various scenario workloads (Visual Reasoning, Document Visual Question Answering and Image Captioning) and various GPU hardware (A6000, A100, H100). Left: Low-latency. Right: High-throughput deployment.

How quantization impacts different workloads

Figure 4 provides a detailed breakdown of Pixtral-Large speedups across different workloads, reinforcing that:

Pixtral-Large benefits most from W8A8 overall.
For lighter workloads (Visual Reasoning, Image Captioning), INT W4A16 performs comparably to W8A8.
For more compute-intensive tasks (DocVQA), W8A8 delivers the highest speedups.

Figure 4: Inference performance (in relative speedup) for baseline and quantized Pixtral-Large model deployed with vLLM across various scenario workloads (Visual Reasoning, Document Visual Question Answering and Image Captioning) and various GPU hardware (A100, H100). Left: Low-latency deployment. Right: High-throughput deployment.

How GPU size impacts performance

Figure 5 highlights Pixtral-12B, analyzing how quantization improves inference speed across different GPU architectures in DocVQA:

Low-latency deployments benefit more from INT W4A16 (1.3–1.7 times speedup).
High-throughput deployments favor W8A8 formats (1.3–1.5 times speedup).
Lower-tier GPUs (e.g., A6000) experience greater gains from quantization, as workloads can overload memory and reduce BF16 request throughput. INT W4A16 and W8A8 alleviate these bottlenecks, enabling higher efficiency and serving more requests in parallel.

Figure 5: Inference performance (in requests per second) of the Pixtral-12B model on vLLM for high-resolution workload (Document Visual Question Answering: 1680×2240) across A6000, A100, and H100 GPUs. Left: Low-latency performance. Right: Multi-stream (high-throughput) performance. W8A8 refers to INT W8A8 for A6000x1 and A100x1 and FP W8A8 for H100x1.

Figure 6 summarizes the performance gains for Pixtral-Large on the most intensive DocVQA workload across different GPUs:

A100: INT W4A16 enables >2.2x faster requests for low-latency scenarios, while INT W8A8 provides 1.9x higher throughput.
H100: FP W8A8 enables 1.9x faster requests for low-latency workloads and provides 3.4x higher throughput.

Figure 6: Inference performance (in requests per second) for baseline and quantized Pixtral-Large model for Document Visual Question Answering deployment scenario (1680Wx2240H, 64 prompt tokens, 128 generated tokens) with vLLM across various hardware platforms. Left: Low-latency deployment. Right: High-throughput deployment. W8A8 corresponds to INT W8A8 for A100x4 and FP W8A8 for H100x4.

Optimizing for real-world deployment

Most practical deployments fall between low-latency and high-throughput scenarios, requiring efficiency while maintaining response time constraints. Figure 7 examines how quantization impacts token generation speed under increasing load:

INT W4A16 provides the lowest response times, maintaining minimal inter-token latency at low to medium request rates.
INT W8A8 balances speed and efficiency, reducing latency over BF16 while supporting higher request rates, making it a strong choice for high-speed, multi-stream inference.

Figure 7: Asynchronous inference performance (request rate vs. inter-token latency) for baseline (BF16) and quantized Pixtral-12B model (INT W8A8, INT W4A16) deployed with vLLM on high-resolution Document Visual Question Answering (1680W × 2240H pixels) workload.

Get started with quantized VLMs

The future of vision-language models (VLMs) lies in balancing performance, efficiency, and accessibility. By leveraging quantization, we enable faster inference, lower costs, and scalable AI deployment without compromising capability.

Ready to explore? Check out our fully open source quantized models, including Qwen2, Qwen2.5, Pixtral, and more on our Hugging Face VLM collection.

Driving AI efficiency with quantization

We’re excited to see how the community applies these quantized VLMs—whether for efficient deployment, advancing quantization techniques, or scaling AI for broader applications. These models provide a strong foundation for real-world AI innovation.

As part of Red Hat, Neural Magic remains committed to open, efficient AI, providing cutting-edge quantization tools like LLM Compressor and optimized inference solutions via vLLM. Explore, deploy, or customize our models to suit your needs.

Get in touch to learn more about enterprise-grade AI solutions or contribute to the open source ecosystem today!

Last updated: April 2, 2025

Red Hat Developer Sandbox

Programming languages & frameworks

System design & architecture

Developer experience

Automated data processing

Platform engineering

Secure development & architectures

E-books

Cheat sheets

Documentation

Enable 3.5 times faster vision language models with quantization

Quantized vision-language models

Accuracy recovery across evaluations

Vision benchmarks

Text benchmarks

Inference performance across use cases

How quantization impacts different models

How quantization impacts different workloads

How GPU size impacts performance

Optimizing for real-world deployment

Get started with quantized VLMs

Driving AI efficiency with quantization

Deploying OpenShift hosted clusters on bare metal

Get started with language model post-training using Training Hub

Speculators: Standardized, production-ready speculative decoding

The strategic choice: Making sense of LLM customization

Building the digital substation: Exploring the LF Energy SEAPATH architecture on Red Hat Enterprise Linux

Platforms

Build

Quicklinks

Communicate

RED HAT DEVELOPER

Red Hat legal and privacy links

Red Hat legal and privacy links

Report a website issue