How well do quantized models handle long-context tasks?

The 4-bit summary

4-bit and 8-bit quantized LLMs excel in long-context tasks, retaining over 99% accuracy across 4K to 64K sequence lengths.
INT4 models show limitations at 128K sequence lengths, though even their unquantized counterparts struggled at this length.
Results are consistent across LLM sizes and diverse long-context evaluation tasks.
All models, results, and techniques are open-sourced on Hugging Face and GitHub.

In our recent research blog, We ran over half a million evaluations on quantized LLMs: Here's what we found, we demonstrated that quantized large language models (LLMs) can rival their full-precision counterparts in accuracy across diverse benchmarks, covering academic and real-world evaluations.

However, the community raised an important question: how well do these models perform in long-context scenarios? With the growing demand for efficient processing of extended sequences through retrieval augmented generation (RAG), agentic pipelines, and reasoning models, this question couldn't be ignored.To address it, we ran nearly 200K long-context evaluations, pushing quantized models to their limits. The results? Even in this challenging setup, quantized LLMs prove remarkably resilient, matching unquantized models in accuracy while improving inference efficiency.

The framework

To rigorously test quantized models in long-context scenarios, we use RULER, NVIDIA’s benchmark from RULER: What’s the Real Context Size of Your Long-Context Language Models? This benchmark generates synthetic examples with configurable sequence lengths and task complexities, providing a robust evaluation framework.

Many LLMs struggle with RULER, showing significant performance degradation as sequence length increases—even though they achieve near-perfect scores on more straightforward needle-in-a-haystack tasks. To assess this challenge, we follow the default setup from the paper, evaluating models across four categories: retrieval, multi-hop tracing, aggregation, and question-answering, at sequence lengths of 4K, 8K, 16K, 32K, 64K, and 128K.

For models, we evaluate Neural Magic’s state-of-the-art quantized Llama-3.1-Instruct models at the 8B and 70B scales, using three different quantization formats: FP W8A8 (FP8 activations and weights), INT W8A8 (INT8 activations and weights), INT W4A16 (INT4 weights only). For deeper insights into these formats and their impact on inference performance, see our research paper “Give Me BF16 or Give Me Death”? Accuracy-Performance Trade-Offs in LLM Quantization.

The results

Figures 1 and 2 show the average score of the baseline and quantized Llama 3.1 8B and 70B Instruct models on the RULER benchmark across various sequence lengths. On average, the 8B model recovers 99.2% of the unquantized model’s accuracy, while the 70B model achieves 98.6% accuracy recovery.

Across all sequence lengths, most quantization formats maintain over 99.5% accuracy recovery, with one exception: INT W4A16 at 128K length, where accuracy recovery drops to 85% (8B) and 88% (70B). However, it is important to note that at this extreme length, even unquantized models perform poorly (average scores below 65 for both sizes). As a result, accuracy recovery at 128K becomes inherently noisy, making it difficult to draw definitive conclusions about quantization’s impact at this scale.

According to RULER’s evaluation criteria, models with such low accuracy are considered unsuitable for use at 128K sequence lengths—a limitation stemming from model architecture and training, rather than quantization itself.

Figure 1: Accuracy of baseline and quantized Llama 3.1 8B Instruct models across varying sequence lengths on the RULER benchmark.

Figure 2: Accuracy of baseline and quantized Llama 3.1 70B Instruct models across varying sequence lengths on the RULER benchmark.

Takeaways

Our findings demonstrate that quantized LLMs perform exceptionally well in long-context tasks. Across RULER’s benchmarks, quantized models consistently recover over 99% of the unquantized model’s accuracy—demonstrating their reliability and efficiency, with a few exceptions at the extremes where even the unquantized models struggle.

These results align with our previous research, showing that carefully quantized models remain highly competitive with their unquantized counterparts across various academic and real-world benchmarks. Together, these studies debunk the misconception that quantization inherently compromises performance. Instead, with proper engineering, quantized models maintain strong accuracy while offering significant efficiency gains, making them an essential tool for scaling LLMs in real-world applications.

Get started with efficient AI

Neural Magic, now part of Red Hat, is committed to advancing open, efficient AI. Our state-of-the-art quantized models, benchmarks, and tools like LLM Compressor are fully open-sourced, enabling faster inference, lower costs, and production-ready performance. Explore our models on Hugging Face, deploy them with vLLM, or customize them with LLM Compressor to unlock tailored optimizations.

Last updated: September 23, 2025

Red Hat Developer Sandbox

Programming languages & frameworks

System design & architecture

Developer experience

Automated data processing

Platform engineering

Secure development & architectures

E-books

Cheat sheets

Documentation

How well do quantized models handle long-context tasks?

The framework

The results

Takeaways

Get started with efficient AI

Use NetApp to run SAP on OpenShift Virtualization with a dual boot on bare metal

How does cgroups v2 impact Java and Node.js in OpenShift 4?

How to enable NVIDIA GPU acceleration in OpenShift Local

Trusted execution clusters operator: Design and flow overview

Autoscaling vLLM with OpenShift AI model serving: Performance validation

Platforms

Build

Quicklinks

Communicate

RED HAT DEVELOPER

Red Hat legal and privacy links

Red Hat legal and privacy links

Report a website issue