Skip to main content
Redhat Developers  Logo
  • AI

    Get started with AI

    • Red Hat AI
      Accelerate the development and deployment of enterprise AI solutions.
    • AI learning hub
      Explore learning materials and tools, organized by task.
    • AI interactive demos
      Click through scenarios with Red Hat AI, including training LLMs and more.
    • AI/ML learning paths
      Expand your OpenShift AI knowledge using these learning resources.
    • AI quickstarts
      Focused AI use cases designed for fast deployment on Red Hat AI platforms.
    • No-cost AI training
      Foundational Red Hat AI training.

    Featured resources

    • OpenShift AI learning
    • Open source AI for developers
    • AI product application development
    • Open source-powered AI/ML for hybrid cloud
    • AI and Node.js cheat sheet

    Red Hat AI Factory with NVIDIA

    • Red Hat AI Factory with NVIDIA is a co-engineered, enterprise-grade AI solution for building, deploying, and managing AI at scale across hybrid cloud environments.
    • Explore the solution
  • Learn

    Self-guided

    • Documentation
      Find answers, get step-by-step guidance, and learn how to use Red Hat products.
    • Learning paths
      Explore curated walkthroughs for common development tasks.
    • Guided learning
      Receive custom learning paths powered by our AI assistant.
    • See all learning

    Hands-on

    • Developer Sandbox
      Spin up Red Hat's products and technologies without setup or configuration.
    • Interactive labs
      Learn by doing in these hands-on, browser-based experiences.
    • Interactive demos
      Click through product features in these guided tours.

    Browse by topic

    • AI/ML
    • Automation
    • Java
    • Kubernetes
    • Linux
    • See all topics

    Training & certifications

    • Courses and exams
    • Certifications
    • Skills assessments
    • Red Hat Academy
    • Learning subscription
    • Explore training
  • Build

    Get started

    • Red Hat build of Podman Desktop
      A downloadable, local development hub to experiment with our products and builds.
    • Developer Sandbox
      Spin up Red Hat's products and technologies without setup or configuration.

    Download products

    • Access product downloads to start building and testing right away.
    • Red Hat Enterprise Linux
    • Red Hat AI
    • Red Hat OpenShift
    • Red Hat Ansible Automation Platform
    • See all products

    Featured

    • Red Hat build of OpenJDK
    • Red Hat JBoss Enterprise Application Platform
    • Red Hat OpenShift Dev Spaces
    • Red Hat Developer Toolset

    References

    • E-books
    • Documentation
    • Cheat sheets
    • Architecture center
  • Community

    Get involved

    • Events
    • Live AI events
    • Red Hat Summit
    • Red Hat Accelerators
    • Community discussions

    Follow along

    • Articles & blogs
    • Developer newsletter
    • Videos
    • Github

    Get help

    • Customer service
    • Customer support
    • Regional contacts
    • Find a partner

    Join the Red Hat Developer program

    • Download Red Hat products and project builds, access support documentation, learning content, and more.
    • Explore the benefits

How well do quantized models handle long-context tasks?

Pushing the limits of accurate quantization

February 3, 2024
Eldar Kurtić Mark Kurtz Alexandre Marques Dan Alistarh
Related topics:
Artificial intelligence
Related products:
Red Hat AI

    The 4-bit summary

    • 4-bit and 8-bit quantized LLMs excel in long-context tasks, retaining over 99% accuracy across 4K to 64K sequence lengths.
    • INT4 models show limitations at 128K sequence lengths, though even their unquantized counterparts struggled at this length.
    • Results are consistent across LLM sizes and diverse long-context evaluation tasks.
    • All models, results, and techniques are open-sourced on Hugging Face and GitHub.

    In our recent research blog, We ran over half a million evaluations on quantized LLMs: Here's what we found, we demonstrated that quantized large language models (LLMs) can rival their full-precision counterparts in accuracy across diverse benchmarks, covering academic and real-world evaluations. 

    However, the community raised an important question: how well do these models perform in long-context scenarios? With the growing demand for efficient processing of extended sequences through retrieval augmented generation (RAG), agentic pipelines, and reasoning models, this question couldn't be ignored.To address it, we ran nearly 200K long-context evaluations, pushing quantized models to their limits. The results? Even in this challenging setup, quantized LLMs prove remarkably resilient, matching unquantized models in accuracy while improving inference efficiency.

    The framework

    To rigorously test quantized models in long-context scenarios, we use RULER, NVIDIA’s benchmark from RULER: What’s the Real Context Size of Your Long-Context Language Models? This benchmark generates synthetic examples with configurable sequence lengths and task complexities, providing a robust evaluation framework.

    Many LLMs struggle with RULER, showing significant performance degradation as sequence length increases—even though they achieve near-perfect scores on more straightforward needle-in-a-haystack tasks. To assess this challenge, we follow the default setup from the paper, evaluating models across four categories: retrieval, multi-hop tracing, aggregation, and question-answering, at sequence lengths of 4K, 8K, 16K, 32K, 64K, and 128K. 

    For models, we evaluate Neural Magic’s state-of-the-art quantized Llama-3.1-Instruct models at the 8B and 70B scales, using three different quantization formats: FP W8A8 (FP8 activations and weights), INT W8A8 (INT8 activations and weights), INT W4A16 (INT4 weights only). For deeper insights into these formats and their impact on inference performance, see our research paper “Give Me BF16 or Give Me Death”? Accuracy-Performance Trade-Offs in LLM Quantization. 

    The results

    Figures 1 and 2 show the average score of the baseline and quantized Llama 3.1 8B and 70B Instruct models on the RULER benchmark across various sequence lengths. On average, the 8B model recovers 99.2% of the unquantized model’s accuracy, while the 70B model achieves 98.6% accuracy recovery. 

    Across all sequence lengths, most quantization formats maintain over 99.5% accuracy recovery, with one exception: INT W4A16 at 128K length, where accuracy recovery drops to 85% (8B) and 88% (70B). However, it is important to note that at this extreme length, even unquantized models perform poorly (average scores below 65 for both sizes). As a result, accuracy recovery at 128K becomes inherently noisy, making it difficult to draw definitive conclusions about quantization’s impact at this scale.

    According to RULER’s evaluation criteria, models with such low accuracy are considered unsuitable for use at 128K sequence lengths—a limitation stemming from model architecture and training, rather than quantization itself.

    Accuracy of baseline and quantized Llama 3.1 8B Instruct models across varying sequence lengths on the RULER benchmark.
    Figure 1: Accuracy of baseline and quantized Llama 3.1 8B Instruct models across varying sequence lengths on the RULER benchmark.
    Accuracy of baseline and quantized Llama 3.1 70B Instruct models across varying sequence lengths on the RULER benchmark.
    Figure 2: Accuracy of baseline and quantized Llama 3.1 70B Instruct models across varying sequence lengths on the RULER benchmark.

    Takeaways

    Our findings demonstrate that quantized LLMs perform exceptionally well in long-context tasks. Across RULER’s benchmarks, quantized models consistently recover over 99% of the unquantized model’s accuracy—demonstrating their reliability and efficiency, with a few exceptions at the extremes where even the unquantized models struggle.

    These results align with our previous research, showing that carefully quantized models remain highly competitive with their unquantized counterparts across various academic and real-world benchmarks. Together, these studies debunk the misconception that quantization inherently compromises performance. Instead, with proper engineering, quantized models maintain strong accuracy while offering significant efficiency gains, making them an essential tool for scaling LLMs in real-world applications.

    Get started with efficient AI

    Neural Magic, now part of Red Hat, is committed to advancing open, efficient AI. Our state-of-the-art quantized models, benchmarks, and tools like LLM Compressor are fully open-sourced, enabling faster inference, lower costs, and production-ready performance. Explore our models on Hugging Face, deploy them with vLLM, or customize them with LLM Compressor to unlock tailored optimizations.

    Last updated: September 23, 2025

    Recent Posts

    • Protect data offloaded to GPU-accelerated environments with OpenShift sandboxed containers

    • Case study: Measuring energy efficiency on the x64 platform

    • How to prevent AI inference stack silent failures

    • Preventing GPU waste: A guide to JIT checkpointing with Kubeflow Trainer on OpenShift AI

    • How to manage TLS certificates used by OpenShift GitOps operator

    Red Hat Developers logo LinkedIn YouTube Twitter Facebook

    Platforms

    • Red Hat AI
    • Red Hat Enterprise Linux
    • Red Hat OpenShift
    • Red Hat Ansible Automation Platform
    • See all products

    Build

    • Developer Sandbox
    • Developer tools
    • Interactive tutorials
    • API catalog

    Quicklinks

    • Learning resources
    • E-books
    • Cheat sheets
    • Blog
    • Events
    • Newsletter

    Communicate

    • About us
    • Contact sales
    • Find a partner
    • Report a website issue
    • Site status dashboard
    • Report a security problem

    RED HAT DEVELOPER

    Build here. Go anywhere.

    We serve the builders. The problem solvers who create careers with code.

    Join us if you’re a developer, software engineer, web designer, front-end designer, UX designer, computer scientist, architect, tester, product manager, project manager or team lead.

    Sign me up

    Red Hat legal and privacy links

    • About Red Hat
    • Jobs
    • Events
    • Locations
    • Contact Red Hat
    • Red Hat Blog
    • Inclusion at Red Hat
    • Cool Stuff Store
    • Red Hat Summit
    © 2026 Red Hat

    Red Hat legal and privacy links

    • Privacy statement
    • Terms of use
    • All policies and guidelines
    • Digital accessibility

    Chat Support

    Please log in with your Red Hat account to access chat support.