Skip to main content
Redhat Developers  Logo
  • Products

    Featured

    • Red Hat Enterprise Linux
      Red Hat Enterprise Linux Icon
    • Red Hat OpenShift AI
      Red Hat OpenShift AI
    • Red Hat Enterprise Linux AI
      Linux icon inside of a brain
    • Image mode for Red Hat Enterprise Linux
      RHEL image mode
    • Red Hat OpenShift
      Openshift icon
    • Red Hat Ansible Automation Platform
      Ansible icon
    • Red Hat Developer Hub
      Developer Hub
    • View All Red Hat Products
    • Linux

      • Red Hat Enterprise Linux
      • Image mode for Red Hat Enterprise Linux
      • Red Hat Universal Base Images (UBI)
    • Java runtimes & frameworks

      • JBoss Enterprise Application Platform
      • Red Hat build of OpenJDK
    • Kubernetes

      • Red Hat OpenShift
      • Microsoft Azure Red Hat OpenShift
      • Red Hat OpenShift Virtualization
      • Red Hat OpenShift Lightspeed
    • Integration & App Connectivity

      • Red Hat Build of Apache Camel
      • Red Hat Service Interconnect
      • Red Hat Connectivity Link
    • AI/ML

      • Red Hat OpenShift AI
      • Red Hat Enterprise Linux AI
    • Automation

      • Red Hat Ansible Automation Platform
      • Red Hat Ansible Lightspeed
    • Developer tools

      • Red Hat Trusted Software Supply Chain
      • Podman Desktop
      • Red Hat OpenShift Dev Spaces
    • Developer Sandbox

      Developer Sandbox
      Try Red Hat products and technologies without setup or configuration fees for 30 days with this shared Openshift and Kubernetes cluster.
    • Try at no cost
  • Technologies

    Featured

    • AI/ML
      AI/ML Icon
    • Linux
      Linux Icon
    • Kubernetes
      Cloud icon
    • Automation
      Automation Icon showing arrows moving in a circle around a gear
    • View All Technologies
    • Programming Languages & Frameworks

      • Java
      • Python
      • JavaScript
    • System Design & Architecture

      • Red Hat architecture and design patterns
      • Microservices
      • Event-Driven Architecture
      • Databases
    • Developer Productivity

      • Developer productivity
      • Developer Tools
      • GitOps
    • Secure Development & Architectures

      • Security
      • Secure coding
    • Platform Engineering

      • DevOps
      • DevSecOps
      • Ansible automation for applications and services
    • Automated Data Processing

      • AI/ML
      • Data Science
      • Apache Kafka on Kubernetes
      • View All Technologies
    • Start exploring in the Developer Sandbox for free

      sandbox graphic
      Try Red Hat's products and technologies without setup or configuration.
    • Try at no cost
  • Learn

    Featured

    • Kubernetes & Cloud Native
      Openshift icon
    • Linux
      Rhel icon
    • Automation
      Ansible cloud icon
    • Java
      Java icon
    • AI/ML
      AI/ML Icon
    • View All Learning Resources

    E-Books

    • GitOps Cookbook
    • Podman in Action
    • Kubernetes Operators
    • The Path to GitOps
    • View All E-books

    Cheat Sheets

    • Linux Commands
    • Bash Commands
    • Git
    • systemd Commands
    • View All Cheat Sheets

    Documentation

    • API Catalog
    • Product Documentation
    • Legacy Documentation
    • Red Hat Learning

      Learning image
      Boost your technical skills to expert-level with the help of interactive lessons offered by various Red Hat Learning programs.
    • Explore Red Hat Learning
  • Developer Sandbox

    Developer Sandbox

    • Access Red Hat’s products and technologies without setup or configuration, and start developing quicker than ever before with our new, no-cost sandbox environments.
    • Explore Developer Sandbox

    Featured Developer Sandbox activities

    • Get started with your Developer Sandbox
    • OpenShift virtualization and application modernization using the Developer Sandbox
    • Explore all Developer Sandbox activities

    Ready to start developing apps?

    • Try at no cost
  • Blog
  • Events
  • Videos

Deployment-ready reasoning with quantized DeepSeek-R1 models

Introducing new quantized reasoning models built on the DeepSeek-R1-Distill model suite

March 3, 2025
Eldar Kurtić Alexandre Marques Mark Kurtz Dan Alistarh
Related topics:
Artificial intelligenceOpen source
Related products:
Red Hat AI

Share:

    The 4-bit breakdown

    • State-of-the-art, open source quantized reasoning models built on the DeepSeek-R1-Distill suite are now available.
    • FP8 and INT8 quantized versions achieve near-perfect accuracy recovery across all tested reasoning benchmarks and model sizes—except for the smallest INT8 1.5B model, which reaches 97%.
    • INT4 models recover 97%+ accuracy for 7B and larger models, with the 1.5B model maintaining ~94%.
    • With vLLM 0.7.2, we validated inference performance across many common inference scenarios and GPU hardware, resulting in up to 4X better inference performance.

    In recent research, including We Ran Over Half a Million Evaluations on Quantized LLMs and How Well Do Quantized Models Handle Long-Context Tasks?, we’ve shown that quantized large language models (LLMs) rival their full-precision counterparts in accuracy across diverse benchmarks, covering academic, real-world use cases, and long-context evaluations while delivering significant speed and cost benefits.

    With the rise of reasoning-focused models, like DeepSeek’s R1 series, a new challenge emerges: Can quantization preserve accuracy in complex reasoning scenarios requiring chain-of-thought, thinking tokens, and long-context comprehension?

    To answer this, we quantized and open-sourced the entire DeepSeek-R1-Distill model suite in three widely-used formats–FP W8A8, INT W8A8, INT W4A16–adhering to the best practices outlined in our recent paper, "Give Me BF16 or Give Me Death"? Accuracy-Performance Trade-Offs in LLM Quantization. 

    Our evaluations across leading reasoning benchmarks confirm that with state-of-the-art (SOTA) quantization techniques, LLMs retain competitive reasoning accuracy while unlocking significant inference speedups. Figures 1 and 2 illustrate these improvements.

    Figure 1
    Figure 1: Pass@1 score and standard deviation for quantized models on the popular reasoning benchmarks.
    Figure 2
    Figure 2: Inference performance (in requests per second) for baseline and quantized DeepSeek-R1-Distill-Llama-70B chat-based deployment scenarios (512 prompt tokens, 256 generated tokens) with vLLM across various hardware platforms. Left: single-stream deployment (low latency). Right: maximum throughput multi-stream deployment.

    Want to get started right away? 

    The quantized DeepSeek-R1-Distill models, including Llama-8B, Llama-70B, Qwen-1.5B, Qwen-7B, Qwen-14B, and Qwen-32B, are now available as a Hugging Face collection with full evaluations, benchmarks, and setup instructions. Check them out now, or keep reading for deeper insights and key takeaways!

    Rigorous evals, real insights

    We quantized the DeepSeek-R1-Distill models using the LLM-Compressor library, which provides a simple, easy-to-use interface for SOTA model compression. The resulting models are optimized for high-performance inference with the popular vLLM inference and serving library.

    Get an introduction to vLLM on the Red Hat Blog: Unleash the full potential of LLMs: Optimize for performance with vLLM

    Reasoning benchmarks

    To rigorously evaluate their reasoning capabilities, we leveraged LightEval, Hugging Face’s lightweight LLM evaluation framework, running on vLLM for fast and scalable evals. Following DeepSeek’s recommendations for text generation, we use sampling with a temperature of 0.6 and top-p of 0.95, generating 20 responses per query to estimate the pass@1 score. The repetitive sampling was important to estimate an accurate average performance for the benchmarks due to high variance across the relatively small datasets. We tested across three leading reasoning benchmarks:

    • AIME 2024: 30 expert-level math problems from the American Invitational Mathematics Examination (AIME).
    • MATH-500: 500 challenging problems curated from OpenAI’s MATH benchmark.
    • GPQA-Diamond: A set of challenging, expert-validated multiple-choice questions spanning biology, physics, and chemistry.

    Figure 3 presents the average pass@1 scores of the various quantized DeepSeek-R1-Distill models across the reasoning benchmarks:

    • FP W8A8 (8-bit floating-point weights and activations) demonstrate near-lossless accuracy, matching BF16.
    • INT W8A8 (8-bit integer weights and activations) closely follows, recovering ~99% of the original accuracy.
    • INT W4A16 (4-bit weight-only integer quantization) exhibits a slight drop on AIME and GPQA-Diamond, while performing strongly on MATH-500: the Qwen-1.5B model, the smallest in the suite, accounts for most of this drop.
    Figure 3
    Figure 3: The average pass@1 score of quantized DeepSeek-R1-Distill models across the popular reasoning benchmarks (AIME 2024, MATH-500, and GPQA Diamond).

    General benchmarks

    To ensure generalization beyond reasoning tasks, we also evaluated all models on the standard Open LLM Leaderboard V1 benchmark, including MMLU, ARC-Challenge, HellaSwag, Winogrande, GSM8k, and TruthfulQA.

    On the Open LLM Leaderboard V1 benchmark, Figure 4 shows that quantized models consistently achieve over 99% accuracy recovery, with only one outlier: the Llama-8B model at INT W4A16, which experiences a modest drop to 97.33%. Despite being optimized for complex reasoning, the original and quantized models perform strongly on standard academic benchmarks.

    Figure 4
    Figure 4: Average score of quantized DeepSeek-R1-Distill models on the Open LLM Leaderboard V1 benchmark.

    These results align with our recent paper, "Give Me BF16 or Give Me Death"? Accuracy-Performance Trade-Offs in LLM Quantization:

    • FP W8A8 consistently delivers lossless compression, maintaining full accuracy while accelerating inference on Hopper and Ada Lovelace GPUs.
    • INT W8A8 closely matches FP W8A8 performance, making them an effective alternative for Ampere and older devices.
    • INT W4A16 performs competitively on larger models but shows some accuracy loss on smaller ones.

    These results confirm that state-of-the-art quantization techniques preserve reasoning accuracy while enabling more efficient deployment. However, accuracy is only one side of the equation—real-world usability depends on inference speed, latency, and hardware efficiency. In the next section, we dive into how these quantized models perform in deployment, benchmarking their inference speed across different hardware configurations and use cases.

    Inference performance in vLLM

    To assess deployment performance, we benchmarked the DeepSeek-R1-Distill models across multiple hardware configurations and workloads using vLLM 0.7.2, focusing on latency, throughput, and server scenarios. More information on the workloads considered here can be found in the model cards of the quantized models.

    Figure 2, at the blog's beginning, briefly summarizes the general results, showcasing the performance gains from quantization for the DeepSeek-R1-Distill-Llama-70B model for a chat use case (512 prompt tokens, 256 generated tokens) across various GPU hardware platforms:

    • A6000: INT W4A16 enables 2.1X faster requests for single stream while INT W8A8 provides 1.5X better throughput.
    • A100: INT W4A16 enables 1.7X faster requests for single stream while INT W8A8 provides 2X better throughput.
    • H100: FP W8A8 enables 1.4X faster requests for single stream and provides 4.3X better throughput.

    Figure 5 extends these insights by presenting average inference speedups across all workloads—including chat, instruction following, summarization, retrieval-augmented generation (RAG), and coding—on A6000, A100, and H100 GPUs.

    • Single-stream (low-latency) deployments: W4A16 delivers the highest speedups, with a performance increase of up to 1.9X better than the baseline. This trend is especially pronounced for medium and large models, where memory and compute optimizations have a more substantial impact.
    • High-throughput multi-stream scenarios: W8A8 (FP and INT formats) achieve the best performance gains, particularly for larger models, averaging 1.3-1.7X better throughput across the tested workloads.
    • The DeepSeek-R1-Distill-Qwen-1.5B model sees minimal speedup, as its BF16 variant is already lightweight relative to the compute and memory capacity of the tested GPUs.
    Figure 5
    Figure 5: Inference performance (in relative speedup) for baseline and quantized DeepSeek-R1-Distill models deployed with vLLM across various scenario workloads (chat, instruction following, summarization, RAG, and coding) and various GPU hardware (A6000, A100, H100). Left: single-stream deployment (low latency). Right: maximum throughput multi-stream deployment.

    Conclusion

    Our results demonstrate that quantized reasoning LLMs perform strongly on the most challenging benchmarks while delivering substantial inference speedups. Whether optimizing for low-latency applications or high-throughput scaling, these models provide an efficient, deployment-ready solution without sacrificing reasoning accuracy.

    All models are fully open-sourced in our Hugging Face model collection, complete with LLM-Compressor recipes to reproduce and fine-tune the quantization process.

    Get started with efficient AI

    Neural Magic, now part of Red Hat, is committed to advancing open, efficient AI. Our state-of-the-art quantized models, benchmarks, and tools like LLM Compressor are fully open-sourced, enabling faster inference, lower costs, and production-ready performance. Explore our models on Hugging Face, deploy them with vLLM, or customize them with LLM Compressor to unlock tailored optimizations.

    Contact us to learn more about enterprise-grade AI solutions or contribute to the open source ecosystem today!

    Related Posts

    • Multimodal model quantization support through LLM Compressor

    • vLLM V1: Accelerating multimodal inference for large language models

    • What is GPU programming?

    • How to use AMD GPUs for model serving in OpenShift AI

    • Simplifying AI with RamaLama and llama-run

    • How RamaLama makes working with AI models boring

    Recent Posts

    • How to run AI models in cloud development environments

    • How Trilio secures OpenShift virtual machines and containers

    • How to implement observability with Node.js and Llama Stack

    • How to encrypt RHEL images for Azure confidential VMs

    • How to manage RHEL virtual machines with Podman Desktop

    Red Hat Developers logo LinkedIn YouTube Twitter Facebook

    Products

    • Red Hat Enterprise Linux
    • Red Hat OpenShift
    • Red Hat Ansible Automation Platform

    Build

    • Developer Sandbox
    • Developer Tools
    • Interactive Tutorials
    • API Catalog

    Quicklinks

    • Learning Resources
    • E-books
    • Cheat Sheets
    • Blog
    • Events
    • Newsletter

    Communicate

    • About us
    • Contact sales
    • Find a partner
    • Report a website issue
    • Site Status Dashboard
    • Report a security problem

    RED HAT DEVELOPER

    Build here. Go anywhere.

    We serve the builders. The problem solvers who create careers with code.

    Join us if you’re a developer, software engineer, web designer, front-end designer, UX designer, computer scientist, architect, tester, product manager, project manager or team lead.

    Sign me up

    Red Hat legal and privacy links

    • About Red Hat
    • Jobs
    • Events
    • Locations
    • Contact Red Hat
    • Red Hat Blog
    • Inclusion at Red Hat
    • Cool Stuff Store
    • Red Hat Summit

    Red Hat legal and privacy links

    • Privacy statement
    • Terms of use
    • All policies and guidelines
    • Digital accessibility

    Report a website issue