Skip to main content
Redhat Developers  Logo
  • AI

    Get started with AI

    • Red Hat AI
      Accelerate the development and deployment of enterprise AI solutions.
    • AI learning hub
      Explore learning materials and tools, organized by task.
    • AI interactive demos
      Click through scenarios with Red Hat AI, including training LLMs and more.
    • AI/ML learning paths
      Expand your OpenShift AI knowledge using these learning resources.
    • AI quickstarts
      Focused AI use cases designed for fast deployment on Red Hat AI platforms.
    • No-cost AI training
      Foundational Red Hat AI training.

    Featured resources

    • OpenShift AI learning
    • Open source AI for developers
    • AI product application development
    • Open source-powered AI/ML for hybrid cloud
    • AI and Node.js cheat sheet

    Red Hat AI Factory with NVIDIA

    • Red Hat AI Factory with NVIDIA is a co-engineered, enterprise-grade AI solution for building, deploying, and managing AI at scale across hybrid cloud environments.
    • Explore the solution
  • Learn

    Self-guided

    • Documentation
      Find answers, get step-by-step guidance, and learn how to use Red Hat products.
    • Learning paths
      Explore curated walkthroughs for common development tasks.
    • Guided learning
      Receive custom learning paths powered by our AI assistant.
    • See all learning

    Hands-on

    • Developer Sandbox
      Spin up Red Hat's products and technologies without setup or configuration.
    • Interactive labs
      Learn by doing in these hands-on, browser-based experiences.
    • Interactive demos
      Click through product features in these guided tours.

    Browse by topic

    • AI/ML
    • Automation
    • Java
    • Kubernetes
    • Linux
    • See all topics

    Training & certifications

    • Courses and exams
    • Certifications
    • Skills assessments
    • Red Hat Academy
    • Learning subscription
    • Explore training
  • Build

    Get started

    • Red Hat build of Podman Desktop
      A downloadable, local development hub to experiment with our products and builds.
    • Developer Sandbox
      Spin up Red Hat's products and technologies without setup or configuration.

    Download products

    • Access product downloads to start building and testing right away.
    • Red Hat Enterprise Linux
    • Red Hat AI
    • Red Hat OpenShift
    • Red Hat Ansible Automation Platform
    • See all products

    Featured

    • Red Hat build of OpenJDK
    • Red Hat JBoss Enterprise Application Platform
    • Red Hat OpenShift Dev Spaces
    • Red Hat Developer Toolset

    References

    • E-books
    • Documentation
    • Cheat sheets
    • Architecture center
  • Community

    Get involved

    • Events
    • Live AI events
    • Red Hat Summit
    • Red Hat Accelerators
    • Community discussions

    Follow along

    • Articles & blogs
    • Developer newsletter
    • Videos
    • Github

    Get help

    • Customer service
    • Customer support
    • Regional contacts
    • Find a partner

    Join the Red Hat Developer program

    • Download Red Hat products and project builds, access support documentation, learning content, and more.
    • Explore the benefits

Advancing low‑bit quantization for LLMs: AutoRound x LLM Compressor

Achieve faster, more efficient LLM serving without sacrificing accuracy

December 9, 2025
Intel Neural Compressor Team, Red Hat AI Model Optimization Team
Related topics:
Artificial intelligence
Related products:
Red Hat AI

    AutoRound, a state‑of‑the‑art post‑training quantization (PTQ) algorithm developed by Intel, is now integrated into LLM Compressor. This collaboration delivers:

    • Higher accuracy for low bit-width quantization
    • Lightweight tuning (hundreds of steps instead of thousands)
    • Zero additional inference overhead
    • Seamless compatibility with compressed tensors and direct serving in vLLM
    • A streamlined workflow that lets you quantize and serve models with just a few lines of code

    Broader quantization schemes and model coverage are coming next—try it now and help shape what we build.

    What is AutoRound?

    AutoRound is an algorithm for reducing the size of large language models (LLMs) and vision-language models (VLMs) after training, called PTQ. It introduces three trainable parameters per quantized tensor: v (rounding offset/adjustment), α and β (learned clipping range controls). By processing decoder layers sequentially and applying signed gradient descent, AutoRound jointly optimizes rounding and clipping to minimize block‑wise output reconstruction error.

    Core strengths:

    • Superior accuracy, especially at very low bit‑widths
    • Supports multiple data types: W4A16, MXFP8, MXFP4, FP8, NVFP4, with more coming soon
    • Mixed‑bit, layer‑wise precision search for flexible accuracy–efficiency trade‑offs
    • Works with both large language models (LLMs) and vision-language models (VLMs)

    AutoRound creates quantized models in low-bit formats that accelerate inference on Intel Xeon processors, Intel Gaudi AI accelerators, Intel Data Center GPUs, Intel Arc B-Series Graphics, as well as other GPUs (for example, CUDA based devices).

    Looking forward, Intel is adding native support for FP8, MXFP8, and MXFP4 formats to its next-generation Intel Data Center GPU codenamed Crescent Island. Models quantized with AutoRound will naturally scale to take advantage of these data types across the Intel AI hardware portfolio. This creates a consistent path from algorithmic innovation to real-world deployment.

    For more details, refer to the paper AutoRound (EMNLP 2024) and the GitHub repository intel/auto-round.

    Why integrate into LLM Compressor?

    LLM Compressor already provides a unified, modular system for compression techniques such as quantization and pruning. Integrating AutoRound into this ecosystem:

    • Aligns with the existing modifier architecture (for example, GPTQModifier)
    • Reuses the sequential calibration and layer‑onloading infrastructure
    • Enables future interoperability with richer multi‑modifier recipes
    • Produces quantized models that are ready for vLLM serving, enabling a clean workflow from compression to deployment

    Integration overview

    We completed the first stage of integration by introducing the new AutoRoundModifier into LLM Compressor, enabling production of wNa16 (for example, W4A16) compressed models that load in vLLM, as implemented in PR #1994. With a straightforward configuration—just specify your model and calibration data—you can quickly generate high‑quality low‑bit checkpoints. This initial stage supports quantizing a range of dense LLMs, including the Llama and Qwen model families, and demonstrates robust compatibility for practical deployment.

    Try it now: Quick start

    This quick start walks you through the process from installation to evaluating the quantized model's performance.

    1. Install

    Start by cloning the repository and installing the necessary Python package.

    git clone https://github.com/vllm-project/llm-compressor.git
    cd llm-compressor
    pip install -e .

    2. Load model and tokenizer

    Load the model and tokenizer from the Hugging Face Model Hub, specifying the desired model ID.

    from transformers import AutoModelForCausalLM, AutoTokenizer
    MODEL_ID = "Qwen/Qwen3-8B"
    model = AutoModelForCausalLM.from_pretrained(MODEL_ID, torch_dtype="auto")
    tokenizer = AutoTokenizer.from_pretrained(MODEL_ID)

    3. Prepare calibration data

    Set up the calibration dataset, which is a small, unlabeled subset of data used to train the quantization parameters.

    from auto_round.calib_dataset import get_dataset
    NUM_CALIBRATION_SAMPLES = 128
    MAX_SEQUENCE_LENGTH = 2048
    ds = get_dataset(tokenizer=tokenizer,
                     seqlen=MAX_SEQUENCE_LENGTH,
                     nsamples=NUM_CALIBRATION_SAMPLES)

    4. Run quantization using AutoRound

    AutoRound quantization can run on a variety of devices, including CPUs and GPUs. Quantization and serving might not happen on the same device. For example, you can quantize on a workstation with GPU and later deploy on AIPC.

    from llmcompressor import oneshot
    from llmcompressor.modifiers.autoround import AutoRoundModifier
    recipe = AutoRoundModifier(
        targets="Linear",
        scheme="W4A16",
        ignore=["lm_head"],
        iters=200,
    )
    oneshot(
        model=model,
        dataset=ds,
        recipe=recipe,
        max_seq_length=MAX_SEQUENCE_LENGTH,
        num_calibration_samples=NUM_CALIBRATION_SAMPLES,
        shuffle_calibration_samples=False,
    )
    SAVE_DIR = MODEL_ID.split("/")[-1] + "-W4A16-G128-AutoRound"
    model.save_pretrained(SAVE_DIR, save_compressed=True)
    tokenizer.save_pretrained(SAVE_DIR)

    In practice, 128 calibration samples + ~200 iterations often reach stable convergence. Increase the number of samples or iterations if you are targeting extremely low bits or tighter accuracy targets.

    5. Serve in vLLM

    Once quantization is complete, the same compressed model can be served on different hardware, independent of the device used for tuning. For example, you can serve the quantized Qwen3‑8B‑W4A16‑G128‑AutoRound model on a single Intel Arc Pro B60 GPU:

    vllm serve Qwen3-8B-W4A16-G128-AutoRound \
        --dtype=bfloat16 \
        --gpu-memory-utilization 0.8 \
        --max-num-batched-tokens 8192

    Note

    Install vLLM from PR 29484 to serve this model. When serving on XPU, you must run vLLM with the --enforce-eager flag.

    6. Evaluate (Example: GSM8K with lm_eval)

    Finally, you can evaluate the quantized model's performance on a benchmark dataset using the command-line interface utility.

    lm_eval --model vllm \
      --model_args pretrained="./Qwen3-8B-W4A16-G128-AutoRound,max_model_len=8192,max_num_batched_tokens=32768,max_num_seqs=128,gpu_memory_utilization=0.8,dtype=bfloat16,max_gen_toks=2048,enable_prefix_caching=False,enforce_eager=True" \
      --tasks gsm8k \
      --num_fewshot 5 \
      --limit 1000 \
      --seed 42 \
      --batch_size 128
    |Tasks|Version|     Filter     |n-shot|  Metric   |   |Value|   |Stderr|
    |-----|------:|----------------|-----:|-----------|---|----:|---|-----:|
    |gsm8k|      3|flexible-extract|     5|exact_match|↑  |0.911|±  | 0.009|
    |     |       |strict-match    |     5|exact_match|↑  |0.911|±  | 0.009|

    Conclusion and future plans

    With this first integration, AutoRound and LLM Compressor already provide a practical, production‑oriented path to low‑bit LLMs: W4A16 quantization is supported end‑to‑end, the workflow is simple to configure, and dense models such as Llama and Qwen are supported. The setup is robust, streamlined, and ready for practical deployment.

    Looking ahead, we plan to extend support to additional schemes such as FP8, MXFP4, MXFP8, and NVFP4, add automatic mixed‑bit search for fine‑grained per‑layer optimization, and cover more model families, including Mixture‑of‑Experts (MoE) models. We also aim to deepen interoperability with other algorithms in LLM Compressor, which will allow AutoRound to be combined into richer multi‑modifier recipes that serve both community use cases and Intel production workloads.

    If you’d like to influence which formats, models, and workflows we prioritize next, join the discussion in RFC #1968 and share your benchmarks or deployment requirements, or bring your feedback to the Intel community so we can align the roadmap with real‑world needs.

    Acknowledgements

    We wish to acknowledge the contributions of the LLM Compressor community. Specifically, we thank Kyle Sayers, Dipika Sikka, Brian Dellabetta, Charles Hernandez, and Robert Shaw for their invaluable feedback on the early proposal and their diligent review of the pull requests.

    Related RFCs and PRs

    • llm-compressor#1968
    • llm-compressor#1994
    • llm-compressor#2055
    • llm-compressor#2062
    • auto-round#993
    • auto-round#1053
    • auto-round#1055
    • auto-round#1072
    • vllm#29484

    Recent Posts

    • Protect data offloaded to GPU-accelerated environments with OpenShift sandboxed containers

    • Case study: Measuring energy efficiency on the x64 platform

    • How to prevent AI inference stack silent failures

    • Preventing GPU waste: A guide to JIT checkpointing with Kubeflow Trainer on OpenShift AI

    • How to manage TLS certificates used by OpenShift GitOps operator

    Red Hat Developers logo LinkedIn YouTube Twitter Facebook

    Platforms

    • Red Hat AI
    • Red Hat Enterprise Linux
    • Red Hat OpenShift
    • Red Hat Ansible Automation Platform
    • See all products

    Build

    • Developer Sandbox
    • Developer tools
    • Interactive tutorials
    • API catalog

    Quicklinks

    • Learning resources
    • E-books
    • Cheat sheets
    • Blog
    • Events
    • Newsletter

    Communicate

    • About us
    • Contact sales
    • Find a partner
    • Report a website issue
    • Site status dashboard
    • Report a security problem

    RED HAT DEVELOPER

    Build here. Go anywhere.

    We serve the builders. The problem solvers who create careers with code.

    Join us if you’re a developer, software engineer, web designer, front-end designer, UX designer, computer scientist, architect, tester, product manager, project manager or team lead.

    Sign me up

    Red Hat legal and privacy links

    • About Red Hat
    • Jobs
    • Events
    • Locations
    • Contact Red Hat
    • Red Hat Blog
    • Inclusion at Red Hat
    • Cool Stuff Store
    • Red Hat Summit
    © 2026 Red Hat

    Red Hat legal and privacy links

    • Privacy statement
    • Terms of use
    • All policies and guidelines
    • Digital accessibility

    Chat Support

    Please log in with your Red Hat account to access chat support.