Skip to main content
Redhat Developers  Logo
  • Products

    Platforms

    • Red Hat Enterprise Linux
      Red Hat Enterprise Linux Icon
    • Red Hat AI
      Red Hat AI
    • Red Hat OpenShift
      Openshift icon
    • Red Hat Ansible Automation Platform
      Ansible icon
    • See all Red Hat products

    Featured

    • Red Hat build of OpenJDK
    • Red Hat Developer Hub
    • Red Hat JBoss Enterprise Application Platform
    • Red Hat OpenShift Dev Spaces
    • Red Hat OpenShift Local
    • Red Hat Developer Sandbox

      Try Red Hat products and technologies without setup or configuration fees for 30 days with this shared Red Hat OpenShift and Kubernetes cluster.
    • Try at no cost
  • Technologies

    Featured

    • AI/ML
      AI/ML Icon
    • Linux
      Linux Icon
    • Kubernetes
      Cloud icon
    • Automation
      Automation Icon showing arrows moving in a circle around a gear
    • See all technologies
    • Programming languages & frameworks

      • Java
      • Python
      • JavaScript
    • System design & architecture

      • Red Hat architecture and design patterns
      • Microservices
      • Event-Driven Architecture
      • Databases
    • Developer experience

      • Productivity
      • Tools
      • GitOps
    • Automated data processing

      • AI/ML
      • Data science
      • Apache Kafka on Kubernetes
    • Platform engineering

      • DevOps
      • DevSecOps
      • Red Hat Ansible Automation Platform for applications and services
    • Secure development & architectures

      • Security
      • Secure coding
  • Learn

    Featured

    • Kubernetes & cloud native
      Openshift icon
    • Linux
      Rhel icon
    • Automation
      Ansible cloud icon
    • AI/ML
      AI/ML Icon
    • See all learning resources

    E-books

    • GitOps cookbook
    • Podman in action
    • Kubernetes operators
    • The path to GitOps
    • See all e-books

    Cheat sheets

    • Linux commands
    • Bash commands
    • Git
    • systemd commands
    • See all cheat sheets

    Documentation

    • Product documentation
    • API catalog
    • Legacy documentation
  • Developer Sandbox

    Developer Sandbox

    • Access Red Hat’s products and technologies without setup or configuration, and start developing quicker than ever before with our new, no-cost sandbox environments.
    • Explore the Developer Sandbox

    Featured Developer Sandbox activities

    • Get started with your Developer Sandbox
    • OpenShift virtualization and application modernization using the Developer Sandbox
    • Explore all Developer Sandbox activities

    Ready to start developing apps?

    • Try at no cost
  • Blog
  • Events
  • Videos

LLM Compressor 0.9.0: Attention quantization, MXFP4 support, and more

Explore new features including attention and KV cache quantization, model-free PTQ, AutoRound, and MXFP4

January 16, 2026
Kyle Sayers Charles Hernandez Dipika Sikka Brian Dellabetta
Related topics:
Artificial intelligence
Related products:
Red Hat AI Inference ServerRed Hat AI

    The LLM Compressor 0.9.0 release introduces a new set of features and strengthens existing ones. These include attention quantization support with extended KV cache quantization functionality, a new model_free_ptq pathway, a new AutoRound quantization modifier, and experimental support for MXFP4 quantization.

    This release also builds on existing compression functionality by adding batched calibration support for improved runtime. The AWQModifier was also updated to work with quantization schemes beyond W4A16.

    LLM Compressor 0.9.0 includes the following updates, explored in detail below:

    • Refactored and expanded attention and KV cache quantization
    • Quantize any model to FP8 using model_free_ptq
    • New AutoRoundModifier
    • Experimental MXFP4 support
    • Batched calibration support
    • AWQ updates and other enhancements

    Refactored and expanded attention and KV cache quantization

    A major part of this release involved refactoring KV cache quantization. The previous logic was limited to FP8 per-tensor quantization. This section covers the new capabilities and design details behind the refactor.

    In LLM Compressor 0.9.0, you can:

    • Apply arbitrary KV cache quantization to models (FP4, INT8, FP8, per channel, etc.)
    • Apply arbitrary attention quantization to models (FP4, INT8, FP8, per channel, etc.)
    • Apply SpinQuant-style R3 rotations to attention for accuracy improvement
    • Run KV cache/attention quantized models using Compressed Tensor’s integration with Hugging Face

    Important note

    vLLM does not support all KV cache and attention quantization schemes. Refer to the vLLM quantized KV cache documentation for the latest support information.

    Figure 1
    Figure 1: Llama Attention module components.

    Core to the redesign of KV cache and attention quantization was, instead of treating Q/K/V quantization as separate implementations of calibration and quantization, treating Q/K/V as locations where calibration and quantization can be applied.
     

    This manifests in newly added hook functions:

    • register_query_hook
    • register_key_hook
    • register_value_hook

    Two new submodules, QuantizedKVCache and QuantizedAttentionImpl, manage these hooks. QuantizedKVCache acts as a wrapper around Hugging Face's DynamicCache class, while QuantizedAttentionImpl inserts itself using Hugging Face's AttentionInterface. These classes work with existing Hugging Face abstractions. Moving these definitions from LLM Compressor to compressed-tensors lets QKV quantized models run directly with Hugging Face model definitions.

    from transformers import AutoModelForCausalLM, AutoTokenizer
    model = AutoModelForCausalLM.from_pretrained("nm-testing/Llama-3.2-1B-Instruct-attention-fp8-head")
    tokenizer = AutoTokenizer.from_pretrained("nm-testing/Llama-3.2-1B-Instruct-attention-fp8-head")
    model.generate(**tokenizer("Attention quantization is awesome"))

    For more information and examples, see the following links:

    • LLM Compressor: [Quantization] Attention/ KV cache refactor
    • compressed-tensors: [Transform] Attention/Cache transforms
    • LLM Compressor's Attention examples

    Quantize any model to FP8 using model_free_ptq

    Using LLM Compressor usually involves calling the oneshot function with a transformers model definition supplied by the user. However, if a model definition is not available, you can now use model_free_ptq to quantize models directly from safetensors files without loading the model through transformers. This pathway is:

    • Ideal for models that lack a Python model class (such as community or experimental releases), very large models like Kimi-K2, or cases where oneshot fails.
    • Capable of data-free quantization, such as FP8 and NVFP4A16, enabling lightweight quantization without calibration datasets.
    • Used in practice for models like Mistral Large 3 models.

    LLM Compressor Observers are directly applied to the model checkpoint, removing the requirement for a model definition. This addition makes it easier to bring compressed models into production regardless of their source format.

    Example: Quantize Kimi-K2-Thinking to FP8 Block.

    from llmcompressor import model_free_ptq
    MODEL_ID = "unsloth/Kimi-K2-Thinking-BF16"
    SAVE_DIR = MODEL_ID.rstrip("/").split("/")[-1] + "-FP8-BLOCK"
    # Apply FP8-Block to the model
    # Once quantized, the model is saved
    # using compressed-tensors to the SAVE_DIR.
    # Unlike oneshot, we pass in the model stub directly 
    model_free_ptq(
        model_stub=MODEL_ID,
        save_directory=SAVE_DIR,
        scheme="FP8_BLOCK",
        ignore=[
            "re:.*gate$",
            "lm_head",
            "re:.*kv_a_proj_with_mqa$",
            "re:.*q_a_proj$",
            "model.embed_tokens",
        ],
        max_workers=15,
        device="cuda:0",
    )

    See the LLM Compressor examples for scripts on how to quantize other models using the model_free_ptq.

    Introducing the AutoRoundModifier

    LLM Compressor now supports Intel's AutoRound algorithm through the addition of the AutoRoundModifier, an advanced quantization method designed for high-accuracy, low-bit compression. AutoRound quantized models are fully compatible with compressed-tensors and can be served directly using vLLM.

    The technique introduces three trainable parameters—V, α, and β—that optimize rounding decisions and clipping ranges during quantization (see Figure 2). It processes each decoder layer sequentially, using block-wise output reconstruction error as the training objective to fine-tune these parameters. By combining efficient post-training quantization with adaptive parameter tuning, AutoRound delivers robust compression for large language models while maintaining inference performance.

    Figure 2
    Figure 2: AutoRound quantization process for LLMs.

    For examples and comparisons with other PTQ algorithms, see the examples folder. Additional details can also be found in the AutoRound x LLM Compressor collaboration blog.

    Experimental MXFP4 support

    LLM Compressor now includes experimental support for MXFP4 quantization. This update enables MXFP4 weight quantization as well as calibration and compression of MXFP4 scaling factors. MXFP4 quantization can be applied using the new MXFP4 preset quantization scheme. Once quantized, models can be compressed and serialized with the MXFP4PackedCompressor, which packs both weights and scales as uint8 tensors. This approach reduces storage overhead while preserving the data required for dequantization at inference time. MXFP4 support is currently experimental and pending validation with vLLM.

    This release also expands the QuantizationArgs API to offer more control over how quantization parameters are stored, specifically for advanced and mixed-precision schemes.

    Two new fields have been introduced. 

    scale_dtype controls how quantization scales are serialized:

    • When set to None, scales are stored using the default dense data type.
    • When explicitly specified, scales are compressed using the provided data type.
    • For example, NVFP4 stores scales in float8_e4m3fn, while MXFP4 stores scales as uint8.
    • The chosen data type is persisted in the model configuration, ensuring correctness during loading and inference.

    zp_dtype specifies the data type used to store zero-point values:

    • Set to None for symmetric quantization schemes.
    • Required for asymmetric quantization, where zero-points must be explicitly saved.
    • This allows consistent handling of asymmetric models while minimizing storage overhead.

    Example: NVFP4 weights.

    "weights": {
      "actorder": null,
      "block_structure": null,
      "dynamic": false,
      "group_size": 16,
      "num_bits": 4,
      "observer": "static_minmax",
      "observer_kwargs": {},
      "scale_dtype": "torch.float8_e4m3fn",
      "strategy": "tensor_group",
      "symmetric": true,
      "type": "float",
      "zp_dtype": null
    }

    Example: MXFP4 weights.

    "weights": {
      "actorder": null,
      "block_structure": null,
      "dynamic": false,
      "group_size": 32,
      "num_bits": 4,
      "observer": "minmax",
      "observer_kwargs": {},
      "scale_dtype": "torch.uint8",
      "strategy": "group",
      "symmetric": true,
      "type": "float",
      "zp_dtype": null
    }

    Together, these additions make quantization behavior more explicit, configurable, and extensible—laying the groundwork for supporting new low-precision formats while maintaining compatibility with existing runtimes and tooling. Experimental examples can be found under the experimental folder.

    Batched calibration support

    As a precursor to LLM Compressor's next release which will be focused on performance improvements, LLM Compressor 0.9.0 also saw the introduction of batched calibration as a means of faster calibration and quantization.

    Batching samples is known to yield higher throughput during inference, primarily due to better cache locality during computations. The same principle applies during model calibration, as well as during optimization for algorithms such as AWQ and AutoRound which perform their own forward passes through the model.

    Previously, batching samples was difficult because large batch sizes lead to large model outputs, which can exceed available VRAM.

    (batch_size = 64) * (seq_len = 2048) * (vocab_size = 128K) ~= 16Gb of extra VRAM

    This issue is unique to the LM head. Because LLM Compressor algorithms do not require these outputs, we allocate them on the PyTorch meta device instead of VRAM to save memory.

    Using batch_size=32 can yield a 3x speedup on large models for algorithms that rely on many onloaded forward passes, such as AWQ, and a smaller ~15% speedup for algorithms like GPTQ.

    AWQ updates and other improvements

    LLM Compressor 0.9.0 includes a series of improvements to the AWQModifier.

    Generalized scheme support

    The AWQModifier now supports additional quantization schemes beyond W4A16. The previous implementation used one-off quantization logic that limited supported configurations. By adopting existing LLM Compressor abstractions, the code is simpler and supports new quantization schemes including INT8, FP8, and mixed schemes within a single model.

    Matching improvements

    The AWQModifier and SmoothQuantModifier previously used suboptimal mapping logic. SmoothQuant did not support MoE models, and AWQ had errors in skip-layer behavior. Both struggled with certain parent contexts.

    Other updates and improvements

    We also updated several other components, including observers, activation quantization, and expert model calibration.

    Expanded observer types

    We simplified our observer tools and added several new types:

    • memoryless_minmax: Real-time min/max, recommended for PTQ weight quantization.
    • static_minmax: Absolute min/max across all observations, recommended for PTQ activation quantization. This is now the default for NVFP4 activations.
    • memoryless_mse: Reduces quantization error by minimizing MSE per observation. This is recommended for PTQ weights. Future releases will use memoryless_minmax for weights and static_minmax for activations.

    New activation quantization support

    In addition to dynamic-per-token activation quantization, LLM Compressor now supports group and channel activation quantization.

    Updated MoE calibration support

    We improved MoE calibration with a new MoECalibrationModule and an updated calibration context so that expert layers calibrate correctly. This update guarantees all experts receive data during forward passes and allows for the quantization of Qwen3 VL and Qwen3 MoE models using data-dependent schemes such as NVFP4, W4A16, and static activation quantization.

    Conclusion

    The LLM Compressor 0.9.0 release advances quantization capabilities by introducing a refactored attention and KV cache system. It also includes the model_free_ptq pathway for quantizing models without transformer definitions and the AutoRoundModifier for high-accuracy compression.

    These updates, along with experimental MXFP4 support, batched calibration for up to 3x faster runtimes, and generalized AWQ support, expand the tools available for tuning large-scale models. With better support for complex architectures like Kimi-K2 and Qwen3 MoE, this release offers the flexibility and performance needed for modern LLM deployment.

    Explore the latest features, updated documentation, and experimental examples in the LLM Compressor repository.

    Related Posts

    • Advancing low‑bit quantization for LLMs: AutoRound x LLM Compressor

    • LLM Compressor 0.8.0: Extended support for Qwen3 and more

    • Axolotl meets LLM Compressor: Fast, sparse, open

    • Optimize LLMs with LLM Compressor in Red Hat OpenShift AI

    • LLM Compressor: Optimize LLMs for low-latency deployments

    • Multimodal model quantization support through LLM Compressor

    Recent Posts

    • LLM Compressor 0.9.0: Attention quantization, MXFP4 support, and more

    • A deep dive into OpenShift Container Platform 4.20 performance

    • Introducing the Dynamic Plug-ins Factory for Developer Hub

    • Diagnose Java applications using Cryostat 4.1

    • AI-driven vulnerability management with Red Hat Lightspeed MCP

    Red Hat Developers logo LinkedIn YouTube Twitter Facebook

    Platforms

    • Red Hat AI
    • Red Hat Enterprise Linux
    • Red Hat OpenShift
    • Red Hat Ansible Automation Platform
    • See all products

    Build

    • Developer Sandbox
    • Developer tools
    • Interactive tutorials
    • API catalog

    Quicklinks

    • Learning resources
    • E-books
    • Cheat sheets
    • Blog
    • Events
    • Newsletter

    Communicate

    • About us
    • Contact sales
    • Find a partner
    • Report a website issue
    • Site status dashboard
    • Report a security problem

    RED HAT DEVELOPER

    Build here. Go anywhere.

    We serve the builders. The problem solvers who create careers with code.

    Join us if you’re a developer, software engineer, web designer, front-end designer, UX designer, computer scientist, architect, tester, product manager, project manager or team lead.

    Sign me up

    Red Hat legal and privacy links

    • About Red Hat
    • Jobs
    • Events
    • Locations
    • Contact Red Hat
    • Red Hat Blog
    • Inclusion at Red Hat
    • Cool Stuff Store
    • Red Hat Summit
    © 2026 Red Hat

    Red Hat legal and privacy links

    • Privacy statement
    • Terms of use
    • All policies and guidelines
    • Digital accessibility

    Report a website issue