Skip to main content
Redhat Developers  Logo
  • Products

    Platforms

    • Red Hat Enterprise Linux
      Red Hat Enterprise Linux Icon
    • Red Hat AI
      Red Hat AI
    • Red Hat OpenShift
      Openshift icon
    • Red Hat Ansible Automation Platform
      Ansible icon
    • View All Red Hat Products

    Featured

    • Red Hat build of OpenJDK
    • Red Hat Developer Hub
    • Red Hat JBoss Enterprise Application Platform
    • Red Hat OpenShift Dev Spaces
    • Red Hat OpenShift Local
    • Red Hat Developer Sandbox

      Try Red Hat products and technologies without setup or configuration fees for 30 days with this shared Openshift and Kubernetes cluster.
    • Try at no cost
  • Technologies

    Featured

    • AI/ML
      AI/ML Icon
    • Linux
      Linux Icon
    • Kubernetes
      Cloud icon
    • Automation
      Automation Icon showing arrows moving in a circle around a gear
    • View All Technologies
    • Programming Languages & Frameworks

      • Java
      • Python
      • JavaScript
    • System Design & Architecture

      • Red Hat architecture and design patterns
      • Microservices
      • Event-Driven Architecture
      • Databases
    • Developer Productivity

      • Developer productivity
      • Developer Tools
      • GitOps
    • Automated Data Processing

      • AI/ML
      • Data Science
      • Apache Kafka on Kubernetes
    • Platform Engineering

      • DevOps
      • DevSecOps
      • Ansible automation for applications and services
    • Secure Development & Architectures

      • Security
      • Secure coding
  • Learn

    Featured

    • Kubernetes & Cloud Native
      Openshift icon
    • Linux
      Rhel icon
    • Automation
      Ansible cloud icon
    • AI/ML
      AI/ML Icon
    • View All Learning Resources

    E-Books

    • GitOps Cookbook
    • Podman in Action
    • Kubernetes Operators
    • The Path to GitOps
    • View All E-books

    Cheat Sheets

    • Linux Commands
    • Bash Commands
    • Git
    • systemd Commands
    • View All Cheat Sheets

    Documentation

    • Product Documentation
    • API Catalog
    • Legacy Documentation
  • Developer Sandbox

    Developer Sandbox

    • Access Red Hat’s products and technologies without setup or configuration, and start developing quicker than ever before with our new, no-cost sandbox environments.
    • Explore Developer Sandbox

    Featured Developer Sandbox activities

    • Get started with your Developer Sandbox
    • OpenShift virtualization and application modernization using the Developer Sandbox
    • Explore all Developer Sandbox activities

    Ready to start developing apps?

    • Try at no cost
  • Blog
  • Events
  • Videos

LLM Compressor 0.8.0: Extended support for Qwen3 and more

This release brings Qwen3-VL and Qwen3-Next support, improved accuracy, and greater flexibility

October 7, 2025
Dipika Sikka Kyle Sayers Brian Dellabetta
Related topics:
Artificial intelligence
Related products:
Red Hat AI

Share:

    The LLM Compressor 0.8.0 release introduces significant enhancements to quantization workflows, extended support for Qwen3 models, and improved accuracy recovery. This release features five notable additions that we'll explore in detail.

    1. Multiple modifiers during oneshot

    LLM Compressor now supports the use of multiple modifiers within a single oneshot compression run. This capability allows practitioners to apply different modifiers to specific submodules—such as combining AWQ and GPTQ for W4A16 quantization—while only having to apply a dataset once through a single calibration run. This feature enables enhanced support for non-uniform quantization, giving users the flexibility to account for varying sensitivity across layers and more options for post-training quantization (PTQ) experimentation.

    Example: Non-uniform quantization with multiple modifiers.

    from transformers import AutoModelForCausalLM
    from llmcompressor import oneshot
    from llmcompressor.modifiers.awq import AWQMapping, AWQModifier
    from llmcompressor.modifiers.quantization import GPTQModifier
    model_id = "meta-llama/Meta-Llama-3-8B-Instruct"
    model = AutoModelForCausalLM.from_pretrained(model_id, torch_dtype="auto")
    #  Configure the quantization algorithm to run.
    #   * quantize self_attn layers to W8A8 with GPTQ
    #   * quantize mlp layers to W4A16 with AWQ
    #       only include mappings pertaining to target layers
    recipe = [
        GPTQModifier(targets=r"re:.*self_attn\.(k]
        
    oneshot(
    model=model,
           dataset="HuggingFaceH4/ultrachat_200k",
           recipe=recipe,
           max_seq_length=MAX_SEQUENCE_LENGTH,
           num_calibration_samples=NUM_CALIBRATION_SAMPLES,
           pipeline="sequential" 
    )

    With the example above, users are able to apply multiple quantization schemes using both AWQ and GPTQ, producing a mixed-precision model that is directly runnable in vLLM.

     "quantization_config": {
        "config_groups": {
            "group_0": {
                "format": "int-quantized",
                "input_activations": {
                    "actorder": null,
                    "block_structure": null,
                    "dynamic": true,
                    "group_size": null,
                    "num_bits": 8,
                    "observer": null,
                    "observer_kwargs": {},
                    "strategy": "token",
                    "symmetric": true,
                    "type": "int"
                },
                "output_activations": null,
                "targets": [
                    "re:.*self_attn\\.(k],
                "weights": {
                    "actorder": null,
                    "block_structure": null,
                    "dynamic": false,
                    "group_size": null,
                    "num_bits": 8,
                    "observer": "minmax",
                    "observer_kwargs": {},
                    "strategy": "channel",
                    "symmetric": true,
                    "type": "int"
                }
             },
            "group_1": {
                "format": "pack-quantized",
                "input_activations": null,
                "output_activations": null,
                "targets": [
                    "re:.*mlp\\.(down],
                "weights": {
                    "actorder": null,
                    "block_structure": null,
                    "dynamic": false,
                    "group_size": 128,
                    "num_bits": 4,
                    "observer": "minmax",
                    "observer_kwargs": {},
                    "strategy": "group",
                    "symmetric": true,
                    "type": "int"
                }
            }
        },
        "format": "mixed-precision",
    }

    For further details on non-uniform quantization support, see the examples in LLM Compressor.

    2. Transforms update: Configurable transforms with variable rotation sizes

    Transform-based modifiers (SpinQuantModifier, QuIPModifier) now support a configurable transform_block_size to further customize the Hadamards applied to the model. The transform_block_size determines the size of the Hadamard, removing the requirement for full-sized rotations. This allows practitioners to align Hadamard block sizes with quantization group sizes, improving efficiency and accuracy as smaller Hadamards require less cost at runtime.

    Example of using Hadamard size 128 with the QuIPModifier.

    from transformers import AutoModelForCausalLM
    from llmcompressor import oneshot
    from llmcompressor.modifiers.quantization import QuantizationModifier
    from llmcompressor.modifiers.transform import QuIPModifier
    # Select model and load it.
    # NOTE: because the datafree pipeline is being used in this
    # example, you can use additional GPUs to support larger models
    MODEL_ID = "meta-llama/Llama-3.1-8B-Instruct"
    model = AutoModelForCausalLM.from_pretrained(MODEL_ID, torch_dtype="auto")
    # Configure the quantization algorithm to run.
    #   * apply quip transforms to model in order to make quantization easier
    #   * quantize the weights to 4 bit with a group size 128
    recipe = [
        QuIPModifier(
            rotations=["v", "u"], transform_block_size=128,
    transform_type="hadamard"
        ),
        QuantizationModifier(targets="Linear", scheme="W4A16", ignore=["lm_head"]),
    ]
    # Apply algorithms.
    oneshot(model=model, recipe=recipe, pipeline="datafree")

    The model produced by the above example will apply online rotations (that is, rotations during runtime). Previously, these online rotations were applied using the dense GEMM. As of the vLLM v0.11 prerelease, you can efficiently apply both full-sized and variable-sized non-random Hadamard rotations using the hadacore kernels. With this update, we're able to see no additional cost to the model's latency when compared to its quantized counterpart with no online rotations, as you can see in the following table.

    Quantized model latency (in seconds). Latency metrics were computed using benchmarks/latency.py.

    Base W4A16

    Hadacore

    GEMM

    0.4402

    0.4489

    1.2917

    3. Transforms update: R4 support

    The release extends support for SpinQuant-style transforms available through the SpinQuantModifier, by enabling R4 support. This particular transform is applied to the down_proj layer, enabling potential improved accuracy recovery.

    Example recipe applying R4 transforms, along with the already supported R1 and R2.

    from llmcompressor.modifiers.quantization import QuantizationModifier
    from llmcompressor.modifiers.transform import SpinQuantModifier
    recipe = [
        SpinQuantModifier(
            rotations=["R1", "R2", "R4"],
            transform_block_size=128,
            transform_type="hadamard",
        ),
        QuantizationModifier(targets="Linear", scheme="W4A16", ignore=["lm_head"]),
    ]

    4. Quantization support for Qwen3 models

    LLM Compressor 0.8.0 has added support for Qwen3-Next and Qwen3-VL MoE models.

    For the Qwen3-Next model, the Qwen3NextSparseMoeBlock is temporarily updated to ensure that all experts see data during oneshot, allowing all quantization scales to be properly calibrated while preserving gated activations for accuracy. Further details can be found in the NVFP4 and FP8 examples.

    This release also adds FP8 quantization support for Qwen3 Vision-Language MoE models. LLM Compressor updates the model's Qwen3VLMoeTextSparseMoeBlock blocks with linearized MoE layers that can be quantized and are runnable in vLLM. See the example for further details. 

    You can see the FP8 block-quantized model on the RedHatAI Hub with full model evaluations on the OpenLLM V1 metrics, where the model achieves an average recovery score of over 99%. Support for calibration pathways requiring data will be added shortly for this model.

    5. Improved accuracy for GPTQ W4A16 schemes

    The GPTQModifier now defaults to using "weight" activation ordering for W4A16 quantization. Weight-based activation ordering has been shown to substantially improve accuracy recovery of up to two points without introducing additional runtime costs. Benchmarks are available in vllm/pull/8135.

    Example model with default weight activation ordering in the "actorder" field.

    "weights": {
         "actorder": "weight",
         "block_structure": null,
         "dynamic": false,
         "group_size": 128,
         "num_bits": 4,
         "observer": "minmax",
         "observer_kwargs": {},
         "strategy": "group",
         "symmetric": true,
         "type": "int"
    }

    Conclusion

    The LLM Compressor 0.8.0 release brings substantial advancements to quantization, including support for multiple modifiers during oneshot, configurable transforms with variable rotation sizes, R4 support for SpinQuant-style transforms, and extended quantization support for Qwen3 Next and Qwen3 VL MoE models. These updates, along with improved accuracy for GPTQ W4A16 schemes, enhance the flexibility, efficiency, and accuracy of LLM compression workflows, paving the way for more optimized and performant models.

    Explore the latest models, recipes, and examples in the LLM Compressor repository, or experiment with quantization techniques to tailor performance to your needs.

    Related Posts

    • LLM Compressor is here: Faster inference with vLLM

    • LLM Compressor: Optimize LLMs for low-latency deployments

    • Multimodal model quantization support through LLM Compressor

    • LLM Compressor 0.7.0 release recap

    • Axolotl meets LLM Compressor: Fast, sparse, open

    • Optimize LLMs with LLM Compressor in Red Hat OpenShift AI

    Recent Posts

    • Profiling vLLM Inference Server with GPU acceleration on RHEL

    • Network performance in distributed training: Maximizing GPU utilization on OpenShift

    • Clang bytecode interpreter update

    • How Red Hat has redefined continuous performance testing

    • Simplify OpenShift installation in air-gapped environments

    What’s up next?

    Learn how to set up a cloud development environment (CDE) using Ollama, Continue, Llama3, and Starcoder2 LLMs with OpenShift Dev Spaces for faster, more efficient coding.

    Start the activity
    Red Hat Developers logo LinkedIn YouTube Twitter Facebook

    Platforms

    • Red Hat AI
    • Red Hat Enterprise Linux
    • Red Hat OpenShift
    • Red Hat Ansible Automation Platform
    • See all products

    Build

    • Developer Sandbox
    • Developer Tools
    • Interactive Tutorials
    • API Catalog

    Quicklinks

    • Learning Resources
    • E-books
    • Cheat Sheets
    • Blog
    • Events
    • Newsletter

    Communicate

    • About us
    • Contact sales
    • Find a partner
    • Report a website issue
    • Site Status Dashboard
    • Report a security problem

    RED HAT DEVELOPER

    Build here. Go anywhere.

    We serve the builders. The problem solvers who create careers with code.

    Join us if you’re a developer, software engineer, web designer, front-end designer, UX designer, computer scientist, architect, tester, product manager, project manager or team lead.

    Sign me up

    Red Hat legal and privacy links

    • About Red Hat
    • Jobs
    • Events
    • Locations
    • Contact Red Hat
    • Red Hat Blog
    • Inclusion at Red Hat
    • Cool Stuff Store
    • Red Hat Summit
    © 2025 Red Hat

    Red Hat legal and privacy links

    • Privacy statement
    • Terms of use
    • All policies and guidelines
    • Digital accessibility

    Report a website issue