Skip to main content
Redhat Developers  Logo
  • Products

    Platforms

    • Red Hat Enterprise Linux
      Red Hat Enterprise Linux Icon
    • Red Hat AI
      Red Hat AI
    • Red Hat OpenShift
      Openshift icon
    • Red Hat Ansible Automation Platform
      Ansible icon
    • See all Red Hat products

    Featured

    • Red Hat build of OpenJDK
    • Red Hat Developer Hub
    • Red Hat JBoss Enterprise Application Platform
    • Red Hat OpenShift Dev Spaces
    • Red Hat OpenShift Local
    • Red Hat Developer Sandbox

      Try Red Hat products and technologies without setup or configuration fees for 30 days with this shared Red Hat OpenShift and Kubernetes cluster.
    • Try at no cost
  • Technologies

    Featured

    • AI/ML
      AI/ML Icon
    • Linux
      Linux Icon
    • Kubernetes
      Cloud icon
    • Automation
      Automation Icon showing arrows moving in a circle around a gear
    • See all technologies
    • Programming languages & frameworks

      • Java
      • Python
      • JavaScript
    • System design & architecture

      • Red Hat architecture and design patterns
      • Microservices
      • Event-Driven Architecture
      • Databases
    • Developer experience

      • Productivity
      • Tools
      • GitOps
    • Automated data processing

      • AI/ML
      • Data science
      • Apache Kafka on Kubernetes
    • Platform engineering

      • DevOps
      • DevSecOps
      • Red Hat Ansible Automation Platform for applications and services
    • Secure development & architectures

      • Security
      • Secure coding
  • Learn

    Featured

    • Kubernetes & cloud native
      Openshift icon
    • Linux
      Rhel icon
    • Automation
      Ansible cloud icon
    • AI/ML
      AI/ML Icon
    • See all learning resources

    E-books

    • GitOps cookbook
    • Podman in action
    • Kubernetes operators
    • The path to GitOps
    • See all e-books

    Cheat sheets

    • Linux commands
    • Bash commands
    • Git
    • systemd commands
    • See all cheat sheets

    Documentation

    • Product documentation
    • API catalog
    • Legacy documentation
  • Developer Sandbox

    Developer Sandbox

    • Access Red Hat’s products and technologies without setup or configuration, and start developing quicker than ever before with our new, no-cost sandbox environments.
    • Explore the Developer Sandbox

    Featured Developer Sandbox activities

    • Get started with your Developer Sandbox
    • OpenShift virtualization and application modernization using the Developer Sandbox
    • Explore all Developer Sandbox activities

    Ready to start developing apps?

    • Try at no cost
  • Blog
  • Events
  • Videos

LLM Compressor v0.10: Faster compression with distributed GPTQ

March 18, 2026
Kyle Sayers Charles Hernandez Dipika Sikka
Related topics:
Artificial intelligence
Related products:
Red Hat AI Inference ServerRed Hat AI

    LLM Compressor v0.10 is here, and it brings a faster way to compress large language models (LLMs). This release introduces distributed quantization, better memory management, and advanced quantization formats that make it easier to compress massive models.

    Key highlights of this release include:

    • Distributed GPTQ: Multi-GPU support for up to 3.8x speedup on 4 GPUs and accuracy improvement via better Hessian numerics.
    • Compressed-tensors offloading: Compress models that exceed your available memory capacity.
    • GPTQ FP4 microscale support: Use both NVFP4 and MXFP4 quantization schemes.
    • Accelerate to compressed-tensors migration: Improve performance and distributed capabilities.

    Distributed GPTQ: Parallelize compression across multiple GPUs

    A major feature in LLM Compressor 0.10 is the introduction of fully distributed GPTQ functionality. This experimental feature lets you use multiple GPUs for quantization. It significantly cuts down compression time without hurting accuracy.

    How distributed GPTQ works

    The distributed implementation follows a four-step process to efficiently parallelize GPTQ compression across multiple GPU ranks:

    1. Module assignment: Each neural network module is assigned to a specific rank for compression. Load balancing algorithms ensure even workload distribution across ranks. This prevents one process from handling more work than another.
    2. Hessian reduction: Each rank asynchronously transmits its partial Hessian information to the rank assigned to that module, so that each rank accumulates everything needed to compress its assigned modules.
    3. Compression: Each rank independently compresses its assigned modules using the accumulated Hessian information. This parallelization is where the primary speedup occurs.
    4. Broadcasting: Once compression completes, the quantized weight values are broadcast to all ranks, ensuring model consistency across the distributed system.

    This architecture uses asynchronous operations for inter-rank data transfer, which improves performance compared to synchronous approaches.

    Performance improvements

    The distributed implementation delivers substantial speedups across various model sizes. Benchmarking on Qwen3-30B-A3B demonstrates clear scaling characteristics:

    • Single GPU: 3.9 hours
    • 2 GPUs (distributed): 1.95 hours (~2x speedup)
    • 4 GPUs (distributed): 1 hour (~3.8x speedup)

    Similar scaling patterns were observed on smaller models like Meta-Llama-3-8B-Instruct, confirming the architecture's effectiveness across different model sizes. Non-parallelized operations currently limit linear speedups. We will include parallel implementations for these in the next LLM Compressor release.

    Accuracy improvements through better numerics

    In addition to distributed scaling, the expanded GPTQ implementation received a numerical optimization that yielded a +4% accuracy improvement on GSM8K benchmarks for Meta-Llama-3-8B-Instruct without any configuration changes.

    The improvement stems from fixing floating-point error accumulation in the Hessian calculation. The original implementation tried to maintain the average Hessian across all samples seen and so had to constantly multiply and divide by the total samples seen. This accumulated a significant amount of floating point error compared to keeping track of the sum and then dividing at the end.

    # Before: Iterative multiplication and division caused floating-point error accumulation
    H_avg = H_avg*num_samples/(new_samples+num_samples) + H_new /(new_samples+num_samples)  # Called repeatedly
    # After: Iterative Sum, final divide
    # Accumulate:
    H_sum += H_new
    # Single division at the end:
    H_avg = H_sum / num_samples

    This change is both subtle and significant. It's subtle because the percent difference for any element is usually below ~10⁻³%. However, the resulting accuracy boost was so large that we initially suspected a bug in our distributed GPTQ implementation. This simple fix improved GSM8K evaluation scores from (0.67, 0.66) to (0.71, 0.71)—a meaningful accuracy boost that benefits all GPTQ users, whether they're using distributed compression or not.

    Set up distributed compression

    Implementing distributed GPTQ requires a few key steps to properly initialize the distributed environment and manage model loading across ranks.

    Step 1: Initialize the distributed context

    Begin by initializing the distributed environment before loading your model:

    from compressed_tensors.offload import init_dist
    # Initialize distributed training context
    init_dist()

    Step 2: Load model with offloading support

    Replace standard model loading with the offload context manager to load models across ranks efficiently:

    from transformers import AutoModelForCausalLM
    from compressed_tensors.offload import load_offloaded_model
    with load_offloaded_model():
        model = AutoModelForCausalLM.from_pretrained(
            "meta-llama/Meta-Llama-3-8B-Instruct",
            dtype="auto",
            device_map="auto_offload"
        )
    # The device_map="auto_offload" setting automatically manages memory across devices, 
    # falling back to CPU and disk as needed. This prevents inefficient weight replication 
    # across ranks in distributed setups. More details on this below.

    Step 3: Partition calibration data

    To maximize efficiency, distribute the calibration dataset across GPU ranks to avoid redundant data loading. The get_rank_partition utility automatically detects the current rank and splits the dataset accordingly:

    from datasets import load_dataset
    from llmcompressor.datasets.utils import get_rank_partition
    DATASET_ID = "HuggingFaceH4/ultrachat_200k"
    DATASET_SPLIT = "train_sft"
    NUM_CALIBRATION_SAMPLES = 512
    # Partition dataset across distributed ranks
    # Each rank gets a unique subset of the calibration data
    ds = load_dataset(
        DATASET_ID,
        split=get_rank_partition(DATASET_SPLIT, NUM_CALIBRATION_SAMPLES)
    )
    # The distributed infrastructure also includes intelligent detection of when all ranks 
    # share identical datasets, optimizing data handling by avoiding unnecessary partitioning 
    # in those cases. Additionally, device assignment has been updated to ensure each process 
    # correctly utilizes its dedicated GPU based on its rank, preventing conflicts in multi-GPU setups.

    Step 4: Run compression with torchrun

    Execute your compression script using PyTorch's distributed launcher:

    torchrun --nproc_per_node=4 compress_llama.py
    # The compression workload automatically distributes across GPUs, maintaining full accuracy 
    # while reducing wall-clock time. For complete implementation examples, refer to 
    # llama_ddp_example.py in the LLM Compressor repository.

    Custom compressed-tensors offloading

    LLM Compressor 0.10 replaces the Hugging Face accelerate library with a custom compressed-tensors offloading system. This system lets you compress models that are larger than your available system memory. The new architecture addresses critical limitations in distributed support, memory efficiency, and model compatibility.

    Why the migration?

    Accelerate's offloading had several limitations:

    • Incompatible with models that directly access module.weight attributes (common in compression code).
    • Limited distributed support and unsupported disk offloading configurations.
    • Inefficient memory management—loads entire modules when only specific parameters are needed.
    • Fragile handling of transform modules in non-standard layer configurations.

    The new system provides fine-grained parameter-level offloading, native distributed coordination with shared memory support, and lazy loading to prevent out-of-memory errors.

    How to use model offloading in LLM Compressor

    Model offloading allows users to compress very large models without requiring the entire model to be loaded into GPU memory. To use LLM Compressor’s custom model offloading for distributed workloads or disk offloading, simply wrap your existing load function with the load_offloaded_model context.

    with offloaded_model():
        model = AutoModelForCausalLM.from_pretrained(model_id, device_map="cpu")

    For more information on how and when to load models using this context, see the Model Loading Guide.

    Key recommendation: Use device_map="auto" for the basic pipeline and device_map="auto_offload" for the sequential pipeline. For more information, see the LLM Compressor documentation.

    How does offloading work with LLM Compressor?

    LLM Compressor supports three types of offloads: Device offloads, CPU offloads, and disk offloads. For more information about how offload caches integrate with torch modules, see the Compressed Tensors Offloading documentation.

    Device offloads are used when a model (or part of a model) can fit in GPU memory. This offload type is typically only used when dispatching a model for sample generation, or when using the "basic" pipeline. This offload type is implemented using the newly added DeviceCache and DistributedDeviceCache classes.

    CPU offloads are used when a model cannot fit into GPU memory, and instead must be stored on the CPU. In the case of distributed workflows, separate processes must be able to reference shared cpu memory. This is achieved using shared memory allocated by the operating system, typically under the /dev/shm directory. Once shared memory has been allocated, the pointer to that shared CPU memory (represented by a file handle) is broadcasted across ranks and used to construct tensor objects which reference the same backend shared memory. For more information, see DistributedCPUCache.

    Disk offloads are used when a model (or part of a model) cannot fit into either GPU or CPU memory. In this case, the system creates a safetensors file to store the parameter data. In order to avoid unnecessary data movement, compressed-tensors uses a copy-on-write (CoW) policy to handle updating model weights. When a model is first loaded, a symlink file is created which points to the original model checkpoint files. When the model weight is updated, that symlink is destroyed and replaced with a real safetensors file containing the updated data. This technique allows distributed processes to asynchronously update weight data without requiring costly sync operations which can significantly degrade throughput. The implementation of symlinks and distributed offloading is implemented by the DiskCache and DistributedDiskCache implementations.

    Disk offloading for very large models

    For models that exceed available CPU memory, disk offloading streams weights from disk on-demand during compression. This is particularly valuable for compressing 70B+ parameter models on consumer hardware.

    Example using disk offloading:

    from transformers import AutoModelForCausalLM
    from compressed_tensors.offload import load_offloaded_model
    from llmcompressor import oneshot
    # Load with disk offloading
    with load_offloaded_model():
        model = AutoModelForCausalLM.from_pretrained(
            "meta-llama/Llama-2-70b-hf",
            dtype="auto",
            device_map="auto_offload",
            offload_folder="./offload_cache"
        )
    # Compression proceeds normally
    oneshot(
        model=model,
        dataset=ds,
        recipe=recipe,
        max_seq_length=2048,
        num_calibration_samples=512
    )
    # Save compressed model
    model.save_pretrained("./llama-2-70b-compressed", save_compressed=True)

    How disk offloading works:

    • Weight files are stored in safetensors format in the specified offload_folder.
    • Tensors are lazily loaded from disk only when needed for computation.
    • The system maintains a global index mapping tensors to their disk locations.
    • In distributed setups, ranks broadcast file paths in order to coordinate and avoid redundant disk writes.
    • Automatic conversion to accelerate format during save ensures compatibility with the broader Hugging Face ecosystem.

    GPTQ FP4 microscale support

    Building on the FP4 quantization capabilities introduced in previous releases, LLM Compressor 0.10 adds GPTQ support for NVFP4 and MXFP4 microscale quantization schemes, potentially leading to improved model accuracy recovery.

    The 0.10 release includes specific accuracy improvements for MXFP4 weight scale generation, resulting in improved model quality after quantization. 

    For MXFP4 weight-only quantization, you can use the simplified scheme-based approach:

    from llmcompressor import oneshot
    from llmcompressor.modifiers.gptq import GPTQModifier
    # Configure MXFP4 quantization using built-in scheme
    recipe = GPTQModifier(
        targets="Linear",
        scheme="MXFP4A16",
        ignore=["lm_head"]
    )
    # Apply quantization
    oneshot(
        model=model,
        dataset=ds,
        recipe=recipe,
        max_seq_length=2048,
        num_calibration_samples=512,
    )
    # Save compressed model
    model.save_pretrained("./llama-3-8b-mxfp4-gptq", save_compressed=True)

    Conclusion

    LLM Compressor 0.10 represents a significant step forward in making model compression more efficient, scalable, and accessible. The distributed GPTQ capabilities enable faster iteration cycles, while enhanced offloading support opens up compression workflows for larger models on more modest hardware.

    Try these new capabilities in your own compression workflows. The combination of distributed compression, intelligent offloading, and advanced quantization formats makes this release particularly valuable for teams working with increasingly large language models.

    Explore more resources:

    • LLM Compressor 0.9.0: Attention quantization, MXFP4 support, and more
    • Big model support documentation
    • Distributed oneshot compression guide
    • LLM Compressor GitHub repository

    Related Posts

    • Advancing low‑bit quantization for LLMs: AutoRound x LLM Compressor

    • Axolotl meets LLM Compressor: Fast, sparse, open

    • Optimize LLMs with LLM Compressor in Red Hat OpenShift AI

    • LLM Compressor: Optimize LLMs for low-latency deployments

    • Multimodal model quantization support through LLM Compressor

    • LLM Compressor is here: Faster inference with vLLM

    Recent Posts

    • LLM Compressor v0.10: Faster compression with distributed GPTQ

    • How Advanced Cluster Management simplifies rule management

    • Prepare to enable Linux pressure stall information on Red Hat OpenShift

    • Advanced Cluster Management 2.16 right-sizing recommendation GA

    • Configure NVIDIA Blackwell GPUs for Red Hat AI workloads

    Red Hat Developers logo LinkedIn YouTube Twitter Facebook

    Platforms

    • Red Hat AI
    • Red Hat Enterprise Linux
    • Red Hat OpenShift
    • Red Hat Ansible Automation Platform
    • See all products

    Build

    • Developer Sandbox
    • Developer tools
    • Interactive tutorials
    • API catalog

    Quicklinks

    • Learning resources
    • E-books
    • Cheat sheets
    • Blog
    • Events
    • Newsletter

    Communicate

    • About us
    • Contact sales
    • Find a partner
    • Report a website issue
    • Site status dashboard
    • Report a security problem

    RED HAT DEVELOPER

    Build here. Go anywhere.

    We serve the builders. The problem solvers who create careers with code.

    Join us if you’re a developer, software engineer, web designer, front-end designer, UX designer, computer scientist, architect, tester, product manager, project manager or team lead.

    Sign me up

    Red Hat legal and privacy links

    • About Red Hat
    • Jobs
    • Events
    • Locations
    • Contact Red Hat
    • Red Hat Blog
    • Inclusion at Red Hat
    • Cool Stuff Store
    • Red Hat Summit
    © 2026 Red Hat

    Red Hat legal and privacy links

    • Privacy statement
    • Terms of use
    • All policies and guidelines
    • Digital accessibility

    Report a website issue