LLM Compressor v0.10: Faster compression with distributed GPTQ

LLM Compressor v0.10 is here, and it brings a faster way to compress large language models (LLMs). This release introduces distributed quantization, better memory management, and advanced quantization formats that make it easier to compress massive models.

Key highlights of this release include:

Distributed GPTQ: Multi-GPU support for up to 3.8x speedup on 4 GPUs and accuracy improvement via better Hessian numerics.
Compressed-tensors offloading: Compress models that exceed your available memory capacity.
GPTQ FP4 microscale support: Use both NVFP4 and MXFP4 quantization schemes.
Accelerate to compressed-tensors migration: Improve performance and distributed capabilities.

Distributed GPTQ: Parallelize compression across multiple GPUs

A major feature in LLM Compressor 0.10 is the introduction of fully distributed GPTQ functionality. This experimental feature lets you use multiple GPUs for quantization. It significantly cuts down compression time without hurting accuracy.

How distributed GPTQ works

The distributed implementation follows a four-step process to efficiently parallelize GPTQ compression across multiple GPU ranks:

Module assignment: Each neural network module is assigned to a specific rank for compression. Load balancing algorithms ensure even workload distribution across ranks. This prevents one process from handling more work than another.
Hessian reduction: Each rank asynchronously transmits its partial Hessian information to the rank assigned to that module, so that each rank accumulates everything needed to compress its assigned modules.
Compression: Each rank independently compresses its assigned modules using the accumulated Hessian information. This parallelization is where the primary speedup occurs.
Broadcasting: Once compression completes, the quantized weight values are broadcast to all ranks, ensuring model consistency across the distributed system.

This architecture uses asynchronous operations for inter-rank data transfer, which improves performance compared to synchronous approaches.

Performance improvements

The distributed implementation delivers substantial speedups across various model sizes. Benchmarking on Qwen3-30B-A3B demonstrates clear scaling characteristics:

Single GPU: 3.9 hours
2 GPUs (distributed): 1.95 hours (~2x speedup)
4 GPUs (distributed): 1 hour (~3.8x speedup)

Similar scaling patterns were observed on smaller models like Meta-Llama-3-8B-Instruct, confirming the architecture's effectiveness across different model sizes. Non-parallelized operations currently limit linear speedups. We will include parallel implementations for these in the next LLM Compressor release.

Accuracy improvements through better numerics

In addition to distributed scaling, the expanded GPTQ implementation received a numerical optimization that yielded a +4% accuracy improvement on GSM8K benchmarks for Meta-Llama-3-8B-Instruct without any configuration changes.

The improvement stems from fixing floating-point error accumulation in the Hessian calculation. The original implementation tried to maintain the average Hessian across all samples seen and so had to constantly multiply and divide by the total samples seen. This accumulated a significant amount of floating point error compared to keeping track of the sum and then dividing at the end.

# Before: Iterative multiplication and division caused floating-point error accumulation
H_avg = H_avg*num_samples/(new_samples+num_samples) + H_new /(new_samples+num_samples)  # Called repeatedly
# After: Iterative Sum, final divide
# Accumulate:
H_sum += H_new
# Single division at the end:
H_avg = H_sum / num_samples

This change is both subtle and significant. It's subtle because the percent difference for any element is usually below ~10⁻³%. However, the resulting accuracy boost was so large that we initially suspected a bug in our distributed GPTQ implementation. This simple fix improved GSM8K evaluation scores from (0.67, 0.66) to (0.71, 0.71)—a meaningful accuracy boost that benefits all GPTQ users, whether they're using distributed compression or not.

Set up distributed compression

Implementing distributed GPTQ requires a few key steps to properly initialize the distributed environment and manage model loading across ranks.

Step 1: Initialize the distributed context

Begin by initializing the distributed environment before loading your model:

from compressed_tensors.offload import init_dist
# Initialize distributed training context
init_dist()

Step 2: Load model with offloading support

Replace standard model loading with the offload context manager to load models across ranks efficiently:

from transformers import AutoModelForCausalLM
from compressed_tensors.offload import load_offloaded_model
with load_offloaded_model():
    model = AutoModelForCausalLM.from_pretrained(
        "meta-llama/Meta-Llama-3-8B-Instruct",
        dtype="auto",
        device_map="auto_offload"
    )
# The device_map="auto_offload" setting automatically manages memory across devices, 
# falling back to CPU and disk as needed. This prevents inefficient weight replication 
# across ranks in distributed setups. More details on this below.

Step 3: Partition calibration data

To maximize efficiency, distribute the calibration dataset across GPU ranks to avoid redundant data loading. The get_rank_partition utility automatically detects the current rank and splits the dataset accordingly:

from datasets import load_dataset
from llmcompressor.datasets.utils import get_rank_partition
DATASET_ID = "HuggingFaceH4/ultrachat_200k"
DATASET_SPLIT = "train_sft"
NUM_CALIBRATION_SAMPLES = 512
# Partition dataset across distributed ranks
# Each rank gets a unique subset of the calibration data
ds = load_dataset(
    DATASET_ID,
    split=get_rank_partition(DATASET_SPLIT, NUM_CALIBRATION_SAMPLES)
)
# The distributed infrastructure also includes intelligent detection of when all ranks 
# share identical datasets, optimizing data handling by avoiding unnecessary partitioning 
# in those cases. Additionally, device assignment has been updated to ensure each process 
# correctly utilizes its dedicated GPU based on its rank, preventing conflicts in multi-GPU setups.

Step 4: Run compression with torchrun

Execute your compression script using PyTorch's distributed launcher:

torchrun --nproc_per_node=4 compress_llama.py
# The compression workload automatically distributes across GPUs, maintaining full accuracy 
# while reducing wall-clock time. For complete implementation examples, refer to 
# llama_ddp_example.py in the LLM Compressor repository.

Custom compressed-tensors offloading

LLM Compressor 0.10 replaces the Hugging Face accelerate library with a custom compressed-tensors offloading system. This system lets you compress models that are larger than your available system memory. The new architecture addresses critical limitations in distributed support, memory efficiency, and model compatibility.

Why the migration?

Accelerate's offloading had several limitations:

Incompatible with models that directly access module.weight attributes (common in compression code).
Limited distributed support and unsupported disk offloading configurations.
Inefficient memory management—loads entire modules when only specific parameters are needed.
Fragile handling of transform modules in non-standard layer configurations.

The new system provides fine-grained parameter-level offloading, native distributed coordination with shared memory support, and lazy loading to prevent out-of-memory errors.

How to use model offloading in LLM Compressor

Model offloading allows users to compress very large models without requiring the entire model to be loaded into GPU memory. To use LLM Compressor’s custom model offloading for distributed workloads or disk offloading, simply wrap your existing load function with the load_offloaded_model context.

with offloaded_model():
    model = AutoModelForCausalLM.from_pretrained(model_id, device_map="cpu")

For more information on how and when to load models using this context, see the Model Loading Guide.

Key recommendation: Use device_map="auto" for the basic pipeline and device_map="auto_offload" for the sequential pipeline. For more information, see the LLM Compressor documentation.

How does offloading work with LLM Compressor?

LLM Compressor supports three types of offloads: Device offloads, CPU offloads, and disk offloads. For more information about how offload caches integrate with torch modules, see the Compressed Tensors Offloading documentation.

Device offloads are used when a model (or part of a model) can fit in GPU memory. This offload type is typically only used when dispatching a model for sample generation, or when using the "basic" pipeline. This offload type is implemented using the newly added DeviceCache and DistributedDeviceCache classes.

CPU offloads are used when a model cannot fit into GPU memory, and instead must be stored on the CPU. In the case of distributed workflows, separate processes must be able to reference shared cpu memory. This is achieved using shared memory allocated by the operating system, typically under the /dev/shm directory. Once shared memory has been allocated, the pointer to that shared CPU memory (represented by a file handle) is broadcasted across ranks and used to construct tensor objects which reference the same backend shared memory. For more information, see DistributedCPUCache.

Disk offloads are used when a model (or part of a model) cannot fit into either GPU or CPU memory. In this case, the system creates a safetensors file to store the parameter data. In order to avoid unnecessary data movement, compressed-tensors uses a copy-on-write (CoW) policy to handle updating model weights. When a model is first loaded, a symlink file is created which points to the original model checkpoint files. When the model weight is updated, that symlink is destroyed and replaced with a real safetensors file containing the updated data. This technique allows distributed processes to asynchronously update weight data without requiring costly sync operations which can significantly degrade throughput. The implementation of symlinks and distributed offloading is implemented by the DiskCache and DistributedDiskCache implementations.

Disk offloading for very large models

For models that exceed available CPU memory, disk offloading streams weights from disk on-demand during compression. This is particularly valuable for compressing 70B+ parameter models on consumer hardware.

Example using disk offloading:

from transformers import AutoModelForCausalLM
from compressed_tensors.offload import load_offloaded_model
from llmcompressor import oneshot
# Load with disk offloading
with load_offloaded_model():
    model = AutoModelForCausalLM.from_pretrained(
        "meta-llama/Llama-2-70b-hf",
        dtype="auto",
        device_map="auto_offload",
        offload_folder="./offload_cache"
    )
# Compression proceeds normally
oneshot(
    model=model,
    dataset=ds,
    recipe=recipe,
    max_seq_length=2048,
    num_calibration_samples=512
)
# Save compressed model
model.save_pretrained("./llama-2-70b-compressed", save_compressed=True)

How disk offloading works:

Weight files are stored in safetensors format in the specified offload_folder.
Tensors are lazily loaded from disk only when needed for computation.
The system maintains a global index mapping tensors to their disk locations.
In distributed setups, ranks broadcast file paths in order to coordinate and avoid redundant disk writes.
Automatic conversion to accelerate format during save ensures compatibility with the broader Hugging Face ecosystem.

GPTQ FP4 microscale support

Building on the FP4 quantization capabilities introduced in previous releases, LLM Compressor 0.10 adds GPTQ support for NVFP4 and MXFP4 microscale quantization schemes, potentially leading to improved model accuracy recovery.

The 0.10 release includes specific accuracy improvements for MXFP4 weight scale generation, resulting in improved model quality after quantization.

For MXFP4 weight-only quantization, you can use the simplified scheme-based approach:

from llmcompressor import oneshot
from llmcompressor.modifiers.gptq import GPTQModifier
# Configure MXFP4 quantization using built-in scheme
recipe = GPTQModifier(
    targets="Linear",
    scheme="MXFP4A16",
    ignore=["lm_head"]
)
# Apply quantization
oneshot(
    model=model,
    dataset=ds,
    recipe=recipe,
    max_seq_length=2048,
    num_calibration_samples=512,
)
# Save compressed model
model.save_pretrained("./llama-3-8b-mxfp4-gptq", save_compressed=True)

Conclusion

LLM Compressor 0.10 represents a significant step forward in making model compression more efficient, scalable, and accessible. The distributed GPTQ capabilities enable faster iteration cycles, while enhanced offloading support opens up compression workflows for larger models on more modest hardware.

Try these new capabilities in your own compression workflows. The combination of distributed compression, intelligent offloading, and advanced quantization formats makes this release particularly valuable for teams working with increasingly large language models.

Explore more resources:

LLM Compressor v0.10: Faster compression with distributed GPTQ

Distributed GPTQ: Parallelize compression across multiple GPUs

How distributed GPTQ works

Performance improvements

Accuracy improvements through better numerics

Set up distributed compression

Step 1: Initialize the distributed context

Step 2: Load model with offloading support

Step 3: Partition calibration data

Step 4: Run compression with torchrun

Custom compressed-tensors offloading

Why the migration?

How to use model offloading in LLM Compressor

How does offloading work with LLM Compressor?

Disk offloading for very large models

GPTQ FP4 microscale support

Conclusion

Red Hat Dependency Analytics works with your private Trusted Profile Analyzer instance!

Understanding ApplicationSets - Generators (Part 2)

Benchmark Red Hat Data Grid in OpenShift 4 using Hyperfoil

Layered sandboxing for AI agents: OpenShift and OpenShell

How obs-mcp boosts AI-native OpenShift observability

Platforms

Build

Quicklinks

Communicate

RED HAT DEVELOPER

Red Hat legal and privacy links

Red Hat legal and privacy links