Skip to main content
Redhat Developers  Logo
  • AI

    Get started with AI

    • Red Hat AI
      Accelerate the development and deployment of enterprise AI solutions.
    • AI learning hub
      Explore learning materials and tools, organized by task.
    • AI interactive demos
      Click through scenarios with Red Hat AI, including training LLMs and more.
    • AI/ML learning paths
      Expand your OpenShift AI knowledge using these learning resources.
    • AI quickstarts
      Focused AI use cases designed for fast deployment on Red Hat AI platforms.
    • No-cost AI training
      Foundational Red Hat AI training.

    Featured resources

    • OpenShift AI learning
    • Open source AI for developers
    • AI product application development
    • Open source-powered AI/ML for hybrid cloud
    • AI and Node.js cheat sheet

    Red Hat AI Factory with NVIDIA

    • Red Hat AI Factory with NVIDIA is a co-engineered, enterprise-grade AI solution for building, deploying, and managing AI at scale across hybrid cloud environments.
    • Explore the solution
  • Learn

    Self-guided

    • Documentation
      Find answers, get step-by-step guidance, and learn how to use Red Hat products.
    • Learning paths
      Explore curated walkthroughs for common development tasks.
    • Guided learning
      Receive custom learning paths powered by our AI assistant.
    • See all learning

    Hands-on

    • Developer Sandbox
      Spin up Red Hat's products and technologies without setup or configuration.
    • Interactive labs
      Learn by doing in these hands-on, browser-based experiences.
    • Interactive demos
      Click through product features in these guided tours.

    Browse by topic

    • AI/ML
    • Automation
    • Java
    • Kubernetes
    • Linux
    • See all topics

    Training & certifications

    • Courses and exams
    • Certifications
    • Skills assessments
    • Red Hat Academy
    • Learning subscription
    • Explore training
  • Build

    Get started

    • Red Hat build of Podman Desktop
      A downloadable, local development hub to experiment with our products and builds.
    • Developer Sandbox
      Spin up Red Hat's products and technologies without setup or configuration.

    Download products

    • Access product downloads to start building and testing right away.
    • Red Hat Enterprise Linux
    • Red Hat AI
    • Red Hat OpenShift
    • Red Hat Ansible Automation Platform
    • See all products

    Featured

    • Red Hat build of OpenJDK
    • Red Hat JBoss Enterprise Application Platform
    • Red Hat OpenShift Dev Spaces
    • Red Hat Developer Toolset

    References

    • E-books
    • Documentation
    • Cheat sheets
    • Architecture center
  • Community

    Get involved

    • Events
    • Live AI events
    • Red Hat Summit
    • Red Hat Accelerators
    • Community discussions

    Follow along

    • Articles & blogs
    • Developer newsletter
    • Videos
    • Github

    Get help

    • Customer service
    • Customer support
    • Regional contacts
    • Find a partner

    Join the Red Hat Developer program

    • Download Red Hat products and project builds, access support documentation, learning content, and more.
    • Explore the benefits

LLM Compressor is here: Faster inference with vLLM

Announcing LLM Compressor

August 14, 2024
Robert Shaw Mark Kurtz Sara Adkins, Benjamin Fineran
Related topics:
Artificial intelligence
Related products:
Red Hat AI

    LLM Compressor, a unified library for creating compressed models for faster inference with vLLM, is now available. Neural Magic's research team has successfully utilized it to create our latest compressed models, including fully quantized and accurate versions of Llama 3.1, and with that, we are excited to open up the toolkit to the community with its first 0.1 release for general usage to compress your models!

    LLM Compressor architecture diagram.
    Figure 1: The LLM Compressor architecture flow.

    In recent months, the high-performance computing team at Neural Magic has brought performant inference for various quantization schemes to vLLM, including custom Marlin kernels for weight-only quantization and custom CUTLASS kernels for INT8 and FP8 activation quantization.

    However, before today, creating quantized checkpoints required navigating a fragmented ecosystem of bespoke compression libraries such as AutoGPTQ, AutoAWQ, AutoFP8, etc. We built LLM Compressor from the ground up as a single library for applying the latest compression best practices, including GPTQ, SmoothQuant, SparseGPT, and RTN, with many more actively being added. It works natively with Hugging Face models for seamless ease of use in the open source ecosystem, and vLLM supports directly loading checkpoints from LLM Compressor for accelerated inference.

    Using LLM Compressor, you can create compressed, accurate versions of your models, including:

    • Activation and weight quantization for up to 3X faster server/throughput deployments. This includes FP8 models using RTN for NVIDIA's Ada Lovelace and Hopper GPUs, and INT8 models using SmoothQuant and GPTQ for Nvidia's Turing and Ampere GPUs.
    • Weight quantization for up to 4X faster latency with INT4 weight-only models using GPTQ for Nvidia's Ampere GPUs and newer.
    • Weight pruning for up to 1.5X faster general performance with 2:4, 50% sparse models utilizing SparseGPT for Nvidia's Ampere GPUs and newer.

    Enabling activation quantization in vLLM

    Thanks to LLM Compressor's flexibility, it enables a critical new feature: activation quantization.

    The open-source compression ecosystem thus far has focused mainly on weight-only quantization, including AutoGPTQ and AutoAWQ. Weight-only quantization enables smaller models and faster latency, but with 16-bit activations, the compute runs through the same 16-bit tensor cores as the unquantized model. This leads to slower inference for compute-heavy workloads due to the penalty of dequantizing the weights. Activation quantization, where the inputs to each layer are quantized, combined with weight quantization, enables utilization of the faster INT8 or FP8 tensor cores for the matrix multiplies, doubling the performance for compute-bound inference.

    Weight-only quantization often fails to deliver speed improvements in production serving deployments. These environments typically result in compute-bound workloads with minimal benefits from weight-only quantization. Activation quantization, however, offers a substantial performance boost in such high-compute scenarios and faster inference at lower queries per second (QPS). 

    Figure 2 demonstrates a 1.6X speedup at 5 QPS for the INT8 weight and activation quantized model (w8a8) compared to the 16-bit baseline (w16a16), while the 4-bit weight quantized model (w4a16) shows little improvement.

    This chart demonstrates a 1.6X speedup at 5 QPS for the INT8 weight and activation quantized model (w8a8) compared to the 16-bit baseline (w16a16), while the 4-bit weight quantized model (w4a16) shows little improvement.
    Figure 2: This chart demonstrates a 1.6X speedup at 5 QPS for the INT8 weight and activation quantized model (w8a8) compared to the 16-bit baseline (w16a16), while the 4-bit weight quantized model (w4a16) shows little improvement.

    Activation quantization performance in vLLM

    Let’s take an example of a Llama 3.1 70B running in vLLM on a 4xA100 GPU setup to see if this analysis holds up.

    We will compare the serving latency for three variants for Llama 3.1 70B

    • Unquantized FP16 (w16a16): meta-llama/Meta-Llama-3.1-70B-Instruct
    • Weight and activation quantization to INT8 (w8a8): neuralmagic/Meta-Llama-3.1-70B-Instruct-quantized.w8a8
    • Weight-only quantization to INT4 (w4a16): neuralmagic/Meta-Llama-3.1-70B-Instruct-quantized.w4a16

    Figure 3 illustrates the average time to generate each new token (TPOT) across different server loads, measured in queries per second (QPS). Additionally, a deployment constraint of 5 seconds is set for the time to generate the first token (TTFT) to ensure the serving application maintains reasonable initial response times.

    This chart illustrates the average time to generate each new token (TPOT) across different server loads, measured in queries per second (QPS). Additionally, a deployment constraint of 5 seconds is set for the time to generate the first token (TTFT) to ensure the serving application maintains reasonable initial response times.
    Figure 3: The average time to generate each new token (TPOT) across different server loads, measured in queries per second (QPS). Full replication instructions for the benchmark are available in the appendix.

    At low QPS, weight-only quantization offers improved latency relative to an unquantized model. However, as the server load increases and becomes compute-bound, the performance of the weight-only model levels off, matching the unquantized model. In contrast, the activation quantized model performs better under high load, supporting more queries per second before the system becomes overloaded and TTFT exceeds our limits for a responsive application.

    For a 70B model on an A100 system, we see that the W8A8 model achieves similar latency performance with just 2 GPUs compared to the unquantized model running with 4, meaning similar latency guarantees with half the resources!

    Llama 3.1 70B Time per Output Token coparing w16a16 on 4 GPUs and w8a8 on 2 GPUs.
    Figure 4: Llama 3.1 70B Time per Output Token coparing w16a16 on 4 GPUs and w8a8 on 2 GPUs. Full replication instructions for the benchmark are available in the appendix.

    Activation quantization accuracy

    vLLM’s CUTLASS kernels for activation quantization offer flexible support for various schemes, allowing for a high degree of customization, including any combination of:

    • Per-tensor or per-channel quantization for weights
    • Per-tensor or per-token quantization for activations
    • Symmetric or asymmetric quantized activations (for int8).

    This flexibility in vLLM, combined with LLM Compressor's advanced algorithms such as GPTQ and SmoothQuant, ensures that model accuracy is maintained even after quantization. As we can see from the model card for neuralmagic/Meta-Llama-3.1-70B-Instruct-quantized.w8a8, we see a negligible drop using static per-channel weight scales and dynamic per token activation scales in comparison to the FP16 baseline on Open LLM (Table 1).

    Table 1: Open LLM Leaderboard evaluation scores.
    BenchmarkMeta-Llama-3.1-70B-InstructMeta-Llama-3.1-70B-Instruct-quantized.w8a8 (this model)Recovery
    MMLU (5-shot)83.8883.6599.7%
    MMLU (CoT, 0-shot)85.7485.4199.6%
    ARC Challenge (0-shot)93.2693.26100.0%
    GSM-8K (CoT, 8-shot, strict-match)93.1093.25100.2%
    Hellaswag (10-shot)86.4086.2899.9%
    Winogrande (5-shot)85.0085.00100.0%
    TruthfulQA (0-shot, mc2)59.8360.88101.8%
    Average83.8983.96100.2%

    This combination of fine-grained quantization and sophisticated algorithms enables users to achieve faster inference without compromising on the precision and reliability of their models.

    Try LLM Compressor

    The following snippet is a minimal example of quantizing meta-llama/Meta-Llama-3.1-8B-Instruct with INT8 weights and activations.

    Install LLM Compressor via PyPi

    LLM Compressor is available for installation via PyPI:

    pip install llmcompressor

    Apply quantization with the LLM Compressor

    Quantization is applied by selecting an algorithm and calling the oneshot API, which applies the selections in a post-training setting.

    In this case, we apply SmoothQuant to make the activations easier to quantize and GPTQ to apply the weight and activation quantization. We apply these algorithms to all linear layers of the network using the built-in open_platypus dataset (note: see the examples for how to use your own calibration set).

    from llmcompressor.modifiers.quantization import GPTQModifier
    from llmcompressor.modifiers.smoothquant import SmoothQuantModifier
    from llmcompressor.transformers import oneshot
    # Select quantization algorithm. In this case, we:
    #   * apply SmoothQuant to make the activations easier to quantize
    #   * quantize the weights to int8 with GPTQ (static per channel)
    #   * quantize the activations to int8 (dynamic per token)
    recipe = [
        SmoothQuantModifier(smoothing_strength=0.8),
        GPTQModifier(scheme="W8A8", targets="Linear", ignore=["lm_head"]),
    ]
    # Apply quantization using the built in open_platypus dataset.
    oneshot(
        model="meta-llama/Meta-Llama-3.1-8B-Instruct",
        dataset="open_platypus",
        recipe=recipe,
        output_dir="Meta-Llama-3.1-8B-Instruct-INT8",
        max_seq_length=2048,
        num_calibration_samples=512,
    )

    Inference compressed models with vLLM

    The resulting model is ready to be loaded and run in vLLM out-of-the-box:

    from vllm import LLM
    model = LLM("./Meta-Llama-3.1-8B-Instruct-INT8")
    output = model.generate("My name is")
    print("Output:", output[0].outputs[0].text)
    # Output: Jakob Schmid.  I live in the Republic of South Moluccas

    Under the hood, vLLM understands how to load and run the compressed model by looking at the config.yaml next to the weight files. Check out some of our more detailed examples to try out other quantization flows:

    • FP8 activation quantization with PTQ
    • INT8 activation quantization with GPTQ and SmoothQuant
    • INT4 weight-only quantization With GPTQ

    LLM compressor roadmap

    We have a robust roadmap planned to expand support for model compression in LLM Compressor. Our roadmap is prioritized across the following initiatives:

    • Expand model support: Mixture of Experts (MoE) and vision-language models
    • Expand algorithm and scheme support: AWQ, additional quantized floating point formats (fp8 and fp4), and KV cache quantization
    • Support for non-NVIDIA hardware: We are actively collaborating with AMD, Google, and Intel teams to support models created by LLM Compressor on non-NVIDIA hardware devices.
    • Tools for creating non-uniform quantization schemes
    • 2:4 sparsity: Sparse foundation models, sparse fine-tuning from sparse foundational models, combining sparsity and quantization
    • Expand support for training aware methods: Quantization-Aware Training (QAT) and Low-Rank Adaptation (LoRA)

    If you have any feature requests, large or small, please comment on our Roadmap Issue in GitHub.

    Final thoughts

    At Neural Magic, we believe the future of AI is open, and we are on a mission to bring the power of open source models and vLLM to every enterprise on the planet.

    We offer nm-vllm, an enterprise distribution of vLLM, with:

    • Stable builds with bug fixes and selected model backporting
    • Enterprise support with SLAs for production deployments of vLLM
    • Tools and access to our teams for applying model optimizations via LLM Compressor
    • Pre-optimized model registry
    • Kubernetes reference architectures for production deployments

    Appendix: Benchmark details

    We used the following three model stubs:

    • meta-llama/Meta-Llama-3.1-70B-Instruct
    • neuralmagic/Meta-Llama-3.1-70B-Instruct-quantized.w8a8
    • neuralmagic/Meta-Llama-3.1-70B-Instruct-quantized.w4a16

    Models were deployed with the following command on A100-80GB-SXM4 (vllm==0.5.4):

    MODEL=MODEL_STUB_TO_BENCHMARK \
    vllm serve $MODEL \
    --enable-chunked-prefill \
    --disable-log-requests \
    --tensor-parallel-size 4

    We ran the following bash script in vllm-project/vllm to generate the data:

    MODEL=MODEL_STUB_TO_BENCHMARK
    TOTAL_SECONDS=120
    QPS_RATES=("1" "3" "5" "7" "9")
    for QPS in ${QPS_RATES[@]}; do
        NUM_PROMPTS=$((TOTAL_SECONDS * QPS))
        echo "===== RUNNING NUM_PROMPTS = $NUM_PROMPTS QPS = $QPS ====="
        python3 benchmarks/benchmark_serving.py \
            --model $MODEL \
            --dataset-name sonnet --sonnet-input-len 550 --sonnet-output-len 150 \
            --dataset-path benchmarks/sonnet.txt \
            --num-prompts $NUM_PROMPTS --request-rate $QPS
    done
    Last updated: March 25, 2025

    Related Posts

    • vLLM V1: Accelerating multimodal inference for large language models

    • Enhance LLMs and streamline MLOps using InstructLab and KitOps

    • How we optimized vLLM for DeepSeek-R1

    • Introducing Podman AI Lab: Developer tooling for working with LLMs

    • TrustyAI Detoxify: Guardrailing LLMs during training

    • vLLM V1 Alpha: A major upgrade to vLLM's core architecture

    Recent Posts

    • Every layer counts: Defense in depth for AI agents with Red Hat AI

    • Fun in the RUN instruction: Why container builds with distroless images can surprise you

    • Trusted software factory: Building trust in the agentic AI era

    • Build a zero trust AI pipeline with OpenShift and RHEL CVMs

    • Red Hat Hardened Images: Top 5 benefits for software developers

    Red Hat Developers logo LinkedIn YouTube Twitter Facebook

    Platforms

    • Red Hat AI
    • Red Hat Enterprise Linux
    • Red Hat OpenShift
    • Red Hat Ansible Automation Platform
    • See all products

    Build

    • Developer Sandbox
    • Developer tools
    • Interactive tutorials
    • API catalog

    Quicklinks

    • Learning resources
    • E-books
    • Cheat sheets
    • Blog
    • Events
    • Newsletter

    Communicate

    • About us
    • Contact sales
    • Find a partner
    • Report a website issue
    • Site status dashboard
    • Report a security problem

    RED HAT DEVELOPER

    Build here. Go anywhere.

    We serve the builders. The problem solvers who create careers with code.

    Join us if you’re a developer, software engineer, web designer, front-end designer, UX designer, computer scientist, architect, tester, product manager, project manager or team lead.

    Sign me up

    Red Hat legal and privacy links

    • About Red Hat
    • Jobs
    • Events
    • Locations
    • Contact Red Hat
    • Red Hat Blog
    • Inclusion at Red Hat
    • Cool Stuff Store
    • Red Hat Summit
    © 2026 Red Hat

    Red Hat legal and privacy links

    • Privacy statement
    • Terms of use
    • All policies and guidelines
    • Digital accessibility

    Chat Support

    Please log in with your Red Hat account to access chat support.