Skip to main content
Redhat Developers  Logo
  • Products

    Featured

    • Red Hat Enterprise Linux
      Red Hat Enterprise Linux Icon
    • Red Hat OpenShift AI
      Red Hat OpenShift AI
    • Red Hat Enterprise Linux AI
      Linux icon inside of a brain
    • Image mode for Red Hat Enterprise Linux
      RHEL image mode
    • Red Hat OpenShift
      Openshift icon
    • Red Hat Ansible Automation Platform
      Ansible icon
    • Red Hat Developer Hub
      Developer Hub
    • View All Red Hat Products
    • Linux

      • Red Hat Enterprise Linux
      • Image mode for Red Hat Enterprise Linux
      • Red Hat Universal Base Images (UBI)
    • Java runtimes & frameworks

      • JBoss Enterprise Application Platform
      • Red Hat build of OpenJDK
    • Kubernetes

      • Red Hat OpenShift
      • Microsoft Azure Red Hat OpenShift
      • Red Hat OpenShift Virtualization
      • Red Hat OpenShift Lightspeed
    • Integration & App Connectivity

      • Red Hat Build of Apache Camel
      • Red Hat Service Interconnect
      • Red Hat Connectivity Link
    • AI/ML

      • Red Hat OpenShift AI
      • Red Hat Enterprise Linux AI
    • Automation

      • Red Hat Ansible Automation Platform
      • Red Hat Ansible Lightspeed
    • Developer tools

      • Red Hat Trusted Software Supply Chain
      • Podman Desktop
      • Red Hat OpenShift Dev Spaces
    • Developer Sandbox

      Developer Sandbox
      Try Red Hat products and technologies without setup or configuration fees for 30 days with this shared Openshift and Kubernetes cluster.
    • Try at no cost
  • Technologies

    Featured

    • AI/ML
      AI/ML Icon
    • Linux
      Linux Icon
    • Kubernetes
      Cloud icon
    • Automation
      Automation Icon showing arrows moving in a circle around a gear
    • View All Technologies
    • Programming Languages & Frameworks

      • Java
      • Python
      • JavaScript
    • System Design & Architecture

      • Red Hat architecture and design patterns
      • Microservices
      • Event-Driven Architecture
      • Databases
    • Developer Productivity

      • Developer productivity
      • Developer Tools
      • GitOps
    • Secure Development & Architectures

      • Security
      • Secure coding
    • Platform Engineering

      • DevOps
      • DevSecOps
      • Ansible automation for applications and services
    • Automated Data Processing

      • AI/ML
      • Data Science
      • Apache Kafka on Kubernetes
      • View All Technologies
    • Start exploring in the Developer Sandbox for free

      sandbox graphic
      Try Red Hat's products and technologies without setup or configuration.
    • Try at no cost
  • Learn

    Featured

    • Kubernetes & Cloud Native
      Openshift icon
    • Linux
      Rhel icon
    • Automation
      Ansible cloud icon
    • Java
      Java icon
    • AI/ML
      AI/ML Icon
    • View All Learning Resources

    E-Books

    • GitOps Cookbook
    • Podman in Action
    • Kubernetes Operators
    • The Path to GitOps
    • View All E-books

    Cheat Sheets

    • Linux Commands
    • Bash Commands
    • Git
    • systemd Commands
    • View All Cheat Sheets

    Documentation

    • API Catalog
    • Product Documentation
    • Legacy Documentation
    • Red Hat Learning

      Learning image
      Boost your technical skills to expert-level with the help of interactive lessons offered by various Red Hat Learning programs.
    • Explore Red Hat Learning
  • Developer Sandbox

    Developer Sandbox

    • Access Red Hat’s products and technologies without setup or configuration, and start developing quicker than ever before with our new, no-cost sandbox environments.
    • Explore Developer Sandbox

    Featured Developer Sandbox activities

    • Get started with your Developer Sandbox
    • OpenShift virtualization and application modernization using the Developer Sandbox
    • Explore all Developer Sandbox activities

    Ready to start developing apps?

    • Try at no cost
  • Blog
  • Events
  • Videos

Quick links: redhat.com, Customer Portal, Red Hat's developer site, Red Hat's partner site.

  • You are here

    Red Hat

    Learn about our open source products, services, and company.

  • You are here

    Red Hat Customer Portal

    Get product support and knowledge from the open source experts.

  • You are here

    Red Hat Developer

    Read developer tutorials and download Red Hat software for cloud application development.

  • You are here

    Red Hat Partner Connect

    Get training, subscriptions, certifications, and more for partners to build, sell, and support customer solutions.

Products & tools

  • Ansible.com

    Learn about and try our IT automation product.
  • Red Hat Ecosystem Catalog

    Find hardware, software, and cloud providers―and download container images―certified to perform with Red Hat technologies.

Try, buy, & sell

  • Red Hat Hybrid Cloud Console

    Access technical how-tos, tutorials, and learning paths focused on Red Hat’s hybrid cloud managed services.
  • Red Hat Store

    Buy select Red Hat products and services online.
  • Red Hat Marketplace

    Try, buy, sell, and manage certified enterprise software for container-based environments.

Events

  • Red Hat Summit and AnsibleFest

    Register for and learn about our annual open source IT industry event.

LLM Compressor is here: Faster inference with vLLM

Announcing LLM Compressor

August 14, 2024
Robert Shaw Mark Kurtz Sara Adkins, Benjamin Fineran
Related topics:
Artificial intelligence
Related products:
Red Hat AI

Share:

Share on twitter Share on facebook Share on linkedin Share with email
  • Enabling activation quantization in vLLM
  • Activation quantization performance in vLLM
  • Activation quantization accuracy
  • Try LLM Compressor
  • LLM compressor roadmap
  • Final thoughts
  • Appendix: Benchmark details

LLM Compressor, a unified library for creating compressed models for faster inference with vLLM, is now available. Neural Magic's research team has successfully utilized it to create our latest compressed models, including fully quantized and accurate versions of Llama 3.1, and with that, we are excited to open up the toolkit to the community with its first 0.1 release for general usage to compress your models!

LLM Compressor architecture diagram.
Figure 1: The LLM Compressor architecture flow.

In recent months, the high-performance computing team at Neural Magic has brought performant inference for various quantization schemes to vLLM, including custom Marlin kernels for weight-only quantization and custom CUTLASS kernels for INT8 and FP8 activation quantization.

However, before today, creating quantized checkpoints required navigating a fragmented ecosystem of bespoke compression libraries such as AutoGPTQ, AutoAWQ, AutoFP8, etc. We built LLM Compressor from the ground up as a single library for applying the latest compression best practices, including GPTQ, SmoothQuant, SparseGPT, and RTN, with many more actively being added. It works natively with Hugging Face models for seamless ease of use in the open source ecosystem, and vLLM supports directly loading checkpoints from LLM Compressor for accelerated inference.

Using LLM Compressor, you can create compressed, accurate versions of your models, including:

  • Activation and weight quantization for up to 3X faster server/throughput deployments. This includes FP8 models using RTN for NVIDIA's Ada Lovelace and Hopper GPUs, and INT8 models using SmoothQuant and GPTQ for Nvidia's Turing and Ampere GPUs.
  • Weight quantization for up to 4X faster latency with INT4 weight-only models using GPTQ for Nvidia's Ampere GPUs and newer.
  • Weight pruning for up to 1.5X faster general performance with 2:4, 50% sparse models utilizing SparseGPT for Nvidia's Ampere GPUs and newer.
Enabling activation quantization in vLLM

Enabling activation quantization in vLLM

Thanks to LLM Compressor's flexibility, it enables a critical new feature: activation quantization.

The open-source compression ecosystem thus far has focused mainly on weight-only quantization, including AutoGPTQ and AutoAWQ. Weight-only quantization enables smaller models and faster latency, but with 16-bit activations, the compute runs through the same 16-bit tensor cores as the unquantized model. This leads to slower inference for compute-heavy workloads due to the penalty of dequantizing the weights. Activation quantization, where the inputs to each layer are quantized, combined with weight quantization, enables utilization of the faster INT8 or FP8 tensor cores for the matrix multiplies, doubling the performance for compute-bound inference.

Weight-only quantization often fails to deliver speed improvements in production serving deployments. These environments typically result in compute-bound workloads with minimal benefits from weight-only quantization. Activation quantization, however, offers a substantial performance boost in such high-compute scenarios and faster inference at lower queries per second (QPS). 

Figure 2 demonstrates a 1.6X speedup at 5 QPS for the INT8 weight and activation quantized model (w8a8) compared to the 16-bit baseline (w16a16), while the 4-bit weight quantized model (w4a16) shows little improvement.

This chart demonstrates a 1.6X speedup at 5 QPS for the INT8 weight and activation quantized model (w8a8) compared to the 16-bit baseline (w16a16), while the 4-bit weight quantized model (w4a16) shows little improvement.
Figure 2: This chart demonstrates a 1.6X speedup at 5 QPS for the INT8 weight and activation quantized model (w8a8) compared to the 16-bit baseline (w16a16), while the 4-bit weight quantized model (w4a16) shows little improvement.
Activation quantization performance in vLLM

Activation quantization performance in vLLM

Let’s take an example of a Llama 3.1 70B running in vLLM on a 4xA100 GPU setup to see if this analysis holds up.

We will compare the serving latency for three variants for Llama 3.1 70B

  • Unquantized FP16 (w16a16): meta-llama/Meta-Llama-3.1-70B-Instruct
  • Weight and activation quantization to INT8 (w8a8): neuralmagic/Meta-Llama-3.1-70B-Instruct-quantized.w8a8
  • Weight-only quantization to INT4 (w4a16): neuralmagic/Meta-Llama-3.1-70B-Instruct-quantized.w4a16

Figure 3 illustrates the average time to generate each new token (TPOT) across different server loads, measured in queries per second (QPS). Additionally, a deployment constraint of 5 seconds is set for the time to generate the first token (TTFT) to ensure the serving application maintains reasonable initial response times.

This chart illustrates the average time to generate each new token (TPOT) across different server loads, measured in queries per second (QPS). Additionally, a deployment constraint of 5 seconds is set for the time to generate the first token (TTFT) to ensure the serving application maintains reasonable initial response times.
Figure 3: The average time to generate each new token (TPOT) across different server loads, measured in queries per second (QPS). Full replication instructions for the benchmark are available in the appendix.

At low QPS, weight-only quantization offers improved latency relative to an unquantized model. However, as the server load increases and becomes compute-bound, the performance of the weight-only model levels off, matching the unquantized model. In contrast, the activation quantized model performs better under high load, supporting more queries per second before the system becomes overloaded and TTFT exceeds our limits for a responsive application.

For a 70B model on an A100 system, we see that the W8A8 model achieves similar latency performance with just 2 GPUs compared to the unquantized model running with 4, meaning similar latency guarantees with half the resources!

Llama 3.1 70B Time per Output Token coparing w16a16 on 4 GPUs and w8a8 on 2 GPUs.
Figure 4: Llama 3.1 70B Time per Output Token coparing w16a16 on 4 GPUs and w8a8 on 2 GPUs. Full replication instructions for the benchmark are available in the appendix.
Activation quantization accuracy

Activation quantization accuracy

vLLM’s CUTLASS kernels for activation quantization offer flexible support for various schemes, allowing for a high degree of customization, including any combination of:

  • Per-tensor or per-channel quantization for weights
  • Per-tensor or per-token quantization for activations
  • Symmetric or asymmetric quantized activations (for int8).

This flexibility in vLLM, combined with LLM Compressor's advanced algorithms such as GPTQ and SmoothQuant, ensures that model accuracy is maintained even after quantization. As we can see from the model card for neuralmagic/Meta-Llama-3.1-70B-Instruct-quantized.w8a8, we see a negligible drop using static per-channel weight scales and dynamic per token activation scales in comparison to the FP16 baseline on Open LLM (Table 1).

Table 1: Open LLM Leaderboard evaluation scores.
BenchmarkMeta-Llama-3.1-70B-InstructMeta-Llama-3.1-70B-Instruct-quantized.w8a8 (this model)Recovery
MMLU (5-shot)83.8883.6599.7%
MMLU (CoT, 0-shot)85.7485.4199.6%
ARC Challenge (0-shot)93.2693.26100.0%
GSM-8K (CoT, 8-shot, strict-match)93.1093.25100.2%
Hellaswag (10-shot)86.4086.2899.9%
Winogrande (5-shot)85.0085.00100.0%
TruthfulQA (0-shot, mc2)59.8360.88101.8%
Average83.8983.96100.2%

This combination of fine-grained quantization and sophisticated algorithms enables users to achieve faster inference without compromising on the precision and reliability of their models.

Try LLM Compressor

Try LLM Compressor

The following snippet is a minimal example of quantizing meta-llama/Meta-Llama-3.1-8B-Instruct with INT8 weights and activations.

Install LLM Compressor via PyPi

LLM Compressor is available for installation via PyPI:

pip install llmcompressor
Copy snippet

Apply quantization with the LLM Compressor

Quantization is applied by selecting an algorithm and calling the oneshot API, which applies the selections in a post-training setting.

In this case, we apply SmoothQuant to make the activations easier to quantize and GPTQ to apply the weight and activation quantization. We apply these algorithms to all linear layers of the network using the built-in open_platypus dataset (note: see the examples for how to use your own calibration set).

from llmcompressor.modifiers.quantization import GPTQModifier
from llmcompressor.modifiers.smoothquant import SmoothQuantModifier
from llmcompressor.transformers import oneshot
# Select quantization algorithm. In this case, we:
#   * apply SmoothQuant to make the activations easier to quantize
#   * quantize the weights to int8 with GPTQ (static per channel)
#   * quantize the activations to int8 (dynamic per token)
recipe = [
    SmoothQuantModifier(smoothing_strength=0.8),
    GPTQModifier(scheme="W8A8", targets="Linear", ignore=["lm_head"]),
]
# Apply quantization using the built in open_platypus dataset.
oneshot(
    model="meta-llama/Meta-Llama-3.1-8B-Instruct",
    dataset="open_platypus",
    recipe=recipe,
    output_dir="Meta-Llama-3.1-8B-Instruct-INT8",
    max_seq_length=2048,
    num_calibration_samples=512,
)
Copy snippet

Inference compressed models with vLLM

The resulting model is ready to be loaded and run in vLLM out-of-the-box:

from vllm import LLM
model = LLM("./Meta-Llama-3.1-8B-Instruct-INT8")
output = model.generate("My name is")
print("Output:", output[0].outputs[0].text)
# Output: Jakob Schmid.  I live in the Republic of South Moluccas
Copy snippet

Under the hood, vLLM understands how to load and run the compressed model by looking at the config.yaml next to the weight files. Check out some of our more detailed examples to try out other quantization flows:

  • FP8 activation quantization with PTQ
  • INT8 activation quantization with GPTQ and SmoothQuant
  • INT4 weight-only quantization With GPTQ
LLM compressor roadmap

LLM compressor roadmap

We have a robust roadmap planned to expand support for model compression in LLM Compressor. Our roadmap is prioritized across the following initiatives:

  • Expand model support: Mixture of Experts (MoE) and vision-language models
  • Expand algorithm and scheme support: AWQ, additional quantized floating point formats (fp8 and fp4), and KV cache quantization
  • Support for non-NVIDIA hardware: We are actively collaborating with AMD, Google, and Intel teams to support models created by LLM Compressor on non-NVIDIA hardware devices.
  • Tools for creating non-uniform quantization schemes
  • 2:4 sparsity: Sparse foundation models, sparse fine-tuning from sparse foundational models, combining sparsity and quantization
  • Expand support for training aware methods: Quantization-Aware Training (QAT) and Low-Rank Adaptation (LoRA)

If you have any feature requests, large or small, please comment on our Roadmap Issue in GitHub.

Final thoughts

Final thoughts

At Neural Magic, we believe the future of AI is open, and we are on a mission to bring the power of open source models and vLLM to every enterprise on the planet.

We offer nm-vllm, an enterprise distribution of vLLM, with:

  • Stable builds with bug fixes and selected model backporting
  • Enterprise support with SLAs for production deployments of vLLM
  • Tools and access to our teams for applying model optimizations via LLM Compressor
  • Pre-optimized model registry
  • Kubernetes reference architectures for production deployments
Appendix: Benchmark details

Appendix: Benchmark details

We used the following three model stubs:

  • meta-llama/Meta-Llama-3.1-70B-Instruct
  • neuralmagic/Meta-Llama-3.1-70B-Instruct-quantized.w8a8
  • neuralmagic/Meta-Llama-3.1-70B-Instruct-quantized.w4a16

Models were deployed with the following command on A100-80GB-SXM4 (vllm==0.5.4):

MODEL=MODEL_STUB_TO_BENCHMARK \
vllm serve $MODEL \
--enable-chunked-prefill \
--disable-log-requests \
--tensor-parallel-size 4
Copy snippet

We ran the following bash script in vllm-project/vllm to generate the data:

MODEL=MODEL_STUB_TO_BENCHMARK
TOTAL_SECONDS=120
QPS_RATES=("1" "3" "5" "7" "9")
for QPS in ${QPS_RATES[@]}; do
    NUM_PROMPTS=$((TOTAL_SECONDS * QPS))
    echo "===== RUNNING NUM_PROMPTS = $NUM_PROMPTS QPS = $QPS ====="
    python3 benchmarks/benchmark_serving.py \
        --model $MODEL \
        --dataset-name sonnet --sonnet-input-len 550 --sonnet-output-len 150 \
        --dataset-path benchmarks/sonnet.txt \
        --num-prompts $NUM_PROMPTS --request-rate $QPS
done
Copy snippet
Last updated: March 25, 2025

Related Posts

  • vLLM V1: Accelerating multimodal inference for large language models

  • Enhance LLMs and streamline MLOps using InstructLab and KitOps

  • How we optimized vLLM for DeepSeek-R1

  • Introducing Podman AI Lab: Developer tooling for working with LLMs

  • TrustyAI Detoxify: Guardrailing LLMs during training

  • vLLM V1 Alpha: A major upgrade to vLLM's core architecture

Recent Posts

  • LLM Compressor: Optimize LLMs for low-latency deployments

  • How to set up NVIDIA NIM on Red Hat OpenShift AI

  • Leveraging Ansible Event-Driven Automation for Automatic CPU Scaling in OpenShift Virtualization

  • Python packaging for RHEL 9 & 10 using pyproject RPM macros

  • Kafka Monthly Digest: April 2025

Red Hat Developers logo LinkedIn YouTube Twitter Facebook

Products

  • Red Hat Enterprise Linux
  • Red Hat OpenShift
  • Red Hat Ansible Automation Platform

Build

  • Developer Sandbox
  • Developer Tools
  • Interactive Tutorials
  • API Catalog

Quicklinks

  • Learning Resources
  • E-books
  • Cheat Sheets
  • Blog
  • Events
  • Newsletter

Communicate

  • About us
  • Contact sales
  • Find a partner
  • Report a website issue
  • Site Status Dashboard
  • Report a security problem

RED HAT DEVELOPER

Build here. Go anywhere.

We serve the builders. The problem solvers who create careers with code.

Join us if you’re a developer, software engineer, web designer, front-end designer, UX designer, computer scientist, architect, tester, product manager, project manager or team lead.

Sign me up

Red Hat legal and privacy links

  • About Red Hat
  • Jobs
  • Events
  • Locations
  • Contact Red Hat
  • Red Hat Blog
  • Inclusion at Red Hat
  • Cool Stuff Store
  • Red Hat Summit

Red Hat legal and privacy links

  • Privacy statement
  • Terms of use
  • All policies and guidelines
  • Digital accessibility

Report a website issue