Skip to main content
Redhat Developers  Logo
  • Products

    Platforms

    • Red Hat Enterprise Linux
      Red Hat Enterprise Linux Icon
    • Red Hat AI
      Red Hat AI
    • Red Hat OpenShift
      Openshift icon
    • Red Hat Ansible Automation Platform
      Ansible icon
    • View All Red Hat Products

    Featured

    • Red Hat build of OpenJDK
    • Red Hat Developer Hub
    • Red Hat JBoss Enterprise Application Platform
    • Red Hat OpenShift Dev Spaces
    • Red Hat OpenShift Local
    • Red Hat Developer Sandbox

      Try Red Hat products and technologies without setup or configuration fees for 30 days with this shared Openshift and Kubernetes cluster.
    • Try at no cost
  • Technologies

    Featured

    • AI/ML
      AI/ML Icon
    • Linux
      Linux Icon
    • Kubernetes
      Cloud icon
    • Automation
      Automation Icon showing arrows moving in a circle around a gear
    • View All Technologies
    • Programming Languages & Frameworks

      • Java
      • Python
      • JavaScript
    • System Design & Architecture

      • Red Hat architecture and design patterns
      • Microservices
      • Event-Driven Architecture
      • Databases
    • Developer Productivity

      • Developer productivity
      • Developer Tools
      • GitOps
    • Automated Data Processing

      • AI/ML
      • Data Science
      • Apache Kafka on Kubernetes
    • Platform Engineering

      • DevOps
      • DevSecOps
      • Ansible automation for applications and services
    • Secure Development & Architectures

      • Security
      • Secure coding
  • Learn

    Featured

    • Kubernetes & Cloud Native
      Openshift icon
    • Linux
      Rhel icon
    • Automation
      Ansible cloud icon
    • AI/ML
      AI/ML Icon
    • View All Learning Resources

    E-Books

    • GitOps Cookbook
    • Podman in Action
    • Kubernetes Operators
    • The Path to GitOps
    • View All E-books

    Cheat Sheets

    • Linux Commands
    • Bash Commands
    • Git
    • systemd Commands
    • View All Cheat Sheets

    Documentation

    • Product Documentation
    • API Catalog
    • Legacy Documentation
  • Developer Sandbox

    Developer Sandbox

    • Access Red Hat’s products and technologies without setup or configuration, and start developing quicker than ever before with our new, no-cost sandbox environments.
    • Explore Developer Sandbox

    Featured Developer Sandbox activities

    • Get started with your Developer Sandbox
    • OpenShift virtualization and application modernization using the Developer Sandbox
    • Explore all Developer Sandbox activities

    Ready to start developing apps?

    • Try at no cost
  • Blog
  • Events
  • Videos

2:4 Sparse Llama FP8: SOTA performance for NVIDIA Hopper GPUs

Introducing 2:4 Sparse Llama with FP8

December 18, 2024
Alexandre Marques Eldar Kurtić Mark Kurtz Dan Alistarh Shubhra Pandit Faraz Shahsavan
Related topics:
Artificial intelligence
Related products:
Red Hat AI

Share:

    A sparse summary

    • Hardware-accelerated sparsity: Achieves an average of 30% lower latency and 20% higher throughput from sparsity alone on NVIDIA Hopper GPUs.
    • FP8 quantization compatible: Supports NVIDIA's FP8 format with sparsity, enabling an average of 1.7X lower latency and 1.5X faster throughput.
    • Open source with vLLM: Built into vLLM with custom CUTLASS-based sparse, FP8 kernels for further adoption and development.

    Advancing AI efficiency is more critical than ever, and sparsity has proven to be a cornerstone in this pursuit. Building on our previous work at Neural Magic with the 2:4 Sparse Llama 3.1 8B foundation model–which increases model efficiency by eliminating unnecessary parameters while preserving accuracy–we are excited to introduce the next step forward: sparse 8-bit floating point (FP8) models and the associated high-performance kernels for vLLM.

    FP8 precision, the latest hardware-supported quantization format on NVIDIA GPUs, delivers significant compute and memory reductions, comparable to 8-bit integer (INT8) formats, with 2X faster compute and 2X lower memory usage. The difference, though, is the floating-point nature provides a better representation of outliers within the model than INT8, enabling easier and more accurate quantization. By combining FP8 with the advantages of the 2:4 sparsity pattern and CUTLASS-based performance kernels in vLLM, we achieve optimal hardware utilization and state-of-the-art performance on NVIDIA's Hopper architecture. This integration unlocks new levels of efficiency with a total of 1.7X lower latency and 1.5X more queries per second for throughput with full accuracy recovery.

    Inference performance and accuracy results for dense BF16, sparse BF16, dense FP8, and sparse FP8 versions of Llama 3.1 8B through vLLM on an H100 GPU.
    Figure 1: Inference performance and accuracy results for dense BF16, sparse BF16, dense FP8, and sparse FP8 versions of Llama 3.1 8B through vLLM on an H100 GPU.
    Server-based inference performance results for a multi-turn chat use case with batch size one at various QPS rates for dense BF16, dense FP8, and sparse FP8 versions of Llama 3.1 8B through vLLM on an H100 GPU.
    Figure 2: Server-based inference performance results for a multi-turn chat use case with batch size one at various QPS rates for dense BF16, dense FP8, and sparse FP8 versions of Llama 3.1 8B through vLLM on an H100 GPU.

    Cutting latency with CUTLASS

    The development of high-performance FP8 sparse kernels for vLLM marks a new chapter in inference optimization, delivering state-of-the-art performance on NVIDIA Hopper GPUs. By combining FP8 precision and the 2:4 structured sparsity pattern, we created custom CUTLASS v3.6 kernels—NVIDIA’s toolkit for efficient matrix multiplication—that tackle memory bottlenecks and improve computational efficiency. FP8 cuts memory bandwidth usage by half compared to BF16, while sparsity doubles the theoretical tensor core throughput by skipping redundant computations.

    Building on existing FP8 kernel implementations in vLLM, which leverage CUTLASS and the torch.float8_e4m3fn tensor type, we enabled high-performance sparse FP8 support through:

    • Custom sparse FP8 CUTLASS kernels: Optimized to handle sparse FP8 weight matrices with FP8 quantized activations efficiently.
    • Optimization and tuning: Fine-tuning CUTLASS parameters across scenarios to maximize inference performance.

    Matrix multiplication performance benchmarks illustrate the impact of these advancements. Compared to a naive PyTorch BF16 implementation, the FP8 CUTLASS kernels alone achieve up to 1.9X speedups. These gains are further amplified when combined with the 2:4 sparsity pattern, delivering up to 30% lower latency across batch sizes. FP8 precision and sparsity unlock a total potential speedup of 2.5X over BF16 while maintaining consistent performance advantages over dense FP8 implementations, as shown in Figure 3.

    Performance comparison of different matmul kernel implementations on an H100 GPU for a weight matrix of size 4096x28672.
    Figure 3: Performance comparison of different matmul kernel implementations on an H100 GPU for a weight matrix of size 4096x28672.

    Accuracy without compromise

    To ensure Sparse FP8 models retain accuracy while delivering inference performance gains and easy-to-apply quantization, we employed a two-part quantization strategy: dynamic per-token FP8 for activations and static per-channel FP8 for weights. This quantization was applied post-training, following fine-tuning processes identical to those outlined in the original 2:4 Sparse Llama blog.

    The fine-tuning and evaluations were conducted across the same key domains to measure accuracy recovery and robustness:

    • Mathematical reasoning: Fine-tuned on GSM8K, evaluated with strict-match accuracy in a zero-shot setting.
    • Coding tasks: Fine-tuned on Evol-CodeAlpaca, evaluated with pass@1 performance on HumanEval.
    • Conversational AI: Fine-tuned on Ultrachat-200K, evaluated with win rate on AlpacaEval.

    As summarized in Table 1, Sparse FP8 models achieve near-full accuracy recovery, comparable to earlier results observed with INT8 quantization. These findings demonstrate the robustness of FP8 quantization, ensuring maximum compression and performance gains without sacrificing accuracy.

    Accuracy evaluations comparing dense BF16, sparse BF16, and sparse FP8 versions of Llama 3.1 8B.
    Table 1: Accuracy evaluations comparing dense BF16, sparse BF16, and sparse FP8 versions of Llama 3.1 8B.

    Efficient inference at scale

    To evaluate the real-world impact of sparse FP8 models, we benchmarked their performance compared to dense FP8 and dense BF16 versions. These benchmarks were generated across scenarios reflecting practical deployments to ensure consistency across various prefill vs. decode sizes, including code completion, docstring generation, instruction following, multi-turn chat, summarization, and long-context retrieval-augmented generation (RAG), as given in Table 2.

    Prefill and decode token amounts for various real-world use cases used for benchmarking.
    Table 2: Prefill and decode token amounts for various real-world use cases used for benchmarking.

    Single-Stream Latency Results

    To illustrate the extreme latency side for inference, we benchmarked the various scenarios in a single-stream setup: batch size one and a single request at a time. Here, sparse FP8 models show an average 1.7X faster inference latency than dense BF16 models, with up to 30% of these gains attributed to sparsity alone, as seen in Table 3.

    Inference latencies across various use cases for dense BF16, dense FP8, and sparse FP8 versions of Llama 3.1 8B through vLLM on an H100 GPU with batch size 1 and 1 request at a time.
    Table 3: Inference latencies across various use cases for dense BF16, dense FP8, and sparse FP8 versions of Llama 3.1 8B through vLLM on an H100 GPU with batch size 1 and 1 request at a time.

    Multi-Stream Throughput Results

    To illustrate the alternative in the performance envelope, we benchmarked the various scenarios in a throughput setup: batch size one and all requests at once. Here, sparse FP8 models show an average 1.5X increase in queries per second than dense BF16 models, with up to 20% of these gains attributed to sparsity alone, as seen in Table 4.

    Throughput inference queries per second across various use cases for dense BF16, dense FP8, and sparse FP8 versions of Llama 3.1 8B through vLLM on an H100 GPU with batch size 1 and all requests at once.
    Table 4: Throughput inference queries per second across various use cases for dense BF16, dense FP8, and sparse FP8 versions of Llama 3.1 8B through vLLM on an H100 GPU with batch size 1 and all requests at once.

    Multi-stream server results

    To evaluate the scalability of Sparse FP8 models in real-world server deployments and ensure the throughput and latency benchmarks align, we present comprehensive results for two key use cases. These benchmarks scale queries per second (QPS) from single-stream to full-throughput conditions while measuring inter-token latency (ITL).

    Figure 2, introduced earlier in the blog, showcases the performance for multi-turn chat, demonstrating consistent performance gains across a range of QPS rates.
    Figure 4, below, focuses on code completion, a more decode-heavy workload, where Sparse FP8 models similarly deliver consistent performance improvements across various QPS rates.

    Both figures provide two key perspectives for interpreting the results:

    • Fixed ITL (Inter-Token Latency) as a Service Level Agreement (SLA): By setting a target ITL, the graphs illustrate how Sparse FP8 models increase the number of queries that can be processed concurrently while maintaining the desired performance level.
    • Fixed QPS (Queries Per Second): At a specific QPS rate, the graphs demonstrate improvements in ITL, showcasing faster response times and lower latency.
    Server-based inference performance results for a code completion use case with batch size one at various QPS rates for dense BF16, dense FP8, and sparse FP8 versions of Llama 3.1 8B through vLLM on an H100 GPU.
    Figure 4: Server-based inference performance results for a code completion use case with batch size one at various QPS rates for dense BF16, dense FP8, and sparse FP8 versions of Llama 3.1 8B through vLLM on an H100 GPU.

    Unlock efficiency

    Sparse FP8 models enable exceptional performance, scalability, and cost-effectiveness on NVIDIA Hopper GPUs. By reducing memory bandwidth demands, maximizing tensor core throughput, and maintaining full accuracy recovery, sparse FP8 models enable faster, more efficient AI deployments without compromising quality.

    Neural Magic is proud to continue its commitment to the open-source community, empowering developers, researchers, and enterprises to adopt and build upon these innovations. Our open source FP8 models and high-performance kernels for vLLM are designed to simplify integration and experimentation for real-world use cases.

    Looking to get started in open source?

    • Explore Sparse FP8 models on Hugging Face.
    • Access our FP8 kernels on GitHub within vLLM.
    Last updated: September 18, 2025

    Recent Posts

    • What's New in OpenShift GitOps 1.18

    • Beyond a single cluster with OpenShift Service Mesh 3

    • Kubernetes MCP server: AI-powered cluster management

    • Unlocking the power of OpenShift Service Mesh 3

    • Run DialoGPT-small on OpenShift AI for internal model testing

    Red Hat Developers logo LinkedIn YouTube Twitter Facebook

    Platforms

    • Red Hat AI
    • Red Hat Enterprise Linux
    • Red Hat OpenShift
    • Red Hat Ansible Automation Platform
    • See all products

    Build

    • Developer Sandbox
    • Developer Tools
    • Interactive Tutorials
    • API Catalog

    Quicklinks

    • Learning Resources
    • E-books
    • Cheat Sheets
    • Blog
    • Events
    • Newsletter

    Communicate

    • About us
    • Contact sales
    • Find a partner
    • Report a website issue
    • Site Status Dashboard
    • Report a security problem

    RED HAT DEVELOPER

    Build here. Go anywhere.

    We serve the builders. The problem solvers who create careers with code.

    Join us if you’re a developer, software engineer, web designer, front-end designer, UX designer, computer scientist, architect, tester, product manager, project manager or team lead.

    Sign me up

    Red Hat legal and privacy links

    • About Red Hat
    • Jobs
    • Events
    • Locations
    • Contact Red Hat
    • Red Hat Blog
    • Inclusion at Red Hat
    • Cool Stuff Store
    • Red Hat Summit
    © 2025 Red Hat

    Red Hat legal and privacy links

    • Privacy statement
    • Terms of use
    • All policies and guidelines
    • Digital accessibility

    Report a website issue