Skip to main content
Redhat Developers  Logo
  • Products

    Featured

    • Red Hat Enterprise Linux
      Red Hat Enterprise Linux Icon
    • Red Hat OpenShift AI
      Red Hat OpenShift AI
    • Red Hat Enterprise Linux AI
      Linux icon inside of a brain
    • Image mode for Red Hat Enterprise Linux
      RHEL image mode
    • Red Hat OpenShift
      Openshift icon
    • Red Hat Ansible Automation Platform
      Ansible icon
    • Red Hat Developer Hub
      Developer Hub
    • View All Red Hat Products
    • Linux

      • Red Hat Enterprise Linux
      • Image mode for Red Hat Enterprise Linux
      • Red Hat Universal Base Images (UBI)
    • Java runtimes & frameworks

      • JBoss Enterprise Application Platform
      • Red Hat build of OpenJDK
    • Kubernetes

      • Red Hat OpenShift
      • Microsoft Azure Red Hat OpenShift
      • Red Hat OpenShift Virtualization
      • Red Hat OpenShift Lightspeed
    • Integration & App Connectivity

      • Red Hat Build of Apache Camel
      • Red Hat Service Interconnect
      • Red Hat Connectivity Link
    • AI/ML

      • Red Hat OpenShift AI
      • Red Hat Enterprise Linux AI
    • Automation

      • Red Hat Ansible Automation Platform
      • Red Hat Ansible Lightspeed
    • Developer tools

      • Red Hat Trusted Software Supply Chain
      • Podman Desktop
      • Red Hat OpenShift Dev Spaces
    • Developer Sandbox

      Developer Sandbox
      Try Red Hat products and technologies without setup or configuration fees for 30 days with this shared Openshift and Kubernetes cluster.
    • Try at no cost
  • Technologies

    Featured

    • AI/ML
      AI/ML Icon
    • Linux
      Linux Icon
    • Kubernetes
      Cloud icon
    • Automation
      Automation Icon showing arrows moving in a circle around a gear
    • View All Technologies
    • Programming Languages & Frameworks

      • Java
      • Python
      • JavaScript
    • System Design & Architecture

      • Red Hat architecture and design patterns
      • Microservices
      • Event-Driven Architecture
      • Databases
    • Developer Productivity

      • Developer productivity
      • Developer Tools
      • GitOps
    • Secure Development & Architectures

      • Security
      • Secure coding
    • Platform Engineering

      • DevOps
      • DevSecOps
      • Ansible automation for applications and services
    • Automated Data Processing

      • AI/ML
      • Data Science
      • Apache Kafka on Kubernetes
      • View All Technologies
    • Start exploring in the Developer Sandbox for free

      sandbox graphic
      Try Red Hat's products and technologies without setup or configuration.
    • Try at no cost
  • Learn

    Featured

    • Kubernetes & Cloud Native
      Openshift icon
    • Linux
      Rhel icon
    • Automation
      Ansible cloud icon
    • Java
      Java icon
    • AI/ML
      AI/ML Icon
    • View All Learning Resources

    E-Books

    • GitOps Cookbook
    • Podman in Action
    • Kubernetes Operators
    • The Path to GitOps
    • View All E-books

    Cheat Sheets

    • Linux Commands
    • Bash Commands
    • Git
    • systemd Commands
    • View All Cheat Sheets

    Documentation

    • API Catalog
    • Product Documentation
    • Legacy Documentation
    • Red Hat Learning

      Learning image
      Boost your technical skills to expert-level with the help of interactive lessons offered by various Red Hat Learning programs.
    • Explore Red Hat Learning
  • Developer Sandbox

    Developer Sandbox

    • Access Red Hat’s products and technologies without setup or configuration, and start developing quicker than ever before with our new, no-cost sandbox environments.
    • Explore Developer Sandbox

    Featured Developer Sandbox activities

    • Get started with your Developer Sandbox
    • OpenShift virtualization and application modernization using the Developer Sandbox
    • Explore all Developer Sandbox activities

    Ready to start developing apps?

    • Try at no cost
  • Blog
  • Events
  • Videos

2:4 Sparse Llama: Smaller models for efficient GPU inference

Introducing Sparse Llama 3.1 8B

February 28, 2025
Eldar Kurtić Alexandre Marques Mark Kurtz Dan Alistarh Shubhra Pandit
Related topics:
Artificial intelligenceData Science
Related products:
Red Hat AI

Share:

    A sparse summary

    • Sparse foundation model: The first sparse, highly accurate foundation model built on top of Meta’s Llama 3.1 8B with 98% recovery on Open LLM Leaderboard v1 and full recovery across fine-tuning tasks, including math, coding, and chat.
    • Hardware-accelerated sparsity: Features a 2:4 sparsity pattern designed for NVIDIA Ampere GPUs and newer, delivering up to 30% higher throughput and 20% lower latency from sparsity alone with vLLM.
    • Quantization Compatible: Fully integrates with advanced 4-bit quantization methods like GPTQ and efficient Sparse-Marlin inference kernels, enabling faster inference anywhere from 1.2x to 3.0x depending on the hardware and scenario.

    Large language models (LLMs) are approaching their limits in terms of traditional scaling, with billions of parameters added for relatively small accuracy gains and advanced quantization techniques squeezing out the last possible bits before accuracy plummets. These dense architectures remain large, costly, and resource-intensive, making it challenging and expensive to scale AI. 

    Neural Magic is doubling down on this challenge with sparse LLMs—reducing the model size by removing unneeded connections while retaining accuracy. Sparse models, though underexplored in the LLM space due to the high compute demands of pretraining, offer an increasingly promising dimension in model compression and efficiency.

    Sparse-Llama-3.1-8B-2of4 is our next step in this commitment—a 50% pruned version of Meta's open-source Llama 3.1 8B. Built with a GPU-friendly 2:4 sparsity structure, it removes two of every four parameters while preserving accuracy. Designed as a versatile foundation model for fine-tuning and instruction alignment, Sparse Llama is optimized for both speed and efficiency. Its quantization-friendly architecture enables faster, cheaper inference with roughly half the connections of its dense counterpart.

    Research background

    Sparse Llama 3.1 originates from years of prior research, building on previous breakthroughs with SparseGPT, SquareHead Knowledge Distillation, and Sparse Llama 2. These contributions laid the groundwork for our state-of-the-art sparse training approach, tailored to the latest generation of LLMs. Leveraging SparseGPT developed in collaboration with ISTA, we efficiently removed redundant connections, while SquareHead’s layerwise knowledge distillation and Sparse Llama 2’s foundational training recipes provided the basis for sparsity optimization.

    Working with the latest LLMs requires more than applying existing techniques. These models, pushed to the edge of training scaling laws, are highly sensitive to sparsity. We iteratively refined our methods to overcome this, starting with meticulously curating publicly available datasets. By sourcing and filtering only the highest-quality and most representative data for LLM use cases, we reduced the pretraining set to just 13 billion tokens—drastically cutting the environmental impact of further training while preserving performance.

    This curated dataset and advancements in our pruning and sparse training recipes allowed training to converge in just 26 hours on 32 H100 GPUs, demonstrating the efficiency and scalability of our approach while delivering a model optimized for real-world deployments.

    Performance snapshot

    Sparse Llama 3.1 demonstrates exceptional performance across few-shot benchmarks, fine-tuning tasks, and inference scenarios, showcasing the versatility and efficiency of 2:4 sparsity.

    • Few-shot benchmarks: It achieved 98.4% accuracy recovery on the Open LLM Leaderboard V1 and 97.3% recovery on the more challenging Mosaic Eval Gauntlet (Figure 1, right), maintaining competitive performance with dense models. 
    • Fine-tuning results (Figure 1, left): The most exciting results emerged during fine-tuning across math (GSM8K), code (Evol-CodeAlpaca), and conversational AI (Ultrachat-200K) tasks. Despite minimal regression in the few-shot benchmarks, it achieved full accuracy recovery and, in some cases, outperformed its dense counterparts. These results underscore Sparse Llama's robustness and adaptability across domains. As a proof of concept, these fine-tuned models are additionally publicly available.
    • Sparse-quantized Inference (Figure 2): Using 4-bit post-training quantization combined with 2:4 sparsity delivered impressive inference speedups with vLLM 0.6.4.post1 and minimal effects on accuracy for most cases. The sparse-quantized models achieved 3.0x speedup on A5000 and A6000 GPUs, and 2.1x on A100 GPUs in single-stream latency, with 1.1x to 1.2x of the gains attributed to sparsity alone. Throughput scenarios showed 1.2x to 1.8x improvement, even when quantization alone had minimal impact.
    Results for dense, sparse, and sparse-quantized Llama 3.1 8B on fine-tuning and few-shot benchmarks.
    Figure 1: Results for dense, sparse, and sparse-quantized Llama 3.1 8B on fine-tuning and few-shot benchmarks.
    Impact of 2:4 sparsity and quantization on inference performance with vLLM nightly (11/22/24).
    Figure 2: Impact of 2:4 sparsity and quantization on inference performance with vLLM nightly (11/22/24).

    For a detailed breakdown of metrics and benchmarks, refer to the Full performance details section.

    Get started

    Explore the Sparse Llama base model and our fine-tuned versions today on Neural Magic’s Hugging Face organization. With open-sourced weights, evaluations, and benchmarks, we intend to empower the community to experiment and build on our foundation.

    Stay tuned for more. The future of LLMs is open source, and with sparsity, we're excited to continue pushing the boundaries of what's possible for efficient, scalable, and performant AI.

    Full performance details

    This section breaks down the metrics and benchmarks.

    Base model evaluations

    We evaluated Sparse Llama on two widely used benchmarks to establish its baseline performance:

    Open LLM Leaderboard v1: This benchmark measures capabilities across domains such as grade school mathematics (GSM8K), world knowledge (MMLU, ARC-Challenge), and language understanding (WinoGrande, HellaSwag). Sparse Llama achieved 98.4% accuracy recovery, performing nearly on par with its dense counterpart, as shown in Table 1.

    Table 1: Dense vs. Sparse Llama 3.1 8B evaluation results for the Open LLM Leaderboard v1.
    Open LLM Leaderboard V1

    ARC C

    25-shot

    MMLU

    5-shot

    HellaSwag

    10-shot

    WinoGrande

    5-shot

    GSM8K

    5-shot

    TruthfulQA

    0-shot

    Llama-3.1-8B

    58.2

    65.4

    82.3

    78.3

    50.7

    44.2

    Sparse-Llama-3.1-8B-2of4

    59.4

    60.6

    79.8

    75.9

    56.3

    40.9

    Mosaic Eval Gauntlet v0.3: A comprehensive benchmark covering reasoning, problem-solving, and reading comprehension tasks. Sparse Llama demonstrated robust performance, achieving 97.3% accuracy recovery, even on more challenging datasets, as reported in Table 2.

    Table 2: Dense versus Sparse Llama 3.1 8B evaluation results for the Mosaic Eval Gauntlet v0.3.
    Mosaic Eval Gauntlet v0.3

    World knowledge

    Commonsense reasoning

    Language understanding

    Symbolic problem solving

    Reading comprehension

    Llama-3.1-8B

    59.4

    49.3

    69.8

    40.0

    58.2

    Sparse-Llama-3.1-8B-2of4

    55.6

    50.0

    69.0

    37.1

    57.5

    Sparse fine-tuning evaluations

    To evaluate Sparse Llama’s adaptability, we fine-tuned both sparse and dense versions across three domains using the same amount of hyperparameter tuning:

    • Mathematical reasoning (GSM8K): Strict-match accuracy in 0-shot setting via lm-evaluation-harness.
    • Coding (Evol-CodeAlpaca): Pass@1 on HumanEval and HumanEval+ via EvalPlus.
    • Conversational AI (Ultrachat-200K): Win rate on AlpacaEval, following the setup of Sparse Llama 2.

    Sparse Llama consistently achieved full accuracy recovery during fine-tuning and even outperformed the dense baseline on some tasks, demonstrating its versatility. The results are detailed in Table 3.

    Table 3: Dense versus Sparse Llama 3.1 8B evaluation results for fine-tuning across GSM8K, Evol-CodeAlpaca, and Ultrachat-200K.
    Model

    Recovery (%)

    Average

    GSM8K (0-shot strict-match)HumanEval (Evol-CodeAlpaca pass@1)HumanEval+ (Evol-CodeAlpaca pass@1)AlpacaEval (Ultrachat-200k Win Rate)
    Llama-3.1-8B

    100.1

    55.3

    66.3

    48.5

    44.1

    62.0

    Sparse-Llama-3.1-8B-2of4

    101.1

    55.8

    66.9

    49.1

    46.3

    61.1

    Quantization and sparsity

    To further optimize Sparse Llama, we applied 4-bit post-training quantization using the W4A16 scheme. This approach preserves the 2:4 sparsity pattern, where weights are quantized to 4-bit integers, and activations remain at 16-bit precision.

    Sparse quantized versions maintained accuracy with minimal degradation compared to unquantized models while enabling significant compression and inference performance gains. See Table 4 for detailed accuracy comparisons across tasks.

    Table 4: Dense, Sparse, and Sparse Quantized evaluation results for fine-tuning across GSM8K, Evol-CodeAlpaca, and Ultrachat-200K.
    ModelGSM8K (0-shot strict-match)HumanEval (Evol-CodeAlpaca pass@1)HumanEval+ (Evol-CodeAlpaca pass@1)AlpacaEval (Ultrachat-200k Win Rate)
    Llama-3.1-8B66.3448.5044.2061.99
    Sparse-Llama-3.1-8B-2of466.8749.1046.3061.12
    Sparse-Llama-3.1-8B-2of4-quantized.w4a1664.2950.6048.0061.61

    Inference

    Sparse Llama’s inference performance was benchmarked using vLLM, the high-performance inference engine, and compared to dense variants across several real-world use cases—ranging from code completion to large summarization—using version 0.6.4.post1, as detailed in Table 5.

    Table 5: Use cases for benchmarking the performance impact of 2:4 sparsity and quantization.
    Use casePrefill tokensDecode tokens
    Code completion2561024
    Docstring generation768128
    Code fixing10241024
    RAG1024128
    Instruction following256128
    Multi-turn chat512256
    Large summarization4096512

    Single-stream deployments

    In single-stream scenarios, combining sparsity and quantization resulted in significant latency reductions ranging from 2.1x to 3.0x faster inference than dense, 16-bit models, with 1.1x to 1.2x speedups from sparsity alone. Table 6 provides full results across the various use cases.

    Table 6: Effect of pruning and quantization on latency for synchronous deployment across multiple use cases.
    GPU classModelAverage Speedup

    Latency (s)

    Code CompletionDocstring GenerationCode FixingRAGInstruction FollowingMulti-Turn ChatLarge Summarization 
    A5000Llama-3.1-8B 25.33.325.73.43.26.513.9 
    Llama-3.1-8B-quantized.w4a162.59.61.410.01.41.32.56.1 
    Sparse-Llama-3.1-8B-2of4-quantized.w4a163.08.31.18.41.21.12.15.0 
    A6000Llama-3.1-8B 24.63.224.93.23.16.213.4 
    Llama-3.1-8B-quantized.w4a162.59.51.39.81.41.22.55.9 
    Sparse-Llama-3.1-8B-2of4-quantized.w4a163.08.01.18.21.11.02.14.9 
    A100Llama-3.1-8B 12.31.612.41.62.03.07.0 
    Llama-3.1-8B-quantized.w4a161.96.40.96.60.91.02.04.0 
    Sparse-Llama-3.1-8B-2of4-quantized.w4a162.15.70.85.80.81.01.03.0 

    Asynchronous deployments

    In multi-query scenarios, we highlight the gains focusing on maximum throughput for the various scenarios to compare Sparse Llama with its dense counterparts. Sparse Llama achieved 1.2x to 1.8x speedup in maximum query rates compared to the dense, 16-bit model, while quantization alone offered only minimal improvements and, in some cases, performed worse. Table 7 provides full results across the various use cases.

    Table 7: Effect of pruning and quantization on maximum query rate for asynchronous deployment across multiple use cases.
      

    Maximum queries per second

    GPU classModel

    Average speedup

    Code completion

    Docstring generation

    Code fixing

    RAG

    Instruction following

    Multi-turn chat

    Large summarization

    A5000Llama-3.1-8B

     

    0.9

    4.2

    0.5

    3.2

    7.9

    24.4

    0.4

    Llama-3.1-8B-quantized.w4a16

    1.4

    1.6

    4.9

    1.0

    3.8

    11.6

    22.5

    0.6

    Sparse-Llama-3.1-8B-2of4-quantized.w4a16

    1.8

    2.0

    7.2

    1.3

    5.6

    15.6

    16.4

    0.8

    A6000Llama-3.1-8B

     

    1.5

    5.6

    1.1

    4.5

    12.5

    11.8

    0.8

    Llama-3.1-8B-quantized.w4a16

    1.1

    2.4

    5.8

    1.3

    4.5

    13.7

    12,7

    0.7

    Sparse-Llama-3.1-8B-2of4-quantized.w4a16

    1.4

    2.7

    7.4

    1.5

    5.9

    17.3

    16.6

    0.8

    A100Llama-3.1-8B

     

    3.5

    11.2

    2.4

    8.7

    23.0

    22.5

    1.6

    Llama-3.1-8B-quantized.w4a16

    1.0

    3.9

    10.7

    2.6

    8.4

    23.7

    22.8

    1.5

    Sparse-Llama-3.1-8B-2of4-quantized.w4a16

    1.2

    4.4

    14.0

    3.0

    10.8

    25.5

    28.2

    1.8

    Empowering AI efficiency

    Sparse Llama 3.1 represents a new step for scalable and efficient LLMs through SOTA sparsity and quantization techniques. From few-shot evaluations to fine-tuning performance and real-world inference benchmarks, Sparse Llama delivers substantial improvements in inference performance, making advanced AI more accessible.

    We are excited to see how the community builds on this foundation. Whether you’re optimizing deployments, exploring model compression, or scaling AI to new heights, Sparse Llama offers a compelling path forward. Together, let’s shape the future of efficient AI.

    This blog was updated on 12/23/24 with corrections to inference performance numbers.

    Last updated: March 25, 2025

    Related Posts

    • Multimodal model quantization support through LLM Compressor

    • LLM Compressor is here: Faster inference with vLLM

    • How we optimized vLLM for DeepSeek-R1

    • Deploy Llama 3 8B with vLLM

    • How to fine-tune Llama 3.1 with Ray on OpenShift AI

    • Simplifying AI with RamaLama and llama-run

    Recent Posts

    • Improve GPU utilization with Kueue in OpenShift AI

    • Integration Powers AI: Apache Camel at Devoxx UK 2025

    • How to use RHEL 10 as a WSL Podman machine

    • MINC: Fast, local Kubernetes with Podman Desktop & MicroShift

    • How to stay informed with Red Hat status notifications

    Red Hat Developers logo LinkedIn YouTube Twitter Facebook

    Products

    • Red Hat Enterprise Linux
    • Red Hat OpenShift
    • Red Hat Ansible Automation Platform

    Build

    • Developer Sandbox
    • Developer Tools
    • Interactive Tutorials
    • API Catalog

    Quicklinks

    • Learning Resources
    • E-books
    • Cheat Sheets
    • Blog
    • Events
    • Newsletter

    Communicate

    • About us
    • Contact sales
    • Find a partner
    • Report a website issue
    • Site Status Dashboard
    • Report a security problem

    RED HAT DEVELOPER

    Build here. Go anywhere.

    We serve the builders. The problem solvers who create careers with code.

    Join us if you’re a developer, software engineer, web designer, front-end designer, UX designer, computer scientist, architect, tester, product manager, project manager or team lead.

    Sign me up

    Red Hat legal and privacy links

    • About Red Hat
    • Jobs
    • Events
    • Locations
    • Contact Red Hat
    • Red Hat Blog
    • Inclusion at Red Hat
    • Cool Stuff Store
    • Red Hat Summit

    Red Hat legal and privacy links

    • Privacy statement
    • Terms of use
    • All policies and guidelines
    • Digital accessibility

    Report a website issue