Skip to main content
Redhat Developers  Logo
  • AI

    Get started with AI

    • Red Hat AI
      Accelerate the development and deployment of enterprise AI solutions.
    • AI learning hub
      Explore learning materials and tools, organized by task.
    • AI interactive demos
      Click through scenarios with Red Hat AI, including training LLMs and more.
    • AI/ML learning paths
      Expand your OpenShift AI knowledge using these learning resources.
    • AI quickstarts
      Focused AI use cases designed for fast deployment on Red Hat AI platforms.
    • No-cost AI training
      Foundational Red Hat AI training.

    Featured resources

    • OpenShift AI learning
    • Open source AI for developers
    • AI product application development
    • Open source-powered AI/ML for hybrid cloud
    • AI and Node.js cheat sheet

    Red Hat AI Factory with NVIDIA

    • Red Hat AI Factory with NVIDIA is a co-engineered, enterprise-grade AI solution for building, deploying, and managing AI at scale across hybrid cloud environments.
    • Explore the solution
  • Learn

    Self-guided

    • Documentation
      Find answers, get step-by-step guidance, and learn how to use Red Hat products.
    • Learning paths
      Explore curated walkthroughs for common development tasks.
    • Guided learning
      Receive custom learning paths powered by our AI assistant.
    • See all learning

    Hands-on

    • Developer Sandbox
      Spin up Red Hat's products and technologies without setup or configuration.
    • Interactive labs
      Learn by doing in these hands-on, browser-based experiences.
    • Interactive demos
      Click through product features in these guided tours.

    Browse by topic

    • AI/ML
    • Automation
    • Java
    • Kubernetes
    • Linux
    • See all topics

    Training & certifications

    • Courses and exams
    • Certifications
    • Skills assessments
    • Red Hat Academy
    • Learning subscription
    • Explore training
  • Build

    Get started

    • Red Hat build of Podman Desktop
      A downloadable, local development hub to experiment with our products and builds.
    • Developer Sandbox
      Spin up Red Hat's products and technologies without setup or configuration.

    Download products

    • Access product downloads to start building and testing right away.
    • Red Hat Enterprise Linux
    • Red Hat AI
    • Red Hat OpenShift
    • Red Hat Ansible Automation Platform
    • See all products

    Featured

    • Red Hat build of OpenJDK
    • Red Hat JBoss Enterprise Application Platform
    • Red Hat OpenShift Dev Spaces
    • Red Hat Developer Toolset

    References

    • E-books
    • Documentation
    • Cheat sheets
    • Architecture center
  • Community

    Get involved

    • Events
    • Live AI events
    • Red Hat Summit
    • Red Hat Accelerators
    • Community discussions

    Follow along

    • Articles & blogs
    • Developer newsletter
    • Videos
    • Github

    Get help

    • Customer service
    • Customer support
    • Regional contacts
    • Find a partner

    Join the Red Hat Developer program

    • Download Red Hat products and project builds, access support documentation, learning content, and more.
    • Explore the benefits

Compressed Granite 3.1: Powerful performance in a small package

January 30, 2025
Shubhra Pandit Alexandre Marques Mark Kurtz
Related topics:
Artificial intelligence
Related products:
Red Hat AI

    A compressed summary

    • Neural Magic is now part of Red Hat, accelerating our mission to deliver open and efficient AI.
    • Our new compressed Granite 3.1 models are designed for enterprise deployments, achieving 3.3X smaller models, up to 2.8X better performance, and 99% accuracy recovery.
    • Models and recipes are open-sourced on Hugging Face, deployment-ready with vLLM, and extensible using LLM Compressor.

    Smaller, faster Granite for all

    Neural Magic is excited to join Red Hat, combining our expertise in AI optimization and inference with Red Hat’s legacy of open-source innovation. Together, we’re paving the way for more efficient, scalable, and accessible AI solutions tailored to the needs of developers and enterprises across the hybrid cloud.

    Our first contribution in this new chapter is the release of compressed Granite 3.1 8B and 2B models. Building on the success of the Granite 3.0 series, IBM’s Granite 3.1 introduced significant upgrades, including competitive OpenLLM leaderboard scores, expanded 128K token context length, multilingual support for 12 languages, and enhanced functionality for retrieval-augmented generation (RAG) and agentic workflows. These improvements make Granite 3.1 a versatile, high-performance solution for enterprise applications.

    To enhance accessibility and efficiency, we’ve created quantized versions of the Granite 3.1 models. These compressed variants reduce resource requirements while maintaining 99% accuracy recovery, making them ideal for cost-effective and scalable AI deployments. Available options include:

    • FP8 weights and activations (FP8 W8A8): Optimized for server and throughput-based scenarios on NVIDIA Ada Lovelace and Hopper GPUs.
    • INT8 weights and activations (INT W8A8): Ideal for servers using NVIDIA Ampere and earlier GPUs.
    • INT4 weight-only models (INT W4A16): Tailored for latency-sensitive applications or limited GPU resources.

    Extensive evaluations confirm that the up to 3.3X smaller, compressed Granite 3.1 models deliver 99% accuracy recovery, on average, and up to 2.8X better inference performance, as shown in Figure 1 and Figure 2. Explore these models, recipes, and full evaluations on Hugging Face, where they are freely available under the Apache 2.0 license.

    Comparison of baseline and quantized versions of the Granite 3.1 models (8B, 2B) for accuracy evaluations across averages for OpenLLM Leaderboard V1 and OpenLLM Leaderboard V2.
    Figure 1: Comparison of baseline and quantized versions of the Granite 3.1 models (8B, 2B) for accuracy evaluations across averages for OpenLLM Leaderboard V1 and OpenLLM Leaderboard V2.
    Comparison of baseline and quantized versions of Granite 3.1 8B for inference benchmarks for RAG and code completion scenarios on 1xA6000 and 1xL40 GPUs for best latency (single-stream) and throughput (multi-stream) performance.
    Figure 2: Comparison of baseline and quantized versions of Granite 3.1 8B for inference benchmarks for RAG and code completion scenarios on 1xA6000 and 1xL40 GPUs for best latency (single-stream) and throughput (multi-stream) performance.

    Compressed Granite in action

    The compressed Granite 3.1 models are ready for immediate deployment, integrating seamlessly with the vLLM ecosystem. Developers can quickly start deploying these models, and with LLM Compressor, they can further customize compression recipes to meet specific requirements. Below is a simple example to get started with vLLM:

    from vllm import LLM
    
    llm = LLM(model="neuralmagic/granite-3.1-8b-instruct-quantized.w4a16")
    prompts = ["The Future of AI is"]
    
    for output in llm.generate(prompts)
        print(f"Prompt {output.prompt}, Generated: {output.outputs[0].text}")

    For smaller request sizes, such as instruction-following tasks in agentic pipelines (256 prompt tokens, 128 output tokens), the compressed Granite models consistently deliver performance gains across latency, server, and throughput use cases on various GPUs. Figure 3 illustrates these scenarios for the 8B model deployed on a small A5000 GPU and a medium L40 GPU. Key takeaways from the graph include:

    • Single Stream, Latency: W4A16 achieves the highest efficiency with 2.7X (A5000) to 1.5X (L40) lower latency.
    • Multi-Stream: W8A8 models perform best (after 6 RPS) on an A5000 enabling 1.6X more requests per second at the same performance, and W4A16 performs best on an L40 enabling up to 8X more requests per second at the same performance.
    Comparison of baseline and quantized versions of Granite 3.1 8B for Instruction Following server-based inference performance on 1xA5000 and 1xL40 GPUs comparing request rate (RPS) to request latency.
    Figure 3: Comparison of baseline and quantized versions of Granite 3.1 8B for Instruction Following server-based inference performance on 1xA5000 and 1xL40 GPUs comparing request rate (RPS) to request latency.

    The compressed models provide comparable performance benefits for larger requests, such as retrieval-augmented generation (RAG) or summarization workflows (4096 prompt tokens, 512 output tokens). Figure 4 shows these scenarios for the 8B model on an A5000 GPU and a large A100 GPU. Highlights from the graph include:

    • Single stream, latency: W4A16 achieves the highest efficiency with 2.4X (A5000) to 1.7X (A100) lower latency.
    • Multi-stream: W8A8 models offer the best performance, enabling up to 4X (A5000) to 3X (A100) more requests at the same performance.
    Comparison of baseline and quantized versions of Granite 3.1 8B for RAG/Summarization server-based inference performance on 1xA5000 and 1xA100 GPUs comparing request rate (RPS) to request latency.
    Figure 4: Comparison of baseline and quantized versions of Granite 3.1 8B for RAG/Summarization server-based inference performance on 1xA5000 and 1xA100 GPUs comparing request rate (RPS) to request latency.

    Open and scalable AI

    Neural Magic’s integration into Red Hat marks an exciting new era for open and efficient AI. The compressed Granite 3.1 models and vLLM support exemplify the benefits of combining state-of-the-art model compression, advanced high-performance computing, and enterprise-grade generative AI capabilities.

    Explore these models and recipes on Hugging Face, deploy them with vLLM, or customize them using LLM Compressor to unlock tailored performance and cost savings.

    Last updated: September 23, 2025

    Recent Posts

    • Installing Red Hat Enterprise Linux 10 from a bootc image with bootc

    • Why your database benchmarking data is probably wrong (and how I fixed mine)

    • Type what you want to break: AI-assisted chaos engineering with Krkn

    • Understanding evaluation collections in EvalHub

    • An overview of confidential containers on OpenShift bare metal

    Red Hat Developers logo LinkedIn YouTube Twitter Facebook

    Platforms

    • Red Hat AI
    • Red Hat Enterprise Linux
    • Red Hat OpenShift
    • Red Hat Ansible Automation Platform
    • See all products

    Build

    • Developer Sandbox
    • Developer tools
    • Interactive tutorials
    • API catalog

    Quicklinks

    • Learning resources
    • E-books
    • Cheat sheets
    • Blog
    • Events
    • Newsletter

    Communicate

    • About us
    • Contact sales
    • Find a partner
    • Report a website issue
    • Site status dashboard
    • Report a security problem

    RED HAT DEVELOPER

    Build here. Go anywhere.

    We serve the builders. The problem solvers who create careers with code.

    Join us if you’re a developer, software engineer, web designer, front-end designer, UX designer, computer scientist, architect, tester, product manager, project manager or team lead.

    Sign me up

    Red Hat legal and privacy links

    • About Red Hat
    • Jobs
    • Events
    • Locations
    • Contact Red Hat
    • Red Hat Blog
    • Inclusion at Red Hat
    • Cool Stuff Store
    • Red Hat Summit
    © 2026 Red Hat

    Red Hat legal and privacy links

    • Privacy statement
    • Terms of use
    • All policies and guidelines
    • Digital accessibility

    Chat Support

    Please log in with your Red Hat account to access chat support.