Skip to main content
Redhat Developers  Logo
  • Products

    Featured

    • Red Hat Enterprise Linux
      Red Hat Enterprise Linux Icon
    • Red Hat OpenShift AI
      Red Hat OpenShift AI
    • Red Hat Enterprise Linux AI
      Linux icon inside of a brain
    • Image mode for Red Hat Enterprise Linux
      RHEL image mode
    • Red Hat OpenShift
      Openshift icon
    • Red Hat Ansible Automation Platform
      Ansible icon
    • Red Hat Developer Hub
      Developer Hub
    • View All Red Hat Products
    • Linux

      • Red Hat Enterprise Linux
      • Image mode for Red Hat Enterprise Linux
      • Red Hat Universal Base Images (UBI)
    • Java runtimes & frameworks

      • JBoss Enterprise Application Platform
      • Red Hat build of OpenJDK
    • Kubernetes

      • Red Hat OpenShift
      • Microsoft Azure Red Hat OpenShift
      • Red Hat OpenShift Virtualization
      • Red Hat OpenShift Lightspeed
    • Integration & App Connectivity

      • Red Hat Build of Apache Camel
      • Red Hat Service Interconnect
      • Red Hat Connectivity Link
    • AI/ML

      • Red Hat OpenShift AI
      • Red Hat Enterprise Linux AI
    • Automation

      • Red Hat Ansible Automation Platform
      • Red Hat Ansible Lightspeed
    • Developer tools

      • Red Hat Trusted Software Supply Chain
      • Podman Desktop
      • Red Hat OpenShift Dev Spaces
    • Developer Sandbox

      Developer Sandbox
      Try Red Hat products and technologies without setup or configuration fees for 30 days with this shared Openshift and Kubernetes cluster.
    • Try at no cost
  • Technologies

    Featured

    • AI/ML
      AI/ML Icon
    • Linux
      Linux Icon
    • Kubernetes
      Cloud icon
    • Automation
      Automation Icon showing arrows moving in a circle around a gear
    • View All Technologies
    • Programming Languages & Frameworks

      • Java
      • Python
      • JavaScript
    • System Design & Architecture

      • Red Hat architecture and design patterns
      • Microservices
      • Event-Driven Architecture
      • Databases
    • Developer Productivity

      • Developer productivity
      • Developer Tools
      • GitOps
    • Secure Development & Architectures

      • Security
      • Secure coding
    • Platform Engineering

      • DevOps
      • DevSecOps
      • Ansible automation for applications and services
    • Automated Data Processing

      • AI/ML
      • Data Science
      • Apache Kafka on Kubernetes
      • View All Technologies
    • Start exploring in the Developer Sandbox for free

      sandbox graphic
      Try Red Hat's products and technologies without setup or configuration.
    • Try at no cost
  • Learn

    Featured

    • Kubernetes & Cloud Native
      Openshift icon
    • Linux
      Rhel icon
    • Automation
      Ansible cloud icon
    • Java
      Java icon
    • AI/ML
      AI/ML Icon
    • View All Learning Resources

    E-Books

    • GitOps Cookbook
    • Podman in Action
    • Kubernetes Operators
    • The Path to GitOps
    • View All E-books

    Cheat Sheets

    • Linux Commands
    • Bash Commands
    • Git
    • systemd Commands
    • View All Cheat Sheets

    Documentation

    • API Catalog
    • Product Documentation
    • Legacy Documentation
    • Red Hat Learning

      Learning image
      Boost your technical skills to expert-level with the help of interactive lessons offered by various Red Hat Learning programs.
    • Explore Red Hat Learning
  • Developer Sandbox

    Developer Sandbox

    • Access Red Hat’s products and technologies without setup or configuration, and start developing quicker than ever before with our new, no-cost sandbox environments.
    • Explore Developer Sandbox

    Featured Developer Sandbox activities

    • Get started with your Developer Sandbox
    • OpenShift virtualization and application modernization using the Developer Sandbox
    • Explore all Developer Sandbox activities

    Ready to start developing apps?

    • Try at no cost
  • Blog
  • Events
  • Videos

The hidden cost of large language models

Why model optimization is essential

June 24, 2025
Mark Kurtz
Related topics:
Artificial intelligence
Related products:
Red Hat AI

Share:

    Large language models (LLMs) have become ubiquitous, fundamentally changing how we build products, work, and interact with technology. They are unlocking immense new capabilities in areas like content generation, coding, and customer support. However, beneath the excitement of their rapid advancements lies a significant, often hidden, cost: the economics of deploying these models.

    The primary challenge is the explosive growth in model size. Models like the Llama 4 herd are setting a new trend with foundation Mixture of Experts (MoE) models reaching up to 2 trillion total parameters. While models are growing rapidly, the memory capacity of accelerators like GPUs has increased only minimally, barely doubling in the past five years, with the latest B200 GPUs offering a maximum of 192 GB. Instead, the focus has shifted to networking multiple accelerators together, which unfortunately increases costs through additional GPU purchases and expensive high-bandwidth networking.

    Deploying these large models is becoming increasingly expensive. Simply fitting the parameters of a 109B parameter model requires three NVIDIA H100s with 80 GB each. A 400B model needs at least 10 80GB GPUs. This requirement doesn't even account for the memory needed for things like KV cache storage and active requests, which can range from 2 GB for medium lengths to 25 GB for long context or reasoning models. Consequently, realistic deployments capable of handling a reasonable number of requests often necessitate full 8x80GB GPU servers: one for a 109B model and two for a 400B model. Even a 70B model might require up to four 80GB GPUs. Only the smaller 8B parameter models can reliably fit onto a single 80GB GPU, often at the expense of accuracy.

    This presents a significant and costly proposition for any company choosing to deploy its own LLMs.Without careful planning, companies can face monthly deployment costs of tens of thousands of dollars for a single use case, while multiple use cases  could result in hundreds of thousands of dollars spent. This high cost can hinder the potential savings companies initially anticipated from using LLMs.

    Model compression: The essential solution

    The good news is that these high costs are not unavoidable. Model compression is the key answer. Compression techniques target and reduce multiple bottlenecks for inference, including compute bandwidth, memory bandwidth, and total memory footprint. The goal is to achieve this reduction without compromising the model's accuracy.

    Several techniques contribute to model compression and optimization:

    • Quantization leverages the inherent noise and redundancy in LLM training at the baseline, 16-bit precision. It works by reducing the number of bits used to represent various model components, such as weights, activations, the key-value (KV) cache, and attention operations. This reduction decreases the memory and compute bandwidth required to run the model. Quantization alone can enable 2-4X faster deployments, all while using less hardware. Techniques like QLoRA extend quantization to post-training, making fine-tuning both cheaper and faster.
    • Pruning involves removing connections from the network entirely. This technique exploits redundancies built during training, where an initially large search space converges to the solution’s smaller optimization space. By removing connections, the technique reduces the number of parameters to store and skips compute for those connections. A 50% sparse model achieved through pruning can be 1.5 to 2 times faster or cheaper for inference (see this example using Sparse Llama).
    • Knowledge and data distillation shrink the model size by training smaller models. In knowledge distillation, a smaller model learns to mimic the behavior of a larger model by learning from its full output distributions. Data distillation involves generating high-quality synthetic datasets that enable smaller models to learn more effectively and efficiently. These distillation methods enable the deployment of models up to 10 times smaller while maintaining reasonable accuracy for specific tasks. They also significantly reduce fine-tuning and iteration times, allowing teams to move faster and spend less.
    • Speculative decoding takes a different approach by extending the model rather than reducing it, trading off extra compute for faster latency. It uses a smaller, faster "speculator" model to predict multiple tokens ahead. The larger, more accurate model then only verifies these predictions. This technique can cut inference latency anywhere from 2-4X.

    Combining these techniques can lead to compounding gains. For instance, a 10X smaller, distilled model can be further quantized, enabling an additional 3X performance gain. Despite these significant benefits, over half of vLLM deployments today still run uncompressed models, resulting in compute inefficiencies. 

    Real-world cost examples

    Let's look at the potential cost savings using two common use cases: online retrieval augmented generation (RAG) and offline summarization. These examples assume deployment on H100 80GB GPU systems at $2.40 per hour per GPU. A100 setups cost roughly 60% of these numbers.

    • Online RAG: This involves LLMs utilizing a knowledge base to respond to real-time queries, necessitating low latency. Prompts are often large (up to 10,000 tokens), while responses are typically shorter (a few hundred tokens).
      • A small startup handling 10,000 requests/day could spend $15,000/month for a 109B model or $30,000/month for a 400B model.
      • A large enterprise handling a million+ requests/day could face costs of $200,000/month for 109B and $400,000/month for 400B.
      • With just quantization, these costs can be reduced by one-third to one-half, resulting in savings up to $5,000/month (low end) to $130,000/month (high end).
      • Further, distillation allows for swapping in a fine-tuned 8B model with similar accuracy to larger models, leading to an 8X cost reduction. Costs drop to $1,000/month or less (low end) and $30,000/month (high end). Quantizing the distilled model improves performance even further.
    • Offline summarization: This involves summarizing content such as reviews, where immediate updates are not required, allowing batch processing and scheduling for cheaper inference. Prompts are also large (up to a few thousand tokens), with short responses.
      • Maximum costs are reduced by roughly 30% compared to RAG due to relaxed latency restrictions, reaching $150,000/month at the highest end. The lowest end cost remains similar as a full server size is needed.
      • Utilizing distilled models and quantization offers similar scale benefits. Quantization provides a roughly 3X reduction, and distillation through fine-tuning allows deployment with an 8B model, further reducing costs.

    Conclusion

    Whether building summarization, chatbots, or other AI-driven applications, compressing models is crucial for achieving the best possible performance and cost savings. Leveraging compression and fine-tuning technologies and deploying them in vLLM is recommended.

    Ready to explore further? Check out LLM Compressor for accurate model compression, InstructLab for easy model customization, Red Hat AI Hugging Face model repository for getting started with already-compressed models, and vLLM to become part of the community shaping AI inference.

    Last updated: June 30, 2025

    Related Posts

    • Axolotl meets LLM Compressor: Fast, sparse, open

    • Multimodal model quantization support through LLM Compressor

    • LLM Compressor is here: Faster inference with vLLM

    • LLM Compressor: Optimize LLMs for low-latency deployments

    • Optimize LLMs with LLM Compressor in Red Hat OpenShift AI

    • Sparse fine-tuning for accelerating large language models with DeepSparse

    Recent Posts

    • Create and enrich ServiceNow ITSM tickets with Ansible Automation Platform

    • Expand Model-as-a-Service for secure enterprise AI

    • OpenShift LACP bonding performance expectations

    • Build container images in CI/CD with Tekton and Buildpacks

    • How to deploy OpenShift AI & Service Mesh 3 on one cluster

    What’s up next?

    Learn how to run distributed AI training on Red Hat OpenShift using RoCE with this step-by-step guide from manual setup to fully automated training.

    Start the activity
    Red Hat Developers logo LinkedIn YouTube Twitter Facebook

    Products

    • Red Hat Enterprise Linux
    • Red Hat OpenShift
    • Red Hat Ansible Automation Platform

    Build

    • Developer Sandbox
    • Developer Tools
    • Interactive Tutorials
    • API Catalog

    Quicklinks

    • Learning Resources
    • E-books
    • Cheat Sheets
    • Blog
    • Events
    • Newsletter

    Communicate

    • About us
    • Contact sales
    • Find a partner
    • Report a website issue
    • Site Status Dashboard
    • Report a security problem

    RED HAT DEVELOPER

    Build here. Go anywhere.

    We serve the builders. The problem solvers who create careers with code.

    Join us if you’re a developer, software engineer, web designer, front-end designer, UX designer, computer scientist, architect, tester, product manager, project manager or team lead.

    Sign me up

    Red Hat legal and privacy links

    • About Red Hat
    • Jobs
    • Events
    • Locations
    • Contact Red Hat
    • Red Hat Blog
    • Inclusion at Red Hat
    • Cool Stuff Store
    • Red Hat Summit
    © 2025 Red Hat

    Red Hat legal and privacy links

    • Privacy statement
    • Terms of use
    • All policies and guidelines
    • Digital accessibility

    Report a website issue