Large language models (LLMs) have become ubiquitous, fundamentally changing how we build products, work, and interact with technology. They are unlocking immense new capabilities in areas like content generation, coding, and customer support. However, beneath the excitement of their rapid advancements lies a significant, often hidden, cost: the economics of deploying these models.
The primary challenge is the explosive growth in model size. Models like the Llama 4 herd are setting a new trend with foundation Mixture of Experts (MoE) models reaching up to 2 trillion total parameters. While models are growing rapidly, the memory capacity of accelerators like GPUs has increased only minimally, barely doubling in the past five years, with the latest B200 GPUs offering a maximum of 192 GB. Instead, the focus has shifted to networking multiple accelerators together, which unfortunately increases costs through additional GPU purchases and expensive high-bandwidth networking.
Deploying these large models is becoming increasingly expensive. Simply fitting the parameters of a 109B parameter model requires three NVIDIA H100s with 80 GB each. A 400B model needs at least 10 80GB GPUs. This requirement doesn't even account for the memory needed for things like KV cache storage and active requests, which can range from 2 GB for medium lengths to 25 GB for long context or reasoning models. Consequently, realistic deployments capable of handling a reasonable number of requests often necessitate full 8x80GB GPU servers: one for a 109B model and two for a 400B model. Even a 70B model might require up to four 80GB GPUs. Only the smaller 8B parameter models can reliably fit onto a single 80GB GPU, often at the expense of accuracy.
This presents a significant and costly proposition for any company choosing to deploy its own LLMs.Without careful planning, companies can face monthly deployment costs of tens of thousands of dollars for a single use case, while multiple use cases could result in hundreds of thousands of dollars spent. This high cost can hinder the potential savings companies initially anticipated from using LLMs.
Model compression: The essential solution
The good news is that these high costs are not unavoidable. Model compression is the key answer. Compression techniques target and reduce multiple bottlenecks for inference, including compute bandwidth, memory bandwidth, and total memory footprint. The goal is to achieve this reduction without compromising the model's accuracy.
Several techniques contribute to model compression and optimization:
- Quantization leverages the inherent noise and redundancy in LLM training at the baseline, 16-bit precision. It works by reducing the number of bits used to represent various model components, such as weights, activations, the key-value (KV) cache, and attention operations. This reduction decreases the memory and compute bandwidth required to run the model. Quantization alone can enable 2-4X faster deployments, all while using less hardware. Techniques like QLoRA extend quantization to post-training, making fine-tuning both cheaper and faster.
- Pruning involves removing connections from the network entirely. This technique exploits redundancies built during training, where an initially large search space converges to the solution’s smaller optimization space. By removing connections, the technique reduces the number of parameters to store and skips compute for those connections. A 50% sparse model achieved through pruning can be 1.5 to 2 times faster or cheaper for inference (see this example using Sparse Llama).
- Knowledge and data distillation shrink the model size by training smaller models. In knowledge distillation, a smaller model learns to mimic the behavior of a larger model by learning from its full output distributions. Data distillation involves generating high-quality synthetic datasets that enable smaller models to learn more effectively and efficiently. These distillation methods enable the deployment of models up to 10 times smaller while maintaining reasonable accuracy for specific tasks. They also significantly reduce fine-tuning and iteration times, allowing teams to move faster and spend less.
- Speculative decoding takes a different approach by extending the model rather than reducing it, trading off extra compute for faster latency. It uses a smaller, faster "speculator" model to predict multiple tokens ahead. The larger, more accurate model then only verifies these predictions. This technique can cut inference latency anywhere from 2-4X.
Combining these techniques can lead to compounding gains. For instance, a 10X smaller, distilled model can be further quantized, enabling an additional 3X performance gain. Despite these significant benefits, over half of vLLM deployments today still run uncompressed models, resulting in compute inefficiencies.
Real-world cost examples
Let's look at the potential cost savings using two common use cases: online retrieval augmented generation (RAG) and offline summarization. These examples assume deployment on H100 80GB GPU systems at $2.40 per hour per GPU. A100 setups cost roughly 60% of these numbers.
- Online RAG: This involves LLMs utilizing a knowledge base to respond to real-time queries, necessitating low latency. Prompts are often large (up to 10,000 tokens), while responses are typically shorter (a few hundred tokens).
- A small startup handling 10,000 requests/day could spend $15,000/month for a 109B model or $30,000/month for a 400B model.
- A large enterprise handling a million+ requests/day could face costs of $200,000/month for 109B and $400,000/month for 400B.
- With just quantization, these costs can be reduced by one-third to one-half, resulting in savings up to $5,000/month (low end) to $130,000/month (high end).
- Further, distillation allows for swapping in a fine-tuned 8B model with similar accuracy to larger models, leading to an 8X cost reduction. Costs drop to $1,000/month or less (low end) and $30,000/month (high end). Quantizing the distilled model improves performance even further.
- Offline summarization: This involves summarizing content such as reviews, where immediate updates are not required, allowing batch processing and scheduling for cheaper inference. Prompts are also large (up to a few thousand tokens), with short responses.
- Maximum costs are reduced by roughly 30% compared to RAG due to relaxed latency restrictions, reaching $150,000/month at the highest end. The lowest end cost remains similar as a full server size is needed.
- Utilizing distilled models and quantization offers similar scale benefits. Quantization provides a roughly 3X reduction, and distillation through fine-tuning allows deployment with an 8B model, further reducing costs.
Conclusion
Whether building summarization, chatbots, or other AI-driven applications, compressing models is crucial for achieving the best possible performance and cost savings. Leveraging compression and fine-tuning technologies and deploying them in vLLM is recommended.
Ready to explore further? Check out LLM Compressor for accurate model compression, InstructLab for easy model customization, Red Hat AI Hugging Face model repository for getting started with already-compressed models, and vLLM to become part of the community shaping AI inference.
Last updated: June 30, 2025