"Which large language model is best?"
It's a simple question every developer and business is asking, but the answer is anything but. The "best" model might be the one that is fastest, least expensive, or most accurate—and you rarely get all three. Choosing the right model, and the right hardware to run it on, has significant implications for cost and user experience. The entire goal of performance benchmarking is answering the key question: Which setup provides the highest performance for the lowest cost?
As we quickly learned, evaluating this performance isn't a single test; it's a complex, multi-dimensional puzzle. When my team from Jounce joined Red Hat, our first major task was to find a solution to that puzzle.
We weren't just running a few tests. We were staring down a list of 7,488 potential combinations. This "combinatorial explosion" is the hard reality of benchmarking. To get a useful answer, we had to test every mix of the following:
- 26 distinct LLM models
- 4 different GPU types
- 4 different GPU counts per machine
- 9 different load levels (requests per second or RPS)
- 2 different prompt profiles (RAG and chat)
It was a daunting and incredibly expensive challenge. We successfully ran 4,506 of those tests—a monumental effort. Then, we checked the results and realized that more than half of our executed runs were rendered invalid. That's right: a 50% machine-time tax, because of a subtle problem we call oversaturation.
The problem of oversaturation
Oversaturation is when a server can't process the load of incoming requests, which causes a queue to build up. This results in the server taking progressively longer to start handling each request. When a performance benchmarking tool oversaturates an LLM inference server, the metrics it measures become significantly skewed, rendering them useless.
Think of it like a cashier getting flustered during a sudden rush. As the line grows (the load), the cashier can’t keep up, the line gets longer, and there is no room for additional customers. This waste of costly machine time was heartbreaking.
Our solution was beautifully simple in theory: If we could automatically perform oversaturation detection (OSD) and stop the runs before they finished, we could save a fortune. But as we discovered, OSD is rarely as straightforward as we hoped.
Our stack: GuideLLM, VLLM, and JBenchmark
To run 4,506 tests, we relied on a three-part stack:
- vLLM (the engine): vLLM is our choice for the LLM inference server. As its repository describes, it's a fast and easy-to-use library for LLM inference and serving. vLLM is an open source, enterprise-oriented engine responsible for the heavy lifting of running LLMs.
- GuideLLM (the measurer): GuideLLM is the tool that simulates real-world load and measures performance. Created by Neural Magic (now part of Red Hat), GuideLLM simulates real user load and records critical LLM-specific metrics like Time-to-First-Token (TTFT), Inter-Token Latency (ITL), and End-to-End Latency (E2E). Every test run produces a detailed report of these measured metrics.
- JBenchmark (the orchestrator): JBenchmark is the (currently internal) Red Hat solution that intelligently manages our massive test matrix. It spins up the right cloud resources and spot instances, tells the engine (vLLM) and measurer (GuideLLM) which of the thousands of combinations to run, and ensures we get our results with maximum cost-efficiency.
Doesn't oversaturation detection already have a solution?
It's a great question. Why not use a standard load-testing tool?
In most traditional load testing, OSD isn't a major focus. That's because most teams test one core production setup deeply. Their goal is depth, not breadth. The cost of those few tests is tiny compared to running the actual system.
Our situation is fundamentally different. We have thousands of unique setups, each requiring a costly GPU machine. Because all we do is load testing, we don't have a production setup; our load testing costs are the entirety of our infrastructure costs.
In addition, we are working exclusively with LLMs, which introduces unique serving characteristics:
- Streaming HTTP requests (tokens are returned in a stream).
- Very long requests (processing can take seconds or minutes).
- Accelerated hardware (reliance on high-cost GPUs).
By using unique LLM-specific metrics like Time to First Token (TTFT) and Inter-Token Latency (ITL), we could detect oversaturation more efficiently than traditional methods. However, finding a stable solution was not as easy as simply picking a new metric.
Oversaturation detection is not trivial
At any point during a performance benchmark, our OSD algorithm must use all available data to predict whether the load has reached oversaturation. If the prediction is positive, the benchmark terminates immediately. The challenge is rooted in time: True-Alerts should be raised as soon as possible, and False-Alerts must ideally never be raised. This makes OSD more akin to anomaly detection or survival analysis, where timing is everything.
There is, however, a fundamental issue with the definition of oversaturation itself. What exactly does "keep up" mean in practice? The definition is highly volatile because it depends on two critical factors: the maximum throughput of the server and the variability in that throughput. In LLM serving, these factors can vary wildly. This means simple, static thresholds like "Alert when there are more than 1,000 concurrent requests" are totally inadequate.
This is a genuinely difficult problem where theory alone won't suffice. We need to employ proper data science practices. Luckily for us, after running those 4,506 costly benchmarks, we now happen to have data that is roughly equal parts oversaturated and undersaturated runs. A very useful coincidence, considering the price tag.
Next steps
This article discussed the costly challenge of oversaturation and established the need for a reliable, data-driven OSD solution. In part 2, the focus shifts to the foundational data science: how to evaluate the performance of an OSD algorithm through custom metrics, dataset labeling, and load augmentation techniques. In part 3, we'll walk through how we built the actual OSD algorithm.