Skip to main content
Redhat Developers  Logo
  • AI

    Get started with AI

    • Red Hat AI
      Accelerate the development and deployment of enterprise AI solutions.
    • AI learning hub
      Explore learning materials and tools, organized by task.
    • AI interactive demos
      Click through scenarios with Red Hat AI, including training LLMs and more.
    • AI/ML learning paths
      Expand your OpenShift AI knowledge using these learning resources.
    • AI quickstarts
      Focused AI use cases designed for fast deployment on Red Hat AI platforms.
    • No-cost AI training
      Foundational Red Hat AI training.

    Featured resources

    • OpenShift AI learning
    • Open source AI for developers
    • AI product application development
    • Open source-powered AI/ML for hybrid cloud
    • AI and Node.js cheat sheet

    Red Hat AI Factory with NVIDIA

    • Red Hat AI Factory with NVIDIA is a co-engineered, enterprise-grade AI solution for building, deploying, and managing AI at scale across hybrid cloud environments.
    • Explore the solution
  • Learn

    Self-guided

    • Documentation
      Find answers, get step-by-step guidance, and learn how to use Red Hat products.
    • Learning paths
      Explore curated walkthroughs for common development tasks.
    • Guided learning
      Receive custom learning paths powered by our AI assistant.
    • See all learning

    Hands-on

    • Developer Sandbox
      Spin up Red Hat's products and technologies without setup or configuration.
    • Interactive labs
      Learn by doing in these hands-on, browser-based experiences.
    • Interactive demos
      Click through product features in these guided tours.

    Browse by topic

    • AI/ML
    • Automation
    • Java
    • Kubernetes
    • Linux
    • See all topics

    Training & certifications

    • Courses and exams
    • Certifications
    • Skills assessments
    • Red Hat Academy
    • Learning subscription
    • Explore training
  • Build

    Get started

    • Red Hat build of Podman Desktop
      A downloadable, local development hub to experiment with our products and builds.
    • Developer Sandbox
      Spin up Red Hat's products and technologies without setup or configuration.

    Download products

    • Access product downloads to start building and testing right away.
    • Red Hat Enterprise Linux
    • Red Hat AI
    • Red Hat OpenShift
    • Red Hat Ansible Automation Platform
    • See all products

    Featured

    • Red Hat build of OpenJDK
    • Red Hat JBoss Enterprise Application Platform
    • Red Hat OpenShift Dev Spaces
    • Red Hat Developer Toolset

    References

    • E-books
    • Documentation
    • Cheat sheets
    • Architecture center
  • Community

    Get involved

    • Events
    • Live AI events
    • Red Hat Summit
    • Red Hat Accelerators
    • Community discussions

    Follow along

    • Articles & blogs
    • Developer newsletter
    • Videos
    • Github

    Get help

    • Customer service
    • Customer support
    • Regional contacts
    • Find a partner

    Join the Red Hat Developer program

    • Download Red Hat products and project builds, access support documentation, learning content, and more.
    • Explore the benefits

Reduce LLM benchmarking costs with oversaturation detection

November 18, 2025
Alon Kellner
Related topics:
Artificial intelligence
Related products:
Red Hat AI

    "Which large language model is best?"

    It's a simple question every developer and business is asking, but the answer is anything but. The "best" model might be the one that is fastest, least expensive, or most accurate—and you rarely get all three. Choosing the right model, and the right hardware to run it on, has significant implications for cost and user experience. The entire goal of performance benchmarking is answering the key question: Which setup provides the highest performance for the lowest cost?

    As we quickly learned, evaluating this performance isn't a single test; it's a complex, multi-dimensional puzzle. When my team from Jounce joined Red Hat, our first major task was to find a solution to that puzzle.

    We weren't just running a few tests. We were staring down a list of 7,488 potential combinations. This "combinatorial explosion" is the hard reality of benchmarking. To get a useful answer, we had to test every mix of the following:

    • 26 distinct LLM models
    • 4 different GPU types
    • 4 different GPU counts per machine
    • 9 different load levels (requests per second or RPS)
    • 2 different prompt profiles (RAG and chat)

    It was a daunting and incredibly expensive challenge. We successfully ran 4,506 of those tests—a monumental effort. Then, we checked the results and realized that more than half of our executed runs were rendered invalid. That's right: a 50% machine-time tax, because of a subtle problem we call oversaturation.

    The problem of oversaturation

    Oversaturation is when a server can't process the load of incoming requests, which causes a queue to build up. This results in the server taking progressively longer to start handling each request. When a performance benchmarking tool oversaturates an LLM inference server, the metrics it measures become significantly skewed, rendering them useless.

    Think of it like a cashier getting flustered during a sudden rush. As the line grows (the load), the cashier can’t keep up, the line gets longer, and there is no room for additional customers. This waste of costly machine time was heartbreaking.

    Our solution was beautifully simple in theory: If we could automatically perform oversaturation detection (OSD) and stop the runs before they finished, we could save a fortune. But as we discovered, OSD is rarely as straightforward as we hoped.

    Our stack: GuideLLM, VLLM, and JBenchmark

    To run 4,506 tests, we relied on a three-part stack:

    • vLLM (the engine): vLLM is our choice for the LLM inference server. As its repository describes, it's a fast and easy-to-use library for LLM inference and serving. vLLM is an open source, enterprise-oriented engine responsible for the heavy lifting of running LLMs.
    • GuideLLM (the measurer): GuideLLM is the tool that simulates real-world load and measures performance. Created by Neural Magic (now part of Red Hat), GuideLLM simulates real user load and records critical LLM-specific metrics like Time-to-First-Token (TTFT), Inter-Token Latency (ITL), and End-to-End Latency (E2E). Every test run produces a detailed report of these measured metrics.
    • JBenchmark (the orchestrator): JBenchmark is the (currently internal) Red Hat solution that intelligently manages our massive test matrix. It spins up the right cloud resources and spot instances, tells the engine (vLLM) and measurer (GuideLLM) which of the thousands of combinations to run, and ensures we get our results with maximum cost-efficiency.

    Doesn't oversaturation detection already have a solution? 

    It's a great question. Why not use a standard load-testing tool?

    In most traditional load testing, OSD isn't a major focus. That's because most teams test one core production setup deeply. Their goal is depth, not breadth. The cost of those few tests is tiny compared to running the actual system.

    Our situation is fundamentally different. We have thousands of unique setups, each requiring a costly GPU machine. Because all we do is load testing, we don't have a production setup; our load testing costs are the entirety of our infrastructure costs.

    In addition, we are working exclusively with LLMs, which introduces unique serving characteristics:

    • Streaming HTTP requests (tokens are returned in a stream).
    • Very long requests (processing can take seconds or minutes).
    • Accelerated hardware (reliance on high-cost GPUs).

    By using unique LLM-specific metrics like Time to First Token (TTFT) and Inter-Token Latency (ITL), we could detect oversaturation more efficiently than traditional methods. However, finding a stable solution was not as easy as simply picking a new metric.

    Oversaturation detection is not trivial

    At any point during a performance benchmark, our OSD algorithm must use all available data to predict whether the load has reached oversaturation. If the prediction is positive, the benchmark terminates immediately. The challenge is rooted in time: True-Alerts should be raised as soon as possible, and False-Alerts must ideally never be raised. This makes OSD more akin to anomaly detection or survival analysis, where timing is everything.

    There is, however, a fundamental issue with the definition of oversaturation itself. What exactly does "keep up" mean in practice? The definition is highly volatile because it depends on two critical factors: the maximum throughput of the server and the variability in that throughput. In LLM serving, these factors can vary wildly. This means simple, static thresholds like "Alert when there are more than 1,000 concurrent requests" are totally inadequate.

    This is a genuinely difficult problem where theory alone won't suffice. We need to employ proper data science practices. Luckily for us, after running those 4,506 costly benchmarks, we now happen to have data that is roughly equal parts oversaturated and undersaturated runs. A very useful coincidence, considering the price tag.

    Next steps

    This article discussed the costly challenge of oversaturation and established the need for a reliable, data-driven OSD solution. In part 2, the focus shifts to the foundational data science: how to evaluate the performance of an OSD algorithm through custom metrics, dataset labeling, and load augmentation techniques. In part 3, we'll walk through how we built the actual OSD algorithm. 

    Read part 2: Defining success: Evaluation metrics and data augmentation for oversaturation detection

    Last updated: November 24, 2025

    Related Posts

    • GuideLLM: Evaluate LLM deployments for real-world inference

    • GPU benchmarking and how to choose a GPU framework

    • Benchmarking with GuideLLM in air-gapped OpenShift clusters

    • Ollama vs. vLLM: A deep dive into performance benchmarking

    • Defining success: Evaluation metrics and data augmentation for oversaturation detection

    • Building a oversaturation detector with iterative error analysis

    Recent Posts

    • Red Hat Hardened Images: Top 5 benefits for software developers

    • How EvalHub manages two-layer Kubernetes control planes

    • Tekton joins the CNCF as an incubating project

    • Federated identity across the hybrid cloud using zero trust workload identity manager

    • Confidential virtual machine storage attack scenarios

    Red Hat Developers logo LinkedIn YouTube Twitter Facebook

    Platforms

    • Red Hat AI
    • Red Hat Enterprise Linux
    • Red Hat OpenShift
    • Red Hat Ansible Automation Platform
    • See all products

    Build

    • Developer Sandbox
    • Developer tools
    • Interactive tutorials
    • API catalog

    Quicklinks

    • Learning resources
    • E-books
    • Cheat sheets
    • Blog
    • Events
    • Newsletter

    Communicate

    • About us
    • Contact sales
    • Find a partner
    • Report a website issue
    • Site status dashboard
    • Report a security problem

    RED HAT DEVELOPER

    Build here. Go anywhere.

    We serve the builders. The problem solvers who create careers with code.

    Join us if you’re a developer, software engineer, web designer, front-end designer, UX designer, computer scientist, architect, tester, product manager, project manager or team lead.

    Sign me up

    Red Hat legal and privacy links

    • About Red Hat
    • Jobs
    • Events
    • Locations
    • Contact Red Hat
    • Red Hat Blog
    • Inclusion at Red Hat
    • Cool Stuff Store
    • Red Hat Summit
    © 2026 Red Hat

    Red Hat legal and privacy links

    • Privacy statement
    • Terms of use
    • All policies and guidelines
    • Digital accessibility

    Chat Support

    Please log in with your Red Hat account to access chat support.