Skip to main content
Redhat Developers  Logo
  • Products

    Platforms

    • Red Hat Enterprise Linux
      Red Hat Enterprise Linux Icon
    • Red Hat AI
      Red Hat AI
    • Red Hat OpenShift
      Openshift icon
    • Red Hat Ansible Automation Platform
      Ansible icon
    • See all Red Hat products

    Featured

    • Red Hat build of OpenJDK
    • Red Hat Developer Hub
    • Red Hat JBoss Enterprise Application Platform
    • Red Hat OpenShift Dev Spaces
    • Red Hat OpenShift Local
    • Red Hat Developer Sandbox

      Try Red Hat products and technologies without setup or configuration fees for 30 days with this shared Red Hat OpenShift and Kubernetes cluster.
    • Try at no cost
  • Technologies

    Featured

    • AI/ML
      AI/ML Icon
    • Linux
      Linux Icon
    • Kubernetes
      Cloud icon
    • Automation
      Automation Icon showing arrows moving in a circle around a gear
    • See all technologies
    • Programming languages & frameworks

      • Java
      • Python
      • JavaScript
    • System design & architecture

      • Red Hat architecture and design patterns
      • Microservices
      • Event-Driven Architecture
      • Databases
    • Developer experience

      • Productivity
      • Tools
      • GitOps
    • Automated data processing

      • AI/ML
      • Data science
      • Apache Kafka on Kubernetes
    • Platform engineering

      • DevOps
      • DevSecOps
      • Red Hat Ansible Automation Platform for applications and services
    • Secure development & architectures

      • Security
      • Secure coding
  • Learn

    Featured

    • Kubernetes & cloud native
      Openshift icon
    • Linux
      Rhel icon
    • Automation
      Ansible cloud icon
    • AI/ML
      AI/ML Icon
    • See all learning resources

    E-books

    • GitOps cookbook
    • Podman in action
    • Kubernetes operators
    • The path to GitOps
    • See all e-books

    Cheat sheets

    • Linux commands
    • Bash commands
    • Git
    • systemd commands
    • See all cheat sheets

    Documentation

    • Product documentation
    • API catalog
    • Legacy documentation
  • Developer Sandbox

    Developer Sandbox

    • Access Red Hat’s products and technologies without setup or configuration, and start developing quicker than ever before with our new, no-cost sandbox environments.
    • Explore the Developer Sandbox

    Featured Developer Sandbox activities

    • Get started with your Developer Sandbox
    • OpenShift virtualization and application modernization using the Developer Sandbox
    • Explore all Developer Sandbox activities

    Ready to start developing apps?

    • Try at no cost
  • Blog
  • Events
  • Videos

Reduce LLM benchmarking costs with oversaturation detection

November 18, 2025
Alon Kellner
Related topics:
Artificial intelligence
Related products:
Red Hat AI

    "Which large language model is best?"

    It's a simple question every developer and business is asking, but the answer is anything but. The "best" model might be the one that is fastest, least expensive, or most accurate—and you rarely get all three. Choosing the right model, and the right hardware to run it on, has significant implications for cost and user experience. The entire goal of performance benchmarking is answering the key question: Which setup provides the highest performance for the lowest cost?

    As we quickly learned, evaluating this performance isn't a single test; it's a complex, multi-dimensional puzzle. When my team from Jounce joined Red Hat, our first major task was to find a solution to that puzzle.

    We weren't just running a few tests. We were staring down a list of 7,488 potential combinations. This "combinatorial explosion" is the hard reality of benchmarking. To get a useful answer, we had to test every mix of the following:

    • 26 distinct LLM models
    • 4 different GPU types
    • 4 different GPU counts per machine
    • 9 different load levels (requests per second or RPS)
    • 2 different prompt profiles (RAG and chat)

    It was a daunting and incredibly expensive challenge. We successfully ran 4,506 of those tests—a monumental effort. Then, we checked the results and realized that more than half of our executed runs were rendered invalid. That's right: a 50% machine-time tax, because of a subtle problem we call oversaturation.

    The problem of oversaturation

    Oversaturation is when a server can't process the load of incoming requests, which causes a queue to build up. This results in the server taking progressively longer to start handling each request. When a performance benchmarking tool oversaturates an LLM inference server, the metrics it measures become significantly skewed, rendering them useless.

    Think of it like a cashier getting flustered during a sudden rush. As the line grows (the load), the cashier can’t keep up, the line gets longer, and there is no room for additional customers. This waste of costly machine time was heartbreaking.

    Our solution was beautifully simple in theory: If we could automatically perform oversaturation detection (OSD) and stop the runs before they finished, we could save a fortune. But as we discovered, OSD is rarely as straightforward as we hoped.

    Our stack: GuideLLM, VLLM, and JBenchmark

    To run 4,506 tests, we relied on a three-part stack:

    • vLLM (the engine): vLLM is our choice for the LLM inference server. As its repository describes, it's a fast and easy-to-use library for LLM inference and serving. vLLM is an open source, enterprise-oriented engine responsible for the heavy lifting of running LLMs.
    • GuideLLM (the measurer): GuideLLM is the tool that simulates real-world load and measures performance. Created by Neural Magic (now part of Red Hat), GuideLLM simulates real user load and records critical LLM-specific metrics like Time-to-First-Token (TTFT), Inter-Token Latency (ITL), and End-to-End Latency (E2E). Every test run produces a detailed report of these measured metrics.
    • JBenchmark (the orchestrator): JBenchmark is the (currently internal) Red Hat solution that intelligently manages our massive test matrix. It spins up the right cloud resources and spot instances, tells the engine (vLLM) and measurer (GuideLLM) which of the thousands of combinations to run, and ensures we get our results with maximum cost-efficiency.

    Doesn't oversaturation detection already have a solution? 

    It's a great question. Why not use a standard load-testing tool?

    In most traditional load testing, OSD isn't a major focus. That's because most teams test one core production setup deeply. Their goal is depth, not breadth. The cost of those few tests is tiny compared to running the actual system.

    Our situation is fundamentally different. We have thousands of unique setups, each requiring a costly GPU machine. Because all we do is load testing, we don't have a production setup; our load testing costs are the entirety of our infrastructure costs.

    In addition, we are working exclusively with LLMs, which introduces unique serving characteristics:

    • Streaming HTTP requests (tokens are returned in a stream).
    • Very long requests (processing can take seconds or minutes).
    • Accelerated hardware (reliance on high-cost GPUs).

    By using unique LLM-specific metrics like Time to First Token (TTFT) and Inter-Token Latency (ITL), we could detect oversaturation more efficiently than traditional methods. However, finding a stable solution was not as easy as simply picking a new metric.

    Oversaturation detection is not trivial

    At any point during a performance benchmark, our OSD algorithm must use all available data to predict whether the load has reached oversaturation. If the prediction is positive, the benchmark terminates immediately. The challenge is rooted in time: True-Alerts should be raised as soon as possible, and False-Alerts must ideally never be raised. This makes OSD more akin to anomaly detection or survival analysis, where timing is everything.

    There is, however, a fundamental issue with the definition of oversaturation itself. What exactly does "keep up" mean in practice? The definition is highly volatile because it depends on two critical factors: the maximum throughput of the server and the variability in that throughput. In LLM serving, these factors can vary wildly. This means simple, static thresholds like "Alert when there are more than 1,000 concurrent requests" are totally inadequate.

    This is a genuinely difficult problem where theory alone won't suffice. We need to employ proper data science practices. Luckily for us, after running those 4,506 costly benchmarks, we now happen to have data that is roughly equal parts oversaturated and undersaturated runs. A very useful coincidence, considering the price tag.

    Next steps

    This article discussed the costly challenge of oversaturation and established the need for a reliable, data-driven OSD solution. In part 2, the focus shifts to the foundational data science: how to evaluate the performance of an OSD algorithm through custom metrics, dataset labeling, and load augmentation techniques. In part 3, we'll walk through how we built the actual OSD algorithm. 

    Related Posts

    • Ollama vs. vLLM: A deep dive into performance benchmarking

    • GuideLLM: Evaluate LLM deployments for real-world inference

    • GPU benchmarking and how to choose a GPU framework

    • Benchmarking with GuideLLM in air-gapped OpenShift clusters

    • Integrate a private AI coding assistant into your CDE using Ollama, Continue, and OpenShift Dev Spaces

    • Ollama or vLLM? How to choose the right LLM serving tool for your use case

    Recent Posts

    • Building the digital substation: Exploring the LF Energy SEAPATH architecture on Red Hat Enterprise Linux

    • How to run performance tests using benchmark-runner

    • Reduce LLM benchmarking costs with oversaturation detection

    • High Scale Performance Testing: Virt Density

    • .NET 10 is now available for RHEL and OpenShift

    Red Hat Developers logo LinkedIn YouTube Twitter Facebook

    Platforms

    • Red Hat AI
    • Red Hat Enterprise Linux
    • Red Hat OpenShift
    • Red Hat Ansible Automation Platform
    • See all products

    Build

    • Developer Sandbox
    • Developer tools
    • Interactive tutorials
    • API catalog

    Quicklinks

    • Learning resources
    • E-books
    • Cheat sheets
    • Blog
    • Events
    • Newsletter

    Communicate

    • About us
    • Contact sales
    • Find a partner
    • Report a website issue
    • Site status dashboard
    • Report a security problem

    RED HAT DEVELOPER

    Build here. Go anywhere.

    We serve the builders. The problem solvers who create careers with code.

    Join us if you’re a developer, software engineer, web designer, front-end designer, UX designer, computer scientist, architect, tester, product manager, project manager or team lead.

    Sign me up

    Red Hat legal and privacy links

    • About Red Hat
    • Jobs
    • Events
    • Locations
    • Contact Red Hat
    • Red Hat Blog
    • Inclusion at Red Hat
    • Cool Stuff Store
    • Red Hat Summit
    © 2025 Red Hat

    Red Hat legal and privacy links

    • Privacy statement
    • Terms of use
    • All policies and guidelines
    • Digital accessibility

    Report a website issue