EvalHub: Because "looks good to me" isn't a benchmark

Your team just finished building a customer support assistant using retrieval-augmented generation (RAG) over the past three months. It handled every test query you threw at it in the demo. Leadership is excited. The launch date is set.

Then it goes live. Users complain that answers about product returns are confidently wrong. The legal team flags a response citing a policy that was deprecated 18 months ago. The model is fast, articulate, and unambiguously unreliable.

What went wrong? Most likely, nothing catastrophic and nothing subtle. The evaluation process simply never asked the right questions. "Looks good to me" and a handful of manual spot checks stood in for systematic measurement. The team optimized for the demo, not for the deployment.

This is not a story about one team's bad luck. It is the dominant pattern in enterprise AI development today, with five distinct structural causes.

Series note

This is the second post in a series covering how to build a scalable, reproducible AI evaluation infrastructure using the EvalHub project and Red Hat AI. Catch up on the other parts in the series:

Part 1: How EvalHub manages two-layer Kubernetes control planes
Part 2: EvalHub: Because "looks good to me" isn't a benchmark
Part 3: Evaluation-driven development with EvalHub
Part 4: Understanding evaluation collections in EvalHub
Part 5: Bring your own evaluation framework to EvalHub
Part 6: Add automated AI evaluations to your CI/CD pipeline
Part 7: Store immutable AI evaluation records with EvalHub and OCI
Part 8: Manage LLM evaluation workloads at scale with EvalHub and Kueue
Part 9: Connect EvalHub to protected production model servers

Five problems that break AI evaluation at scale

Enterprise AI development currently faces five primary structural challenges that hinder effective measurement and deployment.

1. The tooling fragmentation problem

A typical enterprise AI team evaluating a new model or RAG pipeline might use several tools: EleutherAI's LM Evaluation Harness for capability benchmarks, RAGAS for retrieval quality, Garak for red-teaming and safety probes, and GuideLLM for throughput and latency profiling. Often, different team members run these tools separately and store results in disconnected locations. Stitching these together into a coherent picture requires custom scripts and manual data wrangling. The burden of maintaining this evaluation infrastructure often falls on evaluators, diverting their focus from their primary roles.

The operational cost is real: evaluation runs are skipped under deadline pressure, results are not compared across model versions because the formats are incompatible, and the institutional knowledge of "what we ran last time" lives in one engineer's notebook.

2. The "What should I measure?" problem

Generic evaluation tools are built around academic benchmarks: MMLU, ARC, HellaSwag, HumanEval. These are valuable, but they answer the wrong question for most enterprise use cases. A healthcare LLM needs to be evaluated on clinical reasoning accuracy and regulatory compliance, not on general trivia. A multilingual customer service agent needs a breakdown by language, not a single aggregate accuracy score.

The result is what might be called the vanity metric trap: teams report high scores on benchmarks that do not predict real-world performance in their specific domain, and ship systems that fail in the exact ways the benchmark did not measure.

3. The reproducibility crisis

AI model behavior is sensitive to hardware, driver versions, quantization settings, batch sizes, and prompt templates. A benchmark score produced on an A100 cluster might not reproduce on a T4 node. Benchmark results from last quarter might not reproduce today if the model's serving configuration has changed.

Without a systematic way to capture and attach environment metadata (such as hardware specs, software versions, and model configurations) to evaluation results, benchmark scores are claims rather than evidence. This matters enormously for regulated industries. The EU AI Act and FedRAMP require documented, reproducible evidence of model behavior, not a dashboard screenshot.

4. The documentation and accessibility gap

Evaluation knowledge tends to scatter: MLflow experiments with results no one outside the ML team can query, model cards written once and never updated, and benchmark configurations in READMEs that silently go stale. The consequence is that decisions about which model to deploy, which component to tune, or which risk to accept are made on the basis of tribal knowledge rather than traceable evidence.

This is not a data quality problem. It is a workflow problem: there is no standard, organization-wide format for evaluation reports, no single place to look, and no automated path from a completed evaluation run to a governance-ready artifact.

5. The dev-to-enterprise gap

A developer can run a quick evaluation from a Python script on their laptop in minutes. Moving that same evaluation to a Kubernetes cluster for reproducible, production-scale runs—with proper concurrency management, resource quotas, experiment tracking, and structured results—requires either a platform engineering investment or months of glue code. Most teams never bridge this gap. Their evaluation capability stays at the laptop-script level even as their model deployments scale to production traffic.

The gap means that evaluation remains a manual, developer-local activity that does not fit into CI/CD pipelines, does not run at the scale of production data, and does not produce the governance artifacts the enterprise actually needs.

Introducing EvalHub

EvalHub is the Red Hat AI unified foundation for AI evaluation. It directly addresses these five problems through a single orchestration control plane and a set of deliberately designed primitives.

A single orchestration layer for every framework

EvalHub uses a REST API server written in Go and deployed on Kubernetes via the TrustyAI operator. It routes evaluation requests to supported backends without the calling code needing to know which framework is actually running. The default provider set, declared in YAML configuration files, includes:

lm-evaluation-harness: 167 benchmarks for capability, reasoning, and knowledge evaluation
Garak: Red-teaming, safety probes, and toxicity detection
GuideLLM: Throughput, latency, and infrastructure profiling
LightEval: Fast, lightweight capability benchmarks
MTEB: Massive text embedding benchmarks

A single POST request to /api/v1/evaluations can fan out to multiple backends in parallel, aggregate the results with configurable weights, and write the full experiment record to MLflow without requiring the caller to manage any of that complexity.

Adding a new framework does not require a code change: you extend the FrameworkAdapter base class from the EvalHub SDK, implement a single run_benchmark_job method, and register the provider via a ConfigMap entry. EvalHub handles scheduling, status reporting, and result aggregation from that point forward.

The EvalHub SDK packages four capabilities alongside the adapter:

evalhub.client: Sync and async typed REST clients for submitting jobs and querying providers, benchmarks, and collections from Python code or notebooks.
evalhub.cli: A fully featured evalhub CLI with commands for running evaluations, watching status, retrieving results, managing collections and providers, and checking service health; supports YAML/JSON config files, multi-profile configuration, and --wait/--watch flags for blocking runs.
evalhub.mcp (dev preview): A FastMCP-based MCP server exposing nine browsable resources and two action tools, started via evalhub mcp; lets AI agents and coding assistants invoke evaluations and retrieve results via the Model Context Protocol.
OCI artifact persistence (via evalhub.adapter): Evaluation results are pushed to an OCI registry using olot and oras, with SHA256-derived tags and full annotation support; in Kubernetes mode, the sidecar handles registry authentication transparently.

Evaluation collections: Answering "What should I measure?"

The platform addresses domain specificity problem through evaluation collections. These are named, versioned, and expert-curated sets of benchmarks that are framework-neutral and tailored to specific verticals and use cases.

A collection is declared with a list of benchmarks, their providers, and their relative weights. A healthcare_safety_v1 collection, for example, might include clinical reasoning benchmarks from lm-evaluation-harness, safety probes from Garak, and RAG-groundedness metrics from RAGAS, each weighted according to the use case's risk profile.

Calling a collection is as simple as including its ID in an evaluation request:

curl -X POST /api/v1/evaluations \
      -d '{
      "name": "llama-3-v2-healthcare-eval",
      "model": { "url": "http://vllm-service:8080/v1" },
      "collection": { "id": "healthcare_safety_v1" },
      "experiment": { "name": "llama-3-v2-healthcare-eval" }
}'

EvalHub looks up the collection, extracts the benchmarks, and groups them by provider to improve execution. It then returns a weighted aggregate score and a breakdown for each benchmark. The collection encodes organizational evaluation knowledge in a shareable, versionable, reusable form.

Built-in reproducibility and governance

Every evaluation run is automatically tracked in MLflow with a structured ExperimentConfig that captures the experiment name, tags (environment, model family, collection version, and so on), and the full configuration used to produce the results. This provides teams with a queryable, historical record of every evaluation run, which serves as the foundation for regression tracking, model comparison, and governance reporting.

The OCI artifact persistence feature establishes a tamper-evident connection between a model artifact and its deployment justification. This is achieved by directly embedding evaluation results and traces as metadata within the OCI ModelCar images, ensuring that the evaluation evidence is persistently linked to the deployed model.

From laptop to cluster without friction

The EvalHub SDK provides the same evaluation interface whether you are running a quick check in a Python notebook, wiring evaluations into a CI/CD pipeline, or scheduling a production evaluation run on an OpenShift cluster. The server translates structured APIs call into Kubernetes primitives to manage execution at scale. This includes Kueue-based resource quota enforcement, pod scheduling, and status tracking through custom resources.

A developer does not need to understand Kubernetes to run an evaluation. A platform team does not need to build custom evaluation infrastructure. The same EvalHub instance serves both.

What EvalHub is not

EvalHub is not a benchmark leaderboard. It is not a replacement for framework-specific tools like lm-evaluation-harness or RAGAS. Those tools do the actual measurement work; EvalHub orchestrates them, routes between them, tracks their outputs, and makes their results governable.

EvalHub is rapidly evolving to keep pace with AI developments. It offers a sophisticated solution that addresses the five previously mentioned challenges more comprehensively than any single evaluation framework. EvalHub delivers a robust server for routing to live backends, tracking genuine experiments, and deploying on actual Kubernetes clusters.

Getting started

EvalHub is open source under the Apache 2.0 license. You can access the EvalHub server directly on GitHub. The Python SDK includes the evalhub CLI, REST client, BYOF adapter, MCP server, and OCI artefact persistence. Use TrustyAI operator to manage its Kubernetes deployment.

For teams already running Open Data Hub or Red Hat OpenShift AI, EvalHub deploys as a component of the TrustyAI stack and requires no separate infrastructure.

If your current evaluation process involves a combination of "it looked good in testing," a few MLflow runs that no one queries, and a model card written three months ago, EvalHub exists to replace those manual checks with a system you can trust.

Last updated: June 23, 2026

EvalHub: Because "looks good to me" isn't a benchmark

Why AI evaluation is broken in most enterprises and what a unified platform actually fixes

Series note

Five problems that break AI evaluation at scale

1. The tooling fragmentation problem

2. The "What should I measure?" problem

3. The reproducibility crisis

4. The documentation and accessibility gap

5. The dev-to-enterprise gap

Introducing EvalHub

A single orchestration layer for every framework

Evaluation collections: Answering "What should I measure?"

Built-in reproducibility and governance

From laptop to cluster without friction

What EvalHub is not

Getting started

Architect an open blueprint for cloud-native AI agents

Computer use: How AI agents can automate almost anything

PyTorch distributed is changing and TorchComms is why

What 429 chaos experiments taught us about Kubernetes operator resilience

Red Hat Dependency Analytics works with your private Trusted Profile Analyzer instance!

Build and evaluate a fraud detection model with TensorFlow and ONNX

Platforms

Build

Quicklinks

Communicate

RED HAT DEVELOPER

Red Hat legal and privacy links

Red Hat legal and privacy links