Skip to main content
Redhat Developers  Logo
  • AI

    Get started with AI

    • Red Hat AI
      Accelerate the development and deployment of enterprise AI solutions.
    • AI learning hub
      Explore learning materials and tools, organized by task.
    • AI interactive demos
      Click through scenarios with Red Hat AI, including training LLMs and more.
    • AI/ML learning paths
      Expand your OpenShift AI knowledge using these learning resources.
    • AI quickstarts
      Focused AI use cases designed for fast deployment on Red Hat AI platforms.
    • No-cost AI training
      Foundational Red Hat AI training.

    Featured resources

    • OpenShift AI learning
    • Open source AI for developers
    • AI product application development
    • Open source-powered AI/ML for hybrid cloud
    • AI and Node.js cheat sheet

    Red Hat AI Factory with NVIDIA

    • Red Hat AI Factory with NVIDIA is a co-engineered, enterprise-grade AI solution for building, deploying, and managing AI at scale across hybrid cloud environments.
    • Explore the solution
  • Learn

    Self-guided

    • Documentation
      Find answers, get step-by-step guidance, and learn how to use Red Hat products.
    • Learning paths
      Explore curated walkthroughs for common development tasks.
    • Guided learning
      Receive custom learning paths powered by our AI assistant.
    • See all learning

    Hands-on

    • Developer Sandbox
      Spin up Red Hat's products and technologies without setup or configuration.
    • Interactive labs
      Learn by doing in these hands-on, browser-based experiences.
    • Interactive demos
      Click through product features in these guided tours.

    Browse by topic

    • AI/ML
    • Automation
    • Java
    • Kubernetes
    • Linux
    • See all topics

    Training & certifications

    • Courses and exams
    • Certifications
    • Skills assessments
    • Red Hat Academy
    • Learning subscription
    • Explore training
  • Build

    Get started

    • Red Hat build of Podman Desktop
      A downloadable, local development hub to experiment with our products and builds.
    • Developer Sandbox
      Spin up Red Hat's products and technologies without setup or configuration.

    Download products

    • Access product downloads to start building and testing right away.
    • Red Hat Enterprise Linux
    • Red Hat AI
    • Red Hat OpenShift
    • Red Hat Ansible Automation Platform
    • See all products

    Featured

    • Red Hat build of OpenJDK
    • Red Hat JBoss Enterprise Application Platform
    • Red Hat OpenShift Dev Spaces
    • Red Hat Developer Toolset

    References

    • E-books
    • Documentation
    • Cheat sheets
    • Architecture center
  • Community

    Get involved

    • Events
    • Live AI events
    • Red Hat Summit
    • Red Hat Accelerators
    • Community discussions

    Follow along

    • Articles & blogs
    • Developer newsletter
    • Videos
    • Github

    Get help

    • Customer service
    • Customer support
    • Regional contacts
    • Find a partner

    Join the Red Hat Developer program

    • Download Red Hat products and project builds, access support documentation, learning content, and more.
    • Explore the benefits

EvalHub: Because "looks good to me" isn't a benchmark

Why AI evaluation is broken in most enterprises and what a unified platform actually fixes

May 19, 2026
William Caban Babilonia Rui Vieira
Related topics:
Artificial intelligenceAutomation and managementPlatform engineering
Related products:
Red Hat AI

    Your team just finished building a customer support assistant using retrieval-augmented generation (RAG) over the past three months. It handled every test query you threw at it in the demo. Leadership is excited. The launch date is set.

    Then it goes live. Users complain that answers about product returns are confidently wrong. The legal team flags a response citing a policy that was deprecated 18 months ago. The model is fast, articulate, and unambiguously unreliable.

    What went wrong? Most likely, nothing catastrophic and nothing subtle. The evaluation process simply never asked the right questions. "Looks good to me" and a handful of manual spot checks stood in for systematic measurement. The team optimized for the demo, not for the deployment.

    This is not a story about one team's bad luck. It is the dominant pattern in enterprise AI development today, with five distinct structural causes.

    Five problems that break AI evaluation at scale

    Enterprise AI development currently faces five primary structural challenges that hinder effective measurement and deployment.

    1. The tooling fragmentation problem

    A typical enterprise AI team evaluating a new model or RAG pipeline might use several tools: EleutherAI's LM Evaluation Harness for capability benchmarks, RAGAS for retrieval quality, Garak for red-teaming and safety probes, and GuideLLM for throughput and latency profiling. Often, different team members run these tools separately and store results in disconnected locations. Stitching these together into a coherent picture requires custom scripts and manual data wrangling. The burden of maintaining this evaluation infrastructure often falls on evaluators, diverting their focus from their primary roles.

    The operational cost is real: evaluation runs are skipped under deadline pressure, results are not compared across model versions because the formats are incompatible, and the institutional knowledge of "what we ran last time" lives in one engineer's notebook.

    2. The "What should I measure?" problem

    Generic evaluation tools are built around academic benchmarks: MMLU, ARC, HellaSwag, HumanEval. These are valuable, but they answer the wrong question for most enterprise use cases. A healthcare LLM needs to be evaluated on clinical reasoning accuracy and regulatory compliance, not on general trivia. A multilingual customer service agent needs a breakdown by language, not a single aggregate accuracy score.

    The result is what might be called the vanity metric trap: teams report high scores on benchmarks that do not predict real-world performance in their specific domain, and ship systems that fail in the exact ways the benchmark did not measure.

    3. The reproducibility crisis

    AI model behavior is sensitive to hardware, driver versions, quantization settings, batch sizes, and prompt templates. A benchmark score produced on an A100 cluster might not reproduce on a T4 node. Benchmark results from last quarter might not reproduce today if the model's serving configuration has changed.

    Without a systematic way to capture and attach environment metadata (such as hardware specs, software versions, and model configurations) to evaluation results, benchmark scores are claims rather than evidence. This matters enormously for regulated industries. The EU AI Act and FedRAMP require documented, reproducible evidence of model behavior, not a dashboard screenshot.

    4. The documentation and accessibility gap

    Evaluation knowledge tends to scatter: MLflow experiments with results no one outside the ML team can query, model cards written once and never updated, and benchmark configurations in READMEs that silently go stale. The consequence is that decisions about which model to deploy, which component to tune, or which risk to accept are made on the basis of tribal knowledge rather than traceable evidence.

    This is not a data quality problem. It is a workflow problem: there is no standard, organization-wide format for evaluation reports, no single place to look, and no automated path from a completed evaluation run to a governance-ready artifact.

    5. The dev-to-enterprise gap

    A developer can run a quick evaluation from a Python script on their laptop in minutes. Moving that same evaluation to a Kubernetes cluster for reproducible, production-scale runs—with proper concurrency management, resource quotas, experiment tracking, and structured results—requires either a platform engineering investment or months of glue code. Most teams never bridge this gap. Their evaluation capability stays at the laptop-script level even as their model deployments scale to production traffic.

    The gap means that evaluation remains a manual, developer-local activity that does not fit into CI/CD pipelines, does not run at the scale of production data, and does not produce the governance artifacts the enterprise actually needs.

    Introducing EvalHub

    EvalHub is the Red Hat AI unified foundation for AI evaluation. It directly addresses these five problems through a single orchestration control plane and a set of deliberately designed primitives.

    A single orchestration layer for every framework

    EvalHub uses a REST API server written in Go and deployed on Kubernetes via the TrustyAI operator. It routes evaluation requests to supported backends without the calling code needing to know which framework is actually running. The default provider set, declared in YAML configuration files, includes:

    • lm-evaluation-harness: 167 benchmarks for capability, reasoning, and knowledge evaluation
    • Garak: Red-teaming, safety probes, and toxicity detection
    • GuideLLM: Throughput, latency, and infrastructure profiling
    • LightEval: Fast, lightweight capability benchmarks
    • MTEB: Massive text embedding benchmarks

    A single POST request to /api/v1/evaluations can fan out to multiple backends in parallel, aggregate the results with configurable weights, and write the full experiment record to MLflow without requiring the caller to manage any of that complexity.

    Adding a new framework does not require a code change: you extend the FrameworkAdapter base class from the EvalHub SDK, implement a single run_benchmark_job method, and register the provider via a ConfigMap entry. EvalHub handles scheduling, status reporting, and result aggregation from that point forward.

    The EvalHub SDK packages four capabilities alongside the adapter:

    • evalhub.client: Sync and async typed REST clients for submitting jobs and querying providers, benchmarks, and collections from Python code or notebooks.
    • evalhub.cli: A fully featured evalhub CLI with commands for running evaluations, watching status, retrieving results, managing collections and providers, and checking service health; supports YAML/JSON config files, multi-profile configuration, and --wait/--watch flags for blocking runs.
    • evalhub.mcp (dev preview): A FastMCP-based MCP server exposing nine browsable resources and two action tools, started via evalhub mcp; lets AI agents and coding assistants invoke evaluations and retrieve results via the Model Context Protocol.
    • OCI artifact persistence (via evalhub.adapter): Evaluation results are pushed to an OCI registry using olot and oras, with SHA256-derived tags and full annotation support; in Kubernetes mode, the sidecar handles registry authentication transparently.

    Evaluation collections: Answering "What should I measure?"

    The platform addresses domain specificity problem through evaluation collections. These are named, versioned, and expert-curated sets of benchmarks that are framework-neutral and tailored to specific verticals and use cases.

    A collection is declared with a list of benchmarks, their providers, and their relative weights. A healthcare_safety_v1 collection, for example, might include clinical reasoning benchmarks from lm-evaluation-harness, safety probes from Garak, and RAG-groundedness metrics from RAGAS, each weighted according to the use case's risk profile.

    Calling a collection is as simple as including its ID in an evaluation request:

    curl -X POST /api/v1/evaluations \
          -d '{
          "name": "llama-3-v2-healthcare-eval",
          "model": { "url": "http://vllm-service:8080/v1" },
          "collection": { "id": "healthcare_safety_v1" },
          "experiment": { "name": "llama-3-v2-healthcare-eval" }
    }'

    EvalHub looks up the collection, extracts the benchmarks, and groups them by provider to improve execution. It then returns a weighted aggregate score and a breakdown for each benchmark. The collection encodes organizational evaluation knowledge in a shareable, versionable, reusable form.

    Built-in reproducibility and governance

    Every evaluation run is automatically tracked in MLflow with a structured ExperimentConfig that captures the experiment name, tags (environment, model family, collection version, and so on), and the full configuration used to produce the results. This provides teams with a queryable, historical record of every evaluation run, which serves as the foundation for regression tracking, model comparison, and governance reporting.

    The OCI artifact persistence feature establishes a tamper-evident connection between a model artifact and its deployment justification. This is achieved by directly embedding evaluation results and traces as metadata within the OCI ModelCar images, ensuring that the evaluation evidence is persistently linked to the deployed model.

    From laptop to cluster without friction

    The EvalHub SDK provides the same evaluation interface whether you are running a quick check in a Python notebook, wiring evaluations into a CI/CD pipeline, or scheduling a production evaluation run on an OpenShift cluster. The server translates structured APIs call into Kubernetes primitives to manage execution at scale. This includes Kueue-based resource quota enforcement, pod scheduling, and status tracking through custom resources.

    A developer does not need to understand Kubernetes to run an evaluation. A platform team does not need to build custom evaluation infrastructure. The same EvalHub instance serves both.

    What EvalHub is not

    EvalHub is not a benchmark leaderboard. It is not a replacement for framework-specific tools like lm-evaluation-harness or RAGAS. Those tools do the actual measurement work; EvalHub orchestrates them, routes between them, tracks their outputs, and makes their results governable.

    EvalHub is rapidly evolving to keep pace with AI developments. It offers a sophisticated solution that addresses the five previously mentioned challenges more comprehensively than any single evaluation framework. EvalHub delivers a robust server for routing to live backends, tracking genuine experiments, and deploying on actual Kubernetes clusters.

    Getting started

    EvalHub is open source under the Apache 2.0 license. You can access the EvalHub server directly on GitHub. The Python SDK includes the evalhub CLI, REST client, BYOF adapter, MCP server, and OCI artefact persistence. Use TrustyAI operator to manage its Kubernetes deployment.

    For teams already running Open Data Hub or Red Hat OpenShift AI, EvalHub deploys as a component of the TrustyAI stack and requires no separate infrastructure.

    If your current evaluation process involves a combination of "it looked good in testing," a few MLflow runs that no one queries, and a model card written three months ago, EvalHub exists to replace those manual checks with a system you can trust.

    Related Posts

    • How EvalHub manages two-layer Kubernetes control planes

    • Synthetic data for RAG evaluation: Why your RAG system needs better testing

    • Defining success: Evaluation metrics and data augmentation for oversaturation detection

    • GuideLLM: Evaluate LLM deployments for real-world inference

    • We ran over half a million evaluations on quantized LLMs—here's what we found

    • Eval-driven development: Build and evaluate reliable AI agents

    Recent Posts

    • Debugging image mode with Red Hat OpenShift 4.20: A practical guide

    • EvalHub: Because "looks good to me" isn't a benchmark

    • SQL Server HA on RHEL: Meet Pacemaker HA Agent v2 (tech preview)

    • Deploy with confidence: Continuous integration and continuous delivery for agentic AI

    • Every layer counts: Defense in depth for AI agents with Red Hat AI

    What’s up next?

    Learning Path TensorFlow-Onnx-LP-featured-image

    Build and evaluate a fraud detection model with TensorFlow and ONNX

    Learn how to deploy a trained model with Red Hat OpenShift AI and use its...
    Red Hat Developers logo LinkedIn YouTube Twitter Facebook

    Platforms

    • Red Hat AI
    • Red Hat Enterprise Linux
    • Red Hat OpenShift
    • Red Hat Ansible Automation Platform
    • See all products

    Build

    • Developer Sandbox
    • Developer tools
    • Interactive tutorials
    • API catalog

    Quicklinks

    • Learning resources
    • E-books
    • Cheat sheets
    • Blog
    • Events
    • Newsletter

    Communicate

    • About us
    • Contact sales
    • Find a partner
    • Report a website issue
    • Site status dashboard
    • Report a security problem

    RED HAT DEVELOPER

    Build here. Go anywhere.

    We serve the builders. The problem solvers who create careers with code.

    Join us if you’re a developer, software engineer, web designer, front-end designer, UX designer, computer scientist, architect, tester, product manager, project manager or team lead.

    Sign me up

    Red Hat legal and privacy links

    • About Red Hat
    • Jobs
    • Events
    • Locations
    • Contact Red Hat
    • Red Hat Blog
    • Inclusion at Red Hat
    • Cool Stuff Store
    • Red Hat Summit
    © 2026 Red Hat

    Red Hat legal and privacy links

    • Privacy statement
    • Terms of use
    • All policies and guidelines
    • Digital accessibility

    Chat Support

    Please log in with your Red Hat account to access chat support.