Skip to main content
Redhat Developers  Logo
  • AI

    Get started with AI

    • Red Hat AI
      Accelerate the development and deployment of enterprise AI solutions.
    • AI learning hub
      Explore learning materials and tools, organized by task.
    • AI interactive demos
      Click through scenarios with Red Hat AI, including training LLMs and more.
    • AI/ML learning paths
      Expand your OpenShift AI knowledge using these learning resources.
    • AI quickstarts
      Focused AI use cases designed for fast deployment on Red Hat AI platforms.
    • No-cost AI training
      Foundational Red Hat AI training.

    Featured resources

    • OpenShift AI learning
    • Open source AI for developers
    • AI product application development
    • Open source-powered AI/ML for hybrid cloud
    • AI and Node.js cheat sheet

    Red Hat AI Factory with NVIDIA

    • Red Hat AI Factory with NVIDIA is a co-engineered, enterprise-grade AI solution for building, deploying, and managing AI at scale across hybrid cloud environments.
    • Explore the solution
  • Learn

    Self-guided

    • Documentation
      Find answers, get step-by-step guidance, and learn how to use Red Hat products.
    • Learning paths
      Explore curated walkthroughs for common development tasks.
    • Guided learning
      Receive custom learning paths powered by our AI assistant.
    • See all learning

    Hands-on

    • Developer Sandbox
      Spin up Red Hat's products and technologies without setup or configuration.
    • Interactive labs
      Learn by doing in these hands-on, browser-based experiences.
    • Interactive demos
      Click through product features in these guided tours.

    Browse by topic

    • AI/ML
    • Automation
    • Java
    • Kubernetes
    • Linux
    • See all topics

    Training & certifications

    • Courses and exams
    • Certifications
    • Skills assessments
    • Red Hat Academy
    • Learning subscription
    • Explore training
  • Build

    Get started

    • Red Hat build of Podman Desktop
      A downloadable, local development hub to experiment with our products and builds.
    • Developer Sandbox
      Spin up Red Hat's products and technologies without setup or configuration.

    Download products

    • Access product downloads to start building and testing right away.
    • Red Hat Enterprise Linux
    • Red Hat AI
    • Red Hat OpenShift
    • Red Hat Ansible Automation Platform
    • See all products

    Featured

    • Red Hat build of OpenJDK
    • Red Hat JBoss Enterprise Application Platform
    • Red Hat OpenShift Dev Spaces
    • Red Hat Developer Toolset

    References

    • E-books
    • Documentation
    • Cheat sheets
    • Architecture center
  • Community

    Get involved

    • Events
    • Live AI events
    • Red Hat Summit
    • Red Hat Accelerators
    • Community discussions

    Follow along

    • Articles & blogs
    • Developer newsletter
    • Videos
    • Github

    Get help

    • Customer service
    • Customer support
    • Regional contacts
    • Find a partner

    Join the Red Hat Developer program

    • Download Red Hat products and project builds, access support documentation, learning content, and more.
    • Explore the benefits

Evaluation-driven development with EvalHub

Stop guessing. Start measuring.

June 2, 2026
William Caban Babilonia Matteo Mortari
Related topics:
AI inferenceArtificial intelligenceDeveloper productivity
Related products:
Red Hat AIRed Hat OpenShift AI

    If you have shipped software, you probably know test-driven development (TDD). Write a failing test. Write the code to make it pass. Refactor. Ship with confidence. The red-green-refactor cycle is elegant because it is prominently deterministic in nature: the test either passes or it doesn't. Every state is unambiguous.

    AI systems do not work that way.

    A language model answering a customer support question might give ten different responses to the same query across ten runs—all of them arguably correct, none of them identical. "Does it work?" is the wrong question. The right question is how often does it work, for whom, in what contexts, and how we know. Programmatic pass/fail can't answer that. Threshold-based, multidimensional, gradient scoring can.

    This is the insight behind evaluation-driven development (EDD): an evolution of TDD for probabilistic systems that replaces plain assertions with scored, multidimensional evaluation criteria and replaces green/red builds with measurable gaps between current performance and desired thresholds.

    The EDD cycle

    EDD has three steps, each with an analog in traditional software engineering, and each qualitatively different for AI.

    Step 1: Define evaluation criteria: Before you write a prompt

    In TDD, you write tests before you write code. In EDD, you define evaluation criteria before you write prompts, choose a model, or design a pipeline.

    This sounds obvious. It almost never happens. The default process in most AI projects is to build something, prompt it until it seems to work, demo it, ship it, and then figure out what went wrong in production.

    EDD forces the question upfront: what does good look like? Not in the abstract, but specifically and measurably. For a customer support agent, that might mean:

    • Factual accuracy: 90% of claims match source documentation.
    • Escalation rate: Fewer than 20% of queries transferred to a human agent.
    • Response latency: Under 2 seconds at the 95th percentile.
    • Tone: No instances of defensive or dismissive language (measured by a secondary classifier).

    These criteria become your objectives. They determine what you measure, how you weight results, and what "done" means for every subsequent iteration.

    Step 2: Measure quality scores—0 to 100, not pass/fail

    Once the criteria are defined, you run your system and score it. Instead of a simple pass/fail, you can define gradient scores—percentages, weighted composites, dimensional breakdowns.

    A score of 82% on factual accuracy against a 90% threshold tells you something a simple failed status cannot: you are eight points away, not infinitely far. It tells your engineering team where to focus. It gives you a baseline against which every subsequent change can be compared.

    This is the reframe that makes AI development pragmatically and scientifically manageable. Instead of a black box that you poke until it seems better, you now have an optimization problem with a measurable current state and a defined target state. Closing the gap is engineering a solution, not intuition alone.

    Step 3: Iterate on prompts and configurations based on data

    With scores in hand, you run experiments. You A/B test prompt variations. You swap retrieval strategies. You try different models. You compare results not against your gut feeling but against your baseline scores.

    Each iteration either closes the gap, moves you progressively toward your final objective, or it doesn't. The evaluation tells you which. This transforms prompt engineering from an art into a systematic discipline: controlled experiments with measurable outcomes.

    The sliced evaluation: Where the real work happens

    Here is where EDD reveals its primary strength but also where most teams miss the opportunity entirely.

    An aggregate accuracy score is a summary. It provides a global aggregate, but might hide important insights.

    For example, consider an e-commerce recommendation system with an aggregate accuracy of 79%. That sounds reasonable. But what if the per-category breakdown looks like this?

    CategoryThresholdScoreStatus
    Electronics90%82%Below target
    Clothing75%95%Exceeding
    Home Goods80%45%Critical

    The aggregate score of 79% is masking a critical failure in Home Goods that affects a third of the catalog. An engineering team looking only at the top-line number would have no idea where to focus or might even deploy in the production system, confident it is almost 80%.

    Sliced evaluation breaks aggregate scores into the constituent dimensions that actually matter for the specific use case: by product category, by query language, by user segment, by medical specialty, by regulatory domain, and so on. Each slice gets its own threshold, score, and trend over time.

    This is the scientific precision that the EDD cycle promises at step 3. You are not optimizing the system. You are closing the specific gap in Home Goods recommendations, which might mean a different retrieval strategy for product catalog data, additional fine-tuning on product descriptions, or a different prompt template for that category.

    The same principle applies across domains. A healthcare system might track per-specialty evaluation scores. For example, cardiology scores 88% against a 95% threshold, while neurology scores 97%. A multilingual system tracks per-language accuracy. For example, English scores 95%, Mandarin scores 89%, and Spanish scores 78%. Each slice is an independent optimization target.

    EvalHub: EDD as an engineering practice

    The EDD cycle is a methodology. EvalHub is the platform that helps you turn it into an engineering practice.

    Defining evaluation criteria with collections

    EvalHub operationalizes Step 1 through evaluation collections: named, versioned sets of benchmarks with explicit weighting. A collection is the machine-readable form of your evaluation criteria.

    For a healthcare LLM deployment, the collection might look like:

    {
      "id": "healthcare_safety_v1",
      "benchmarks": [
        { "id": "medqa", "provider_id": "lm_evaluation_harness", "weight": 2.0, "pass_criteria":{"threshold": 80.0} },
        { "id": "pubmedqa", "provider_id": "lm_evaluation_harness", "weight": 1.5, "pass_criteria":{"threshold": 75.0} },
        { "id": "toxicity", "provider_id": "garak", "weight": 2.5, "pass_criteria":{"threshold": 70.0} },
        { "id": "faithfulness", "provider_id": "ragas", "weight": 1.5, "pass_criteria":{"threshold": 80.0} }
      ]
    }

    The collection is defined before a single evaluation is run. It encodes the team's measurement strategy: which dimensions matter, how they are weighted, all in a shareable, versionable artifact. When the criteria change (because the use case evolves or a new regulatory requirement emerges), the collection can be updated and versioned, not rewritten in someone's script.

    Measuring at scale with automatic experiment tracking

    In Step 2, measuring quality scores is where EDD's operational overhead typically kills the practice. Running five frameworks against three model variants, capturing all configurations, storing results in a queryable format, and making them accessible to the wider team requires a major engineering investment.

    EvalHub handles this automatically. A single POST to /api/v1/evaluations with a collection ID and a model endpoint:

    1. Expands the collection into individual benchmarks grouped by provider.
    2. Routes each group to the appropriate backend (lm-eval, Ragas, Garak, and so on) in parallel.
    3. Applies the collection's weights to produce a dimensional score breakdown.
    4. Writes the full experiment record, including scores, configurations, and tags to MLflow.

    The MLflow integration means every evaluation run is automatically tracked with the configuration that produced it: model version, collection version, hardware tags, and environment. Reproducing a result from three months ago is a query, not a detective investigation.

    Every provider in EvalHub's default set covers a different evaluation dimension:

    • The lm-evaluation-harness (150+ benchmarks): Reasoning, knowledge, coding, commonsense
    • Ragas: Retrieval quality, answer relevance, and faithfulness for RAG pipelines
    • Garak: Red-teaming, toxicity, bias, safety adversarial probes
    • GuideLLM: Latency, throughput, memory usage, hardware efficiency
    • LightEval: Fast capability checks
    • MTEB: Text embedding quality across retrieval tasks

    The sliced evaluation that EDD requires (per-category, per-language, per-specialty, and so on) is achieved by running collections against different filtered dataset subsets and tagging the results accordingly in MLflow. Each slice becomes a trackable dimension in the experiment history.

    Iterating from the notebook to the cluster

    Step 3, iterate based on data, is where EvalHub's architecture eliminates the dev-to-enterprise gap.

    A developer running a quick prompt-variation test uses evalhub.client (the SDK's typed Python REST client) to submit an evaluation request against an EvalHub instance. The exact same call, pointed at a production EvalHub deployment on OpenShift, runs as a Kubernetes-native job with Kueue-managed scheduling, resource quotas, structured logging, and Prometheus metrics. There is no evaluation mode for development and evaluation mode for production: there is one API, one format, one results store.

    This means the EDD iterate-and-measure cycle can happen in CI/CD. The SDK's evalhub.cli makes EvalHub directly accessible from shell scripts and pipeline steps without writing Python code. For example, evalhub eval run --config eval.yaml --wait submits a job and blocks until it completes, returning a non-zero exit code on failure. Every pull request that touches a prompt template, a retrieval strategy, or a model configuration can trigger an EvalHub evaluation run as part of the pipeline. The pull request review will then includeboth the code review and an evaluation report showing the baseline score, the updated score, and a breakdown of which slices improved or regressed.

    The evalhub.mcp module (developer preview) extends this to agentic workflows. AI agents and coding assistants can browse providers, benchmarks, and collections as MCP resources, submit evaluation jobs, and cancel running jobs via two MCP tools—all through a dedicated MCP server. This enables a pattern in which an agent validates its own changes against a defined collection before committing.

    The bring-your-own-framework capability rounds this out. Extend FrameworkAdapter in evalhub.adapterand implement run_benchmark_job to plug your custom evaluation logic into the orchestration, tracking, and reporting infrastructure. The adapter's OCI persistence (callbacks.create_oci_artifact(...)) pushes evaluation results to an OCI registry using olot and oras-py. In Kubernetes mode, the sidecar handles authentication; in local mode, standard Docker config is used. The artifact OCI reference and digest are included in the results reported back to EvalHub, so long-term evaluation provenance is queryable alongside MLflow experiment data.

    EDD in practice: From e-commerce to healthcare

    Let's make this concrete with two examples drawn from real evaluation patterns.

    E-commerce recommendation quality

    The team defines a collection with three slices: Electronics (threshold: 90%), Clothing (threshold: 75%), Home Goods (threshold: 80%). They run the collection against their baseline RAG pipeline. Electronics: 82%, Clothing: 95%, Home Goods: 45%.

    The optimization target is unambiguous: Home Goods at 45% needs urgent attention. The team discovers that the product catalog data for Home Goods is poorly structured and rarely retrieved correctly. They redesign the retrieval strategy for that category specifically—a change that would never have been identified from the 79% aggregate score.

    Healthcare LLM safety validation

    The team uses the healthcare_safety_v1 collection, which covers clinical reasoning (MedQA and PubMedQA), safety (Garak toxicity and adversarial probes), and RAG groundedness (Ragas faithfulness). Each benchmark has a minimum threshold reflecting regulatory requirements. Per-specialty slicing tracks cardiology, dermatology, and neurology separately.

    Before each model update, the collection runs automatically. If any specialty falls below its threshold, the update is blocked. If safety scores decline even as aggregate accuracy improves, the evaluation explicitly surfaces the tradeoff rather than hiding it in the average.

    The shift: From black box to optimization problem

    The traditional approach to AI development treats the model as a black box: put something in, get something out, and decide subjectively if it's good enough.

    EDD reframes the problem: you have a current performance score, a desired performance threshold, and a measurable gap between them. Your engineering work is to close that gap systematically, through controlled experiments, guided by evaluation data rather than instinct.

    EvalHub is the platform that makes this reframe possible, pragmatic, and effective for enterprise development. It handles the measurement infrastructure so teams can focus on the optimization work. It enforces the discipline of defining criteria before running experiments. It makes sliced evaluation as easy as aggregate evaluation. And it ensures that the evaluation practice scales from a developer's laptop to a production Kubernetes cluster without a platform engineering project in between.

    Stop guesswork. Define your criteria. Run your collections. Measure the gap. Close it. Repeat.

    That is EDD. That is what EvalHub is built for.

    Start here

    EvalHub is open source under the Apache 2.0 license and deploys on Kubernetes via the TrustyAI operator. For teams already running Red Hat OpenShift AI, it is available as part of the TrustyAI stack with no additional infrastructure required. Start with the resources below, or learn more about how Red Hat AI supports production-grade, governed AI.

    • EvalHub website
    • EvalHub server
    • EvalHub SDK
    • OpenAPI specification
    • TrustyAI Operator

    Related Posts

    • EvalHub: Because "looks good to me" isn't a benchmark

    • How EvalHub manages two-layer Kubernetes control planes

    • Eval-driven development: Build and evaluate reliable AI agents

    • Synthetic data for RAG evaluation: Why your RAG system needs better testing

    • Defining success: Evaluation metrics and data augmentation for oversaturation detection

    • GuideLLM: Evaluate LLM deployments for real-world inference

    Recent Posts

    • UBI 9 and 10 builders on Paketo Buildpacks with multi-arch support

    • Deploy Hermes Agent on OpenShift AI with vLLM model serving

    • Evaluation-driven development with EvalHub

    • Improve vLLM Semantic Router accuracy with fine-tuning

    • Red Hat build of Cryostat 4.2: Enhanced Java monitoring for OpenShift

    What’s up next?

    applied ai for devs tile card

    Applied AI for Enterprise Java Development

    Alex Soto Bueno +2
    Red Hat Developers logo LinkedIn YouTube Twitter Facebook

    Platforms

    • Red Hat AI
    • Red Hat Enterprise Linux
    • Red Hat OpenShift
    • Red Hat Ansible Automation Platform
    • See all products

    Build

    • Developer Sandbox
    • Developer tools
    • Interactive tutorials
    • API catalog

    Quicklinks

    • Learning resources
    • E-books
    • Cheat sheets
    • Blog
    • Events
    • Newsletter

    Communicate

    • About us
    • Contact sales
    • Find a partner
    • Report a website issue
    • Site status dashboard
    • Report a security problem

    RED HAT DEVELOPER

    Build here. Go anywhere.

    We serve the builders. The problem solvers who create careers with code.

    Join us if you’re a developer, software engineer, web designer, front-end designer, UX designer, computer scientist, architect, tester, product manager, project manager or team lead.

    Sign me up

    Red Hat legal and privacy links

    • About Red Hat
    • Jobs
    • Events
    • Locations
    • Contact Red Hat
    • Red Hat Blog
    • Inclusion at Red Hat
    • Cool Stuff Store
    • Red Hat Summit
    © 2026 Red Hat

    Red Hat legal and privacy links

    • Privacy statement
    • Terms of use
    • All policies and guidelines
    • Digital accessibility

    Chat Support

    Please log in with your Red Hat account to access chat support.