If you have shipped software, you probably know test-driven development (TDD). Write a failing test. Write the code to make it pass. Refactor. Ship with confidence. The red-green-refactor cycle is elegant because it is prominently deterministic in nature: the test either passes or it doesn't. Every state is unambiguous.
AI systems do not work that way.
A language model answering a customer support question might give ten different responses to the same query across ten runs—all of them arguably correct, none of them identical. "Does it work?" is the wrong question. The right question is how often does it work, for whom, in what contexts, and how we know. Programmatic pass/fail can't answer that. Threshold-based, multidimensional, gradient scoring can.
This is the insight behind evaluation-driven development (EDD): an evolution of TDD for probabilistic systems that replaces plain assertions with scored, multidimensional evaluation criteria and replaces green/red builds with measurable gaps between current performance and desired thresholds.
The EDD cycle
EDD has three steps, each with an analog in traditional software engineering, and each qualitatively different for AI.
Step 1: Define evaluation criteria: Before you write a prompt
In TDD, you write tests before you write code. In EDD, you define evaluation criteria before you write prompts, choose a model, or design a pipeline.
This sounds obvious. It almost never happens. The default process in most AI projects is to build something, prompt it until it seems to work, demo it, ship it, and then figure out what went wrong in production.
EDD forces the question upfront: what does good look like? Not in the abstract, but specifically and measurably. For a customer support agent, that might mean:
- Factual accuracy: 90% of claims match source documentation.
- Escalation rate: Fewer than 20% of queries transferred to a human agent.
- Response latency: Under 2 seconds at the 95th percentile.
- Tone: No instances of defensive or dismissive language (measured by a secondary classifier).
These criteria become your objectives. They determine what you measure, how you weight results, and what "done" means for every subsequent iteration.
Step 2: Measure quality scores—0 to 100, not pass/fail
Once the criteria are defined, you run your system and score it. Instead of a simple pass/fail, you can define gradient scores—percentages, weighted composites, dimensional breakdowns.
A score of 82% on factual accuracy against a 90% threshold tells you something a simple failed status cannot: you are eight points away, not infinitely far. It tells your engineering team where to focus. It gives you a baseline against which every subsequent change can be compared.
This is the reframe that makes AI development pragmatically and scientifically manageable. Instead of a black box that you poke until it seems better, you now have an optimization problem with a measurable current state and a defined target state. Closing the gap is engineering a solution, not intuition alone.
Step 3: Iterate on prompts and configurations based on data
With scores in hand, you run experiments. You A/B test prompt variations. You swap retrieval strategies. You try different models. You compare results not against your gut feeling but against your baseline scores.
Each iteration either closes the gap, moves you progressively toward your final objective, or it doesn't. The evaluation tells you which. This transforms prompt engineering from an art into a systematic discipline: controlled experiments with measurable outcomes.
The sliced evaluation: Where the real work happens
Here is where EDD reveals its primary strength but also where most teams miss the opportunity entirely.
An aggregate accuracy score is a summary. It provides a global aggregate, but might hide important insights.
For example, consider an e-commerce recommendation system with an aggregate accuracy of 79%. That sounds reasonable. But what if the per-category breakdown looks like this?
| Category | Threshold | Score | Status |
|---|---|---|---|
| Electronics | 90% | 82% | Below target |
| Clothing | 75% | 95% | Exceeding |
| Home Goods | 80% | 45% | Critical |
The aggregate score of 79% is masking a critical failure in Home Goods that affects a third of the catalog. An engineering team looking only at the top-line number would have no idea where to focus or might even deploy in the production system, confident it is almost 80%.
Sliced evaluation breaks aggregate scores into the constituent dimensions that actually matter for the specific use case: by product category, by query language, by user segment, by medical specialty, by regulatory domain, and so on. Each slice gets its own threshold, score, and trend over time.
This is the scientific precision that the EDD cycle promises at step 3. You are not optimizing the system. You are closing the specific gap in Home Goods recommendations, which might mean a different retrieval strategy for product catalog data, additional fine-tuning on product descriptions, or a different prompt template for that category.
The same principle applies across domains. A healthcare system might track per-specialty evaluation scores. For example, cardiology scores 88% against a 95% threshold, while neurology scores 97%. A multilingual system tracks per-language accuracy. For example, English scores 95%, Mandarin scores 89%, and Spanish scores 78%. Each slice is an independent optimization target.
EvalHub: EDD as an engineering practice
The EDD cycle is a methodology. EvalHub is the platform that helps you turn it into an engineering practice.
Defining evaluation criteria with collections
EvalHub operationalizes Step 1 through evaluation collections: named, versioned sets of benchmarks with explicit weighting. A collection is the machine-readable form of your evaluation criteria.
For a healthcare LLM deployment, the collection might look like:
{
"id": "healthcare_safety_v1",
"benchmarks": [
{ "id": "medqa", "provider_id": "lm_evaluation_harness", "weight": 2.0, "pass_criteria":{"threshold": 80.0} },
{ "id": "pubmedqa", "provider_id": "lm_evaluation_harness", "weight": 1.5, "pass_criteria":{"threshold": 75.0} },
{ "id": "toxicity", "provider_id": "garak", "weight": 2.5, "pass_criteria":{"threshold": 70.0} },
{ "id": "faithfulness", "provider_id": "ragas", "weight": 1.5, "pass_criteria":{"threshold": 80.0} }
]
}The collection is defined before a single evaluation is run. It encodes the team's measurement strategy: which dimensions matter, how they are weighted, all in a shareable, versionable artifact. When the criteria change (because the use case evolves or a new regulatory requirement emerges), the collection can be updated and versioned, not rewritten in someone's script.
Measuring at scale with automatic experiment tracking
In Step 2, measuring quality scores is where EDD's operational overhead typically kills the practice. Running five frameworks against three model variants, capturing all configurations, storing results in a queryable format, and making them accessible to the wider team requires a major engineering investment.
EvalHub handles this automatically. A single POST to /api/v1/evaluations with a collection ID and a model endpoint:
- Expands the collection into individual benchmarks grouped by provider.
- Routes each group to the appropriate backend (
lm-eval, Ragas, Garak, and so on) in parallel. - Applies the collection's weights to produce a dimensional score breakdown.
- Writes the full experiment record, including scores, configurations, and tags to MLflow.
The MLflow integration means every evaluation run is automatically tracked with the configuration that produced it: model version, collection version, hardware tags, and environment. Reproducing a result from three months ago is a query, not a detective investigation.
Every provider in EvalHub's default set covers a different evaluation dimension:
- The
lm-evaluation-harness(150+ benchmarks): Reasoning, knowledge, coding, commonsense - Ragas: Retrieval quality, answer relevance, and faithfulness for RAG pipelines
- Garak: Red-teaming, toxicity, bias, safety adversarial probes
- GuideLLM: Latency, throughput, memory usage, hardware efficiency
- LightEval: Fast capability checks
- MTEB: Text embedding quality across retrieval tasks
The sliced evaluation that EDD requires (per-category, per-language, per-specialty, and so on) is achieved by running collections against different filtered dataset subsets and tagging the results accordingly in MLflow. Each slice becomes a trackable dimension in the experiment history.
Iterating from the notebook to the cluster
Step 3, iterate based on data, is where EvalHub's architecture eliminates the dev-to-enterprise gap.
A developer running a quick prompt-variation test uses evalhub.client (the SDK's typed Python REST client) to submit an evaluation request against an EvalHub instance. The exact same call, pointed at a production EvalHub deployment on OpenShift, runs as a Kubernetes-native job with Kueue-managed scheduling, resource quotas, structured logging, and Prometheus metrics. There is no evaluation mode for development and evaluation mode for production: there is one API, one format, one results store.
This means the EDD iterate-and-measure cycle can happen in CI/CD. The SDK's evalhub.cli makes EvalHub directly accessible from shell scripts and pipeline steps without writing Python code. For example, evalhub eval run --config eval.yaml --wait submits a job and blocks until it completes, returning a non-zero exit code on failure. Every pull request that touches a prompt template, a retrieval strategy, or a model configuration can trigger an EvalHub evaluation run as part of the pipeline. The pull request review will then includeboth the code review and an evaluation report showing the baseline score, the updated score, and a breakdown of which slices improved or regressed.
The evalhub.mcp module (developer preview) extends this to agentic workflows. AI agents and coding assistants can browse providers, benchmarks, and collections as MCP resources, submit evaluation jobs, and cancel running jobs via two MCP tools—all through a dedicated MCP server. This enables a pattern in which an agent validates its own changes against a defined collection before committing.
The bring-your-own-framework capability rounds this out. Extend FrameworkAdapter in evalhub.adapterand implement run_benchmark_job to plug your custom evaluation logic into the orchestration, tracking, and reporting infrastructure. The adapter's OCI persistence (callbacks.create_oci_artifact(...)) pushes evaluation results to an OCI registry using olot and oras-py. In Kubernetes mode, the sidecar handles authentication; in local mode, standard Docker config is used. The artifact OCI reference and digest are included in the results reported back to EvalHub, so long-term evaluation provenance is queryable alongside MLflow experiment data.
EDD in practice: From e-commerce to healthcare
Let's make this concrete with two examples drawn from real evaluation patterns.
E-commerce recommendation quality
The team defines a collection with three slices: Electronics (threshold: 90%), Clothing (threshold: 75%), Home Goods (threshold: 80%). They run the collection against their baseline RAG pipeline. Electronics: 82%, Clothing: 95%, Home Goods: 45%.
The optimization target is unambiguous: Home Goods at 45% needs urgent attention. The team discovers that the product catalog data for Home Goods is poorly structured and rarely retrieved correctly. They redesign the retrieval strategy for that category specifically—a change that would never have been identified from the 79% aggregate score.
Healthcare LLM safety validation
The team uses the healthcare_safety_v1 collection, which covers clinical reasoning (MedQA and PubMedQA), safety (Garak toxicity and adversarial probes), and RAG groundedness (Ragas faithfulness). Each benchmark has a minimum threshold reflecting regulatory requirements. Per-specialty slicing tracks cardiology, dermatology, and neurology separately.
Before each model update, the collection runs automatically. If any specialty falls below its threshold, the update is blocked. If safety scores decline even as aggregate accuracy improves, the evaluation explicitly surfaces the tradeoff rather than hiding it in the average.
The shift: From black box to optimization problem
The traditional approach to AI development treats the model as a black box: put something in, get something out, and decide subjectively if it's good enough.
EDD reframes the problem: you have a current performance score, a desired performance threshold, and a measurable gap between them. Your engineering work is to close that gap systematically, through controlled experiments, guided by evaluation data rather than instinct.
EvalHub is the platform that makes this reframe possible, pragmatic, and effective for enterprise development. It handles the measurement infrastructure so teams can focus on the optimization work. It enforces the discipline of defining criteria before running experiments. It makes sliced evaluation as easy as aggregate evaluation. And it ensures that the evaluation practice scales from a developer's laptop to a production Kubernetes cluster without a platform engineering project in between.
Stop guesswork. Define your criteria. Run your collections. Measure the gap. Close it. Repeat.
That is EDD. That is what EvalHub is built for.
Start here
EvalHub is open source under the Apache 2.0 license and deploys on Kubernetes via the TrustyAI operator. For teams already running Red Hat OpenShift AI, it is available as part of the TrustyAI stack with no additional infrastructure required. Start with the resources below, or learn more about how Red Hat AI supports production-grade, governed AI.