In Because "looks good to me" isn't a benchmark, we identified five structural failures in enterprise AI evaluation. The second problem, the what should I measure? problem, is the one that bites teams earliest and quietly. You have a model. You have an endpoint. You need to know if it's good enough to deploy. So you run MMLU, get a score, and make a judgment call.
The score tells you almost nothing useful.
Not because MMLU is a bad benchmark. It is a rigorous one. The problem is that a single benchmark score is a proxy answer to a different question. What you actually need to know is whether the model clears the bar for your specific use case, deployment context, and risk tolerance. That requires a set of measurements (plural) and a defined bar for each one, not a vague intuition about what a good score looks like.
Evaluation-driven development with EvalHub introduced evaluation-driven development (EDD) and showed how evaluation collections operationalize step 1 of the EDD cycle: defining criteria before running experiments. This post goes one level deeper. It covers how to read an existing system collection, understand its threshold logic, and build your own collection that encodes your actual measurement strategy with thresholds that mean something.
The Leaderboard v2 collection is the worked example. It is one of EvalHub's built-in system collections, and its threshold choices are defensible enough to deserve explanation.
This is the fourth post in a series covering how to build a scalable, reproducible AI evaluation infrastructure using the EvalHub project and Red Hat AI. Catch up on the other parts in the series:
- Part 1: How EvalHub manages two-layer Kubernetes control planes
- Part 2: EvalHub: Because "looks good to me" isn't a benchmark
- Part 3: Evaluation-driven development with EvalHub
- Part 4: Understanding evaluation collections in EvalHub
What an evaluation collection actually is
An evaluation collection is a named, versioned artifact that captures three things:
- Which benchmarks to run: Identified by benchmark ID and the provider that executes them
- How to weight them: Numerical weights that determine each benchmark's contribution to the aggregate score
- What thresholds define a pass: Both per-benchmark (individual gates) and collection-level (overall pass criteria)
Collections are stored server-side and referenced by ID in evaluation requests. They are not shell scripts or experiment configurations. Instead, they are first-class EvalHub resources with full CRUD support via the REST API and CLI.
The server distinguishes two scopes:
system: Built-in collections shipped with EvalHub and maintained by the project team. Read-only. For example,leaderboard-v2.user: Collections you create with full read/write access. These are where your domain-specific measurement strategies live.
When you reference a collection ID in an evaluation request, EvalHub expands it: it reads the benchmark list, groups benchmarks by provider, routes each group to the appropriate backend in parallel, applies the weights, and returns a two-tier result: collection-level aggregate and per-benchmark breakdown.
The Leaderboard v2 collection: A system baseline worth understanding
Leaderboard v2 (leaderboard-v2) is EvalHub's implementation of the Open LLM Leaderboard v2 benchmark suite, which is the current standard for comparing general-purpose instruction-tuned models. It comprises six benchmarks, all executed via lm_evaluation_harness:
id: leaderboard-v2
category: leaderboard
description: "Leaderboard v2"
tags:
- leaderboard
pass_criteria:
threshold: 38.0
benchmarks:
- id: leaderboard_ifeval
provider_id: lm_evaluation_harness
metric: inst_level_strict_acc
threshold: 80.0
weight: 1
lower_is_better: false
- id: leaderboard_bbh
provider_id: lm_evaluation_harness
metric: acc_norm
threshold: 68.0
weight: 1
lower_is_better: false
- id: leaderboard_gpqa
provider_id: lm_evaluation_harness
metric: acc_norm
threshold: 40.0
weight: 1
lower_is_better: false
- id: leaderboard_mmlu_pro
provider_id: lm_evaluation_harness
metric: acc_norm
threshold: 60.0
weight: 1
lower_is_better: false
- id: leaderboard_musr
provider_id: lm_evaluation_harness
metric: acc_norm
threshold: 38.0
weight: 1
lower_is_better: false
- id: leaderboard_math_hard
provider_id: lm_evaluation_harness
metric: exact_match
threshold: 55.0
weight: 1
lower_is_better: falseWhy these six benchmarks
Each benchmark measures a qualitatively distinct capability:
| Benchmark | What it actually measures | Key callout |
|---|---|---|
| IFEval | Instruction following: can the model do what it is told, in the format it is told | The largest differentiator between base and instruction-tuned models at the <80B scale. |
| BBH | Chain-of-thought reasoning across 23 hard tasks where earlier models scored below human-level | Relatively stable across well-tuned 70B models; Phi-4 (14B) scores around 65, which is above many larger models. |
| GPQA | PhD-level science reasoning in biology, physics, and chemistry | Hardest discriminator in this class. PhD experts score 65%, skilled non-experts 34%; frontier GPT-4 baselines hit only 39% at initial release. |
| MMLU-Pro | Reasoning-centric knowledge across 12,000+ graduate-level questions in 14+ domains | Anything above 55 is competitive; Gemma-3-27B and Phi-4 land 57–62 at this tier. |
| MuSR | Multi-step logical reasoning inside long-form narrative text (murder mysteries, etc.) | Notoriously noisy. Even Qwen2.5-72B scores 11.7, so a score above 30 is solid. |
| Math-Lvl-5 | The hardest 1,324 problems from the MATH dataset—competition math | Math-focused post-training shows clearly here; Phi-4 and Gemma-3-27B score higher per parameter than Llama-3.1-70B. |
Two-tier threshold logic
The collection defines thresholds at two levels, each serving a different purpose.
Benchmark-level thresholds act as the per-dimension gates. They encode what a strong model looks like for a competitive <80B model in each specific capability:
| Benchmark | Threshold | Rationale |
|---|---|---|
| IFEval | 80.0 | A score above 65 indicates consistent instruction adherence; 80 represents frontier <80B performance (where Qwen2.5-72B scores around 86) |
| BBH | 68.0 | Top of the "strong" range for well-tuned 70B models |
| GPQA | 40.0 | Anything above 35 for a sub-80B model is strong; 40 marks the frontier tier |
| MMLU-Pro | 60.0 | Competitive bar for this tier; top models land 57–67 |
| MuSR | 38.0 | Top of what the best <80B models reliably achieve |
| Math-Lvl-5 | 55.0 | Strong post-training signal; above 55 is excellent for <80B |
The collection-level pass_criteria.threshold of 38.0 is not derived from a single benchmark. It is the weighted average that a model must clear across all six to be considered passing at the collection level. With equal weights, this is the arithmetic mean of the per-benchmark scores. The threshold of 38.0 represents a realistic genuinely capable bar. A model scoring above this average across all six benchmarks clears a broader capability floor than any individual benchmark can establish.
Here is a practical, competitive bar that distinguishes capable from average models at the <80B weight class:
IFEval ≥ 65 | BBH ≥ 55 | GPQA ≥ 25 | MMLU-Pro ≥ 50 | MuSR ≥ 25 | Math-Lvl-5 ≥ 35The Leaderboard v2 thresholds set a higher baseline, roughly 10 to 15 points higher. This is deliberate: system collections are conservative reference points, not deployment gates. Your custom collection is where you set the thresholds that actually matter for your use case.
Creating a custom collection
System collections, such as Leaderboard v2, are read-only. To define your own measurement strategy with different benchmarks, different weights, and thresholds calibrated to your use case, you create a user-scoped collection.
The full schema for a collection create request:
{
"name": "string (required)",
"category": "string (required)",
"description": "string (optional, max 1024 chars)",
"benchmarks": [
{
"id": "string (required) — benchmark identifier",
"provider_id": "string (required) — which backend runs this",
"metric": "string — primary metric name",
"threshold": "number — pass threshold for this benchmark",
"weight": "number (non-negative) — contribution to aggregate, default 1",
"lower_is_better": "boolean — score direction, default false",
"parameters": "object (optional) — benchmark-specific config",
"url": "string (optional) — custom provider URL"
}
],
"tags": ["string"],
"metadata": {"key": "value"},
"pass_criteria": {
"threshold": "number — collection-level aggregate threshold"
}
}Worked example: Adapting Leaderboard v2 for a production deployment gate
Suppose your team is deploying a general-purpose assistant on OpenShift, and you want a deployment gate that reflects your actual requirements rather than the frontier-optimized thresholds in the system collection. Your model is expected to handle instruction-following tasks and multi-step reasoning, but you are not optimizing for PhD-level science or competition math.
You want:
- Lower stakes on GPQA and Math-Lvl-5 (not your use case's core capability)
- Higher weight on IFEval (instruction following is critical for your assistant)
- A collection-level pass bar calibrated to a competitive <80B model instead of the frontier tier
name: "General Assistant Deployment Gate v1"
category: "deployment-gate"
description: "Production deployment gate for general-purpose assistants. Based on Leaderboard v2 with weights adjusted for instruction-following priority. Thresholds calibrated to competitive <80B bar."
tags:
- assistant
- deployment-gate
- general-purpose
pass_criteria:
threshold: 55.0
benchmarks:
- id: leaderboard_ifeval
provider_id: lm_evaluation_harness
metric: inst_level_strict_acc
threshold: 65.0
weight: 2.0
lower_is_better: false
- id: leaderboard_bbh
provider_id: lm_evaluation_harness
metric: acc_norm
threshold: 55.0
weight: 1.5
lower_is_better: false
- id: leaderboard_gpqa
provider_id: lm_evaluation_harness
metric: acc_norm
threshold: 25.0
weight: 0.5
lower_is_better: false
- id: leaderboard_mmlu_pro
provider_id: lm_evaluation_harness
metric: acc_norm
threshold: 50.0
weight: 1.5
lower_is_better: false
- id: leaderboard_musr
provider_id: lm_evaluation_harness
metric: acc_norm
threshold: 25.0
weight: 1.0
lower_is_better: false
- id: leaderboard_math_hard
provider_id: lm_evaluation_harness
metric: exact_match
threshold: 35.0
weight: 0.5
lower_is_better: falseKey design decisions made explicit:
- IFEval weight: 2.0. Instruction following is the primary capability this deployment needs. Doubling its weight means IFEval contributes two times as much to the aggregate as lower-priority dimensions.
- GPQA and Math-Lvl-5 weights: 0.5. These benchmarks are still measured (regressions matter), but their contribution to the gate is halved.
- Collection threshold: 55.0. This is the weighted average. With IFEval weighted 2x, clearing IFEval at 65 pulls the average up significantly; failing it drags the aggregate down hard. The threshold is set to be achievable by a well-tuned competitive model that prioritizes the right capabilities.
The pass_criteria.threshold is intentional: it is not the average of the per-benchmark thresholds. It is a separate, independently calibrated number. A model can fail one benchmark threshold and still clear the collection-level gate if it excels in more heavily weighted dimensions. A model can pass every individual benchmark threshold and still fail the collection gate if its weighted average falls short. These two signals together are more informative than either alone.
Registering the collection via CLI
Save the preceding YAML as assistant-gate-v1.yaml, then:
evalhub collections create --spec assistant-gate-v1.yamlThe server validates the request, assigns an ID, timestamps it, and registers it as a user-scoped collection. The response includes the assigned id. Use this identifier in evaluation requests.
# Confirm the collection was registered
evalhub collections describe <assigned-id>
# List all collections, including system and user-scoped
evalhub collections listVia the Python SDK:
from evalhub.client import SyncEvalHubClient
client = SyncEvalHubClient(base_url="http://evalhub-service:8080")
# List all collections
collections = client.collections.list()
# Retrieve the specific collection to verify
collection = client.collections.get(id="<assigned-id>")Running an evaluation against your collection
With the collection registered, running an evaluation is a single request that references the collection ID:
evalhub eval run \
--collection <assigned-id> \
--model-url http://vllm-service:8080/v1 \
--waitThe --wait flag blocks until the run completes and returns a non-zero exit code if the collection-level pass criteria are not met, making this directly usable as a CI/CD gate.
Equivalently, via the REST API:
curl -X POST http://evalhub-service:8080/api/v1/evaluations \
-H "Content-Type: application/json" \
-d '{
"model": { "url": "http://vllm-service:8080/v1" },
"collection_id": "<assigned-id>",
"experiment": {
"name": "llama-3.2-3b-assistant-gate-eval",
"tags": {
"model_family": "llama-3",
"environment": "staging",
"collection_version": "v1"
}
}
}'EvalHub expands the collection, groups the six benchmarks (all from lm_evaluation_harness in this case, so a single provider call), runs them, applies the weights, and returns:
{
"status": "completed",
"collection_id": "<assigned-id>",
"collection_score": 61.4,
"pass_criteria": {
"threshold": 55.0,
"passed": true
},
"benchmark_results": [
{ "id": "leaderboard_ifeval", "score": 71.2, "threshold": 65.0, "passed": true },
{ "id": "leaderboard_bbh", "score": 58.3, "threshold": 55.0, "passed": true },
{ "id": "leaderboard_gpqa", "score": 22.1, "threshold": 25.0, "passed": false },
{ "id": "leaderboard_mmlu_pro", "score": 51.8, "threshold": 50.0, "passed": true },
{ "id": "leaderboard_musr", "score": 29.4, "threshold": 25.0, "passed": true },
{ "id": "leaderboard_math_hard","score": 31.2, "threshold": 35.0, "passed": false }
],
"experiment": { "mlflow_run_id": "..." }
}Two things are immediately visible in this result:
- The model cleared the collection gate (61.4 > 55.0 threshold) and meets your defined criteria for deployment.
- GPQA (22.1 vs 25.0) and Math-Lvl-5 (31.2 vs 35.0) failed their individual thresholds. However, because these are low-weight dimensions, the failure did not lower the aggregate score below the gate.
This is the distinction that matters operationally. The collection gate answers the deployment question: Can this go to production? The per-benchmark breakdown answers the engineering question: Which areas still need work? Both are present in every evaluation result. An aggregate-only result forces you to choose between them.
The full experiment record (scores, configurations, collection version, and hardware tags) is automatically written to MLflow via the tags in the experiment configuration. Every run is queryable. If you reproduce this run three months from now, you will have the exact configuration that produced these numbers.
Managing collection evolution
Evaluation criteria are not static. Regulatory requirements change. New benchmarks emerge. Your use case evolves. The collection schema is built for this.
Update a collection via the REST API:
# Partial update — change the pass_criteria threshold only
curl -X PATCH http://evalhub-service:8080/api/v1/evaluations/collections/<id> \
-H "Content-Type: application/json" \
-d '[
{ "op": "replace", "path": "/pass_criteria/threshold", "value": 60.0 }
]'
# Full replacement — swap in a new benchmark list
curl -X PUT http://evalhub-service:8080/api/v1/evaluations/collections/<id> \
-H "Content-Type: application/json" \
-d '{ ... full updated collection body ... }'The practical implication: when your collection changes, update the tags in evaluation requests to include a collection_version tag. MLflow then lets you query all runs against v1 of this collection separately from all runs against v2. Collection evolution becomes traceable in the experiment history rather than a silent configuration drift that makes historical scores incomparable.
What this unlocks
Every team operating production AI workloads has evaluation criteria, whether written down or not. The criteria live in someone's head, in a Slack thread, in a notebook, or in a one-off script that ran before the last launch. Those undocumented guidelines are not reusable, auditable, or portable across model upgrades or team changes.
An evaluation collection externalizes that knowledge. The thresholds are not judgment calls made the night before a launch. They are deliberate design decisions, documented in a versioned artifact, enforced consistently on every evaluation run. Changing them requires a deliberate act, not an implicit recalibration.
The Leaderboard v2 collection gives you a grounded starting point: six benchmarks whose thresholds are calibrated against the real distribution of <80B model scores, with a documented rationale for each threshold choice. From there, the work is adapting weights and thresholds to your actual use case. This is the only evaluation strategy that provides actionable insights about your specific deployment.
Start with evalhub collections describe leaderboard-v2. Understand what each threshold is actually measuring and why. Then build the collection that answers your deployment question instead of the question from the open source leaderboard.
Start here
Use the following resources to explore EvalHub and begin building your own evaluation collections:
- EvalHub website
- EvalHub server (Collections API, OpenAPI spec)
- EvalHub SDK (EvalHub collections CLI, REST client)
- OpenAPI specification
- TrustyAI Operator (Kubernetes/OpenShift deployment)