Skip to main content
Redhat Developers  Logo
  • AI

    Get started with AI

    • Red Hat AI
      Accelerate the development and deployment of enterprise AI solutions.
    • AI learning hub
      Explore learning materials and tools, organized by task.
    • AI interactive demos
      Click through scenarios with Red Hat AI, including training LLMs and more.
    • AI/ML learning paths
      Expand your OpenShift AI knowledge using these learning resources.
    • AI quickstarts
      Focused AI use cases designed for fast deployment on Red Hat AI platforms.
    • No-cost AI training
      Foundational Red Hat AI training.

    Featured resources

    • OpenShift AI learning
    • Open source AI for developers
    • AI product application development
    • Open source-powered AI/ML for hybrid cloud
    • AI and Node.js cheat sheet

    Red Hat AI Factory with NVIDIA

    • Red Hat AI Factory with NVIDIA is a co-engineered, enterprise-grade AI solution for building, deploying, and managing AI at scale across hybrid cloud environments.
    • Explore the solution
  • Learn

    Self-guided

    • Documentation
      Find answers, get step-by-step guidance, and learn how to use Red Hat products.
    • Learning paths
      Explore curated walkthroughs for common development tasks.
    • Guided learning
      Receive custom learning paths powered by our AI assistant.
    • See all learning

    Hands-on

    • Developer Sandbox
      Spin up Red Hat's products and technologies without setup or configuration.
    • Interactive labs
      Learn by doing in these hands-on, browser-based experiences.
    • Interactive demos
      Click through product features in these guided tours.

    Browse by topic

    • AI/ML
    • Automation
    • Java
    • Kubernetes
    • Linux
    • See all topics

    Training & certifications

    • Courses and exams
    • Certifications
    • Skills assessments
    • Red Hat Academy
    • Learning subscription
    • Explore training
  • Build

    Get started

    • Red Hat build of Podman Desktop
      A downloadable, local development hub to experiment with our products and builds.
    • Developer Sandbox
      Spin up Red Hat's products and technologies without setup or configuration.

    Download products

    • Access product downloads to start building and testing right away.
    • Red Hat Enterprise Linux
    • Red Hat AI
    • Red Hat OpenShift
    • Red Hat Ansible Automation Platform
    • See all products

    Featured

    • Red Hat build of OpenJDK
    • Red Hat JBoss Enterprise Application Platform
    • Red Hat OpenShift Dev Spaces
    • Red Hat Developer Toolset

    References

    • E-books
    • Documentation
    • Cheat sheets
    • Architecture center
  • Community

    Get involved

    • Events
    • Live AI events
    • Red Hat Summit
    • Red Hat Accelerators
    • Community discussions

    Follow along

    • Articles & blogs
    • Developer newsletter
    • Videos
    • Github

    Get help

    • Customer service
    • Customer support
    • Regional contacts
    • Find a partner

    Join the Red Hat Developer program

    • Download Red Hat products and project builds, access support documentation, learning content, and more.
    • Explore the benefits

Understanding evaluation collections in EvalHub

Define what "good" looks like before you run a single benchmark

June 4, 2026
William Caban Babilonia Julian Payne Marius Ion Danciu
Related topics:
Artificial intelligenceKubernetes
Related products:
Red Hat AI

    In Because "looks good to me" isn't a benchmark, we identified five structural failures in enterprise AI evaluation. The second problem, the what should I measure? problem, is the one that bites teams earliest and quietly. You have a model. You have an endpoint. You need to know if it's good enough to deploy. So you run MMLU, get a score, and make a judgment call.

    The score tells you almost nothing useful.

    Not because MMLU is a bad benchmark. It is a rigorous one. The problem is that a single benchmark score is a proxy answer to a different question. What you actually need to know is whether the model clears the bar for your specific use case, deployment context, and risk tolerance. That requires a set of measurements (plural) and a defined bar for each one, not a vague intuition about what a good score looks like.

    Evaluation-driven development with EvalHub introduced evaluation-driven development (EDD) and showed how evaluation collections operationalize step 1 of the EDD cycle: defining criteria before running experiments. This post goes one level deeper. It covers how to read an existing system collection, understand its threshold logic, and build your own collection that encodes your actual measurement strategy with thresholds that mean something.

    The Leaderboard v2 collection is the worked example. It is one of EvalHub's built-in system collections, and its threshold choices are defensible enough to deserve explanation.

    This is the fourth post in a series covering how to build a scalable, reproducible AI evaluation infrastructure using the EvalHub project and Red Hat AI. Catch up on the other parts in the series:

    • Part 1: How EvalHub manages two-layer Kubernetes control planes
    • Part 2: EvalHub: Because "looks good to me" isn't a benchmark
    • Part 3: Evaluation-driven development with EvalHub
    • Part 4: Understanding evaluation collections in EvalHub

    What an evaluation collection actually is

    An evaluation collection is a named, versioned artifact that captures three things:

    • Which benchmarks to run: Identified by benchmark ID and the provider that executes them
    • How to weight them: Numerical weights that determine each benchmark's contribution to the aggregate score
    • What thresholds define a pass: Both per-benchmark (individual gates) and collection-level (overall pass criteria)

    Collections are stored server-side and referenced by ID in evaluation requests. They are not shell scripts or experiment configurations. Instead, they are first-class EvalHub resources with full CRUD support via the REST API and CLI.

    The server distinguishes two scopes:

    • system: Built-in collections shipped with EvalHub and maintained by the project team. Read-only. For example, leaderboard-v2.
    • user: Collections you create with full read/write access. These are where your domain-specific measurement strategies live.

    When you reference a collection ID in an evaluation request, EvalHub expands it: it reads the benchmark list, groups benchmarks by provider, routes each group to the appropriate backend in parallel, applies the weights, and returns a two-tier result: collection-level aggregate and per-benchmark breakdown.

    The Leaderboard v2 collection: A system baseline worth understanding

    Leaderboard v2 (leaderboard-v2) is EvalHub's implementation of the Open LLM Leaderboard v2 benchmark suite, which is the current standard for comparing general-purpose instruction-tuned models. It comprises six benchmarks, all executed via lm_evaluation_harness:

    id: leaderboard-v2
    category: leaderboard
    description: "Leaderboard v2"
    tags:
      - leaderboard
    pass_criteria:
      threshold: 38.0
    
    benchmarks:
      - id: leaderboard_ifeval
        provider_id: lm_evaluation_harness
        metric: inst_level_strict_acc
        threshold: 80.0
        weight: 1
        lower_is_better: false
    
      - id: leaderboard_bbh
        provider_id: lm_evaluation_harness
        metric: acc_norm
        threshold: 68.0
        weight: 1
        lower_is_better: false
    
      - id: leaderboard_gpqa
        provider_id: lm_evaluation_harness
        metric: acc_norm
        threshold: 40.0
        weight: 1
        lower_is_better: false
    
      - id: leaderboard_mmlu_pro
        provider_id: lm_evaluation_harness
        metric: acc_norm
        threshold: 60.0
        weight: 1
        lower_is_better: false
    
      - id: leaderboard_musr
        provider_id: lm_evaluation_harness
        metric: acc_norm
        threshold: 38.0
        weight: 1
        lower_is_better: false
    
      - id: leaderboard_math_hard
        provider_id: lm_evaluation_harness
        metric: exact_match
        threshold: 55.0
        weight: 1
        lower_is_better: false

    Why these six benchmarks

    Each benchmark measures a qualitatively distinct capability:

    BenchmarkWhat it actually measuresKey callout
    IFEvalInstruction following: can the model do what it is told, in the format it is toldThe largest differentiator between base and instruction-tuned models at the <80B scale.
    BBHChain-of-thought reasoning across 23 hard tasks where earlier models scored below human-levelRelatively stable across well-tuned 70B models; Phi-4 (14B) scores around 65, which is above many larger models.
    GPQAPhD-level science reasoning in biology, physics, and chemistryHardest discriminator in this class. PhD experts score 65%, skilled non-experts 34%; frontier GPT-4 baselines hit only 39% at initial release.
    MMLU-ProReasoning-centric knowledge across 12,000+ graduate-level questions in 14+ domainsAnything above 55 is competitive; Gemma-3-27B and Phi-4 land 57–62 at this tier.
    MuSRMulti-step logical reasoning inside long-form narrative text (murder mysteries, etc.)Notoriously noisy. Even Qwen2.5-72B scores 11.7, so a score above 30 is solid.
    Math-Lvl-5The hardest 1,324 problems from the MATH dataset—competition mathMath-focused post-training shows clearly here; Phi-4 and Gemma-3-27B score higher per parameter than Llama-3.1-70B.

    Two-tier threshold logic

    The collection defines thresholds at two levels, each serving a different purpose.

    Benchmark-level thresholds act as the per-dimension gates. They encode what a strong model looks like for a competitive <80B model in each specific capability:

    BenchmarkThresholdRationale
    IFEval80.0A score above 65 indicates consistent instruction adherence; 80 represents frontier <80B performance (where Qwen2.5-72B scores around 86)
    BBH68.0Top of the "strong" range for well-tuned 70B models
    GPQA40.0Anything above 35 for a sub-80B model is strong; 40 marks the frontier tier
    MMLU-Pro60.0Competitive bar for this tier; top models land 57–67
    MuSR38.0Top of what the best <80B models reliably achieve
    Math-Lvl-555.0Strong post-training signal; above 55 is excellent for <80B

    The collection-level pass_criteria.threshold of 38.0 is not derived from a single benchmark. It is the weighted average that a model must clear across all six to be considered passing at the collection level. With equal weights, this is the arithmetic mean of the per-benchmark scores. The threshold of 38.0 represents a realistic genuinely capable bar. A model scoring above this average across all six benchmarks clears a broader capability floor than any individual benchmark can establish.

    Here is a practical, competitive bar that distinguishes capable from average models at the <80B weight class:

    IFEval ≥ 65 | BBH ≥ 55 | GPQA ≥ 25 | MMLU-Pro ≥ 50 | MuSR ≥ 25 | Math-Lvl-5 ≥ 35

    The Leaderboard v2 thresholds set a higher baseline, roughly 10 to 15 points higher. This is deliberate: system collections are conservative reference points, not deployment gates. Your custom collection is where you set the thresholds that actually matter for your use case.

    Creating a custom collection

    System collections, such as Leaderboard v2, are read-only. To define your own measurement strategy with different benchmarks, different weights, and thresholds calibrated to your use case, you create a user-scoped collection.

    The full schema for a collection create request:

    {
      "name": "string (required)",
      "category": "string (required)",
      "description": "string (optional, max 1024 chars)",
      "benchmarks": [
        {
          "id": "string (required) — benchmark identifier",
          "provider_id": "string (required) — which backend runs this",
          "metric": "string — primary metric name",
          "threshold": "number — pass threshold for this benchmark",
          "weight": "number (non-negative) — contribution to aggregate, default 1",
          "lower_is_better": "boolean — score direction, default false",
          "parameters": "object (optional) — benchmark-specific config",
          "url": "string (optional) — custom provider URL"
        }
      ],
      "tags": ["string"],
      "metadata": {"key": "value"},
      "pass_criteria": {
        "threshold": "number — collection-level aggregate threshold"
      }
    }

    Worked example: Adapting Leaderboard v2 for a production deployment gate

    Suppose your team is deploying a general-purpose assistant on OpenShift, and you want a deployment gate that reflects your actual requirements rather than the frontier-optimized thresholds in the system collection. Your model is expected to handle instruction-following tasks and multi-step reasoning, but you are not optimizing for PhD-level science or competition math.

    You want:

    • Lower stakes on GPQA and Math-Lvl-5 (not your use case's core capability)
    • Higher weight on IFEval (instruction following is critical for your assistant)
    • A collection-level pass bar calibrated to a competitive <80B model instead of the frontier tier
    name: "General Assistant Deployment Gate v1"
    category: "deployment-gate"
    description: "Production deployment gate for general-purpose assistants. Based on Leaderboard v2 with weights adjusted for instruction-following priority. Thresholds calibrated to competitive <80B bar."
    tags:
      - assistant
      - deployment-gate
      - general-purpose
    pass_criteria:
      threshold: 55.0
    
    benchmarks:
      - id: leaderboard_ifeval
        provider_id: lm_evaluation_harness
        metric: inst_level_strict_acc
        threshold: 65.0
        weight: 2.0
        lower_is_better: false
    
      - id: leaderboard_bbh
        provider_id: lm_evaluation_harness
        metric: acc_norm
        threshold: 55.0
        weight: 1.5
        lower_is_better: false
    
      - id: leaderboard_gpqa
        provider_id: lm_evaluation_harness
        metric: acc_norm
        threshold: 25.0
        weight: 0.5
        lower_is_better: false
    
      - id: leaderboard_mmlu_pro
        provider_id: lm_evaluation_harness
        metric: acc_norm
        threshold: 50.0
        weight: 1.5
        lower_is_better: false
    
      - id: leaderboard_musr
        provider_id: lm_evaluation_harness
        metric: acc_norm
        threshold: 25.0
        weight: 1.0
        lower_is_better: false
    
      - id: leaderboard_math_hard
        provider_id: lm_evaluation_harness
        metric: exact_match
        threshold: 35.0
        weight: 0.5
        lower_is_better: false

    Key design decisions made explicit:

    • IFEval weight: 2.0. Instruction following is the primary capability this deployment needs. Doubling its weight means IFEval contributes two times as much to the aggregate as lower-priority dimensions.
    • GPQA and Math-Lvl-5 weights: 0.5. These benchmarks are still measured (regressions matter), but their contribution to the gate is halved.
    • Collection threshold: 55.0. This is the weighted average. With IFEval weighted 2x, clearing IFEval at 65 pulls the average up significantly; failing it drags the aggregate down hard. The threshold is set to be achievable by a well-tuned competitive model that prioritizes the right capabilities.

    The pass_criteria.threshold is intentional: it is not the average of the per-benchmark thresholds. It is a separate, independently calibrated number. A model can fail one benchmark threshold and still clear the collection-level gate if it excels in more heavily weighted dimensions. A model can pass every individual benchmark threshold and still fail the collection gate if its weighted average falls short. These two signals together are more informative than either alone.

    Registering the collection via CLI

    Save the preceding YAML as assistant-gate-v1.yaml, then:

    evalhub collections create --spec assistant-gate-v1.yaml

    The server validates the request, assigns an ID, timestamps it, and registers it as a user-scoped collection. The response includes the assigned id. Use this identifier in evaluation requests.

    # Confirm the collection was registered
    evalhub collections describe <assigned-id>
    
    # List all collections, including system and user-scoped
    evalhub collections list

    Via the Python SDK:

    from evalhub.client import SyncEvalHubClient
    
    client = SyncEvalHubClient(base_url="http://evalhub-service:8080")
    
    # List all collections
    collections = client.collections.list()
    
    # Retrieve the specific collection to verify
    collection = client.collections.get(id="<assigned-id>")

    Running an evaluation against your collection

    With the collection registered, running an evaluation is a single request that references the collection ID:

    evalhub eval run \
      --collection <assigned-id> \
      --model-url http://vllm-service:8080/v1 \
      --wait

    The --wait flag blocks until the run completes and returns a non-zero exit code if the collection-level pass criteria are not met, making this directly usable as a CI/CD gate.

    Equivalently, via the REST API:

    curl -X POST http://evalhub-service:8080/api/v1/evaluations \
      -H "Content-Type: application/json" \
      -d '{
        "model": { "url": "http://vllm-service:8080/v1" },
        "collection_id": "<assigned-id>",
        "experiment": {
          "name": "llama-3.2-3b-assistant-gate-eval",
          "tags": {
            "model_family": "llama-3",
            "environment": "staging",
            "collection_version": "v1"
          }
        }
      }'

    EvalHub expands the collection, groups the six benchmarks (all from lm_evaluation_harness in this case, so a single provider call), runs them, applies the weights, and returns:

    {
      "status": "completed",
      "collection_id": "<assigned-id>",
      "collection_score": 61.4,
      "pass_criteria": {
        "threshold": 55.0,
        "passed": true
      },
      "benchmark_results": [
        { "id": "leaderboard_ifeval",   "score": 71.2, "threshold": 65.0, "passed": true  },
        { "id": "leaderboard_bbh",      "score": 58.3, "threshold": 55.0, "passed": true  },
        { "id": "leaderboard_gpqa",     "score": 22.1, "threshold": 25.0, "passed": false },
        { "id": "leaderboard_mmlu_pro", "score": 51.8, "threshold": 50.0, "passed": true  },
        { "id": "leaderboard_musr",     "score": 29.4, "threshold": 25.0, "passed": true  },
        { "id": "leaderboard_math_hard","score": 31.2, "threshold": 35.0, "passed": false }
      ],
      "experiment": { "mlflow_run_id": "..." }
    }

    Two things are immediately visible in this result:

    • The model cleared the collection gate (61.4 > 55.0 threshold) and meets your defined criteria for deployment.
    • GPQA (22.1 vs 25.0) and Math-Lvl-5 (31.2 vs 35.0) failed their individual thresholds. However, because these are low-weight dimensions, the failure did not lower the aggregate score below the gate.

    This is the distinction that matters operationally. The collection gate answers the deployment question: Can this go to production? The per-benchmark breakdown answers the engineering question: Which areas still need work? Both are present in every evaluation result. An aggregate-only result forces you to choose between them.

    The full experiment record (scores, configurations, collection version, and hardware tags) is automatically written to MLflow via the tags in the experiment configuration. Every run is queryable. If you reproduce this run three months from now, you will have the exact configuration that produced these numbers.

    Managing collection evolution

    Evaluation criteria are not static. Regulatory requirements change. New benchmarks emerge. Your use case evolves. The collection schema is built for this.

    Update a collection via the REST API:

    # Partial update — change the pass_criteria threshold only
    curl -X PATCH http://evalhub-service:8080/api/v1/evaluations/collections/<id> \
      -H "Content-Type: application/json" \
      -d '[
        { "op": "replace", "path": "/pass_criteria/threshold", "value": 60.0 }
      ]'
    
    # Full replacement — swap in a new benchmark list
    curl -X PUT http://evalhub-service:8080/api/v1/evaluations/collections/<id> \
      -H "Content-Type: application/json" \
      -d '{ ... full updated collection body ... }'

    The practical implication: when your collection changes, update the tags in evaluation requests to include a collection_version tag. MLflow then lets you query all runs against v1 of this collection separately from all runs against v2. Collection evolution becomes traceable in the experiment history rather than a silent configuration drift that makes historical scores incomparable.

    What this unlocks

    Every team operating production AI workloads has evaluation criteria, whether written down or not. The criteria live in someone's head, in a Slack thread, in a notebook, or in a one-off script that ran before the last launch. Those undocumented guidelines are not reusable, auditable, or portable across model upgrades or team changes.

    An evaluation collection externalizes that knowledge. The thresholds are not judgment calls made the night before a launch. They are deliberate design decisions, documented in a versioned artifact, enforced consistently on every evaluation run. Changing them requires a deliberate act, not an implicit recalibration.

    The Leaderboard v2 collection gives you a grounded starting point: six benchmarks whose thresholds are calibrated against the real distribution of <80B model scores, with a documented rationale for each threshold choice. From there, the work is adapting weights and thresholds to your actual use case. This is the only evaluation strategy that provides actionable insights about your specific deployment.

    Start with evalhub collections describe leaderboard-v2. Understand what each threshold is actually measuring and why. Then build the collection that answers your deployment question instead of the question from the open source leaderboard.

    Start here

    Use the following resources to explore EvalHub and begin building your own evaluation collections:

    • EvalHub website
    • EvalHub server (Collections API, OpenAPI spec)
    • EvalHub SDK (EvalHub collections CLI, REST client)
    • OpenAPI specification
    • TrustyAI Operator (Kubernetes/OpenShift deployment)

    Related Posts

    • Evaluation-driven development with EvalHub

    • EvalHub: Because "looks good to me" isn't a benchmark

    • How EvalHub manages two-layer Kubernetes control planes

    • Eval-driven development: Build and evaluate reliable AI agents

    • Synthetic data for RAG evaluation: Why your RAG system needs better testing

    • GuideLLM: Evaluate LLM deployments for real-world inference

    Recent Posts

    • Type what you want to break: AI-assisted chaos engineering with Krkn

    • Understanding evaluation collections in EvalHub

    • An overview of confidential containers on OpenShift bare metal

    • iSCSI vs. NVMe/TCP: The ultimate storage showdown for Red Hat OpenShift Virtualization

    • Speculators v0.5.0: DFlash support and online training

    What’s up next?

    Learning Path TensorFlow-Onnx-LP-featured-image

    Build and evaluate a fraud detection model with TensorFlow and ONNX

    Learn how to deploy a trained model with Red Hat OpenShift AI and use its...
    Red Hat Developers logo LinkedIn YouTube Twitter Facebook

    Platforms

    • Red Hat AI
    • Red Hat Enterprise Linux
    • Red Hat OpenShift
    • Red Hat Ansible Automation Platform
    • See all products

    Build

    • Developer Sandbox
    • Developer tools
    • Interactive tutorials
    • API catalog

    Quicklinks

    • Learning resources
    • E-books
    • Cheat sheets
    • Blog
    • Events
    • Newsletter

    Communicate

    • About us
    • Contact sales
    • Find a partner
    • Report a website issue
    • Site status dashboard
    • Report a security problem

    RED HAT DEVELOPER

    Build here. Go anywhere.

    We serve the builders. The problem solvers who create careers with code.

    Join us if you’re a developer, software engineer, web designer, front-end designer, UX designer, computer scientist, architect, tester, product manager, project manager or team lead.

    Sign me up

    Red Hat legal and privacy links

    • About Red Hat
    • Jobs
    • Events
    • Locations
    • Contact Red Hat
    • Red Hat Blog
    • Inclusion at Red Hat
    • Cool Stuff Store
    • Red Hat Summit
    © 2026 Red Hat

    Red Hat legal and privacy links

    • Privacy statement
    • Terms of use
    • All policies and guidelines
    • Digital accessibility

    Chat Support

    Please log in with your Red Hat account to access chat support.