Skip to main content
Redhat Developers  Logo
  • AI

    Get started with AI

    • Red Hat AI
      Accelerate the development and deployment of enterprise AI solutions.
    • AI learning hub
      Explore learning materials and tools, organized by task.
    • AI interactive demos
      Click through scenarios with Red Hat AI, including training LLMs and more.
    • AI/ML learning paths
      Expand your OpenShift AI knowledge using these learning resources.
    • AI quickstarts
      Focused AI use cases designed for fast deployment on Red Hat AI platforms.
    • No-cost AI training
      Foundational Red Hat AI training.

    Featured resources

    • OpenShift AI learning
    • Open source AI for developers
    • AI product application development
    • Open source-powered AI/ML for hybrid cloud
    • AI and Node.js cheat sheet

    Red Hat AI Factory with NVIDIA

    • Red Hat AI Factory with NVIDIA is a co-engineered, enterprise-grade AI solution for building, deploying, and managing AI at scale across hybrid cloud environments.
    • Explore the solution
  • Learn

    Self-guided

    • Documentation
      Find answers, get step-by-step guidance, and learn how to use Red Hat products.
    • Learning paths
      Explore curated walkthroughs for common development tasks.
    • Guided learning
      Receive custom learning paths powered by our AI assistant.
    • See all learning

    Hands-on

    • Developer Sandbox
      Spin up Red Hat's products and technologies without setup or configuration.
    • Interactive labs
      Learn by doing in these hands-on, browser-based experiences.
    • Interactive demos
      Click through product features in these guided tours.

    Browse by topic

    • AI/ML
    • Automation
    • Java
    • Kubernetes
    • Linux
    • See all topics

    Training & certifications

    • Courses and exams
    • Certifications
    • Skills assessments
    • Red Hat Academy
    • Learning subscription
    • Explore training
  • Build

    Get started

    • Red Hat build of Podman Desktop
      A downloadable, local development hub to experiment with our products and builds.
    • Developer Sandbox
      Spin up Red Hat's products and technologies without setup or configuration.

    Download products

    • Access product downloads to start building and testing right away.
    • Red Hat Enterprise Linux
    • Red Hat AI
    • Red Hat OpenShift
    • Red Hat Ansible Automation Platform
    • See all products

    Featured

    • Red Hat build of OpenJDK
    • Red Hat JBoss Enterprise Application Platform
    • Red Hat OpenShift Dev Spaces
    • Red Hat Developer Toolset

    References

    • E-books
    • Documentation
    • Cheat sheets
    • Architecture center
  • Community

    Get involved

    • Events
    • Live AI events
    • Red Hat Summit
    • Red Hat Accelerators
    • Community discussions

    Follow along

    • Articles & blogs
    • Developer newsletter
    • Videos
    • Github

    Get help

    • Customer service
    • Customer support
    • Regional contacts
    • Find a partner

    Join the Red Hat Developer program

    • Download Red Hat products and project builds, access support documentation, learning content, and more.
    • Explore the benefits

Add automated AI evaluations to your CI/CD pipeline

The complete practical guide to running EvalHub evaluations from the terminal and wiring them into pipeline gates

June 11, 2026
William Caban Babilonia Rui Vieira Matteo Mortari
Related topics:
Artificial intelligenceCI/CD
Related products:
Red Hat AI

    EvalHub's API server and Kubernetes Operator handle orchestration. The Python SDK handles notebook and application integration. But for continuous integration and continuous delivery (CI/CD) pipelines, the CLI is the right surface: it installs in seconds, reads config from environment variables, returns machine-parseable output, and exits non-zero on failure.

    This post covers the full CLI workflow, from first-time setup through a production pipeline gate, without detours into platform architecture or evaluation methodology. Those are covered in the rest of the series. This is the operational reference.

    Series note

    This is the sixth post in a series covering how to build a scalable, reproducible AI evaluation infrastructure using the EvalHub project and Red Hat AI. Catch up on the other parts in the series:

    • Part 1: How EvalHub manages two-layer Kubernetes control planes
    • Part 2: EvalHub: Because "looks good to me" isn't a benchmark
    • Part 3: Evaluation-driven development with EvalHub
    • Part 4: Understanding evaluation collections in EvalHub
    • Part 5: Bring your own evaluation framework to EvalHub

    Installation

    The CLI ships as an optional dependency group of the EvalHub SDK:

    pip install "eval-hub-sdk[cli]"

    Verify the install and confirm connectivity to your EvalHub instance:

    evalhub version
    evalhub health

    The evalhub health command exits with a value of 0 if the service is reachable, and 1 if it is not. Use it as the first step in any pipeline that depends on EvalHub.

    Configuration

    The CLI stores named profiles in ~/.config/evalhub/config.yaml:

    active_profile: default
    
    profiles:
      default:
        base_url: http://evalhub-service:8080
        token: ""
        tenant: ""
    
      staging:
        base_url: https://evalhub.staging.example.com
        token: "staging-token"
        tenant: "team-platform"
        timeout: 60
    
      prod:
        base_url: https://evalhub.prod.example.com
        token: "prod-token"
        tenant: "team-platform"
        insecure: false
        timeout: 120

    Each profile requires the base_url, token, and tenant variables. Set these values with the following commands:

    evalhub config set base_url http://evalhub-service:8080
    evalhub config set token <your-token>
    evalhub config set tenant ""
    
    # Add a second profile
    EVALHUB_PROFILE=prod evalhub config set base_url https://evalhub.prod.example.com
    evalhub config use prod   # switch active profile

    Environment variables

    In CI/CD, prefer environment variables over config files because they integrate directly with secret management:

    VariablePurposeOverrides
    EVALHUB_BASE_URLServer URLprofile base_url
    EVALHUB_TOKENAuth tokenprofile token
    EVALHUB_PROFILEActive profile nameconfig active_profile
    EVALHUB_VERBOSEEnable debug logging—
    EVALHUB_CONFIGCustom config file pathdefault path

    Priority order: CLI flag → environment variable → config file profile → default.

    In any pipeline, setting EVALHUB_BASE_URL and EVALHUB_TOKEN from secrets is sufficient. No config file needed.

    Running evaluations

    You can execute automated evaluations by defining your configuration parameters in a dedicated file or running a pre-registered collection.

    The eval.yaml config file

    For repeatable pipeline runs, define the evaluation in a YAML file committed to your repository:

    # eval.yaml
    name: "llama-3.2-staging-gate"
    description: "Pre-merge evaluation gate for staging deployments"
    
    model:
      url: "http://vllm-service:8000/v1"
      name: "meta-llama/Llama-3.2-3B-Instruct"
    
    collection:
      id: "general-assistant-gate-v1"   # reference a registered collection
    
    experiment:
      name: "llama-3.2-staging-eval"
      tags: # MLflow tags
        - key: "environment"
          value: "staging"
        - key: "trigger"
          value: "pre-merge"

    To run against individual benchmarks instead of a collection:

    name: "targeted-benchmark-run"
    model:
      url: "http://vllm-service:8000/v1"
      name: "meta-llama/Llama-3.2-3B-Instruct"
    
    benchmarks:
      - id: "leaderboard_ifeval"
        provider_id: "lm_evaluation_harness"
        parameters:
          metrics:
            - inst_level_strict_acc
      - id: "leaderboard_bbh"
        provider_id: "lm_evaluation_harness"
    
    exports:
      oci:
        coordinates:
           oci_host: quay.io
           oci_repository: my-org/eval-results

    benchmarks and collection are mutually exclusive. Use one or the other.

    Submit and wait

    The --wait flag polls the job status until it reaches a terminal state (completed or failed). The --timeout flag sets the maximum wait in seconds. The command exits with a value of 1 if the job fails, making it a direct pipeline gate.

    # Submit and return immediately (async)
    evalhub eval run --config eval.yaml
    
    # Submit and block until completion (use this in pipelines)
    evalhub eval run --config eval.yaml --wait --timeout 3600

    To capture the job ID for subsequent commands:

    JOB_ID=$(evalhub eval run --config eval.yaml --format json | jq -r '.[0].id')
    echo "Job submitted: $JOB_ID"
    

    Checking status

    Use the following commands to monitor active jobs or view past run histories:

    # List all jobs
    evalhub eval status
    
    # Filter by state
    evalhub eval status --status running
    evalhub eval status --status completed
    
    # Filter by recency (client-side)
    evalhub eval status --since 24h
    evalhub eval status --provider lm_evaluation_harness --since 7d
    
    # Single job detail
    evalhub eval status $JOB_ID
    
    # Block until terminal state (for async submissions)
    evalhub eval status $JOB_ID --watch --poll-interval 10

    Retrieving results

    Use the following commands to retrieve your evaluation results in different formats:

    # Human-readable table (default)
    evalhub eval results $JOB_ID
    
    # Machine-readable JSON
    evalhub eval results $JOB_ID --format json
    
    # CSV for export
    evalhub eval results $JOB_ID --format csv > results.csv

    Table output:

    benchmark              provider               metric                  value
    leaderboard_ifeval     lm_evaluation_harness  inst_level_strict_acc   0.712
    leaderboard_bbh        lm_evaluation_harness  acc_norm                0.581
    
    MLflow experiment: https://mlflow.example.com/experiments/42

    Managing collections from the CLI

    You can manage your metric collections directly from the terminal with the following commands:

    # List all available collections (system + user)
    evalhub collections list
    evalhub collections list --tag leaderboard --format json
    
    # Inspect a collection
    evalhub collections describe leaderboard-v2
    
    # Create a user collection from a spec file
    evalhub collections create --file my-collection.yaml
    
    # Run a collection directly (shorthand for eval run with collection reference)
    evalhub collections run general-assistant-gate-v1 \
      --model-url http://vllm-service:8000/v1 \
      --model-name meta-llama/Llama-3.2-3B-Instruct \
      --wait \
      --timeout 3600

    Cancelling jobs

    If you need to stop an active evaluation, use the cancel command:

    evalhub eval cancel $JOB_ID             # cancel with confirmation prompt
    evalhub eval cancel $JOB_ID --hard-delete  # permanent deletion, no recovery

    CI/CD pipeline integration

    Automating your AI evaluations within a CI/CD workflow helps catch performance regressions early.

    GitHub Actions

    The following workflow file demonstrates how to execute an evaluation gate automatically on every pull request:

    # .github/workflows/model-eval.yaml
    name: Model Evaluation Gate
    
    on:
      pull_request:
        paths:
          - 'model/**'
          - 'prompts/**'
          - 'eval.yaml'
    
    jobs:
      evaluate:
        runs-on: ubuntu-latest
        steps:
          - uses: actions/checkout@v4
    
          - name: Install EvalHub CLI
            run: pip install "eval-hub-sdk[cli]"
    
          - name: Verify service connectivity
            env:
              EVALHUB_BASE_URL: ${{ secrets.EVALHUB_BASE_URL }}
              EVALHUB_TOKEN: ${{ secrets.EVALHUB_TOKEN }}
            run: evalhub health
    
          - name: Run evaluation gate
            env:
              EVALHUB_BASE_URL: ${{ secrets.EVALHUB_BASE_URL }}
              EVALHUB_TOKEN: ${{ secrets.EVALHUB_TOKEN }}
            run: |
              evalhub eval run \
                --config eval.yaml \
                --wait \
                --timeout 3600
    
          - name: Export results
            if: always()
            env:
              EVALHUB_BASE_URL: ${{ secrets.EVALHUB_BASE_URL }}
              EVALHUB_TOKEN: ${{ secrets.EVALHUB_TOKEN }}
            run: |
              JOB_ID=$(evalhub eval status --status completed --since 1h --format json \
                | jq -r '.[0].id')
              evalhub eval results "$JOB_ID" --format json > eval-results.json
              evalhub eval results "$JOB_ID" --format csv  > eval-results.csv
    
          - name: Upload results artifact
            if: always()
            uses: actions/upload-artifact@v4
            with:
              name: eval-results-${{ github.sha }}
              path: |
                eval-results.json
                eval-results.csv

    The evalhub eval run --wait command exits with a value of 1 if the collection gate fails. This failure stops the GitHub Actions step and blocks the PR merge without requiring a separate post-processing script.

    GitLab CI

    You can also run your evaluations using GitLab CI/CD pipelines:

    # .gitlab-ci.yml
    model-evaluation:
      stage: test
      image: python:3.11-slim
      variables:
        EVALHUB_BASE_URL: $EVALHUB_BASE_URL   # from GitLab CI/CD variables
        EVALHUB_TOKEN: $EVALHUB_TOKEN
      before_script:
        - pip install "eval-hub-sdk[cli]"
        - evalhub health
      script:
        - evalhub eval run --config eval.yaml --wait --timeout 3600
      after_script:
        - |
          JOB_ID=$(evalhub eval status --status completed --since 1h --format json \
            | jq -r '.[0].id')
          evalhub eval results "$JOB_ID" --format csv > eval-results.csv
      artifacts:
        when: always
        paths:
          - eval-results.csv
        expire_in: 30 days
      rules:
        - changes:
            - model/**
            - prompts/**
            - eval.yaml

    Reusable shell script

    For pipelines that need more control, such as capturing the job ID, branching on per-benchmark results, posting summaries, use a script:

    #!/usr/bin/env bash
    # eval-gate.sh — Submit an EvalHub evaluation and gate on the result
    set -euo pipefail
    
    EVAL_CONFIG="${1:-eval.yaml}"
    TIMEOUT="${2:-3600}"
    RESULTS_FILE="eval-results-$(date +%Y%m%d-%H%M%S).json"
    
    # Requires: EVALHUB_BASE_URL and EVALHUB_TOKEN in environment
    
    echo "==> Checking EvalHub connectivity"
    evalhub health
    
    echo "==> Submitting evaluation: $EVAL_CONFIG"
    JOB_ID=$(evalhub eval run --config "$EVAL_CONFIG" --format json | jq -r '.[0].id')
    echo "    Job ID: $JOB_ID"
    
    echo "==> Waiting for completion (timeout: ${TIMEOUT}s)"
    evalhub eval status "$JOB_ID" --watch --poll-interval 15
    
    echo "==> Fetching results"
    evalhub eval results "$JOB_ID" --format json > "$RESULTS_FILE"
    evalhub eval results "$JOB_ID"   # human-readable summary to stdout
    
    echo "==> Results saved to $RESULTS_FILE"

    The set -euo pipefail option combined with evalhub eval status --watch propagates the exit code correctly. If the job fails, the --watch exits with a value of 1, set -e halts the script, and the pipeline step fails.

    CLI reference at a glance

    Use this quick reference table to find common EvalHub CLI commands and their associated flags:

    CommandCommon flagsPurpose
    evalhub health—Verify service connectivity
    evalhub eval run--config, --wait, --timeout, --formatSubmit evaluation job
    evalhub eval status--watch, --status, --since, --formatList or monitor jobs
    evalhub eval results--formatRetrieve completed job results
    evalhub eval cancel--hard-deleteCancel or delete a job
    evalhub collections list--tag, --formatList available collections
    evalhub collections describe--formatInspect a collection
    evalhub collections create--fileRegister a new collection
    evalhub collections run--model-url, --model-name, --waitRun a collection directly
    evalhub providers list--formatList registered providers
    evalhub providers describe--formatInspect registered providers
    evalhub config setKEY VALUESet a config value
    evalhub config usePROFILESwitch active profile

    All data-returning commands support --format table (default), json, yaml, and csv. JSON output is stable and safe to pipe into jq.

    Practical notes

    Use --wait for gates, --watch for monitoring. The evalhub eval run --wait command submits and blocks in a single command, which is the simplest option for pipelines. The evalhub eval status $JOB_ID --watch command monitors a previously submitted async job. This is useful when the submission and the wait steps run in different pipeline stages.

    Never store tokens in the config file in CI. Set EVALHUB_TOKEN from your secret store. The environment variable automatically takes priority over the config file.

    Scope eval.yaml to your pipeline trigger. A collection reference in eval.yaml means threshold changes are picked up automatically when the collection is updated. Benchmark lists require a file change to update.

    Results are always in MLflow. Every evalhub eval run call automatically writes a full experiment record. evalhub eval results provides a convenience summary, but MLflow serves as the authoritative long-term data store.

    Start here

    • EvalHub website
    • EvalHub SDK (evalhub CLI)
    • EvalHub server (Collections API, provider registration)
    • TrustyAI Operator (Kubernetes/OpenShift deployment)

    Related Posts

    • Understanding evaluation collections in EvalHub

    • Evaluation-driven development with EvalHub

    • EvalHub: Because "looks good to me" isn't a benchmark

    • How EvalHub manages two-layer Kubernetes control planes

    • Eval-driven development: Build and evaluate reliable AI agents

    • Defining success: Evaluation metrics and data augmentation for oversaturation detection

    Recent Posts

    • Add automated AI evaluations to your CI/CD pipeline

    • Configure input guardrails for an OpenShift AI voice agent

    • Intelligent inference scheduling with llm-d on Red Hat AI

    • What's new in Red Hat Ansible Automation Platform 2.7

    • Building and running Bazel applications on AutoSD: Toolchains, containers, and recommended practices

    What’s up next?

    Learning Path 5G rollout learning path feature image

    Accelerate 5G core standalone rollout: An end-to-end testing pipeline with Red Hat...

    Deploy a 5G core testing pipeline to create a continuous quality check for a...
    Red Hat Developers logo LinkedIn YouTube Twitter Facebook

    Platforms

    • Red Hat AI
    • Red Hat Enterprise Linux
    • Red Hat OpenShift
    • Red Hat Ansible Automation Platform
    • See all products

    Build

    • Developer Sandbox
    • Developer tools
    • Interactive tutorials
    • API catalog

    Quicklinks

    • Learning resources
    • E-books
    • Cheat sheets
    • Blog
    • Events
    • Newsletter

    Communicate

    • About us
    • Contact sales
    • Find a partner
    • Report a website issue
    • Site status dashboard
    • Report a security problem

    RED HAT DEVELOPER

    Build here. Go anywhere.

    We serve the builders. The problem solvers who create careers with code.

    Join us if you’re a developer, software engineer, web designer, front-end designer, UX designer, computer scientist, architect, tester, product manager, project manager or team lead.

    Sign me up

    Red Hat legal and privacy links

    • About Red Hat
    • Jobs
    • Events
    • Locations
    • Contact Red Hat
    • Red Hat Blog
    • Inclusion at Red Hat
    • Cool Stuff Store
    • Red Hat Summit
    © 2026 Red Hat

    Red Hat legal and privacy links

    • Privacy statement
    • Terms of use
    • All policies and guidelines
    • Digital accessibility

    Chat Support

    Please log in with your Red Hat account to access chat support.