Add automated AI evaluations to your CI/CD pipeline

EvalHub's API server and Kubernetes Operator handle orchestration. The Python SDK handles notebook and application integration. But for continuous integration and continuous delivery (CI/CD) pipelines, the CLI is the right surface: it installs in seconds, reads config from environment variables, returns machine-parseable output, and exits non-zero on failure.

This post covers the full CLI workflow, from first-time setup through a production pipeline gate, without detours into platform architecture or evaluation methodology. Those are covered in the rest of the series. This is the operational reference.

Series note

This is the sixth post in a series covering how to build a scalable, reproducible AI evaluation infrastructure using the EvalHub project and Red Hat AI. Catch up on the other parts in the series:

Part 1: How EvalHub manages two-layer Kubernetes control planes
Part 2: EvalHub: Because "looks good to me" isn't a benchmark
Part 3: Evaluation-driven development with EvalHub
Part 4: Understanding evaluation collections in EvalHub
Part 5: Bring your own evaluation framework to EvalHub
Part 6: Add automated AI evaluations to your CI/CD pipeline
Part 7: Store immutable AI evaluation records with EvalHub and OCI
Part 8: Manage LLM evaluation workloads at scale with EvalHub and Kueue
Part 9: Connect EvalHub to protected production model servers

Installation

The CLI ships as an optional dependency group of the EvalHub SDK:

pip install "eval-hub-sdk[cli]"

Verify the install and confirm connectivity to your EvalHub instance:

evalhub version
evalhub health

The evalhub health command exits with a value of 0 if the service is reachable, and 1 if it is not. Use it as the first step in any pipeline that depends on EvalHub.

Configuration

The CLI stores named profiles in ~/.config/evalhub/config.yaml:

active_profile: default

profiles:
  default:
    base_url: http://evalhub-service:8080
    token: ""
    tenant: ""

  staging:
    base_url: https://evalhub.staging.example.com
    token: "staging-token"
    tenant: "team-platform"
    timeout: 60

  prod:
    base_url: https://evalhub.prod.example.com
    token: "prod-token"
    tenant: "team-platform"
    insecure: false
    timeout: 120

Each profile requires the base_url, token, and tenant variables. Set these values with the following commands:

evalhub config set base_url http://evalhub-service:8080
evalhub config set token <your-token>
evalhub config set tenant ""

# Add a second profile
EVALHUB_PROFILE=prod evalhub config set base_url https://evalhub.prod.example.com
evalhub config use prod   # switch active profile

Environment variables

In CI/CD, prefer environment variables over config files because they integrate directly with secret management:

Variable	Purpose	Overrides
`EVALHUB_BASE_URL`	Server URL	profile base_url
`EVALHUB_TOKEN`	Auth token	profile token
`EVALHUB_PROFILE`	Active profile name	config active_profile
`EVALHUB_VERBOSE`	Enable debug logging	—
`EVALHUB_CONFIG`	Custom config file path	default path

Priority order: CLI flag → environment variable → config file profile → default.

In any pipeline, setting EVALHUB_BASE_URL and EVALHUB_TOKEN from secrets is sufficient. No config file needed.

Running evaluations

You can execute automated evaluations by defining your configuration parameters in a dedicated file or running a pre-registered collection.

The eval.yaml config file

For repeatable pipeline runs, define the evaluation in a YAML file committed to your repository:

# eval.yaml
name: "llama-3.2-staging-gate"
description: "Pre-merge evaluation gate for staging deployments"

model:
  url: "http://vllm-service:8000/v1"
  name: "meta-llama/Llama-3.2-3B-Instruct"

collection:
  id: "general-assistant-gate-v1"   # reference a registered collection

experiment:
  name: "llama-3.2-staging-eval"
  tags: # MLflow tags
    - key: "environment"
      value: "staging"
    - key: "trigger"
      value: "pre-merge"

To run against individual benchmarks instead of a collection:

name: "targeted-benchmark-run"
model:
  url: "http://vllm-service:8000/v1"
  name: "meta-llama/Llama-3.2-3B-Instruct"

benchmarks:
  - id: "leaderboard_ifeval"
    provider_id: "lm_evaluation_harness"
    parameters:
      metrics:
        - inst_level_strict_acc
  - id: "leaderboard_bbh"
    provider_id: "lm_evaluation_harness"

exports:
  oci:
    coordinates:
       oci_host: quay.io
       oci_repository: my-org/eval-results

benchmarks and collection are mutually exclusive. Use one or the other.

Submit and wait

The --wait flag polls the job status until it reaches a terminal state (completed or failed). The --timeout flag sets the maximum wait in seconds. The command exits with a value of 1 if the job fails, making it a direct pipeline gate.

# Submit and return immediately (async)
evalhub eval run --config eval.yaml

# Submit and block until completion (use this in pipelines)
evalhub eval run --config eval.yaml --wait --timeout 3600

To capture the job ID for subsequent commands:

JOB_ID=$(evalhub eval run --config eval.yaml --format json | jq -r '.[0].id')
echo "Job submitted: $JOB_ID"

Checking status

Use the following commands to monitor active jobs or view past run histories:

# List all jobs
evalhub eval status

# Filter by state
evalhub eval status --status running
evalhub eval status --status completed

# Filter by recency (client-side)
evalhub eval status --since 24h
evalhub eval status --provider lm_evaluation_harness --since 7d

# Single job detail
evalhub eval status $JOB_ID

# Block until terminal state (for async submissions)
evalhub eval status $JOB_ID --watch --poll-interval 10

Retrieving results

Use the following commands to retrieve your evaluation results in different formats:

# Human-readable table (default)
evalhub eval results $JOB_ID

# Machine-readable JSON
evalhub eval results $JOB_ID --format json

# CSV for export
evalhub eval results $JOB_ID --format csv > results.csv

Table output:

benchmark              provider               metric                  value
leaderboard_ifeval     lm_evaluation_harness  inst_level_strict_acc   0.712
leaderboard_bbh        lm_evaluation_harness  acc_norm                0.581

MLflow experiment: https://mlflow.example.com/experiments/42

Managing collections from the CLI

You can manage your metric collections directly from the terminal with the following commands:

# List all available collections (system + user)
evalhub collections list
evalhub collections list --tag leaderboard --format json

# Inspect a collection
evalhub collections describe leaderboard-v2

# Create a user collection from a spec file
evalhub collections create --file my-collection.yaml

# Run a collection directly (shorthand for eval run with collection reference)
evalhub collections run general-assistant-gate-v1 \
  --model-url http://vllm-service:8000/v1 \
  --model-name meta-llama/Llama-3.2-3B-Instruct \
  --wait \
  --timeout 3600

Cancelling jobs

If you need to stop an active evaluation, use the cancel command:

evalhub eval cancel $JOB_ID             # cancel with confirmation prompt
evalhub eval cancel $JOB_ID --hard-delete  # permanent deletion, no recovery

CI/CD pipeline integration

Automating your AI evaluations within a CI/CD workflow helps catch performance regressions early.

GitHub Actions

The following workflow file demonstrates how to execute an evaluation gate automatically on every pull request:

# .github/workflows/model-eval.yaml
name: Model Evaluation Gate

on:
  pull_request:
    paths:
      - 'model/**'
      - 'prompts/**'
      - 'eval.yaml'

jobs:
  evaluate:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4

      - name: Install EvalHub CLI
        run: pip install "eval-hub-sdk[cli]"

      - name: Verify service connectivity
        env:
          EVALHUB_BASE_URL: ${{ secrets.EVALHUB_BASE_URL }}
          EVALHUB_TOKEN: ${{ secrets.EVALHUB_TOKEN }}
        run: evalhub health

      - name: Run evaluation gate
        env:
          EVALHUB_BASE_URL: ${{ secrets.EVALHUB_BASE_URL }}
          EVALHUB_TOKEN: ${{ secrets.EVALHUB_TOKEN }}
        run: |
          evalhub eval run \
            --config eval.yaml \
            --wait \
            --timeout 3600

      - name: Export results
        if: always()
        env:
          EVALHUB_BASE_URL: ${{ secrets.EVALHUB_BASE_URL }}
          EVALHUB_TOKEN: ${{ secrets.EVALHUB_TOKEN }}
        run: |
          JOB_ID=$(evalhub eval status --status completed --since 1h --format json \
            | jq -r '.[0].id')
          evalhub eval results "$JOB_ID" --format json > eval-results.json
          evalhub eval results "$JOB_ID" --format csv  > eval-results.csv

      - name: Upload results artifact
        if: always()
        uses: actions/upload-artifact@v4
        with:
          name: eval-results-${{ github.sha }}
          path: |
            eval-results.json
            eval-results.csv

The evalhub eval run --wait command exits with a value of 1 if the collection gate fails. This failure stops the GitHub Actions step and blocks the PR merge without requiring a separate post-processing script.

GitLab CI

You can also run your evaluations using GitLab CI/CD pipelines:

# .gitlab-ci.yml
model-evaluation:
  stage: test
  image: python:3.11-slim
  variables:
    EVALHUB_BASE_URL: $EVALHUB_BASE_URL   # from GitLab CI/CD variables
    EVALHUB_TOKEN: $EVALHUB_TOKEN
  before_script:
    - pip install "eval-hub-sdk[cli]"
    - evalhub health
  script:
    - evalhub eval run --config eval.yaml --wait --timeout 3600
  after_script:
    - |
      JOB_ID=$(evalhub eval status --status completed --since 1h --format json \
        | jq -r '.[0].id')
      evalhub eval results "$JOB_ID" --format csv > eval-results.csv
  artifacts:
    when: always
    paths:
      - eval-results.csv
    expire_in: 30 days
  rules:
    - changes:
        - model/**
        - prompts/**
        - eval.yaml

Reusable shell script

For pipelines that need more control, such as capturing the job ID, branching on per-benchmark results, posting summaries, use a script:

#!/usr/bin/env bash
# eval-gate.sh — Submit an EvalHub evaluation and gate on the result
set -euo pipefail

EVAL_CONFIG="${1:-eval.yaml}"
TIMEOUT="${2:-3600}"
RESULTS_FILE="eval-results-$(date +%Y%m%d-%H%M%S).json"

# Requires: EVALHUB_BASE_URL and EVALHUB_TOKEN in environment

echo "==> Checking EvalHub connectivity"
evalhub health

echo "==> Submitting evaluation: $EVAL_CONFIG"
JOB_ID=$(evalhub eval run --config "$EVAL_CONFIG" --format json | jq -r '.[0].id')
echo "    Job ID: $JOB_ID"

echo "==> Waiting for completion (timeout: ${TIMEOUT}s)"
evalhub eval status "$JOB_ID" --watch --poll-interval 15

echo "==> Fetching results"
evalhub eval results "$JOB_ID" --format json > "$RESULTS_FILE"
evalhub eval results "$JOB_ID"   # human-readable summary to stdout

echo "==> Results saved to $RESULTS_FILE"

The set -euo pipefail option combined with evalhub eval status --watch propagates the exit code correctly. If the job fails, the --watch exits with a value of 1, set -e halts the script, and the pipeline step fails.

CLI reference at a glance

Use this quick reference table to find common EvalHub CLI commands and their associated flags:

Command	Common flags	Purpose
`evalhub health`	—	Verify service connectivity
`evalhub eval run`	`--config`, `--wait`, `--timeout`, `--format`	Submit evaluation job
`evalhub eval status`	`--watch`, `--status`, `--since`, `--format`	List or monitor jobs
`evalhub eval results`	`--format`	Retrieve completed job results
`evalhub eval cancel`	`--hard-delete`	Cancel or delete a job
`evalhub collections list`	`--tag`, `--format`	List available collections
`evalhub collections describe`	`--format`	Inspect a collection
`evalhub collections create`	`--file`	Register a new collection
`evalhub collections run`	`--model-url`, `--model-name`, `--wait`	Run a collection directly
`evalhub providers list`	`--format`	List registered providers
`evalhub providers describe`	`--format`	Inspect registered providers
`evalhub config set`	`KEY VALUE`	Set a config value
`evalhub config use`	`PROFILE`	Switch active profile

All data-returning commands support --format table (default), json, yaml, and csv. JSON output is stable and safe to pipe into jq.

Practical notes

Use --wait for gates, --watch for monitoring. The evalhub eval run --wait command submits and blocks in a single command, which is the simplest option for pipelines. The evalhub eval status $JOB_ID --watch command monitors a previously submitted async job. This is useful when the submission and the wait steps run in different pipeline stages.

Never store tokens in the config file in CI. Set EVALHUB_TOKEN from your secret store. The environment variable automatically takes priority over the config file.

Scope eval.yaml to your pipeline trigger. A collection reference in eval.yaml means threshold changes are picked up automatically when the collection is updated. Benchmark lists require a file change to update.

Results are always in MLflow. Every evalhub eval run call automatically writes a full experiment record. evalhub eval results provides a convenience summary, but MLflow serves as the authoritative long-term data store.

Start here

EvalHub website
EvalHub SDK (evalhub CLI)
EvalHub server (Collections API, provider registration)
TrustyAI Operator (Kubernetes/OpenShift deployment)

Last updated: June 23, 2026

Add automated AI evaluations to your CI/CD pipeline

The complete practical guide to running EvalHub evaluations from the terminal and wiring them into pipeline gates

Series note

Installation

Configuration

Environment variables

Running evaluations

The eval.yaml config file

Submit and wait

Checking status

Retrieving results

Managing collections from the CLI

Cancelling jobs

CI/CD pipeline integration

GitHub Actions

GitLab CI

Reusable shell script

CLI reference at a glance

Practical notes

Start here

Simplify GitOps workflows with MCP in OpenShift Lightspeed

Operationalize AI agents with OpenShift and Kubernetes primitives

Architect an open blueprint for cloud-native AI agents

Computer use: How AI agents can automate almost anything

PyTorch distributed is changing and TorchComms is why

Accelerate 5G core standalone rollout: An end-to-end testing pipeline with Red Hat...

Platforms

Build

Quicklinks

Communicate

RED HAT DEVELOPER

Red Hat legal and privacy links

Red Hat legal and privacy links