EvalHub's API server and Kubernetes Operator handle orchestration. The Python SDK handles notebook and application integration. But for continuous integration and continuous delivery (CI/CD) pipelines, the CLI is the right surface: it installs in seconds, reads config from environment variables, returns machine-parseable output, and exits non-zero on failure.
This post covers the full CLI workflow, from first-time setup through a production pipeline gate, without detours into platform architecture or evaluation methodology. Those are covered in the rest of the series. This is the operational reference.
Series note
This is the sixth post in a series covering how to build a scalable, reproducible AI evaluation infrastructure using the EvalHub project and Red Hat AI. Catch up on the other parts in the series:
Installation
The CLI ships as an optional dependency group of the EvalHub SDK:
pip install "eval-hub-sdk[cli]"Verify the install and confirm connectivity to your EvalHub instance:
evalhub version
evalhub healthThe evalhub health command exits with a value of 0 if the service is reachable, and 1 if it is not. Use it as the first step in any pipeline that depends on EvalHub.
Configuration
The CLI stores named profiles in ~/.config/evalhub/config.yaml:
active_profile: default
profiles:
default:
base_url: http://evalhub-service:8080
token: ""
tenant: ""
staging:
base_url: https://evalhub.staging.example.com
token: "staging-token"
tenant: "team-platform"
timeout: 60
prod:
base_url: https://evalhub.prod.example.com
token: "prod-token"
tenant: "team-platform"
insecure: false
timeout: 120Each profile requires the base_url, token, and tenant variables. Set these values with the following commands:
evalhub config set base_url http://evalhub-service:8080
evalhub config set token <your-token>
evalhub config set tenant ""
# Add a second profile
EVALHUB_PROFILE=prod evalhub config set base_url https://evalhub.prod.example.com
evalhub config use prod # switch active profileEnvironment variables
In CI/CD, prefer environment variables over config files because they integrate directly with secret management:
| Variable | Purpose | Overrides |
|---|---|---|
EVALHUB_BASE_URL | Server URL | profile base_url |
EVALHUB_TOKEN | Auth token | profile token |
EVALHUB_PROFILE | Active profile name | config active_profile |
EVALHUB_VERBOSE | Enable debug logging | — |
EVALHUB_CONFIG | Custom config file path | default path |
Priority order: CLI flag → environment variable → config file profile → default.
In any pipeline, setting EVALHUB_BASE_URL and EVALHUB_TOKEN from secrets is sufficient. No config file needed.
Running evaluations
You can execute automated evaluations by defining your configuration parameters in a dedicated file or running a pre-registered collection.
The eval.yaml config file
For repeatable pipeline runs, define the evaluation in a YAML file committed to your repository:
# eval.yaml
name: "llama-3.2-staging-gate"
description: "Pre-merge evaluation gate for staging deployments"
model:
url: "http://vllm-service:8000/v1"
name: "meta-llama/Llama-3.2-3B-Instruct"
collection:
id: "general-assistant-gate-v1" # reference a registered collection
experiment:
name: "llama-3.2-staging-eval"
tags: # MLflow tags
- key: "environment"
value: "staging"
- key: "trigger"
value: "pre-merge"To run against individual benchmarks instead of a collection:
name: "targeted-benchmark-run"
model:
url: "http://vllm-service:8000/v1"
name: "meta-llama/Llama-3.2-3B-Instruct"
benchmarks:
- id: "leaderboard_ifeval"
provider_id: "lm_evaluation_harness"
parameters:
metrics:
- inst_level_strict_acc
- id: "leaderboard_bbh"
provider_id: "lm_evaluation_harness"
exports:
oci:
coordinates:
oci_host: quay.io
oci_repository: my-org/eval-resultsbenchmarks and collection are mutually exclusive. Use one or the other.
Submit and wait
The --wait flag polls the job status until it reaches a terminal state (completed or failed). The --timeout flag sets the maximum wait in seconds. The command exits with a value of 1 if the job fails, making it a direct pipeline gate.
# Submit and return immediately (async)
evalhub eval run --config eval.yaml
# Submit and block until completion (use this in pipelines)
evalhub eval run --config eval.yaml --wait --timeout 3600To capture the job ID for subsequent commands:
JOB_ID=$(evalhub eval run --config eval.yaml --format json | jq -r '.[0].id')
echo "Job submitted: $JOB_ID"
Checking status
Use the following commands to monitor active jobs or view past run histories:
# List all jobs
evalhub eval status
# Filter by state
evalhub eval status --status running
evalhub eval status --status completed
# Filter by recency (client-side)
evalhub eval status --since 24h
evalhub eval status --provider lm_evaluation_harness --since 7d
# Single job detail
evalhub eval status $JOB_ID
# Block until terminal state (for async submissions)
evalhub eval status $JOB_ID --watch --poll-interval 10Retrieving results
Use the following commands to retrieve your evaluation results in different formats:
# Human-readable table (default)
evalhub eval results $JOB_ID
# Machine-readable JSON
evalhub eval results $JOB_ID --format json
# CSV for export
evalhub eval results $JOB_ID --format csv > results.csvTable output:
benchmark provider metric value
leaderboard_ifeval lm_evaluation_harness inst_level_strict_acc 0.712
leaderboard_bbh lm_evaluation_harness acc_norm 0.581
MLflow experiment: https://mlflow.example.com/experiments/42Managing collections from the CLI
You can manage your metric collections directly from the terminal with the following commands:
# List all available collections (system + user)
evalhub collections list
evalhub collections list --tag leaderboard --format json
# Inspect a collection
evalhub collections describe leaderboard-v2
# Create a user collection from a spec file
evalhub collections create --file my-collection.yaml
# Run a collection directly (shorthand for eval run with collection reference)
evalhub collections run general-assistant-gate-v1 \
--model-url http://vllm-service:8000/v1 \
--model-name meta-llama/Llama-3.2-3B-Instruct \
--wait \
--timeout 3600Cancelling jobs
If you need to stop an active evaluation, use the cancel command:
evalhub eval cancel $JOB_ID # cancel with confirmation prompt
evalhub eval cancel $JOB_ID --hard-delete # permanent deletion, no recoveryCI/CD pipeline integration
Automating your AI evaluations within a CI/CD workflow helps catch performance regressions early.
GitHub Actions
The following workflow file demonstrates how to execute an evaluation gate automatically on every pull request:
# .github/workflows/model-eval.yaml
name: Model Evaluation Gate
on:
pull_request:
paths:
- 'model/**'
- 'prompts/**'
- 'eval.yaml'
jobs:
evaluate:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- name: Install EvalHub CLI
run: pip install "eval-hub-sdk[cli]"
- name: Verify service connectivity
env:
EVALHUB_BASE_URL: ${{ secrets.EVALHUB_BASE_URL }}
EVALHUB_TOKEN: ${{ secrets.EVALHUB_TOKEN }}
run: evalhub health
- name: Run evaluation gate
env:
EVALHUB_BASE_URL: ${{ secrets.EVALHUB_BASE_URL }}
EVALHUB_TOKEN: ${{ secrets.EVALHUB_TOKEN }}
run: |
evalhub eval run \
--config eval.yaml \
--wait \
--timeout 3600
- name: Export results
if: always()
env:
EVALHUB_BASE_URL: ${{ secrets.EVALHUB_BASE_URL }}
EVALHUB_TOKEN: ${{ secrets.EVALHUB_TOKEN }}
run: |
JOB_ID=$(evalhub eval status --status completed --since 1h --format json \
| jq -r '.[0].id')
evalhub eval results "$JOB_ID" --format json > eval-results.json
evalhub eval results "$JOB_ID" --format csv > eval-results.csv
- name: Upload results artifact
if: always()
uses: actions/upload-artifact@v4
with:
name: eval-results-${{ github.sha }}
path: |
eval-results.json
eval-results.csvThe evalhub eval run --wait command exits with a value of 1 if the collection gate fails. This failure stops the GitHub Actions step and blocks the PR merge without requiring a separate post-processing script.
GitLab CI
You can also run your evaluations using GitLab CI/CD pipelines:
# .gitlab-ci.yml
model-evaluation:
stage: test
image: python:3.11-slim
variables:
EVALHUB_BASE_URL: $EVALHUB_BASE_URL # from GitLab CI/CD variables
EVALHUB_TOKEN: $EVALHUB_TOKEN
before_script:
- pip install "eval-hub-sdk[cli]"
- evalhub health
script:
- evalhub eval run --config eval.yaml --wait --timeout 3600
after_script:
- |
JOB_ID=$(evalhub eval status --status completed --since 1h --format json \
| jq -r '.[0].id')
evalhub eval results "$JOB_ID" --format csv > eval-results.csv
artifacts:
when: always
paths:
- eval-results.csv
expire_in: 30 days
rules:
- changes:
- model/**
- prompts/**
- eval.yamlReusable shell script
For pipelines that need more control, such as capturing the job ID, branching on per-benchmark results, posting summaries, use a script:
#!/usr/bin/env bash
# eval-gate.sh — Submit an EvalHub evaluation and gate on the result
set -euo pipefail
EVAL_CONFIG="${1:-eval.yaml}"
TIMEOUT="${2:-3600}"
RESULTS_FILE="eval-results-$(date +%Y%m%d-%H%M%S).json"
# Requires: EVALHUB_BASE_URL and EVALHUB_TOKEN in environment
echo "==> Checking EvalHub connectivity"
evalhub health
echo "==> Submitting evaluation: $EVAL_CONFIG"
JOB_ID=$(evalhub eval run --config "$EVAL_CONFIG" --format json | jq -r '.[0].id')
echo " Job ID: $JOB_ID"
echo "==> Waiting for completion (timeout: ${TIMEOUT}s)"
evalhub eval status "$JOB_ID" --watch --poll-interval 15
echo "==> Fetching results"
evalhub eval results "$JOB_ID" --format json > "$RESULTS_FILE"
evalhub eval results "$JOB_ID" # human-readable summary to stdout
echo "==> Results saved to $RESULTS_FILE"The set -euo pipefail option combined with evalhub eval status --watch propagates the exit code correctly. If the job fails, the --watch exits with a value of 1, set -e halts the script, and the pipeline step fails.
CLI reference at a glance
Use this quick reference table to find common EvalHub CLI commands and their associated flags:
| Command | Common flags | Purpose |
|---|---|---|
evalhub health | — | Verify service connectivity |
evalhub eval run | --config, --wait, --timeout, --format | Submit evaluation job |
evalhub eval status | --watch, --status, --since, --format | List or monitor jobs |
evalhub eval results | --format | Retrieve completed job results |
evalhub eval cancel | --hard-delete | Cancel or delete a job |
evalhub collections list | --tag, --format | List available collections |
evalhub collections describe | --format | Inspect a collection |
evalhub collections create | --file | Register a new collection |
evalhub collections run | --model-url, --model-name, --wait | Run a collection directly |
evalhub providers list | --format | List registered providers |
evalhub providers describe | --format | Inspect registered providers |
evalhub config set | KEY VALUE | Set a config value |
evalhub config use | PROFILE | Switch active profile |
All data-returning commands support --format table (default), json, yaml, and csv. JSON output is stable and safe to pipe into jq.
Practical notes
Use --wait for gates, --watch for monitoring. The evalhub eval run --wait command submits and blocks in a single command, which is the simplest option for pipelines. The evalhub eval status $JOB_ID --watch command monitors a previously submitted async job. This is useful when the submission and the wait steps run in different pipeline stages.
Never store tokens in the config file in CI. Set EVALHUB_TOKEN from your secret store. The environment variable automatically takes priority over the config file.
Scope eval.yaml to your pipeline trigger. A collection reference in eval.yaml means threshold changes are picked up automatically when the collection is updated. Benchmark lists require a file change to update.
Results are always in MLflow. Every evalhub eval run call automatically writes a full experiment record. evalhub eval results provides a convenience summary, but MLflow serves as the authoritative long-term data store.
Start here
- EvalHub website
- EvalHub SDK (
evalhubCLI) - EvalHub server (Collections API, provider registration)
- TrustyAI Operator (Kubernetes/OpenShift deployment)