EvalHub: Because "looks good to me" isn't a benchmark identified the reproducibility crisis as a structural failure in enterprise AI evaluation: benchmark scores without the environment metadata that produced them are claims, not evidence. Evaluation-driven development with EvalHub showed how EvalHub's automatic MLflow integration closes that gap for experiment tracking, recording every evaluation run with its configuration, model version, collection version, and hardware tags.
MLflow solves the queryability problem. It does not solve the immutability problem.
Series note
This is part 7 in a series covering how to build a scalable, reproducible AI evaluation infrastructure using the EvalHub project and Red Hat AI. Catch up on the other parts in the series:
- Part 1: How EvalHub manages two-layer Kubernetes control planes
- Part 2: EvalHub: Because "looks good to me" isn't a benchmark
- Part 3: Evaluation-driven development with EvalHub
- Part 4: Understanding evaluation collections in EvalHub
- Part 5: Bring your own evaluation framework to EvalHub
- Part 6: Add automated AI evaluations to your CI/CD pipeline
The governance gap that MLflow alone cannot close
An MLflow experiment record is mutable. Users can delete, overwrite, or lose entrie if the tracking server is rebuilt. For internal iteration, that is fine. For regulated workloads (such as the EU AI Act, FedRAMP High, and SOC 2), the evidence that a model met its evaluation criteria before deployment needs to be tamper-evident, content-addressed, and independently verifiable. A database row does not satisfy that requirement. A signed OCI artifact in a content-addressable registry does.
EvalHub's OCI persistence layer pushes evaluation result artifacts to any OCI-compliant registry at the end of every evaluation run. The artifact reference, which is a digest of the form sha256:..., is immutable: if the contents change, the digest changes. EvalHub stores the digest in the JobResults and writes it to MLflow alongside the experiment record. The full provenance chain is: evaluation run → MLflow experiment (queryable) → OCI artifact (immutable).
How OCI persistence works
OCI persistence is opt-in, configured per evaluation job via the exports block. When configured, the adapter calls callbacks.create_oci_artifact() after the evaluation completes, which delegates to the OCIArtifactPersister internally. The persister:
- Takes the results directory (any files your adapter wrote, such as JSON metrics, raw outputs, and logs).
- Creates an OCI artifact layout using
olot. - Authenticates to the registry (via Docker config in local mode, via Kubernetes sidecar in cluster mode).
- Pushes the artifact using
oras. - Retrieves the
Docker-Content-Digestresponse header. - Returns an
OCIArtifactResultwith the digest and full artifact reference Uniform Resource Identifier (URI).
The result is a content-addressed, immutable artifact at a stable reference like:
quay.io/my-org/eval-results:evalhub-a3f7c1b2@sha256:9f86d081884c7d659a2feaa0c55ad015a3bf4f1b2b0b822cd15d6c15b0f00a3EvalHub stores that reference in JobResults.oci_artifact, reports it back to the EvalHub server, and writes it as MLflow artifact metadata. Pulling that reference six months from now returns exactly the results that the evaluation run produced, or it fails with a digest mismatch if anything has been altered.
How to configure OCI exports
You can configure EvalHub to export evaluation results using a local configuration file or the REST API.
In eval.yaml
Add an exports block to your evaluation configuration file:
name: "llama-3.2-staging-gate"
model:
url: "http://vllm-service:8000/v1"
name: "meta-llama/Llama-3.2-3B-Instruct"
collection:
id: "general-assistant-gate-v1"
exports:
oci:
coordinates:
oci_host: "quay.io"
oci_repository: "my-org/eval-results"
oci_tag: "llama-3.2-staging-2026-04-01" # optional; auto-generated if omitted
annotations:
environment: "staging"
model-family: "llama-3"
collection-version: "v1"
k8s:
connection: "registry-credentials" # name of K8s Secret (type: dockerconfigjson)The k8s.connection property is required in Kubernetes mode. This property names the kubernetes.io/dockerconfigjson Secret that holds registry credentials. In local mode, omit the k8s block so the persister reads from ~/.docker/config.json.
OCI coordinates field reference
| Field | Required | Description |
|---|---|---|
oci_host | Yes | Registry hostname (for example, quay.io, registry.example.com) |
oci_repository | Yes | Repository path (for example, my-org/eval-results) |
oci_tag | No | Custom tag; deterministic SHA256 tag generated if omitted |
oci_subject | No | Optional subject identifier within the same registry or repository |
annotations | No | Custom key-value metadata merged into OCI annotations |
In the REST API
The same exports structure is accepted in the POST /api/v1/evaluations/jobs endpoint directly:
curl -X POST http://evalhub-service:8080/api/v1/evaluations/jobs \
-H "Content-Type: application/json" \
-d '{
"name": "llama-3.2-staging-gate",
"model": {
"url": "http://vllm-service:8000/v1",
"name": "meta-llama/Llama-3.2-3B-Instruct"
},
"collection_id": "general-assistant-gate-v1",
"exports": {
"oci": {
"coordinates": {
"oci_host": "quay.io",
"oci_repository": "my-org/eval-results",
"annotations": {
"environment": "staging",
"model-family": "llama-3"
}
},
"k8s": {
"connection": "registry-credentials"
}
}
},
"experiment": {
"name": "llama-3.2-staging-eval",
"tags": { "environment": "staging" }
}
}'Deterministic tags
If oci_tag is omitted, the persister generates a deterministic tag from the evaluation context:
SHA256(job_id + provider_id + benchmark_id + benchmark_index)The result is a hex string conforming to OCI tag specifications (alphanumeric, underscores, periods, hyphens; maximum of 128 characters), prefixed with evalhub-:
evalhub-a3f7c1b2d4e6f8a0b2c4d6e8f0a2b4c6d8e0f2a4b6c8d0e2f4a6b8c0d2e4f6Deterministic tags mean the same job configuration always produces the same tag, so you can find the artifact for a specific evaluation run without storing the reference externally. The digest still changes if the results change, which provides evidence of tampering; the tag provides human-navigable indexing.
Annotations
Every pushed artifact carries a set of standard OCI annotations merged with any user-provided annotations. Default annotations:
| Annotation | Value |
|---|---|
org.opencontainers.image.created | ISO 8601 timestamp of the push |
io.github.eval-hub.job | Job ID |
io.github.eval-hub.benchmark | Benchmark ID |
io.github.eval-hub.provider | Provider ID (if present) |
User-provided annotations in coordinates.annotations take precedence over defaults. Use these annotations to attach deployment-relevant metadata (such as environment, model family, collection version, and compliance tags) to make the artifact queryable in your registry's search interface.
Authentication: Kubernetes sidecar versus local Docker config
EvalHub supports different authentication workflows depending on whether you run your evaluation jobs inside a Kubernetes cluster or in a local development environment.
Kubernetes mode
In Kubernetes mode (EVALHUB_MODE=k8s), the EvalHub-managed pod runs two containers: the adapter and a sidecar. The sidecar acts as an authenticated proxy for registry operations:
Adapter container
→ OCIArtifactPersister.persist()
→ routes push through sidecar proxy (localhost:8080)
→ sidecar performs bearer token challenge with actual registry
→ sidecar uses credentials from the K8s Secret named in k8s.connection
→ push completes; digest returned to adapter
→ artifact reference uses original registry host (not the proxy address)The credentials Secret must be of type kubernetes.io/dockerconfigjson:
apiVersion: v1
kind: Secret
metadata:
name: registry-credentials
namespace: evalhub
type: kubernetes.io/dockerconfigjson
data:
.dockerconfigjson: |
{
"auths": {
"quay.io": {
"auth": "<base64(username:password)>"
}
}
}The adapter code never handles registry credentials directly. Authentication is fully delegated to the sidecar, which reads fresh Kubernetes ServiceAccount tokens per request and handles token rotation automatically.
Local development mode
In local mode (EVALHUB_MODE=local, the default), the persister reads Docker credentials from ~/.docker/config.json (or the path in DOCKER_CONFIG). Log in to your registry before running:
docker login quay.io
EVALHUB_MODE=local python main.pyNo other configuration change is needed. The same adapter code runs in both modes.
Retrieving artifact references
The OCI artifact reference and digest are returned in the evaluation result:
# Full result including OCI artifact reference
evalhub eval results $JOB_ID --format json[
{
"id": "eval-abc123",
"benchmark_id": "leaderboard_ifeval",
"model_name": "meta-llama/Llama-3.2-3B-Instruct",
"results": [
{ "metric_name": "inst_level_strict_acc", "metric_value": 0.712 }
],
"overall_score": 71.2,
"oci_artifact": {
"digest": "sha256:9f86d081884c7d659a2feaa0c55ad015a3bf4f1b2b0b822cd15d6c15b0f00a3",
"oci_ref": "quay.io/my-org/eval-results:evalhub-a3f7c1b2@sha256:9f86d081884c7d659a2feaa0c55ad015a3bf4f1b2b0b822cd15d6c15b0f00a3"
},
"mlflow_run_id": "3f7a2c1b4d6e8f0a"
}
]The oci_ref is the complete, pullable reference. The digest is the tamper-evident fingerprint. EvalHub writes both values to the MLflow experiment record alongside the metrics. This setup allows you to run a standard MLflow query to find evaluation runs that produced an artifact for model X in collection version Y. Each query result includes a pullable artifact reference.
Pull and verify the artifact using ORAS
# Pull and inspect using oras
oras pull quay.io/my-org/eval-results:evalhub-a3f7c1b2@sha256:9f86d08... \
--output ./retrieved-results
# Verify digest before using
echo "9f86d081884c7d659a2feaa0c55ad015a3bf4f1b2b0b822cd15d6c15b0f00a3 retrieved-results/results.json" \
| sha256sum --checkThe digest check verifies your data: if the file hash matches the digest in the artifact reference, the evaluation run produced exactly those results. If the hashes do not match, the artifact has been tampered with or corrupted.
End-to-end: Evaluation run to auditable artifact
# 1. Configure OCI export in eval.yaml
# (add exports.oci block with coordinates and k8s.connection)
# 2. Run evaluation — OCI push happens automatically at job completion
evalhub eval run --config eval.yaml --wait --timeout 3600
# 3. Retrieve the artifact reference from results
evalhub eval results $JOB_ID --format json | \
jq -r '.[0].oci_artifact.oci_ref'
# → quay.io/my-org/eval-results:evalhub-a3f7c1b2@sha256:9f86d081...
# 4. Store the reference in your deployment record / SBOM / audit log
ARTIFACT_REF=$(evalhub eval results $JOB_ID --format json | \
jq -r '.[0].oci_artifact.oci_ref')
# 5. At any future point, pull and verify
oras pull "$ARTIFACT_REF" --output ./audit-evidenceIn a CI/CD pipeline, step 4 writes the artifact reference into the deployment manifest or signed software bill of materials (SBOM). A compliance audit then follows the reference from the deployed model version back to the immutable evaluation artifact. This requires no reconstruction, no log parsing, and no reliance on a queryable database that might have changed.
Integrating OCI export into a pipeline gate
To use EvalHub within an automated pipeline, you must first define the OCI coordinates in your evaluation configuration file. The following example shows an eval.yaml file configured to pass environment and pull request metadata as custom annotations:
# eval.yaml with OCI export
name: "llama-3.2-pr-gate"
model:
url: "http://vllm-service:8000/v1"
name: "meta-llama/Llama-3.2-3B-Instruct"
collection:
id: "general-assistant-gate-v1"
exports:
oci:
coordinates:
oci_host: "quay.io"
oci_repository: "my-org/eval-results"
annotations:
git-commit: "${GIT_COMMIT}"
pr-number: "${PR_NUMBER}"
k8s:
connection: "registry-credentials"After defining your configuration, add a step to your continuous integration pipeline to automate the evaluation run. The following GitHub Actions workflow substitutes your active environment variables into the eval.yaml file, triggers the evaluation job, and captures the resulting immutable artifact reference:
# .github/workflows/model-eval.yaml
- name: Run evaluation gate with OCI export
env:
EVALHUB_BASE_URL: ${{ secrets.EVALHUB_BASE_URL }}
EVALHUB_TOKEN: ${{ secrets.EVALHUB_TOKEN }}
run: |
# Substitute commit and PR into annotations
sed -i "s/\${GIT_COMMIT}/${{ github.sha }}/g" eval.yaml
sed -i "s/\${PR_NUMBER}/${{ github.event.number }}/g" eval.yaml
evalhub eval run --config eval.yaml --wait --timeout 3600
- name: Capture artifact reference
if: success()
env:
EVALHUB_BASE_URL: ${{ secrets.EVALHUB_BASE_URL }}
EVALHUB_TOKEN: ${{ secrets.EVALHUB_TOKEN }}
run: |
JOB_ID=$(evalhub eval status --status completed --since 1h --format json \
| jq -r '.[0].id')
ARTIFACT_REF=$(evalhub eval results "$JOB_ID" --format json \
| jq -r '.[0].oci_artifact.oci_ref')
echo "EVAL_ARTIFACT_REF=$ARTIFACT_REF" >> $GITHUB_ENV
echo "Evaluation artifact: $ARTIFACT_REF"The git-commit annotation in the OCI artifact links the immutable evaluation evidence directly to the source commit. Anyone auditing that deployment can pull the artifact by reference to verify the evaluation results that preceded it.
Why map evaluation results to OCI artifacts?
Review the following comparison to see how enabling OCI persistence shifts your evaluation workflows and audit capabilities:
| Concern | Without OCI persistence | With OCI persistence |
|---|---|---|
| Evidence durability | MLflow database (mutable) | Content-addressed OCI artifact (immutable) |
| Tamper detection | None | SHA256 digest mismatch on pull |
| Evidence retrieval | Query MLflow (requires live server) | oras pull <reference> (requires only registry access) |
| Audit trail | Log entries and dashboard screenshots | Pullable, verifiable artifact with structured annotations |
| Compliance reporting | Manual reconstruction | Reference in deployment manifest → oras pull → structured result files |
The MLflow integration and OCI persistence are complementary, not alternatives. MLflow gives you queryability across runs to support trend analysis, regression detection, and experiment comparison. OCI gives you durability and verifiability for each result. Use both.
Next steps to get started with EvalHub
Ready to implement immutable artifact tracking for your machine learning workflows? Explore the following open source resources from the EvalHub community:
- EvalHub website
- EvalHub SDK (OCI persistence,
OCIArtifactPersister,DefaultCallbacks) - eval-hub-contrib (reference adapters showing OCI persistence in context)
- EvalHub server (jobs API, exports field)
- ORAS (OCI artifact push/pull tool)