Store immutable AI evaluation records with EvalHub and OCI

EvalHub: Because "looks good to me" isn't a benchmark identified the reproducibility crisis as a structural failure in enterprise AI evaluation: benchmark scores without the environment metadata that produced them are claims, not evidence. Evaluation-driven development with EvalHub showed how EvalHub's automatic MLflow integration closes that gap for experiment tracking, recording every evaluation run with its configuration, model version, collection version, and hardware tags.

MLflow solves the queryability problem. It does not solve the immutability problem.

Series note

This is part 7 in a series covering how to build a scalable, reproducible AI evaluation infrastructure using the EvalHub project and Red Hat AI. Catch up on the other parts in the series:

Part 1: How EvalHub manages two-layer Kubernetes control planes
Part 2: EvalHub: Because "looks good to me" isn't a benchmark
Part 3: Evaluation-driven development with EvalHub
Part 4: Understanding evaluation collections in EvalHub
Part 5: Bring your own evaluation framework to EvalHub
Part 6: Add automated AI evaluations to your CI/CD pipeline
Part 7: Store immutable AI evaluation records with EvalHub and OCI
Part 8: Manage LLM evaluation workloads at scale with EvalHub and Kueue
Part 9: Connect EvalHub to protected production model servers

The governance gap that MLflow alone cannot close

An MLflow experiment record is mutable. Users can delete, overwrite, or lose entrie if the tracking server is rebuilt. For internal iteration, that is fine. For regulated workloads (such as the EU AI Act, FedRAMP High, and SOC 2), the evidence that a model met its evaluation criteria before deployment needs to be tamper-evident, content-addressed, and independently verifiable. A database row does not satisfy that requirement. A signed OCI artifact in a content-addressable registry does.

EvalHub's OCI persistence layer pushes evaluation result artifacts to any OCI-compliant registry at the end of every evaluation run. The artifact reference, which is a digest of the form sha256:..., is immutable: if the contents change, the digest changes. EvalHub stores the digest in the JobResults and writes it to MLflow alongside the experiment record. The full provenance chain is: evaluation run → MLflow experiment (queryable) → OCI artifact (immutable).

How OCI persistence works

OCI persistence is opt-in, configured per evaluation job via the exports block. When configured, the adapter calls callbacks.create_oci_artifact() after the evaluation completes, which delegates to the OCIArtifactPersister internally. The persister:

Takes the results directory (any files your adapter wrote, such as JSON metrics, raw outputs, and logs).
Creates an OCI artifact layout using olot.
Authenticates to the registry (via Docker config in local mode, via Kubernetes sidecar in cluster mode).
Pushes the artifact using oras.
Retrieves the Docker-Content-Digest response header.
Returns an OCIArtifactResult with the digest and full artifact reference Uniform Resource Identifier (URI).

The result is a content-addressed, immutable artifact at a stable reference like:

quay.io/my-org/eval-results:evalhub-a3f7c1b2@sha256:9f86d081884c7d659a2feaa0c55ad015a3bf4f1b2b0b822cd15d6c15b0f00a3

EvalHub stores that reference in JobResults.oci_artifact, reports it back to the EvalHub server, and writes it as MLflow artifact metadata. Pulling that reference six months from now returns exactly the results that the evaluation run produced, or it fails with a digest mismatch if anything has been altered.

How to configure OCI exports

You can configure EvalHub to export evaluation results using a local configuration file or the REST API.

In eval.yaml

Add an exports block to your evaluation configuration file:

name: "llama-3.2-staging-gate"
model:
  url: "http://vllm-service:8000/v1"
  name: "meta-llama/Llama-3.2-3B-Instruct"

collection:
  id: "general-assistant-gate-v1"

exports:
  oci:
    coordinates:
      oci_host: "quay.io"
      oci_repository: "my-org/eval-results"
      oci_tag: "llama-3.2-staging-2026-04-01"  # optional; auto-generated if omitted
      annotations:
        environment: "staging"
        model-family: "llama-3"
        collection-version: "v1"
    k8s:
      connection: "registry-credentials"   # name of K8s Secret (type: dockerconfigjson)

The k8s.connection property is required in Kubernetes mode. This property names the kubernetes.io/dockerconfigjson Secret that holds registry credentials. In local mode, omit the k8s block so the persister reads from ~/.docker/config.json.

OCI coordinates field reference

Field	Required	Description
`oci_host`	Yes	Registry hostname (for example, `quay.io`, `registry.example.com`)
`oci_repository`	Yes	Repository path (for example, `my-org/eval-results`)
`oci_tag`	No	Custom tag; deterministic SHA256 tag generated if omitted
`oci_subject`	No	Optional subject identifier within the same registry or repository
`annotations`	No	Custom key-value metadata merged into OCI annotations

In the REST API

The same exports structure is accepted in the POST /api/v1/evaluations/jobs endpoint directly:

curl -X POST http://evalhub-service:8080/api/v1/evaluations/jobs \
  -H "Content-Type: application/json" \
  -d '{
    "name": "llama-3.2-staging-gate",
    "model": {
      "url": "http://vllm-service:8000/v1",
      "name": "meta-llama/Llama-3.2-3B-Instruct"
    },
    "collection_id": "general-assistant-gate-v1",
    "exports": {
      "oci": {
        "coordinates": {
          "oci_host": "quay.io",
          "oci_repository": "my-org/eval-results",
          "annotations": {
            "environment": "staging",
            "model-family": "llama-3"
          }
        },
        "k8s": {
          "connection": "registry-credentials"
        }
      }
    },
    "experiment": {
      "name": "llama-3.2-staging-eval",
      "tags": { "environment": "staging" }
    }
  }'

Deterministic tags

If oci_tag is omitted, the persister generates a deterministic tag from the evaluation context:

SHA256(job_id + provider_id + benchmark_id + benchmark_index)

The result is a hex string conforming to OCI tag specifications (alphanumeric, underscores, periods, hyphens; maximum of 128 characters), prefixed with evalhub-:

evalhub-a3f7c1b2d4e6f8a0b2c4d6e8f0a2b4c6d8e0f2a4b6c8d0e2f4a6b8c0d2e4f6

Deterministic tags mean the same job configuration always produces the same tag, so you can find the artifact for a specific evaluation run without storing the reference externally. The digest still changes if the results change, which provides evidence of tampering; the tag provides human-navigable indexing.

Annotations

Every pushed artifact carries a set of standard OCI annotations merged with any user-provided annotations. Default annotations:

Annotation	Value
`org.opencontainers.image.created`	ISO 8601 timestamp of the push
`io.github.eval-hub.job`	Job ID
`io.github.eval-hub.benchmark`	Benchmark ID
`io.github.eval-hub.provider`	Provider ID (if present)

User-provided annotations in coordinates.annotations take precedence over defaults. Use these annotations to attach deployment-relevant metadata (such as environment, model family, collection version, and compliance tags) to make the artifact queryable in your registry's search interface.

Authentication: Kubernetes sidecar versus local Docker config

EvalHub supports different authentication workflows depending on whether you run your evaluation jobs inside a Kubernetes cluster or in a local development environment.

Kubernetes mode

In Kubernetes mode (EVALHUB_MODE=k8s), the EvalHub-managed pod runs two containers: the adapter and a sidecar. The sidecar acts as an authenticated proxy for registry operations:

Adapter container
  → OCIArtifactPersister.persist()
  → routes push through sidecar proxy (localhost:8080)
  → sidecar performs bearer token challenge with actual registry
  → sidecar uses credentials from the K8s Secret named in k8s.connection
  → push completes; digest returned to adapter
  → artifact reference uses original registry host (not the proxy address)

The credentials Secret must be of type kubernetes.io/dockerconfigjson:

apiVersion: v1
kind: Secret
metadata:
  name: registry-credentials
  namespace: evalhub
type: kubernetes.io/dockerconfigjson
data:
  .dockerconfigjson: |
    {
      "auths": {
        "quay.io": {
          "auth": "<base64(username:password)>"
        }
      }
    }

The adapter code never handles registry credentials directly. Authentication is fully delegated to the sidecar, which reads fresh Kubernetes ServiceAccount tokens per request and handles token rotation automatically.

Local development mode

In local mode (EVALHUB_MODE=local, the default), the persister reads Docker credentials from ~/.docker/config.json (or the path in DOCKER_CONFIG). Log in to your registry before running:

docker login quay.io
EVALHUB_MODE=local python main.py

No other configuration change is needed. The same adapter code runs in both modes.

Retrieving artifact references

The OCI artifact reference and digest are returned in the evaluation result:

# Full result including OCI artifact reference
evalhub eval results $JOB_ID --format json

[
  {
    "id": "eval-abc123",
    "benchmark_id": "leaderboard_ifeval",
    "model_name": "meta-llama/Llama-3.2-3B-Instruct",
    "results": [
      { "metric_name": "inst_level_strict_acc", "metric_value": 0.712 }
    ],
    "overall_score": 71.2,
    "oci_artifact": {
      "digest": "sha256:9f86d081884c7d659a2feaa0c55ad015a3bf4f1b2b0b822cd15d6c15b0f00a3",
      "oci_ref": "quay.io/my-org/eval-results:evalhub-a3f7c1b2@sha256:9f86d081884c7d659a2feaa0c55ad015a3bf4f1b2b0b822cd15d6c15b0f00a3"
    },
    "mlflow_run_id": "3f7a2c1b4d6e8f0a"
  }
]

The oci_ref is the complete, pullable reference. The digest is the tamper-evident fingerprint. EvalHub writes both values to the MLflow experiment record alongside the metrics. This setup allows you to run a standard MLflow query to find evaluation runs that produced an artifact for model X in collection version Y. Each query result includes a pullable artifact reference.

Pull and verify the artifact using ORAS

# Pull and inspect using oras
oras pull quay.io/my-org/eval-results:evalhub-a3f7c1b2@sha256:9f86d08... \
  --output ./retrieved-results

# Verify digest before using
echo "9f86d081884c7d659a2feaa0c55ad015a3bf4f1b2b0b822cd15d6c15b0f00a3 retrieved-results/results.json" \
  | sha256sum --check

The digest check verifies your data: if the file hash matches the digest in the artifact reference, the evaluation run produced exactly those results. If the hashes do not match, the artifact has been tampered with or corrupted.

End-to-end: Evaluation run to auditable artifact

# 1. Configure OCI export in eval.yaml
#    (add exports.oci block with coordinates and k8s.connection)

# 2. Run evaluation — OCI push happens automatically at job completion
evalhub eval run --config eval.yaml --wait --timeout 3600

# 3. Retrieve the artifact reference from results
evalhub eval results $JOB_ID --format json | \
  jq -r '.[0].oci_artifact.oci_ref'
# → quay.io/my-org/eval-results:evalhub-a3f7c1b2@sha256:9f86d081...

# 4. Store the reference in your deployment record / SBOM / audit log
ARTIFACT_REF=$(evalhub eval results $JOB_ID --format json | \
  jq -r '.[0].oci_artifact.oci_ref')

# 5. At any future point, pull and verify
oras pull "$ARTIFACT_REF" --output ./audit-evidence

In a CI/CD pipeline, step 4 writes the artifact reference into the deployment manifest or signed software bill of materials (SBOM). A compliance audit then follows the reference from the deployed model version back to the immutable evaluation artifact. This requires no reconstruction, no log parsing, and no reliance on a queryable database that might have changed.

Integrating OCI export into a pipeline gate

To use EvalHub within an automated pipeline, you must first define the OCI coordinates in your evaluation configuration file. The following example shows an eval.yaml file configured to pass environment and pull request metadata as custom annotations:

# eval.yaml with OCI export
name: "llama-3.2-pr-gate"
model:
  url: "http://vllm-service:8000/v1"
  name: "meta-llama/Llama-3.2-3B-Instruct"
collection:
  id: "general-assistant-gate-v1"
exports:
  oci:
    coordinates:
      oci_host: "quay.io"
      oci_repository: "my-org/eval-results"
      annotations:
        git-commit: "${GIT_COMMIT}"
        pr-number: "${PR_NUMBER}"
    k8s:
      connection: "registry-credentials"

After defining your configuration, add a step to your continuous integration pipeline to automate the evaluation run. The following GitHub Actions workflow substitutes your active environment variables into the eval.yaml file, triggers the evaluation job, and captures the resulting immutable artifact reference:

# .github/workflows/model-eval.yaml
- name: Run evaluation gate with OCI export
  env:
    EVALHUB_BASE_URL: ${{ secrets.EVALHUB_BASE_URL }}
    EVALHUB_TOKEN: ${{ secrets.EVALHUB_TOKEN }}
  run: |
    # Substitute commit and PR into annotations
    sed -i "s/\${GIT_COMMIT}/${{ github.sha }}/g" eval.yaml
    sed -i "s/\${PR_NUMBER}/${{ github.event.number }}/g" eval.yaml

    evalhub eval run --config eval.yaml --wait --timeout 3600

- name: Capture artifact reference
  if: success()
  env:
    EVALHUB_BASE_URL: ${{ secrets.EVALHUB_BASE_URL }}
    EVALHUB_TOKEN: ${{ secrets.EVALHUB_TOKEN }}
  run: |
    JOB_ID=$(evalhub eval status --status completed --since 1h --format json \
      | jq -r '.[0].id')
    ARTIFACT_REF=$(evalhub eval results "$JOB_ID" --format json \
      | jq -r '.[0].oci_artifact.oci_ref')
    echo "EVAL_ARTIFACT_REF=$ARTIFACT_REF" >> $GITHUB_ENV
    echo "Evaluation artifact: $ARTIFACT_REF"

The git-commit annotation in the OCI artifact links the immutable evaluation evidence directly to the source commit. Anyone auditing that deployment can pull the artifact by reference to verify the evaluation results that preceded it.

Why map evaluation results to OCI artifacts?

Review the following comparison to see how enabling OCI persistence shifts your evaluation workflows and audit capabilities:

Concern	Without OCI persistence	With OCI persistence
Evidence durability	MLflow database (mutable)	Content-addressed OCI artifact (immutable)
Tamper detection	None	SHA256 digest mismatch on pull
Evidence retrieval	Query MLflow (requires live server)	`oras pull <reference>` (requires only registry access)
Audit trail	Log entries and dashboard screenshots	Pullable, verifiable artifact with structured annotations
Compliance reporting	Manual reconstruction	Reference in deployment manifest → `oras pull` → structured result files

The MLflow integration and OCI persistence are complementary, not alternatives. MLflow gives you queryability across runs to support trend analysis, regression detection, and experiment comparison. OCI gives you durability and verifiability for each result. Use both.

Next steps to get started with EvalHub

Ready to implement immutable artifact tracking for your machine learning workflows? Explore the following open source resources from the EvalHub community:

EvalHub website
EvalHub SDK (OCI persistence, OCIArtifactPersister, DefaultCallbacks)
eval-hub-contrib (reference adapters showing OCI persistence in context)
EvalHub server (jobs API, exports field)
ORAS (OCI artifact push/pull tool)

Last updated: June 23, 2026

Store immutable AI evaluation records with EvalHub and OCI

Beyond MLflow

Series note

The governance gap that MLflow alone cannot close

How OCI persistence works

How to configure OCI exports

In eval.yaml

OCI coordinates field reference

In the REST API

Deterministic tags

Annotations

Authentication: Kubernetes sidecar versus local Docker config

Kubernetes mode

Local development mode

Retrieving artifact references

Pull and verify the artifact using ORAS

End-to-end: Evaluation run to auditable artifact

Integrating OCI export into a pipeline gate

Why map evaluation results to OCI artifacts?

Next steps to get started with EvalHub

Why is pytorch compile so fast?

The hidden cost of observability sprawl

Camel integration quarterly digest: Q2 2026

Optimize OpenShift workloads with software-defined memory

Why your AI agent needs two sandboxes: Benchmark data

Extract live data collection from images and logs

Platforms

Build

Quicklinks

Communicate

RED HAT DEVELOPER

Red Hat legal and privacy links

Red Hat legal and privacy links