Skip to main content
Redhat Developers  Logo
  • AI

    Get started with AI

    • Red Hat AI
      Accelerate the development and deployment of enterprise AI solutions.
    • AI learning hub
      Explore learning materials and tools, organized by task.
    • AI interactive demos
      Click through scenarios with Red Hat AI, including training LLMs and more.
    • AI/ML learning paths
      Expand your OpenShift AI knowledge using these learning resources.
    • AI quickstarts
      Focused AI use cases designed for fast deployment on Red Hat AI platforms.
    • No-cost AI training
      Foundational Red Hat AI training.

    Featured resources

    • OpenShift AI learning
    • Open source AI for developers
    • AI product application development
    • Open source-powered AI/ML for hybrid cloud
    • AI and Node.js cheat sheet

    Red Hat AI Factory with NVIDIA

    • Red Hat AI Factory with NVIDIA is a co-engineered, enterprise-grade AI solution for building, deploying, and managing AI at scale across hybrid cloud environments.
    • Explore the solution
  • Learn

    Self-guided

    • Documentation
      Find answers, get step-by-step guidance, and learn how to use Red Hat products.
    • Learning paths
      Explore curated walkthroughs for common development tasks.
    • Guided learning
      Receive custom learning paths powered by our AI assistant.
    • See all learning

    Hands-on

    • Developer Sandbox
      Spin up Red Hat's products and technologies without setup or configuration.
    • Interactive labs
      Learn by doing in these hands-on, browser-based experiences.
    • Interactive demos
      Click through product features in these guided tours.

    Browse by topic

    • AI/ML
    • Automation
    • Java
    • Kubernetes
    • Linux
    • See all topics

    Training & certifications

    • Courses and exams
    • Certifications
    • Skills assessments
    • Red Hat Academy
    • Learning subscription
    • Explore training
  • Build

    Get started

    • Red Hat build of Podman Desktop
      A downloadable, local development hub to experiment with our products and builds.
    • Developer Sandbox
      Spin up Red Hat's products and technologies without setup or configuration.

    Download products

    • Access product downloads to start building and testing right away.
    • Red Hat Enterprise Linux
    • Red Hat AI
    • Red Hat OpenShift
    • Red Hat Ansible Automation Platform
    • See all products

    Featured

    • Red Hat build of OpenJDK
    • Red Hat JBoss Enterprise Application Platform
    • Red Hat OpenShift Dev Spaces
    • Red Hat Developer Toolset

    References

    • E-books
    • Documentation
    • Cheat sheets
    • Architecture center
  • Community

    Get involved

    • Events
    • Live AI events
    • Red Hat Summit
    • Red Hat Accelerators
    • Community discussions

    Follow along

    • Articles & blogs
    • Developer newsletter
    • Videos
    • Github

    Get help

    • Customer service
    • Customer support
    • Regional contacts
    • Find a partner

    Join the Red Hat Developer program

    • Download Red Hat products and project builds, access support documentation, learning content, and more.
    • Explore the benefits

Store immutable AI evaluation records with EvalHub and OCI

Beyond MLflow

June 16, 2026
William Caban Babilonia Matteo Mortari
Related topics:
Artificial intelligence
Related products:
Red Hat AI

    EvalHub: Because "looks good to me" isn't a benchmark identified the reproducibility crisis as a structural failure in enterprise AI evaluation: benchmark scores without the environment metadata that produced them are claims, not evidence. Evaluation-driven development with EvalHub showed how EvalHub's automatic MLflow integration closes that gap for experiment tracking, recording every evaluation run with its configuration, model version, collection version, and hardware tags.

    MLflow solves the queryability problem. It does not solve the immutability problem.

    Series note

    This is part 7 in a series covering how to build a scalable, reproducible AI evaluation infrastructure using the EvalHub project and Red Hat AI. Catch up on the other parts in the series:

    • Part 1: How EvalHub manages two-layer Kubernetes control planes
    • Part 2: EvalHub: Because "looks good to me" isn't a benchmark
    • Part 3: Evaluation-driven development with EvalHub
    • Part 4: Understanding evaluation collections in EvalHub
    • Part 5: Bring your own evaluation framework to EvalHub
    • Part 6: Add automated AI evaluations to your CI/CD pipeline

    The governance gap that MLflow alone cannot close

    An MLflow experiment record is mutable. Users can delete, overwrite, or lose entrie if the tracking server is rebuilt. For internal iteration, that is fine. For regulated workloads (such as the EU AI Act, FedRAMP High, and SOC 2), the evidence that a model met its evaluation criteria before deployment needs to be tamper-evident, content-addressed, and independently verifiable. A database row does not satisfy that requirement. A signed OCI artifact in a content-addressable registry does.

    EvalHub's OCI persistence layer pushes evaluation result artifacts to any OCI-compliant registry at the end of every evaluation run. The artifact reference, which is a digest of the form sha256:..., is immutable: if the contents change, the digest changes. EvalHub stores the digest in the JobResults and writes it to MLflow alongside the experiment record. The full provenance chain is: evaluation run → MLflow experiment (queryable) → OCI artifact (immutable).

    How OCI persistence works

    OCI persistence is opt-in, configured per evaluation job via the exports block. When configured, the adapter calls callbacks.create_oci_artifact() after the evaluation completes, which delegates to the OCIArtifactPersister internally. The persister:

    1. Takes the results directory (any files your adapter wrote, such as JSON metrics, raw outputs, and logs).
    2. Creates an OCI artifact layout using olot.
    3. Authenticates to the registry (via Docker config in local mode, via Kubernetes sidecar in cluster mode).
    4. Pushes the artifact using oras.
    5. Retrieves the Docker-Content-Digest response header.
    6. Returns an OCIArtifactResult with the digest and full artifact reference Uniform Resource Identifier (URI).

    The result is a content-addressed, immutable artifact at a stable reference like:

    quay.io/my-org/eval-results:evalhub-a3f7c1b2@sha256:9f86d081884c7d659a2feaa0c55ad015a3bf4f1b2b0b822cd15d6c15b0f00a3

    EvalHub stores that reference in JobResults.oci_artifact, reports it back to the EvalHub server, and writes it as MLflow artifact metadata. Pulling that reference six months from now returns exactly the results that the evaluation run produced, or it fails with a digest mismatch if anything has been altered.

    How to configure OCI exports

    You can configure EvalHub to export evaluation results using a local configuration file or the REST API.

    In eval.yaml

    Add an exports block to your evaluation configuration file:

    name: "llama-3.2-staging-gate"
    model:
      url: "http://vllm-service:8000/v1"
      name: "meta-llama/Llama-3.2-3B-Instruct"
    
    collection:
      id: "general-assistant-gate-v1"
    
    exports:
      oci:
        coordinates:
          oci_host: "quay.io"
          oci_repository: "my-org/eval-results"
          oci_tag: "llama-3.2-staging-2026-04-01"  # optional; auto-generated if omitted
          annotations:
            environment: "staging"
            model-family: "llama-3"
            collection-version: "v1"
        k8s:
          connection: "registry-credentials"   # name of K8s Secret (type: dockerconfigjson)

    The k8s.connection property is required in Kubernetes mode. This property names the kubernetes.io/dockerconfigjson Secret that holds registry credentials. In local mode, omit the k8s block so the persister reads from ~/.docker/config.json.

    OCI coordinates field reference

    FieldRequiredDescription
    oci_hostYesRegistry hostname (for example, quay.io, registry.example.com)
    oci_repositoryYesRepository path (for example, my-org/eval-results)
    oci_tagNoCustom tag; deterministic SHA256 tag generated if omitted
    oci_subjectNoOptional subject identifier within the same registry or repository
    annotationsNoCustom key-value metadata merged into OCI annotations

    In the REST API

    The same exports structure is accepted in the POST /api/v1/evaluations/jobs endpoint directly:

    curl -X POST http://evalhub-service:8080/api/v1/evaluations/jobs \
      -H "Content-Type: application/json" \
      -d '{
        "name": "llama-3.2-staging-gate",
        "model": {
          "url": "http://vllm-service:8000/v1",
          "name": "meta-llama/Llama-3.2-3B-Instruct"
        },
        "collection_id": "general-assistant-gate-v1",
        "exports": {
          "oci": {
            "coordinates": {
              "oci_host": "quay.io",
              "oci_repository": "my-org/eval-results",
              "annotations": {
                "environment": "staging",
                "model-family": "llama-3"
              }
            },
            "k8s": {
              "connection": "registry-credentials"
            }
          }
        },
        "experiment": {
          "name": "llama-3.2-staging-eval",
          "tags": { "environment": "staging" }
        }
      }'

    Deterministic tags

    If oci_tag is omitted, the persister generates a deterministic tag from the evaluation context:

    SHA256(job_id + provider_id + benchmark_id + benchmark_index)

    The result is a hex string conforming to OCI tag specifications (alphanumeric, underscores, periods, hyphens; maximum of 128 characters), prefixed with evalhub-:

    evalhub-a3f7c1b2d4e6f8a0b2c4d6e8f0a2b4c6d8e0f2a4b6c8d0e2f4a6b8c0d2e4f6

    Deterministic tags mean the same job configuration always produces the same tag, so you can find the artifact for a specific evaluation run without storing the reference externally. The digest still changes if the results change, which provides evidence of tampering; the tag provides human-navigable indexing.

    Annotations

    Every pushed artifact carries a set of standard OCI annotations merged with any user-provided annotations. Default annotations:

    AnnotationValue
    org.opencontainers.image.createdISO 8601 timestamp of the push
    io.github.eval-hub.jobJob ID
    io.github.eval-hub.benchmarkBenchmark ID
    io.github.eval-hub.providerProvider ID (if present)

    User-provided annotations in coordinates.annotations take precedence over defaults. Use these annotations to attach deployment-relevant metadata (such as environment, model family, collection version, and compliance tags) to make the artifact queryable in your registry's search interface.

    Authentication: Kubernetes sidecar versus local Docker config

    EvalHub supports different authentication workflows depending on whether you run your evaluation jobs inside a Kubernetes cluster or in a local development environment.

    Kubernetes mode

    In Kubernetes mode (EVALHUB_MODE=k8s), the EvalHub-managed pod runs two containers: the adapter and a sidecar. The sidecar acts as an authenticated proxy for registry operations:

    Adapter container
      → OCIArtifactPersister.persist()
      → routes push through sidecar proxy (localhost:8080)
      → sidecar performs bearer token challenge with actual registry
      → sidecar uses credentials from the K8s Secret named in k8s.connection
      → push completes; digest returned to adapter
      → artifact reference uses original registry host (not the proxy address)

    The credentials Secret must be of type kubernetes.io/dockerconfigjson:

    apiVersion: v1
    kind: Secret
    metadata:
      name: registry-credentials
      namespace: evalhub
    type: kubernetes.io/dockerconfigjson
    data:
      .dockerconfigjson: |
        {
          "auths": {
            "quay.io": {
              "auth": "<base64(username:password)>"
            }
          }
        }

    The adapter code never handles registry credentials directly. Authentication is fully delegated to the sidecar, which reads fresh Kubernetes ServiceAccount tokens per request and handles token rotation automatically.

    Local development mode

    In local mode (EVALHUB_MODE=local, the default), the persister reads Docker credentials from ~/.docker/config.json (or the path in DOCKER_CONFIG). Log in to your registry before running:

    docker login quay.io
    EVALHUB_MODE=local python main.py

    No other configuration change is needed. The same adapter code runs in both modes.

    Retrieving artifact references

    The OCI artifact reference and digest are returned in the evaluation result:

    # Full result including OCI artifact reference
    evalhub eval results $JOB_ID --format json
    [
      {
        "id": "eval-abc123",
        "benchmark_id": "leaderboard_ifeval",
        "model_name": "meta-llama/Llama-3.2-3B-Instruct",
        "results": [
          { "metric_name": "inst_level_strict_acc", "metric_value": 0.712 }
        ],
        "overall_score": 71.2,
        "oci_artifact": {
          "digest": "sha256:9f86d081884c7d659a2feaa0c55ad015a3bf4f1b2b0b822cd15d6c15b0f00a3",
          "oci_ref": "quay.io/my-org/eval-results:evalhub-a3f7c1b2@sha256:9f86d081884c7d659a2feaa0c55ad015a3bf4f1b2b0b822cd15d6c15b0f00a3"
        },
        "mlflow_run_id": "3f7a2c1b4d6e8f0a"
      }
    ]

    The oci_ref is the complete, pullable reference. The digest is the tamper-evident fingerprint. EvalHub writes both values to the MLflow experiment record alongside the metrics. This setup allows you to run a standard MLflow query to find evaluation runs that produced an artifact for model X in collection version Y. Each query result includes a pullable artifact reference.

    Pull and verify the artifact using ORAS

    # Pull and inspect using oras
    oras pull quay.io/my-org/eval-results:evalhub-a3f7c1b2@sha256:9f86d08... \
      --output ./retrieved-results
    
    # Verify digest before using
    echo "9f86d081884c7d659a2feaa0c55ad015a3bf4f1b2b0b822cd15d6c15b0f00a3 retrieved-results/results.json" \
      | sha256sum --check

    The digest check verifies your data: if the file hash matches the digest in the artifact reference, the evaluation run produced exactly those results. If the hashes do not match, the artifact has been tampered with or corrupted.

    End-to-end: Evaluation run to auditable artifact

    # 1. Configure OCI export in eval.yaml
    #    (add exports.oci block with coordinates and k8s.connection)
    
    # 2. Run evaluation — OCI push happens automatically at job completion
    evalhub eval run --config eval.yaml --wait --timeout 3600
    
    # 3. Retrieve the artifact reference from results
    evalhub eval results $JOB_ID --format json | \
      jq -r '.[0].oci_artifact.oci_ref'
    # → quay.io/my-org/eval-results:evalhub-a3f7c1b2@sha256:9f86d081...
    
    # 4. Store the reference in your deployment record / SBOM / audit log
    ARTIFACT_REF=$(evalhub eval results $JOB_ID --format json | \
      jq -r '.[0].oci_artifact.oci_ref')
    
    # 5. At any future point, pull and verify
    oras pull "$ARTIFACT_REF" --output ./audit-evidence

    In a CI/CD pipeline, step 4 writes the artifact reference into the deployment manifest or signed software bill of materials (SBOM). A compliance audit then follows the reference from the deployed model version back to the immutable evaluation artifact. This requires no reconstruction, no log parsing, and no reliance on a queryable database that might have changed.

    Integrating OCI export into a pipeline gate

    To use EvalHub within an automated pipeline, you must first define the OCI coordinates in your evaluation configuration file. The following example shows an eval.yaml file configured to pass environment and pull request metadata as custom annotations:

    # eval.yaml with OCI export
    name: "llama-3.2-pr-gate"
    model:
      url: "http://vllm-service:8000/v1"
      name: "meta-llama/Llama-3.2-3B-Instruct"
    collection:
      id: "general-assistant-gate-v1"
    exports:
      oci:
        coordinates:
          oci_host: "quay.io"
          oci_repository: "my-org/eval-results"
          annotations:
            git-commit: "${GIT_COMMIT}"
            pr-number: "${PR_NUMBER}"
        k8s:
          connection: "registry-credentials"

    After defining your configuration, add a step to your continuous integration pipeline to automate the evaluation run. The following GitHub Actions workflow substitutes your active environment variables into the eval.yaml file, triggers the evaluation job, and captures the resulting immutable artifact reference:

    # .github/workflows/model-eval.yaml
    - name: Run evaluation gate with OCI export
      env:
        EVALHUB_BASE_URL: ${{ secrets.EVALHUB_BASE_URL }}
        EVALHUB_TOKEN: ${{ secrets.EVALHUB_TOKEN }}
      run: |
        # Substitute commit and PR into annotations
        sed -i "s/\${GIT_COMMIT}/${{ github.sha }}/g" eval.yaml
        sed -i "s/\${PR_NUMBER}/${{ github.event.number }}/g" eval.yaml
    
        evalhub eval run --config eval.yaml --wait --timeout 3600
    
    - name: Capture artifact reference
      if: success()
      env:
        EVALHUB_BASE_URL: ${{ secrets.EVALHUB_BASE_URL }}
        EVALHUB_TOKEN: ${{ secrets.EVALHUB_TOKEN }}
      run: |
        JOB_ID=$(evalhub eval status --status completed --since 1h --format json \
          | jq -r '.[0].id')
        ARTIFACT_REF=$(evalhub eval results "$JOB_ID" --format json \
          | jq -r '.[0].oci_artifact.oci_ref')
        echo "EVAL_ARTIFACT_REF=$ARTIFACT_REF" >> $GITHUB_ENV
        echo "Evaluation artifact: $ARTIFACT_REF"

    The git-commit annotation in the OCI artifact links the immutable evaluation evidence directly to the source commit. Anyone auditing that deployment can pull the artifact by reference to verify the evaluation results that preceded it.

    Why map evaluation results to OCI artifacts?

    Review the following comparison to see how enabling OCI persistence shifts your evaluation workflows and audit capabilities:

    ConcernWithout OCI persistenceWith OCI persistence
    Evidence durabilityMLflow database (mutable)Content-addressed OCI artifact (immutable)
    Tamper detectionNoneSHA256 digest mismatch on pull
    Evidence retrievalQuery MLflow (requires live server)oras pull <reference> (requires only registry access)
    Audit trailLog entries and dashboard screenshotsPullable, verifiable artifact with structured annotations
    Compliance reportingManual reconstructionReference in deployment manifest → oras pull → structured result files

    The MLflow integration and OCI persistence are complementary, not alternatives. MLflow gives you queryability across runs to support trend analysis, regression detection, and experiment comparison. OCI gives you durability and verifiability for each result. Use both.

    Next steps to get started with EvalHub

    Ready to implement immutable artifact tracking for your machine learning workflows? Explore the following open source resources from the EvalHub community:

    • EvalHub website
    • EvalHub SDK (OCI persistence, OCIArtifactPersister, DefaultCallbacks)
    • eval-hub-contrib (reference adapters showing OCI persistence in context)
    • EvalHub server (jobs API, exports field)
    • ORAS (OCI artifact push/pull tool)

    Related Posts

    • Add automated AI evaluations to your CI/CD pipeline

    • Bring your own evaluation framework to EvalHub

    • Evaluation-driven development with EvalHub

    • EvalHub: Because "looks good to me" isn't a benchmark

    • How EvalHub manages two-layer Kubernetes control planes

    • Defining success: Evaluation metrics and data augmentation for oversaturation detection

    Recent Posts

    • Store immutable AI evaluation records with EvalHub and OCI

    • The evolution of agentic AI and text-to-SQL

    • Security is Getting Harder: Here's Why Image Mode for RHEL Helps

    • Using NetworkManager to permanently set an interface administratively down

    • MPI-powered gradient synchronization in PyTorch distributed training

    What’s up next?

    Learning Path Extract-live-data-lp-feature-image

    Extract live data collection from images and logs

    Explore the complete machine learning operations (MLOps) pipeline utilizing...
    Red Hat Developers logo LinkedIn YouTube Twitter Facebook

    Platforms

    • Red Hat AI
    • Red Hat Enterprise Linux
    • Red Hat OpenShift
    • Red Hat Ansible Automation Platform
    • See all products

    Build

    • Developer Sandbox
    • Developer tools
    • Interactive tutorials
    • API catalog

    Quicklinks

    • Learning resources
    • E-books
    • Cheat sheets
    • Blog
    • Events
    • Newsletter

    Communicate

    • About us
    • Contact sales
    • Find a partner
    • Report a website issue
    • Site status dashboard
    • Report a security problem

    RED HAT DEVELOPER

    Build here. Go anywhere.

    We serve the builders. The problem solvers who create careers with code.

    Join us if you’re a developer, software engineer, web designer, front-end designer, UX designer, computer scientist, architect, tester, product manager, project manager or team lead.

    Sign me up

    Red Hat legal and privacy links

    • About Red Hat
    • Jobs
    • Events
    • Locations
    • Contact Red Hat
    • Red Hat Blog
    • Inclusion at Red Hat
    • Cool Stuff Store
    • Red Hat Summit
    © 2026 Red Hat

    Red Hat legal and privacy links

    • Privacy statement
    • Terms of use
    • All policies and guidelines
    • Digital accessibility

    Chat Support

    Please log in with your Red Hat account to access chat support.