Manage LLM evaluation workloads at scale with EvalHub and Kueue

EvalHub is a service for running large language model (LLM) evaluation benchmarks in Kubernetes environments. As organizations scale their AI/ML workloads, they face increasing challenges around resource management, fair sharing, and job prioritization. This is where Kueue comes in.

Kueue is a Kubernetes-native system for queueing and managing workloads. This guide explores why and how to use Kueue with EvalHub to build a production-ready evaluation platform.

Series note

This is part 8 in a series covering how to build a scalable, reproducible AI evaluation infrastructure using the EvalHub project and Red Hat AI. Catch up on the other parts in the series:

Part 1: How EvalHub manages two-layer Kubernetes control planes
Part 2: EvalHub: Because "looks good to me" isn't a benchmark
Part 3: Evaluation-driven development with EvalHub
Part 4: Understanding evaluation collections in EvalHub
Part 5: Bring your own evaluation framework to EvalHub
Part 6: Add automated AI evaluations to your CI/CD pipeline
Part 7: Store immutable AI evaluation records with EvalHub and OCI
Part 8: Manage LLM evaluation workloads at scale with EvalHub and Kueue
Part 9: Connect EvalHub to protected production model servers

Why EvalHub needs Kueue

To resolve these operational bottlenecks, you must implement a management layer that governs how jobs access compute resources. The following section details the challenges caused by resource contention and how a native queueing system addresses them.

The challenge: Resource contention in shared clusters

In a typical AI/ML platform deployment without a centralized controller, several issues frequently arise:

Unmanaged resource consumption: Multiple teams run evaluation jobs simultaneously, often exceeding available GPU and CPU capacity.
Lack of prioritization: Urgent evaluations (production model validation) compete with experimental evaluations (research experiments).
Cluster instability: Resource sprawl can lead to cluster instability or quota exhaustion.
Operational inefficiency: Jobs that fail due to insufficient resources waste valuable time and compute cycles, requiring manual intervention to retry or reschedule.

Without a formal scheduling system, resource allocation is chaotic and unpredictable, as illustrated in Figure 1.

Users submitting various AI workloads to a Kubernetes cluster without a scheduler, resulting in resource contention, job failures, and cluster instability. — Figure 1: Kubernetes job scheduling without Kueue.

The solution: Intelligent workload management

With Kueue, you move from an "uncontrolled" model to a queue-based system. Kueue is a job scheduler that manages the lifecycle of your workloads, making sure they only enter the cluster when sufficient resources are available to support them. The structured flow of this managed approach is shown in Figure 2.

A diagram illustrating a managed workflow where multiple AI workloads enter a centralized Kueue job scheduler before being processed in a Kubernetes cluster, ensuring orderly resource allocation and stable operations. — Figure 2: Kubernetes job scheduling with Kueue.

Key advantages of Kueue

Using Kueue for benchmark evaluations offers several operational benefits for managing evaluation workloads at scale.

Fair resource sharing across tenants

Kueue supports multitenancy with configured quotas:

# Team A gets 50% of resources
ClusterQueue: team-a-cq
  CPU: 32 cores
  Memory: 128Gi
  GPU: 4

# Team B gets 50% of resources  
ClusterQueue: team-b-cq
  CPU: 32 cores
  Memory: 128Gi
  GPU: 4

Each team's evaluation jobs stay within their quota, preventing one team from monopolizing cluster resources.

Priority-based job scheduling

Critical production evaluations can preempt lower-priority research jobs:

Production model validation: High priority (1000); must complete quickly.
Routine evaluations: Medium priority (500); normal SLA.
Experimental benchmarks: Low priority (100); can wait or be preempted.

Resource quota enforcement

Prevents runaway jobs from consuming all cluster resources:

# Quota limits per ClusterQueue
resources:
  - name: cpu
    nominalQuota: 32
  - name: memory
    nominalQuota: 128Gi
  - name: nvidia.com/gpu
    nominalQuota: 4

Automatic queueing and admission

When your cluster reaches quota, Kueue prevents job failures by automatically queueing workloads until resources become available:

Without Kueue: The job fails with an Insufficient resources error, which requires a manual retry.
With Kueue: The job is queued automatically and admitted once resources become available.

Cohort-based resource borrowing

Teams can borrow unused quota from other teams within the same cohort.

Visibility into job queue status

Track why jobs are pending and their position in the queue:

kubectl get localqueue -n team-a
NAME          CLUSTERQUEUE   PENDING   ADMITTED
local-queue   team-a-cq      3         5

kubectl get workload -n team-a
NAME                  QUEUE         ADMITTED   AGE
eval-job-1-abc123    local-queue   True       2m
eval-job-2-def456    local-queue   False      30s  # Waiting in queue

Understanding the personas

Enabling Kueue for benchmark evaluations involves three key personas, each with distinct responsibilities: the cluster administrator, the namespace owner, and the machine learning (ML) engineer.

Persona	Role	Responsibilities	Scope
Cluster administrator	Manages the Kubernetes cluster and Kueue installation.	Install and configure the Kueue operator. Create `ClusterQueue` and `ResourceFlavor` objects. Define cluster-wide preemption policies. Set up multitenancy boundaries. Monitor cluster-wide resource utilization.	Cluster-wide
Namespace owner or team lead	Manages resources for a specific team or namespace.	Create `LocalQueue` objects in team namespaces. Map `LocalQueue` objects to appropriate `ClusterQueue` objects. Configure namespace labels for Kueue management. Monitor the team's quota usage.	Namespace-specific
EvalHub user or ML engineer	Submits evaluation jobs via the EvalHub API.	Specify the queue name when creating evaluation jobs. Understand job queueing and preemption behavior. Monitor job status through the EvalHub API or `kubectl`.	Individual jobs

Setup guide by persona

Follow these configuration steps tailored to your specific operational role within the cluster environment.

Cluster administrator: Installing and configuring Kueue

Cluster administrators handle the initial cluster-wide setup, including operator installation and global queue definitions.

Step 1: Install the Kueue operator

Install the Kueue operator, create a Kueue cluster instance, and configure ResourceFlavor objects. Refer to the Red Hat build of Kueue installation documentation.

Step 2: Create ClusterQueues for multitenancy

Create ClusterQueue objects for multitenancy. Refer to the Configuring ClusterQueues documentation to set up the cluster-scoped team-a-cq and team-b-cq resources.

Namespace owner: Setting up team resources

Namespace owners configure local namespace labels and connect team resources to the main cluster queues.

Step 1: Label the namespace

Configure your namespace with the following labels:

team=team-a: Matches the ClusterQueue namespaceSelector
kueue.openshift.io/managed=true: Enables Kueue management
evalhub.trustyai.opendatahub.io/tenant=true: EvalHub tenant marker

Step 2: Create LocalQueue

The LocalQueue connects your namespace to the ClusterQueue:

apiVersion: kueue.x-k8s.io/v1beta2
kind: LocalQueue
metadata:
  name: eval-queue
  namespace: team-a-namespace
spec:
  clusterQueue: team-a-cq  # References the ClusterQueue created by admin

EvalHub user: Submitting jobs with Kueue

Submit and monitor your evaluation workloads directly through the application interface or command-line tools.

Job submission via API

Jobs submitted via the EvalHub API are assigned priority 0 by default:

curl --request POST \
  --url https://evalhub-team-a.example.com/api/v1/evaluations/jobs \
  --header 'Authorization: Bearer <token>' \
  --header 'Content-Type: application/json' \
  --data '{
  "name": "standard-eval",
  "model": {
    "url": "http://llm-service.team-a.svc.cluster.local:8080/v1",
    "name": "granite-3.1-8b"
  },
  "queue": {
    "kind": "kueue",
    "name": "eval-queue"
  },
  "benchmarks": [
    {
      "id": "mmlu",
      "provider_id": "lm_evaluation_harness",
      "parameters": {
        "num_fewshot": 5
      }
    }
  ]
}'

After you submit the job, the system queues it with priority 0 and admits when quota becomes available.

Checking job queue status

You can check the job queue status through the EvalHub API with this command:

curl --request GET \
  --url https://evalhub-team-a.example.com/api/v1/evaluations/jobs/<resource-id> \
  --header 'Authorization: Bearer <token>'

The API returns a JSON response indicating the current high-level state:

{
  "resource": {
    "id": "abc123-def456-...",
    "created_at": "2026-04-13T10:30:00Z"
  },
  "status": {
    "state": "pending",
    "message": {
      "message": "Evaluation job created",
      "message_code": "evaluation_job_created"
    }
  }
}

The EvalHub API currently shows high-level states only:

pending: Job created but not yet admitted
running: Job admitted and executing
completed: Job finished

To view detailed status conditions, query the cluster directly using the command-line interface:

# Find the Kubernetes Job
JOB_NAME=$(kubectl get jobs -n team-a-namespace | grep "$RESOURCE_ID" | awk '{print $1}')

# Check job status
kubectl get job "$JOB_NAME" -n team-a-namespace

# Check workload status (shows queue position, preemption, etc.)
WORKLOAD=$(kubectl get workloads -n team-a-namespace -o json | \
  jq -r ".items[] | select(.metadata.ownerReferences[].name == \"$JOB_NAME\") | .metadata.name")

kubectl get workload "$WORKLOAD" -n team-a-namespace -o yaml

Understanding preemption in evaluation jobs

Preemption is a critical concept when using Kueue. Here's what every persona needs to know about how the system manages resource contention.

What is preemption?

Preemption occurs when a high-priority job needs resources, but the cluster is at quota. Kueue performs the following sequence:

Suspends (stop) a lower-priority running job.
Terminates its pod(s).
Admits the higher-priority job.
Requeues the preempted job.
Resumes the preempted job when resources become available.

Default preemption behavior

When you create a ClusterQueue without specifying preemption settings, the system applies these defaults:

# Default behavior (no preemption section specified)
apiVersion: kueue.x-k8s.io/v1beta2
kind: ClusterQueue
metadata:
  name: my-queue
spec:
  resourceGroups: [...]
  # preemption not specified

The resulting effective configuration is::

preemption:
  withinClusterQueue: Never           # No preemption within queue
  reclaimWithinCohort: Never          # Can't reclaim from cohort
  borrowWithinCohort:
    policy: Never                     # Can't preempt when borrowing

By default, Jobs queue in FIFO order. No preemption occurs, even if you assign different priorities to specific jobs.

Enabling preemption

Use the following configuration to enable priority-based preemption:

apiVersion: kueue.x-k8s.io/v1beta2
kind: ClusterQueue
metadata:
  name: my-queue
spec:
  preemption:
    withinClusterQueue: LowerPriority  # Enable preemption
  resourceGroups: [...]

With this setting, higher-priority jobs preempt lower-priority jobs within the same ClusterQueue.

Where is preemption status reported?

Understanding where to look for preemption data is essential for effective debugging.

Kubernetes Workload resource

The Workload resource contains the most detailed preemption information:

kubectl get workload <workload-name> -n <namespace> -o yaml

Check the status.conditions field for the following transitions:

During preemption: The Admitted condition transitions to False, while the Evicted, Preempted, and Requeued conditions become True.
After resume: The Admitted and Requeued conditions show as True, while Evicted and Preempted return to False.

Note that the Requeued condition remains True even after resume, preserving the history that the job was preempted.

Kubernetes Job resource (basic)

The Job resource displays basic suspension status:

kubectl get job <job-name> -n <namespace> -o yaml

status:
  conditions:
  # When preempted:
  - type: Suspended
    status: "True"
    reason: JobSuspended
    message: "Job suspended"
    
  # After resume:
  - type: Suspended
    status: "False"
    reason: JobResumed
    message: "Job resumed"

The Job resource does not indicate why the suspension occurred (for example, preemption versus manual intervention) nor does it provide the preemption UID.

Kubernetes Events

Events provide a historical timeline of cluster actions:

kubectl get events -n <namespace> --sort-by='.lastTimestamp' | grep <job-name>

This sequence illustrates a typical event log for a preempted workload:

7m30s  Normal   QuotaReserved    workload   Quota reserved in ClusterQueue
7m30s  Normal   Admitted         workload   Admitted by ClusterQueue
6m55s  Normal   EvictedDueToPreempted  workload  Preempted to accommodate workload (UID: ...)
6m55s  Normal   Preempted        workload   Preempted to accommodate workload (UID: ...)
6m55s  Normal   Suspended        job        Job suspended
6m55s  Normal   Stopped          job        Preempted to accommodate workload (UID: ...)
5m50s  Normal   Resumed          job        Job resumed
5m49s  Normal   QuotaReserved    workload   Quota reserved in ClusterQueue (after waiting 65s)
5m49s  Normal   Admitted         workload   Admitted by ClusterQueue

EvalHub API response

The EvalHub API does not currently expose Kueue-specific states like preemption or requeueing.

{
  "status": {
    "state": "pending",  // High-level only: pending, running, completed
    "message": {
      "message": "Evaluation job created"
    }
  }
}

To track preemption for EvalHub jobs, follow these steps:

Retrieve the resource.id from the API response.
Identify the Kubernetes Job (the name contains the resource ID).
Locate the associated Workload resource.
Check the Workload status.conditions for detailed status.

You can automate this check using the following script:

RESOURCE_ID="abc123-def456-..."

# Find Job
JOB_NAME=$(kubectl get jobs -n team-a-namespace | grep "$RESOURCE_ID" | awk '{print $1}')

# Find Workload
WORKLOAD=$(kubectl get workloads -n team-a-namespace -o json | \
  jq -r ".items[] | select(.metadata.ownerReferences[].name == \"$JOB_NAME\") | .metadata.name")

# Check for preemption
kubectl get workload "$WORKLOAD" -n team-a-namespace -o jsonpath='{.status.conditions}' | \
  jq '.[] | select(.type == "Preempted" or .type == "Evicted" or .type == "Requeued")'

Impact on evaluation results

When a job is preempted and resumed, it restarts from the beginning.

The state transition diagram highlights how preemption alters execution flow (Figure 3).

State transition diagram showing an evaluation job moving from running to preempted, then back to pending and restarted, illustrating that resumed jobs must restart from the beginning. — Figure 3: An indicative sequence of states when an evaluation job is preempted. ‘Resumed’ is effectively the job being restarted.

Preemption introduces several critical operational implications:

No progress is saved: The job doesn't checkpoint its state.
Increased total runtime: Job age includes the suspension period.
Unpredictable completion times: Jobs can be preempted multiple times.

To avoid these issues, create a dedicated ClusterQueue for evaluation jobs and set withinClusterQueue: Never. Because evaluation workloads cannot checkpoint their progress, this configuration helps make sure your jobs complete without interruption.

Job lifecycle with Kueue

Understanding the complete job lifecycle helps with monitoring and troubleshooting. Use these flows to identify your evaluation job's current stage.

Normal flow (no preemption)

In a standard execution, a job is submitted, reserved in a queue, and admitted to the cluster where it runs to completion without interruption (Figure 4).

Lifecycle diagram showing a job submitted to a queue, admitted to a Kubernetes cluster, and successfully running to completion without interruption. — Figure 4: Lifecycle of an evaluation job with preemption disabled.

Preemption flow

When preemption is enabled, a job might be suspended if a higher-priority workload requires resources. Understanding this flow is essential for interpreting job status changes during peak cluster utilization.

Lifecycle diagram showing a job submitted to a queue and admitted to a cluster, with a path for preemption that suspends the job and returns it to the queue before it eventually resumes and runs to completion. — Figure 5: Lifecycle of an evaluation job with preemption enabled.

Monitoring and troubleshooting

Monitoring your EvalHub jobs ensures you can identify and resolve resource contention issues quickly. Here are common scenarios you might encounter while managing your evaluation workloads.

Scenario 1: Job stuck in pending

If a job remains in the pending state, the Kubernetes job status will appear as follows:

kubectl get job my-eval-job -n team-a-namespace
# NAME           STATUS     COMPLETIONS   AGE
# my-eval-job    Suspended  0/1           5m

If a job is stuck in Suspended status, use this command to diagnose the cause

# Check workload status
WORKLOAD=$(kubectl get workloads -n team-a-namespace -o json | \
  jq -r ".items[] | select(.metadata.ownerReferences[].name == \"my-eval-job\") | .metadata.name")

kubectl get workload "$WORKLOAD" -n team-a-namespace -o jsonpath='{.status.conditions}' | \
  jq '.[] | select(.type == "QuotaReserved" or .type == "Admitted")'

Common causes include:

Insufficient quota: You might need to wait for resources to free up or request a quota increase.

{
  "type": "QuotaReserved",
  "status": "False",
  "reason": "Pending",
  "message": "couldn't assign flavors to pod set main: insufficient unused quota for cpu in flavor default-flavor, 8 more needed"
}

Invalid queue name: Verify that the LocalQueue exists and the name is correct in your job specification.

kubectl get workload "$WORKLOAD" -n team-a-namespace
# NAME                    QUEUE              RESERVED IN   ADMITTED
# job-my-eval-job-abc12   non-existent-queue              False

Waiting for higher-priority jobs: Increase job priority or wait for the queue to clear.
```
kubectl get workloads -n team-a-namespace --sort-by=.spec.priority
```

Scenario 2: Job was preempted

If a job remains suspended after it has already begun execution, it may have been preempted. You can verify this by checking the job status:

kubectl get job my-eval-job -n team-a-namespace
# NAME           STATUS     COMPLETIONS   AGE
# my-eval-job    Suspended  0/1           10m

If a job shows as Suspended after previously running, check for preemption:

# Check for preemption
kubectl get workload "$WORKLOAD" -n team-a-namespace -o jsonpath='{.status.conditions}' | \
  jq '.[] | select(.type == "Preempted" or .type == "Evicted")'

The following output confirms that the job was preempted to accommodate a higher-priority workload:

{
  "type": "Preempted",
  "status": "True",
  "reason": "InClusterQueue",
  "message": "Preempted to accommodate a workload (UID: 641031a6-be4d-43f5-b51f-24a4d05dffe6, JobUID: 1f1c675a-711f-4a13-a3bd-da3d50e6f893)"
}

To resolve this, wait for the preempting job to complete, which allows your job to auto-resume. Alternatively, increase your job’s priority to avoid future preemption.

Scenario 3: Job running but progress unknown

If a job has been running for an extended period, verify if the pod was restarted due to preemption:

kubectl get pod "$POD" -n team-a-namespace -o jsonpath='{.status.containerStatuses[0].restartCount}'
# 0  (no restarts)

# Check pod age vs job age
kubectl get pod "$POD" -n team-a-namespace -o jsonpath='{.metadata.creationTimestamp}'
kubectl get job my-eval-job -n team-a-namespace -o jsonpath='{.metadata.creationTimestamp}'

# If pod is much newer than job, it was likely preempted and recreated

Useful monitoring commands

You can use these command-line entries to monitor workload states, check queue positions, and track resource availability across your cluster:

# View all queued workloads
kubectl get workloads -n team-a-namespace

# View quota usage
kubectl get clusterqueue team-a-cq -o yaml | grep -A 20 "flavorsUsage:"

# View pending workloads count
kubectl get localqueue eval-queue -n team-a-namespace

# Get workload events
kubectl get events -n team-a-namespace --field-selector involvedObject.kind=Workload

# View all preempted workloads
kubectl get workloads -n team-a-namespace -o json | \
  jq -r '.items[] | select(.status.conditions[]? | select(.type == "Preempted" and .status == "True")) | .metadata.name'

Conclusion

Using Kueue with EvalHub transforms ad-hoc evaluation job execution into a managed, fair, and efficient system. By understanding the roles of each persona and following the best practices outlined in this guide, organizations can establish a resilient foundation for the entire AI/ML lifecycle, ensuring your evaluation platform can evolve alongside your needs.

Adopting this integrated approach allows you to:

Prevent resource contention through quota enforcement
Enable fair sharing across multiple teams
Prioritize critical work with intelligent preemption
Increase cluster utilization through cohort-based borrowing
Improve visibility into job queueing and resource usage

By aligning your infrastructure with these standard patterns, you ensure that your evaluation platform is ready to support the next generation of LLM development.

Resources

Last updated: June 23, 2026