Tame Ray workloads on OpenShift AI with KubeRay and Kueue

If you're running AI and machine learning workloads on Kubernetes, you're likely facing a familiar problem: resource management.

How do you stop critical jobs from being starved of GPUs by less urgent workloads? How do you ensure your production inference cluster always has the resources it needs? How do you fairly share expensive hardware between multiple teams running essential jobs?

This is a major challenge for any enterprise AI platform, and it's one that Red Hat OpenShift AI 3 is built to solve. The solution is a new integration for efficient resource control: KubeRay and Kueue on OpenShift AI.

Kueue, the Kubernetes-native job queueing system, integrates directly with KubeRay. The new integration uses a built-in SDK on OpenShift AI to let you manage the entire lifecycle of RayCluster resources. This capability provides quota-aware, priority-driven scheduling for your complex Ray workloads.

Let's walk through how it works.

What you'll learn

This guide shows you how OpenShift AI 3 resolves resource contention for Ray workloads using three workflows:

Long-running RayCluster: Your personal Ray laboratory
Quick-iteration jobs: Fast feedback loops on your long-running RayCluster
Ephemeral Ray clusters: Automated lifecyle management for self-cleaning jobs

These features are accessible to data scientists through a simple SDK integrated into Red Hat OpenShift AI 3 (Technology Preview).

Background: What are we working with?

First, let's briefly explain the key components:

Ray: An open source framework for scaling Python workloads across multiple machines. Think of it as a way to turn your laptop code into distributed computing code with minimal changes.
KubeRay: The Kubernetes operator that makes it possible to run and manage Ray on Kubernetes.
Kueue: A Kubernetes-native job queuing system. It acts as a traffic controller for your cluster resources, ensuring fair sharing, priorities, and quotas are enforced.
CodeFlare SDK: A Python SDK that lets data scientists interact with Ray on OpenShift AI without writing any Kubernetes YAML or learning kubectl commands.

KubeRay custom resource definitions (CRDs):

RayCluster: A Kubernetes resource that represents a Ray cluster (one head node plus multiple worker nodes). Managing these on Kubernetes traditionally requires writing complex YAML manifests.
RayJob: A higher-level resource that manages the lifecycle of a Ray workload. It can spin up a temporary RayCluster to run your code and automatically remove the infrastructure when it finishes, or submit against an existing RayCluster.

Kueue CRDs:

LocalQueue: A namespaced entry point where teams submit their jobs. It aggregates a specific team's workloads and routes them to a ClusterQueue for admission.
ClusterQueue: A global pool of resources that enforces usage limits and quotas. It governs admission and fair sharing for workloads funnelled from multiple LocalQueue resources.
Cohort: A grouping of ClusterQueue resources that enables sharing of unused quota. Busier queues can borrow excess capacity from idle queues.
ResourceFlavor: Defines hardware variations available to the cluster.
WorkloadPriorityClass: The importance of a job. This influences scheduling order and allows critical workloads to evict less important jobs.

The following sections describe the workflows available in Ray and Kueue, as shown in Figure 1.

CodeFlare-SDK connects a data scientist to the Workspace RayCluster and RayJob workflows, with Kueue managing admission for both. — Figure 1: CodeFlare-SDK architecture showing Workspace RayCluster and RayJob workflows.

Long-running RayCluster: Your personal Ray workspace

Use case: A data scientist needs a persistent environment for interactive development. This environment supports prototyping models, running exploratory analysis with Ray Data, or connecting to live systems for tasks like Feast feature engineering.

The problem

You can't prototype a 4-GPU model on your laptop. You need to run interactive, exploratory code—like a Feast feature engineering task or a Ray Data exploration—on a powerful, multinode cluster. However, you should not start a new cluster for every single notebook cell you execute.

The solution: A long-lived workspace cluster

The CodeFlare SDK allows you to define and manage this persistent cluster with a few lines of Python:

from codeflare_sdk import Cluster, ClusterConfiguration, TokenAuthentication


# Authenticate to RHOAI
auth = TokenAuthentication(
    token = "XXXXX",
    server = "XXXXX",
    skip_tls=False
)

auth.login()

# Define your workspace cluster
workspace_cluster = Cluster(
    ClusterConfiguration(
        name="my-workspace-cluster",
        num_workers=2,
        worker_cpu_requests=2,
        worker_memory_requests=8,
        worker_extended_resource_requests={'nvidia.com/gpu': 1},  # 2 GPUs total
        local_queue="ds-workspace-queue",  # The team's development queue
        labels={"kueue.x-k8s.io/priority-class": "rd-priority"} # The R&D priority
    )
)

# Start your cluster
workspace_cluster.apply()
workspace_cluster.wait_ready()

You now have a personal Ray cluster that runs as long as you need. You can connect to it from your notebook by explicitly pointing to its address. This is ideal for any task where you use ray.init() to connect your local notebook directly to the cluster for interactive exploration:

# Connect to Ray for interactive work
import ray
ray.init(workspace_cluster.cluster_uri())

# Now you can use Ray interactively
@ray.remote
def train_model(data):
    # Your ML code here
    pass

results = ray.get([train_model.remote(batch) for batch in data])

When you're done for the day (or week), clean up:

workspace_cluster.down()  # Resources immediately returned to the queue

The quick iteration loop: Running jobs on your workspace

Use case: You moved your code from a notebook into a .py script and want to test it as a self-contained job (for example, on a data sample) before running it at scale. Note: This requires Red Hat build of Kueue 1.4.

The problem

Your code has moved from a notebook into a test_model.py script. You are now in the "code-run-check" loop, and startup latency is your enemy. Starting an ephemeral cluster for every test is too slow—you might wait up to five minutes just for the cluster to pull its container images, only to discover a typo. You need an instant, subsecond submission time to iterate quickly.

The solution: Submit jobs to your existing workspace cluster

The CodeFlare SDK provides the RayJob object to submit a job to your existing cluster for quick iteration:

from codeflare_sdk import RayJob

# Define a job that runs on your EXISTING workspace cluster
quick_job = RayJob(
    job_name="quick-dev-test",
    entrypoint="python test_model.py",
    cluster_name="my-workspace-cluster",  # Points to your running cluster
    namespace="your-namespace",
    runtime_env={
        "working_dir": ".",  # Uses your current directory
        "pip": "requirements.txt"  # Installs dependencies
    }
)

# Submit the job (instant - no cluster startup wait)
quick_job.submit()

# Check status
quick_job.status()

# Get logs
quick_job.logs()

This workflow is the bridge between interactive prototyping and an automated job. It lets you test your full script (test_model.py), including its runtime_env and dependencies, without waiting for a new cluster to spin up.

The ephemeral cluster: Self-service, automated jobs

Use case: Your code is tested and ready for an automated, "fire-and-forget" run. This could be a 12-hour Ray Train job, a nightly Ray Data batch inference job, or a large-scale experiment. The cluster can be created for the job and automatically deleted on completion.

The problem

You have two new challenges:

Resource management: You need 8 GPUs, but so do other teams. How do you submit your job without starving their critical workloads? And how does the platform ensure your job gets the resources it needs without manual intervention?
Resource waste: You don't want to manually manage infrastructure. You need a cluster that is created for your job and automatically destroyed the second it finishes, ensuring zero idle time and zero wasted cost.

The solution: Job-managed, ephemeral clusters

This is the most powerful workflow. Instead of creating a cluster and submitting jobs to it, you define the resources your job needs, and the platform handles everything else.

from codeflare_sdk import RayJob, ManagedClusterConfig

# Define a job with embedded cluster requirements
production_job = RayJob(
    job_name="final-training-run",
    local_queue="batch-jobs-queue",  # Production team's queue
    cluster_config=ManagedClusterConfig(  # Define cluster inline
        num_workers=4,
        worker_cpu_requests=8,
        worker_cpu_limits=8,
        worker_memory_requests=16,
        worker_memory_limits=16,
        worker_accelerators={'nvidia.com/gpu': 2},  # 8 GPUs total
    ),
    entrypoint="python train_model_full.py",
    runtime_env={
        "working_dir": "https://github.com/your-org/your-repo/archive/main.zip",
        "pip": "./path/to/requirements.txt"
    },
    labels={"kueue.x-k8s.io/priority-class": "prod-priority"}
)

# Submit and forget
production_job.submit()

The data scientist never touches infrastructure. They just define what they need and let the platform handle the lifecycle.

Why this matters: Real-world scenarios

Let's look at two options: operations without Kueue, and how Kueue and Ray work together to address this common scenario.

Without Kueue: The manual admin gatekeeper

This is a standard Kubernetes or OpenShift AI cluster with the KubeRay operator, but no Kueue. To prevent a scenario where users creates clusters that conflict, the company has made the platform admin the manual gatekeeper for all resources.

The problem: High cost and difficulty

This manual approach is a drain on the organization.

High cost (idle waste): If a data scientist finishes at 3 PM but forgets to file a deletion ticket, a 2-GPU cluster—which can cost $3-10 per hour—sits idle for five hours. This idle waste, scaled across a whole team, costs the company.
Resource hoarding: Data scientists know the request process is slow. When they finally get a 6-GPU cluster, they might keep it "just in case" they need it, even if they aren't actively using it. Utilization plummets.
Zero efficiency (no borrowing): The 12 GPUs reserved for the production job sit 100% idle for 16 hours a day. There is no mechanism to safely lend this massive, expensive block of capacity to the R&D team. The cluster's maximum possible utilization is permanently capped.
Admin as bottleneck: The administrator is a human queue. Every request is blocked on one person's availability.
Productivity loss: Data scientists spend their time waiting in queues and filing tickets instead of doing research.

With Kueue: The automated platform

The platform admin's job is no longer to act as a manual gatekeeper, but to design the platform's resource management policies. They configure Kueue once to create a self-service, automated, and efficient system.

The admin's setup

The admin defines the resource management policies (Figure 2):

cq-production: 12-GPU guarantee, prod-priority (1000).
cq-development: 8-GPU guarantee, dev-priority (100).
all-gpus Cohort: Both queues are added to a cohort, allowing them to borrow from each other. The reclaimWithinCohort policy is set, allowing production to preempt R&D.

Cluster Admin and Data Scientist workflows showing how CodeFlare SDK workloads route from local queues in namespaces to shared cluster queues in a cohort. — Figure 2: Resource management hierarchy showing how CodeFlare SDK submits workloads to local and cluster queues.

How the scenario plays out

Here's what happens to the exact same R&D and production requests from the "before" scenario.

1 PM (R&D team self-service):
- A data scientist needs their 2-GPU workspace RayCluster. They don't file a ticket. They just run their workspace_cluster.apply() script, which submits their RayCluster to the lq-rd-workspace queue.
- Kueue's action: Kueue intercepts the request. It checks the 8-GPU quota for cq-development, sees 2 GPUs are available, and instantly admits the cluster.
- Two other data scientists do the same. Kueue admits their clusters, using the remaining 6 guaranteed R&D GPUs and borrowing 4 GPUs from the idle production quota.
- Result: All R&D clusters are running. The admin wasn't involved. All resources are 100% utilized.
8 PM (Production job automated):
- A CI/CD pipeline (not an admin) automatically submits the high-priority RayJob to the lq-prod-batch queue.
- Kueue's action:
  1. Intercept: Kueue sees the high-priority (value: 1000) production job arrive.
  2. Check quota: It checks cq-production and sees its 12 guaranteed GPUs are currently being borrowed by the R&D queue.
  3. Trigger preemption: Kueue's reclaimWithinCohort rule is activated. It identifies the lower-priority (value: 100) R&D clusters that are using the borrowed resources.
  4. Clean eviction: Kueue cleanly preempts the R&D clusters. It doesn't just kill random pods; it suspends the RayCluster, and the system removes the associated pods. This is a safe, orderly shutdown.
  5. Admit: The 12 GPUs are now free. Kueue instantly admits the high-priority production job.
- Result: The production job runs on time, every time, with zero manual intervention.
The next morning (automatic cleanup and recovery):
- The nightly production job (workflow 3) finishes. Because it's an ephemeral cluster, the system automatically deletes itas part of the RayJob lifecycle and returns its 12 GPUs to the queue.
- Now that 12 GPUs are available, Kueue un-suspends the preempted Ray Clusters.
- The data scientist arrives and doesn’t need to rerun the workspace_cluster.apply() script, as the clusters have already returned to their previous state.
- Result: The admin is still not involved. The R&D team is back up and running in minutes.

The problem: Solved

This automated approach addresses every problem from the manual "before" scenario:

High costs: Reduced. This is achieved by ensuring zero idle waste, as ephemeral Ray clusters are automatically deleted, and preventing hoarding, since a data scientist's long-running Ray cluster is cleanly preempted if a high-priority job needs the resources.
Resource efficiency: Solved. The R&D team automatically borrows the 12 idle production GPUs during the day. The production job automatically reclaims them at night. The cluster is always doing useful work.
Admin as bottleneck: Solved. The admin is no longer a human scheduler. The platform is fully self-service. The admin's job is to monitor and tweak the resource management policies, not manage individual requests.
Productivity loss: Solved. Data scientists get their resources in seconds via the queue, not in hours via a ticket. They can focus on their work.

Conclusion: From manual gatekeeping to automated platform

The days of manually managing RayCluster YAML files, wasting money on idle GPUs, and making data scientists wait in ticket queues are over. The integration of Ray and Kueue on Red Hat OpenShift AI 3 transforms this process into a fully automated, self-service platform.

This solution provides benefits for two key roles:

For data scientists: You get a simple, powerful set of Python-native workflows. Whether you need an interactive workspace, a fast-iteration loop, or a fire-and-forget ephemeral job , you can access the resources you need without ever writing a line of Kubernetes YAML.
For platform administrators: You move from being a human bottleneck to a platform architect. You can now design sophisticated, fair, and efficient policies. You can automatically manage quotas, priorities, and resource sharing across the entire organization, ensuring high-priority jobs always run and cluster utilization remains high.

What's next

Explore OpenShift AI learning paths and try OpenShift AI in our no-cost Developer Sandbox.

Tame Ray workloads on OpenShift AI with KubeRay and Kueue

What you'll learn

Background: What are we working with?

Long-running RayCluster: Your personal Ray workspace

The problem

The solution: A long-lived workspace cluster

The quick iteration loop: Running jobs on your workspace

The problem

The solution: Submit jobs to your existing workspace cluster

The ephemeral cluster: Self-service, automated jobs

The problem

The solution: Job-managed, ephemeral clusters

Why this matters: Real-world scenarios

Without Kueue: The manual admin gatekeeper

The problem: High cost and difficulty

With Kueue: The automated platform

The admin's setup

How the scenario plays out

The problem: Solved

Conclusion: From manual gatekeeping to automated platform

What's next

What's new in Red Hat Developer Hub 1.9?

Zero trust GitOps: Build a secure, secretless GitOps pipeline

How to manage Red Hat OpenShift AI dependencies with Kustomize and Argo CD

How to develop agentic workflows in a CI pipeline with cicaddy

Accelerated expert-parallel distributed tuning in Red Hat OpenShift AI

Platforms

Build

Quicklinks

Communicate

RED HAT DEVELOPER

Red Hat legal and privacy links

Red Hat legal and privacy links

Report a website issue