Gang autoscaling on OpenShift with Kueue and ProvisionRequest

Running high-performance workloads like AI/ML training, high-performance computing (HPC) simulations, or large data processing on Kubernetes introduces a critical challenge: gang scheduling and coordinated autoscaling.

Imagine you have a distributed machine learning job that requires ten worker pods to start simultaneously. If the cluster only has capacity for eight, traditional Kubernetes scheduling might start those eight pods, leaving the remaining two unschedulable. The Cluster Autoscaler might see those two pending pods and scale up the cluster. However, the first eight pods are useless without the full "gang," wasting compute resources and potentially causing the autoscaler to over-provision or get into a scaling loop before the job can even begin.

This scenario is where the concept of gang autoscaling becomes essential. We need a mechanism that can hold the entire group (or gang) of pods, make sure that all required resources are available before any pod is scheduled, and efficiently signal the need for new capacity to the autoscaler. This coordinated approach prevents resource waste so your latency-sensitive, high-throughput workloads get exactly the resources they need right from the start.

This post explores how combining Red Hat build of Kueue, a queueing and resource management tool, with the ProvisionRequest API brings true gang autoscaling capabilities to OpenShift, so your critical workloads start efficiently and reliably.

Configuring OpenShift autoscaler to support ProvisionRequest

Edit the feature gate for OpenShift clusters to be in DevPreviewNoUpgrade.

featureSet: DevPreviewNoUpgrade

The first step to enable the autoscaler is to create the ClusterAutoscaler custom resource (CR).

Note

The values in this example CR are for demonstration purposes and should only be used after consulting a cluster administrator.

apiVersion: "autoscaling.openshift.io/v1"
kind: "ClusterAutoscaler"
metadata:
  name: "default"
spec:
  podPriorityThreshold: -10
  resourceLimits:
    maxNodesTotal: 24
    cores:
      min: 8
      max: 128
    memory:
      min: 4
      max: 256
  logVerbosity: 4
  scaleDown:
    cordonNodeBeforeTerminating: Enabled
    enabled: true
    delayAfterAdd: 10m
    delayAfterDelete: 5m
    delayAfterFailure: 30s
    unneededTime: 5m
    utilizationThreshold: "0.4"
  scaleUp:
    newPodScaleUpDelay: "10s"
  expanders: ["Random"]

The next step is to configure the MachineAutoscaler to know which MachineSet resources you need to scale. The maxReplicas field in this CR controls the maximum number of nodes that this MachineAutoscaler will be able to create.

apiVersion: "autoscaling.openshift.io/v1beta1"
kind: "MachineAutoscaler"
metadata:
  name: "worker-autoscaler"
  namespace: "openshift-machine-api"
spec:
  minReplicas: 1
  maxReplicas: 12
  scaleTargetRef:
    apiVersion: machine.openshift.io/v1beta1
    kind: MachineSet
    name: your_machine_set_you_want_to_scale

Configuring Red Hat build of Kueue to work with autoscaling

Red Hat build of Kueue acts as the intelligent arbiter for resource requests, managing job queues and making sure cluster capacity is reserved or provisioned before jobs are admitted. Integrating it with autoscaling via ProvisionRequest requires specific configuration to enable this coordinated behavior.

To achieve this, Kueue must recognize when a job requires provisioning new capacity by setting the appropriate cluster queue and resource flavor configurations. The following breakdown represents the complete plumbing required to move from a job request to physical cloud infrastructure, connecting the user's queue to an automated provisioning system.

ResourceFlavor: Defining available resource types

This resource is the simplest piece. It defines a type of resource. In a real-world scenario, you might have one for spot-instances and one for on-demand. Here, it’s just a label called default-flavor.

kind: ResourceFlavor
apiVersion: kueue.x-k8s.io/v1beta2
metadata:
  name: "default-flavor"

ClusterQueue: Managing cluster resource quotas

The ClusterQueue configuration manages the overall cluster resource budget. It specifies that this cluster can handle 36 CPUs of default-flavor.

kind: ClusterQueue
apiVersion: kueue.x-k8s.io/v1beta2
spec:
  resourceGroups:
  - flavors:
    - name: "default-flavor"
      resources:
      - name: "cpu"
        nominalQuota: 36
  admissionChecksStrategy:
    admissionChecks:
      - name: "sample-prov"
        onFlavors: [default-flavor]

The admissionChecksStrategy field acts as a bridge for this configuration. It tells Kueue that even if the user has enough quota (fewer than 36 CPUs), do not admit the job until the check named sample-prov returns a Ready status.

AdmissionCheck: Controlling job admission gates

The AdmissionCheck resource acts as an automated cluster gatekeeper.

kind: AdmissionCheck
metadata:
  name: sample-prov
spec:
  controllerName: kueue.x-k8s.io/provisioning-request
  parameters:
    kind: ProvisioningRequestConfig
    name: prov-test-config

By setting controllerName to provisioning-request, you tell Kueue to use its internal provisioning logic. This configuration points directly to a ProvisioningRequestConfig resource for instructions on how to build the infrastructure if it doesn't already exist.

ProvisioningRequestConfig: Specifying autoscaling parameters

This configuration block provides the explicit set of instructions for the Cluster Autoscaler. The provisioningClassName tells the cluster which autoscaler driver to use, such as an atomic scale-up driver. The managedResources field specifies that the autoscaler must provision more CPU capacity. Finally, retryStrategy defines that if the cloud provider lacks capacity, the system will attempt to provision the nodes up to two times before giving up.

kind: ProvisioningRequestConfig
metadata:
  name: prov-test-config
spec:
  provisioningClassName: best-effort-atomic-scale-up.autoscaling.x-k8s.io
  managedResources:
  - cpu
  retryStrategy:
    backoffLimitCount: 2

How the components flow together

To understand how these individual configurations interact in a live cluster, you can track the end-to-end lifecycle of a job request as it passes through the provisioning pipeline.

First, a user submits a job to the user-queue (the LocalQueue).
Kueue then evaluates the ClusterQueue to see if there are enough of the 36 CPUs left.
If yes, it sees the AdmissionCheck and notices the check requires provisioning.
Kueue then creates a ProvisioningRequest custom resource definition based on the template in ProvisioningRequestConfig.
The Cluster Autoscaler sees that request, builds the nodes, and marks the request as Succeeded.
The AdmissionCheck flips to Ready, and Kueue allows the job to run on the new nodes.

Optimizing cluster use for AI/ML

Kueue support for provision requests opens the door for AI/ML workloads. AI/ML workloads need gang scheduling, and being able to provide the right nodes for your workloads is essential.

This integration of Red Hat build of Kueue and the ProvisionRequest API solves a long-standing challenge in running high-performance workloads on OpenShift. By enabling true gang autoscaling, we eliminate resource waste, prevent scaling deadlocks, and make sure that complex, multi-pod jobs—like distributed AI/ML training—can start immediately and reliably once all required capacity is available. This capability is foundational for optimizing cluster use and increasing the throughput of critical applications.

What's next: Autoscaling for inference

While this post focused on the provisioning of resources for demanding, capacity-guaranteed workloads, the integration of Kueue and the ProvisionRequest API paves the way for advanced autoscaling for inference workloads. We are exploring how to use these same mechanisms to handle sudden, large spikes in demand for AI/ML model serving, allowing the system to scale rapidly to meet real-time user needs without maintaining expensive idle capacity.

Connect with the Kueue community

Are you running distributed AI/ML, HPC, or other gang-scheduled workloads on OpenShift? Does the challenge of coordinated autoscaling resonate with your team's pain points? We are actively seeking feedback and collaborations to refine these features.

Gang autoscaling on OpenShift with Kueue and ProvisionRequest

Coordinated autoscaling for AI/ML workloads on OpenShift with Kueue

Configuring OpenShift autoscaler to support ProvisionRequest

Note

Configuring Red Hat build of Kueue to work with autoscaling

ResourceFlavor: Defining available resource types

ClusterQueue: Managing cluster resource quotas

AdmissionCheck: Controlling job admission gates

ProvisioningRequestConfig: Specifying autoscaling parameters

How the components flow together

Optimizing cluster use for AI/ML

What's next: Autoscaling for inference

Connect with the Kueue community

Why is pytorch compile so fast?

The hidden cost of observability sprawl

Camel integration quarterly digest: Q2 2026

Optimize OpenShift workloads with software-defined memory

Why your AI agent needs two sandboxes: Benchmark data

Get started with consuming GPU-hosted large language models on Developer Sandbox

Platforms

Build

Quicklinks

Communicate

RED HAT DEVELOPER

Red Hat legal and privacy links

Red Hat legal and privacy links