Running high-performance workloads like AI/ML training, high-performance computing (HPC) simulations, or large data processing on Kubernetes introduces a critical challenge: gang scheduling and coordinated autoscaling.
Imagine you have a distributed machine learning job that requires ten worker pods to start simultaneously. If the cluster only has capacity for eight, traditional Kubernetes scheduling might start those eight pods, leaving the remaining two unschedulable. The Cluster Autoscaler might see those two pending pods and scale up the cluster. However, the first eight pods are useless without the full "gang," wasting compute resources and potentially causing the autoscaler to over-provision or get into a scaling loop before the job can even begin.
This scenario is where the concept of gang autoscaling becomes essential. We need a mechanism that can hold the entire group (or gang) of pods, make sure that all required resources are available before any pod is scheduled, and efficiently signal the need for new capacity to the autoscaler. This coordinated approach prevents resource waste so your latency-sensitive, high-throughput workloads get exactly the resources they need right from the start.
This post explores how combining Red Hat build of Kueue, a queueing and resource management tool, with the ProvisionRequest API brings true gang autoscaling capabilities to OpenShift, so your critical workloads start efficiently and reliably.
Configuring OpenShift autoscaler to support ProvisionRequest
Edit the feature gate for OpenShift clusters to be in DevPreviewNoUpgrade.
featureSet: DevPreviewNoUpgradeThe first step to enable the autoscaler is to create the ClusterAutoscaler custom resource (CR).
Note
The values in this example CR are for demonstration purposes and should only be used after consulting a cluster administrator.
apiVersion: "autoscaling.openshift.io/v1"
kind: "ClusterAutoscaler"
metadata:
name: "default"
spec:
podPriorityThreshold: -10
resourceLimits:
maxNodesTotal: 24
cores:
min: 8
max: 128
memory:
min: 4
max: 256
logVerbosity: 4
scaleDown:
cordonNodeBeforeTerminating: Enabled
enabled: true
delayAfterAdd: 10m
delayAfterDelete: 5m
delayAfterFailure: 30s
unneededTime: 5m
utilizationThreshold: "0.4"
scaleUp:
newPodScaleUpDelay: "10s"
expanders: ["Random"]The next step is to configure the MachineAutoscaler to know which MachineSet resources you need to scale. The maxReplicas field in this CR controls the maximum number of nodes that this MachineAutoscaler will be able to create.
apiVersion: "autoscaling.openshift.io/v1beta1"
kind: "MachineAutoscaler"
metadata:
name: "worker-autoscaler"
namespace: "openshift-machine-api"
spec:
minReplicas: 1
maxReplicas: 12
scaleTargetRef:
apiVersion: machine.openshift.io/v1beta1
kind: MachineSet
name: your_machine_set_you_want_to_scaleConfiguring Red Hat build of Kueue to work with autoscaling
Red Hat build of Kueue acts as the intelligent arbiter for resource requests, managing job queues and making sure cluster capacity is reserved or provisioned before jobs are admitted. Integrating it with autoscaling via ProvisionRequest requires specific configuration to enable this coordinated behavior.
To achieve this, Kueue must recognize when a job requires provisioning new capacity by setting the appropriate cluster queue and resource flavor configurations. The following breakdown represents the complete plumbing required to move from a job request to physical cloud infrastructure, connecting the user's queue to an automated provisioning system.
ResourceFlavor: Defining available resource types
This resource is the simplest piece. It defines a type of resource. In a real-world scenario, you might have one for spot-instances and one for on-demand. Here, it’s just a label called default-flavor.
kind: ResourceFlavor
apiVersion: kueue.x-k8s.io/v1beta2
metadata:
name: "default-flavor"ClusterQueue: Managing cluster resource quotas
The ClusterQueue configuration manages the overall cluster resource budget. It specifies that this cluster can handle 36 CPUs of default-flavor.
kind: ClusterQueue
apiVersion: kueue.x-k8s.io/v1beta2
spec:
resourceGroups:
- flavors:
- name: "default-flavor"
resources:
- name: "cpu"
nominalQuota: 36
admissionChecksStrategy:
admissionChecks:
- name: "sample-prov"
onFlavors: [default-flavor]The admissionChecksStrategy field acts as a bridge for this configuration. It tells Kueue that even if the user has enough quota (fewer than 36 CPUs), do not admit the job until the check named sample-prov returns a Ready status.
AdmissionCheck: Controlling job admission gates
The AdmissionCheck resource acts as an automated cluster gatekeeper.
kind: AdmissionCheck
metadata:
name: sample-prov
spec:
controllerName: kueue.x-k8s.io/provisioning-request
parameters:
kind: ProvisioningRequestConfig
name: prov-test-configBy setting controllerName to provisioning-request, you tell Kueue to use its internal provisioning logic. This configuration points directly to a ProvisioningRequestConfig resource for instructions on how to build the infrastructure if it doesn't already exist.
ProvisioningRequestConfig: Specifying autoscaling parameters
This configuration block provides the explicit set of instructions for the Cluster Autoscaler. The provisioningClassName tells the cluster which autoscaler driver to use, such as an atomic scale-up driver. The managedResources field specifies that the autoscaler must provision more CPU capacity. Finally, retryStrategy defines that if the cloud provider lacks capacity, the system will attempt to provision the nodes up to two times before giving up.
kind: ProvisioningRequestConfig
metadata:
name: prov-test-config
spec:
provisioningClassName: best-effort-atomic-scale-up.autoscaling.x-k8s.io
managedResources:
- cpu
retryStrategy:
backoffLimitCount: 2How the components flow together
To understand how these individual configurations interact in a live cluster, you can track the end-to-end lifecycle of a job request as it passes through the provisioning pipeline.
- First, a user submits a job to the
user-queue(theLocalQueue). - Kueue then evaluates the
ClusterQueueto see if there are enough of the 36 CPUs left. - If yes, it sees the
AdmissionCheckand notices the check requires provisioning. - Kueue then creates a
ProvisioningRequestcustom resource definition based on the template inProvisioningRequestConfig. - The Cluster Autoscaler sees that request, builds the nodes, and marks the request as
Succeeded. - The
AdmissionCheckflips toReady, and Kueue allows the job to run on the new nodes.
Optimizing cluster use for AI/ML
Kueue support for provision requests opens the door for AI/ML workloads. AI/ML workloads need gang scheduling, and being able to provide the right nodes for your workloads is essential.
This integration of Red Hat build of Kueue and the ProvisionRequest API solves a long-standing challenge in running high-performance workloads on OpenShift. By enabling true gang autoscaling, we eliminate resource waste, prevent scaling deadlocks, and make sure that complex, multi-pod jobs—like distributed AI/ML training—can start immediately and reliably once all required capacity is available. This capability is foundational for optimizing cluster use and increasing the throughput of critical applications.
What's next: Autoscaling for inference
While this post focused on the provisioning of resources for demanding, capacity-guaranteed workloads, the integration of Kueue and the ProvisionRequest API paves the way for advanced autoscaling for inference workloads. We are exploring how to use these same mechanisms to handle sudden, large spikes in demand for AI/ML model serving, allowing the system to scale rapidly to meet real-time user needs without maintaining expensive idle capacity.
Connect with the Kueue community
Are you running distributed AI/ML, HPC, or other gang-scheduled workloads on OpenShift? Does the challenge of coordinated autoscaling resonate with your team's pain points? We are actively seeking feedback and collaborations to refine these features.