With OpenShift 4.21, Dynamic Resource Allocation (DRA) graduates to General Availability, fundamentally changing how GPU and accelerator resources are requested, allocated, and shared across your cluster. Built on the upstream Kubernetes 1.34 DRA implementation, this release replaces the limitations of the old device plug-in model with a richer, expression-driven framework that understands device attributes, not just device counts.
This post covers what DRA is, why it matters, what's new in OpenShift 4.21, and how to use it with real examples running on an OpenShift 4.21 cluster with NVIDIA A100 GPUs.
The problem: Why device plug-ins fall short
Kubernetes has supported hardware accelerators like GPUs through the device plug-in framework since version 1.8. While functional, device plug-ins have fundamental limitations that become painful at scale, especially for AI/ML workloads:
- Count-based allocation only: A pod requests
nvidia.com/gpu: 1, but has no way to say which GPU it needs. There is no mechanism to filter by model, memory capacity, compute capability, or driver version. - No device sharing: A GPU allocated to one container cannot be shared with another, even for lightweight inference workloads that would only use a fraction of the device.
- No topology awareness: The scheduler is blind to PCIe topology, NVLink connectivity, and NUMA placement. Multi-GPU workloads may land on suboptimal device combinations.
- No parameterization: Workloads cannot request specific device configurations like MIG profiles or power limits at scheduling time.
- No cluster autoscaler integration: The autoscaler cannot reason about opaque device resources when deciding whether to add or remove nodes.
Teams work around these limitations with node labels, taints, tolerations, and custom admission webhooks, but these are brittle, error-prone, and do not scale.
What is dynamic resource allocation?
Dynamic resource allocation (DRA) is a Kubernetes API framework under the resource.k8s.io API group that enables workloads to request specialized hardware based on device attributes rather than simple counts. Think of it as the device equivalent of the PersistentVolume/PersistentVolumeClaim model, but for GPUs, FPGAs, NICs, and other accelerators.
DRA introduces four core API objects:
- ResourceSlice: Published by DRA drivers on each node. Describes available devices with typed attributes (model, memory, driver version, UUID, and so on).
- DeviceClass: Defines a category of devices using CEL selector expressions. Created by admins or drivers.
- ResourceClaim: A workload's request for specific devices. Supports CEL-based filtering, can be shared across pods, and persists independently of pod lifecycle.
- ResourceClaimTemplate: A template from which Kubernetes auto-generates per-pod ResourceClaims. The generated claim is deleted when its pod terminates.
The key architectural insight is that DRA drivers publish structured, transparent device information (ResourceSlices) to the API server, and the kube-scheduler itself handles allocation decisions by evaluating CEL expressions against device attributes. No external controller negotiation is needed during scheduling, which makes DRA significantly faster and fully compatible with the cluster autoscaler.
See it in practice
On a cluster with the NVIDIA DRA driver installed, each node publishes a ResourceSlice describing its GPUs. Here is what the driver advertises for a full A100 GPU:
{
"attributes": {
"architecture": { "string": "Ampere" },
"brand": { "string": "Nvidia" },
"cudaComputeCapability": { "version": "8.0.0" },
"cudaDriverVersion": { "version": "13.0.0" },
"driverVersion": { "version": "580.105.8" },
"productName": { "string": "NVIDIA A100-SXM4-40GB" },
"type": { "string": "gpu" },
"uuid": { "string": "GPU-ec819aa6-26b9-d90a-00c8-3fcf0a34a0c9" }
},
"capacity": {
"memory": { "value": "40Gi" }
},
"name": "gpu-0"
}And for a MIG slice on the same GPU model:
{
"attributes": {
"architecture": { "string": "Ampere" },
"productName": { "string": "NVIDIA A100-SXM4-40GB" },
"profile": { "string": "1g.5gb" },
"type": { "string": "mig" },
"uuid": { "string": "MIG-e42ee090-5c43-53b2-a164-c6a0b7ac1a57" },
"parentUUID": { "string": "GPU-e40930a0-c463-2611-3473-bc72ac15679a" }
},
"capacity": {
"memory": { "value": "4864Mi" },
"multiprocessors": { "value": "14" }
},
"name": "gpu-0-mig-1g5gb-19-5"
}The scheduler can now see the product name, architecture, MIG profile, memory capacity, and more. This is information that was completely invisible in the device plug-in model.
The NVIDIA DRA driver also creates DeviceClasses automatically:
$ oc get deviceclasses
NAME AGE
gpu.nvidia.com 12m
mig.nvidia.com 12mThe gpu.nvidia.com name matches full GPUs, while mig.nvidia.com matches MIG slices. Both use CEL selectors against the type attribute published in the ResourceSlice.
The road to GA
DRA's path to General Availability (GA) in OpenShift 4.21 spans multiple Kubernetes and OpenShift releases:
| Release | Kubernetes | DRA status | Milestone |
|---|---|---|---|
| OpenShift 4.19 | 1.32 | Not available | Upstream DRA beta with structured parameters; classic DRA withdrawn |
| OpenShift 4.20 | 1.33 | Technology Preview | DRA enabled behind TechPreview feature gate in OpenShift with validation of Nvidia driver |
| OpenShift 4.21 | 1.34 | General Availability | DRA enabled by default; resource.k8s.io/v1 API; beta APIs removed |
The feature gate DynamicResourceAllocation was promoted to the default feature set (OCPNODE-3779, now closed). Earlier alpha/beta API enablement was removed since the v1 API is now served by default.
What's GA in OpenShift 4.21
Three DRA capabilities reached General Availability in the 4.21 release. The demos in this article were run on an OpenShift 4.21.3 cluster on Google Cloud with three A100-SXM4-40GB worker nodes, each configured with a different GPU layout:
- worker-1:
all-1g.5gb- 7x MIG
1g.5gbslices (4.8 GB each)
- 7x MIG
- worker-2:
all-3g.20gb- 2x MIG 3g.20gb slices (19.6 GB each)
- worker-3: MIG disabled
- 1x full A100 40 GB
1. Attribute-based GPU allocation
This is the headline feature. Pods can now request GPUs based on specific device attributes exposed by a DRA driver, including product name, memory capacity, compute capability, driver version, and MIG profile.
Requesting a specific MIG profile
The following ResourceClaimTemplate requests a 1g.5gb MIG slice using a CEL selector against the profile attribute:
apiVersion: resource.k8s.io/v1
kind: ResourceClaimTemplate
metadata:
name: mig-1g5gb-claim
namespace: dra-demo
spec:
spec:
devices:
requests:
- name: mig
exactly:
deviceClassName: mig.nvidia.com
selectors:
- cel:
expression: "device.attributes['gpu.nvidia.com'].profile == '1g.5gb'"A pod references this template and runs the CUDA vectorAdd sample to verify that the GPU is actually usable:
apiVersion: v1
kind: Pod
metadata:
name: vectoradd-1g5gb
namespace: dra-demo
spec:
restartPolicy: Never
containers:
- name: vectoradd
image: nvcr.io/nvidia/k8s/cuda-sample:vectoradd-cuda12.5.0
resources:
claims:
- name: gpu
resourceClaims:
- name: gpu
resourceClaimTemplateName: mig-1g5gb-claimThe result:
$ oc logs vectoradd-1g5gb -n dra-demo
[Vector addition of 50000 elements]
Copy input data from the host memory to the CUDA device
CUDA kernel launch with 196 blocks of 256 threads
Copy output data from the CUDA device to the host memory
Test PASSED
DoneThe pod landed on worker-1 (the node with 1g.5gb slices), and the scheduler matched the claim to a specific MIG device.
Requesting a full GPU
Using the gpu.nvidia.com DeviceClass instead of mig.nvidia.com requests a whole GPU:
apiVersion: resource.k8s.io/v1
kind: ResourceClaim
metadata:
name: full-gpu-claim
namespace: dra-demo
spec:
devices:
requests:
- name: gpu
exactly:
deviceClassName: gpu.nvidia.comThe pod landed on worker-3 (the only node with a full GPU), and the allocation shows exactly which device was assigned:
{
"device": "gpu-0",
"driver": "gpu.nvidia.com",
"pool": "worker-3",
"request": "gpu"
}No node selectors, no taints, no tolerations. The claim describes what the workload needs, and the scheduler finds a match.
Device sharing between containers
Two containers in the same pod can reference the same ResourceClaim, giving both access to the same physical device. This is something the device plug-in framework cannot do.
apiVersion: resource.k8s.io/v1
kind: ResourceClaim
metadata:
name: shared-mig-claim
namespace: dra-demo
spec:
devices:
requests:
- name: mig
exactly:
deviceClassName: mig.nvidia.com
selectors:
- cel:
expression: "device.attributes['gpu.nvidia.com'].profile == '3g.20gb'"
---
apiVersion: v1
kind: Pod
metadata:
name: shared-gpu-pod
namespace: dra-demo
spec:
restartPolicy: Never
containers:
- name: vectoradd-1
image: nvcr.io/nvidia/k8s/cuda-sample:vectoradd-cuda12.5.0
command: ["sh", "-c", "/cuda-samples/vectorAdd && nvidia-smi -L"]
resources:
claims:
- name: gpu
- name: vectoradd-2
image: nvcr.io/nvidia/k8s/cuda-sample:vectoradd-cuda12.5.0
command: ["sh", "-c", "/cuda-samples/vectorAdd && nvidia-smi -L"]
resources:
claims:
- name: gpu
resourceClaims:
- name: gpu
resourceClaimName: shared-mig-claimBoth containers ran vectorAdd successfully, and nvidia-smi -L confirms they see the exact same MIG device:
=== Container 1 ===
[Vector addition of 50000 elements]
...
Test PASSED
Done
GPU 0: NVIDIA A100-SXM4-40GB (UUID: GPU-ce5b3123-f4f2-6250-8428-40617e5f9b9d)
MIG 3g.20gb Device 0: (UUID: MIG-90155500-9a09-5016-9746-95ef09bd78a6)
=== Container 2 ===
[Vector addition of 50000 elements]
...
Test PASSED
Done
GPU 0: NVIDIA A100-SXM4-40GB (UUID: GPU-ce5b3123-f4f2-6250-8428-40617e5f9b9d)
MIG 3g.20gb Device 0: (UUID: MIG-90155500-9a09-5016-9746-95ef09bd78a6)The same UUID is in both containers. This enables lightweight inference sidecars, monitoring agents, or multi-process training to share a single GPU allocation without wasting resources.
2. Prioritized alternatives in device requests
Based on upstream KEP-4816, this feature allows pods to specify a prioritized list of acceptable device types within a single ResourceClaim. The scheduler tries to satisfy requests in priority order and falls back to lower-priority alternatives when preferred devices are unavailable.
Here is a ResourceClaimTemplate that prefers 1g.5gb MIG slices but falls back to 3g.20gb if none are available:
apiVersion: resource.k8s.io/v1
kind: ResourceClaimTemplate
metadata:
name: prefer-small-mig
namespace: dra-demo-priority
spec:
spec:
devices:
requests:
- name: gpu
firstAvailable:
- name: prefer-1g5gb
deviceClassName: mig.nvidia.com
selectors:
- cel:
expression: "device.attributes['gpu.nvidia.com'].profile == '1g.5gb'"
- name: fallback-3g20gb
deviceClassName: mig.nvidia.com
selectors:
- cel:
expression: "device.attributes['gpu.nvidia.com'].profile == '3g.20gb'"Step 1: Deploy when a preferred device is available
The first pod gets a 1g.5gb slice. The allocation's request field confirms the first alternative was selected:
{
"device": "gpu-0-mig-1g5gb-19-5",
"driver": "gpu.nvidia.com",
"pool": "worker-1",
"request": "gpu/prefer-1g5gb"
}Step 2: Exhaust the preferred device
Deploy seven more pods using the same template. All seven 1g.5gb slices on worker-1 are consumed.
Step 3: Fallback kicks in
The next pod cannot get a 1g.5gb slice because all seven are taken. The scheduler automatically falls back to 3g.20gb on worker-2:
$ oc logs priority-fallback -n dra-demo-priority
[Vector addition of 50000 elements]
...
Test PASSED
Done
GPU 0: NVIDIA A100-SXM4-40GB (UUID: GPU-ce5b3123-f4f2-6250-8428-40617e5f9b9d)
MIG 3g.20gb Device 0: (UUID: MIG-90155500-9a09-5016-9746-95ef09bd78a6)The allocation confirms the fallback alternative was selected:
{
"device": "gpu-0-mig-3g20gb-9-0",
"driver": "gpu.nvidia.com",
"pool": "worker-2",
"request": "gpu/fallback-3g20gb"
}Step 4: Total exhaustion
When both 1g.5gb and 3g.20gb slices are consumed, the next pod remains Pending:
$ oc get pod priority-exhausted -n dra-demo-priority
NAME READY STATUS RESTARTS AGE
priority-exhausted 0/1 Pending 0 10s
$ oc get pod priority-exhausted -n dra-demo-priority -o jsonpath='{.status.conditions[0].message}'
0/6 nodes are available: 3 cannot allocate all claims, 3 node(s) had untolerated
taint(s). still not schedulable, preemption: 0/6 nodes are available:
6 Preemption is not helpful for scheduling.In a heterogeneous cluster, teams no longer need separate deployments for each GPU type. One ResourceClaimTemplate handles the preference logic, and the scheduler does the rest.
3. Namespace-controlled admin access
Cluster administrators can gain privileged access to devices already in use by other workloads. This is useful for monitoring, health checks, and debugging, and it does not disrupt those workloads. To use admin access, the namespace must carry a specific label, and the ResourceClaim must set adminAccess: true:
apiVersion: v1
kind: Namespace
metadata:
name: dra-demo-admin
labels:
resource.kubernetes.io/admin-access: "true"
---
apiVersion: resource.k8s.io/v1
kind: ResourceClaim
metadata:
name: admin-gpu-claim
namespace: dra-demo-admin
spec:
devices:
requests:
- name: gpu
exactly:
adminAccess: true
deviceClassName: mig.nvidia.com
selectors:
- cel:
expression: "device.attributes['gpu.nvidia.com'].profile == '3g.20gb'"In this demo, the 3g.20gb slices on worker-2 are already allocated to workload pods from the prioritized alternatives demo above. The admin monitoring pod is deployed in a separate namespace with admin access enabled:
apiVersion: v1
kind: Pod
metadata:
name: admin-monitor
namespace: dra-demo-admin
spec:
restartPolicy: Never
containers:
- name: monitor
image: nvcr.io/nvidia/k8s/cuda-sample:vectoradd-cuda12.5.0
command: ["sh", "-c", "nvidia-smi -L && nvidia-smi && sleep 3600"]
resources:
claims:
- name: gpu
resourceClaims:
- name: gpu
resourceClaimName: admin-gpu-claimThe admin pod gets access to the in-use device and can run nvidia-smi to inspect it:
GPU 0: NVIDIA A100-SXM4-40GB (UUID: GPU-ce5b3123-f4f2-6250-8428-40617e5f9b9d)
MIG 3g.20gb Device 0: (UUID: MIG-90155500-9a09-5016-9746-95ef09bd78a6)
+-------------------------------------------------------------------------+
|NVIDIA-SMI 580.105.08 Driver Ver:580.105.08 CUDA Ver: 13.0 |
+-------------------------+------------------------+----------------------+
| GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
|=========================+========================+======================|
| 0 NVIDIA A100-SXM4-40GB On |00000000:00:04.0 Off| On |
| N/A 36C P0 93W / 400W | N/A | N/A Default |
| | | Enabled |
+---------------------------+------------------------+----------------------+
| MIG devices: |
+------------------+--------------------+-----------+-----------------------+
| GPU GI CI MIG |Shared Memory-Usage | Vol| Shared |
| ID ID Dev | Shared BAR1-Usage | SM Unc| CE ENC DEC OFA JPG |
|==================+================================+===========+===========|
| 0 1 0 0 |107MiB / 20096MiB | 42 0 | 3 0 2 0 0 |
| | 0MiB / 12210MiB | | |
+------------------+--------------------------------+-----------+-----------+The allocation confirms admin access was granted:
{
"adminAccess": true,
"device": "gpu-0-mig-3g20gb-9-0",
"driver": "gpu.nvidia.com",
"pool": "worker-2",
"request": "gpu"
}Meanwhile, the original workload pod continues running undisturbed on the same device. This gives SREs and platform teams the ability to monitor GPU health and debug allocation issues in production without evicting running workloads.
What's next
DRA continues to evolve upstream. Features currently in alpha or beta in Kubernetes 1.34 that may appear in future OpenShift releases include:
- Partitionable devices allow drivers to advertise overlapping logical device partitions and reconfigure physical hardware dynamically based on actual allocations.
- Device taints and tolerations mark devices as degraded or unusable, similar to node taints, with workloads explicitly tolerating tainted devices.
- Device binding conditions support for network-attached and fabric-attached accelerators that need pre-binding to nodes before pod scheduling.
To learn more, check out these resources: