Breadcrumb

  1. Red Hat Interactive Learning Portal
  2. OpenShift learning
  3. Partition AMD Instinct GPU accelerators via device config manager (DCM) in Red Hat OpenShift
  4. Taint and label GPU node to trigger AMD hardware partitioning

Partition AMD Instinct GPU accelerators via device config manager (DCM) in Red Hat OpenShift

Make your AI infrastructure more efficient by partitioning AMD Instinct GPUs via the device config manager in Red Hat OpenShift and validate your setup with a vLLM workload.

With your control plane protected and the device config manager (DCM) deployed, you are ready to begin partitioning. Slicing an AMD Instinct GPU requires resetting the underlying device. Because of this, you must clear the target node of all active workloads so that no applications are left trying to use the hardware when the configuration change occurs.

Tainting the GPU node with amd-dcm=up:NoExecute will immediately evict all non-essential workloads and prevent the scheduling of new workloads on the node. This ensures no workloads are using the GPUs before partitioning. Only pods and DaemonSets with the matching toleration will remain running.

We will then label the node with the desired partition profile, which signals DCM to apply the GPU partitioning configuration. 

Prerequisites:

In this lesson, you will:

  • Apply a node taint to safely evict active user workloads from your GPU host.
  • Label the node with a partition profile to trigger DCM partitioning.

Taint and label the GPU node 

To protect your workloads and begin partitioning, start by tainting and then labeling the GPU node.

  1. Taint the GPU node to evict non-essential workloads. Run the following command: 

    oc taint nodes "$NODE_NAME" amd-dcm=up:NoExecute
  2. After tainting the node, inspect the status of all Pods in the cluster. In some environments, a Prometheus exporter Pod may enter an Error state (rather than Pending) and continue holding GPU resources, which blocks the DCM partitioning process. If this occurs, scale down the Prometheus instance and force-delete the stuck Pod

    oc patch prometheus amd-gpu-prometheus -n devmetrics --type='merge' -p '{"spec":{"replicas":0}}'
    oc delete pod -n devmetrics prometheus-amd-gpu-prometheus-0 --force --grace-period=0
  3. With GPU-consuming workloads evicted, label the node to trigger DCM partitioning. 

    oc label node $NODE_NAME dcm.amd.com/gpu-config-profile=cpx-profile-nps4 --overwrite

Note

The --overwrite flag accounts for any existing gpu-config-profile label.

 

  1. Wait for DCM to process the profile.

    oc logs -n openshift-amd-gpu -l app.kubernetes.io/name=device-config-manager -f

    A successful output will look like this:

    NEW TRIGGER ALERT FROM NODE LABELS
    2026/05/27 09:10:48 Label changed: dcm.amd.com/gpu-config-profile
    Old value: 
    New value: cpx-profile-nps4
    2026/05/27 09:10:48 #####################################
    2026/05/27 09:10:48 Partition profile info:
    2026/05/27 09:10:48 Selected profile name: cpx-profile-nps4
    2026/05/27 09:10:48 #####################################
    
    . . . TRUNCATED . . .
    
    2026/05/27 09:12:10 Successfully Partitioned GPUs of profile 1
    2026/05/27 09:12:10 Partition completed successfully
    2026/05/27 09:12:10 Label "dcm.amd.com/gpu-config-profile-state" added successfully to node "smc6216gpu.partner-accelerators.redhat.lab"
    2026/05/27 09:12:10 AMD SMI shutdown successfully
    2026/05/27 09:12:10 #####################################
    2026/05/27 09:12:10 PartitionGPU executed successfully
    2026/05/27 09:12:10 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
    2026/05/27 09:12:10 ServicesList [amd-metrics-exporter gpuagent]
    2026/05/27 09:12:10 Restarting service skipped for: amd-metrics-exporter.service (was not-loaded at 2026-05-27 09:10:58.654388053 +0000 UTC m=+45.487989488)
    2026/05/27 09:12:10 Restarting service skipped for: gpuagent.service (was not-loaded at 2026-05-27 09:11:08.663196594 +0000 UTC m=+55.496798039)
    2026/05/27 09:12:10 Cleaning up PreStateDB...
    2026/05/27 09:12:10 PreStateDB has been successfully emptied.
    2026/05/27 09:12:10 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  2. Once DCM has completed the partitioning procedure, it labels the node with the result of that operation:

    oc get nodes -o json | jq '.items[].metadata.labels | with_entries(select(.key | startswith("dcm.amd.com")))'

    Your output will look like this:

      "dcm.amd.com/gpu-config-profile": "cpx-profile-nps4",
      "dcm.amd.com/gpu-config-profile-state": "success"

Success! You’ve tainted and labeled the GPU node, triggering DCM to partition your GPU accelerators to the selected profile. 

Now it’s time to verify your new GPU partitioning. 

Previous resource
Configure and deploy the device config manager
Next resource
Verify GPU partitioning