Breadcrumb

  1. Red Hat Interactive Learning Portal
  2. OpenShift learning
  3. Partition AMD Instinct GPU accelerators via device config manager (DCM) in Red Hat OpenShift
  4. Untaint GPU node

Partition AMD Instinct GPU accelerators via device config manager (DCM) in Red Hat OpenShift

Make your AI infrastructure more efficient by partitioning AMD Instinct GPUs via the device config manager in Red Hat OpenShift and validate your setup with a vLLM workload.

Now that you’ve successfully partitioned your hardware, you need to make the node schedulable again. 

In this lesson, you will remove the node taint, adding the node back to the cluster scheduler. You will also restore the Prometheus exporter to resume GPU metrics collection of your partitioned resources.

Prerequisites:

In this lesson, you will:

  • Remove taint from GPU node.
  • Restore telemetry logging services. 

Untaint GPU node 

Remove the node taint so you can add it back to the cluster and resume workloads and metrics collection. 

  1. Run the following command to untaint the GPU node. Adding a minus sign (-) to the taint key tells Red Hat OpenShift to remove that taint from the node:

    oc taint nodes $NODE_NAME amd-dcm=up:NoExecute-
  2. Restore Prometheus exporter so it can resume scraping metrics from the newly partitioned devices:

    oc patch prometheus amd-gpu-prometheus -n devmetrics --type='merge' -p '{"spec":{"replicas":1}}'
  3. With the node untainted and the Prometheus exporter restored, the cluster is now fully operational under the new configuration. As seen in the updated dashboard (Figure 1), the MI300X system is successfully partitioned using the Core Partition X (CPX) compute profile paired with the Non-Uniform Memory Access (NUMA) Per Socket (NPS) 4 (NPS4) memory profile, thereby exposing the maximum number of logical GPUs.

    The same Grafana "AMD Instinct Single Node Dashboard" after CPX + NPS4 partitioning, now showing 64 logical GPUs (indices 0–63) instead of the original 8 physical devices, all healthy (green), at 0% utilization with ~20–83 MB memory each, confirming the partition profile was successfully applied.
    Figure 1: MI300X System partitioned with CPX and NPS4 combination (maximum multi-tenancy partitioning).


Success! You’ve removed the node taint, restored metric collection, and verified that your Red Hat OpenShift cluster recognizes the maximum-density configuration.

Now, let's validate your workload using vLLM.

Previous resource
Verify GPU partitioning
Next resource
Deploy a vLLM inference workload to validate GPU partitioning