Breadcrumb

  1. Red Hat Interactive Learning Portal
  2. OpenShift learning
  3. Partition AMD Instinct GPU accelerators via device config manager (DCM) in Red Hat OpenShift
  4. Verify control plane toleration

Partition AMD Instinct GPU accelerators via device config manager (DCM) in Red Hat OpenShift

Make your AI infrastructure more efficient by partitioning AMD Instinct GPUs via the device config manager in Red Hat OpenShift and validate your setup with a vLLM workload.

AMD GPUs cannot be partitioned while workloads are using them. In a later lesson, you will apply an amd-dcm=up:NoExecute taint to the GPU node, which evicts all pods that lack a matching toleration. On Kubernetes, this requires manually patching kube-system DaemonSets and Deployments with the DCM toleration before tainting.

On Red Hat OpenShift, this step is not required. Critical control-plane DaemonSets (cluster networking, core services, machine-config, etc.) carry a wildcard toleration (operator: Exists) by default, so they survive the NoExecute taint without any manual patching.

The following table summarizes what stays running and what gets evicted when the taint is applied:

Component

Status

Reason

etcd, apiserver, controller-manager, scheduler

Running

Wildcard toleration (operator: Exists)

OVN, Multus, DNS, MCD, node-exporter

Running

Wildcard toleration (operator: Exists)

DCM

Running

Explicit amd-dcm=up toleration

device-plugin, node-labeller, metrics-exporter

Evicted

No matching toleration (expected)

 

The GPU operands (device-plugin, node-labeller, metrics-exporter) are intentionally evicted; they must release the GPU devices before DCM can repartition the hardware.

Prerequisites

  • Full cluster administrator privileges on your Red Hat OpenShift environment.
  • The jq command-line utility installed locally for processing JSON outputs.

In this lesson, you will:

  • Confirm that Red Hat OpenShift control-plane pods already tolerate the DCM taint.
  • Understand which pods will be evicted and why that is expected.

Verify control plane toleration 

First, verify your control planes and all tolerations they have. This confirms which components are already protected and helps you understand the impact of the taint you'll apply later.

  1. Pick any critical control-plane DaemonSet and inspect its tolerations. For example, check the OVN-Kubernetes DaemonSet:

    oc get daemonset ovnkube-node -n openshift-ovn-kubernetes -o jsonpath='{.spec.template.spec.tolerations}' | jq .

    You will see a wildcard toleration entry like this:

    [
     {
       "operator": "Exists"
     }
    ]


    The operator: Exists toleration (with no key specified) matches every taint, including amd-dcm=up:NoExecute. This is why no manual patching is needed.

  2. Optional: For additional confidence, verify that the DCM Pod itself carries the explicit toleration it needs. Run this after the DCM is deployed. The pod won't exist until then.

    oc get pod -n openshift-amd-gpu -l app.kubernetes.io/name=device-config-manager -o jsonpath='{.items[0].spec.tolerations}' | jq .

    You should see:

        [
     {
       "key": "amd-dcm",
       "operator": "Equal",
       "value": "up",
       "effect": "NoExecute"
     },
    ... TRUNCATED ...
    ]

    The operator: Exists toleration (with no key specified) matches every taint, including amd-dcm=up:NoExecute. This is why no manual patching is needed.

Note

On vanilla Kubernetes clusters, you must manually add the amd-dcm toleration to kube-system  Deployments and DaemonSets before tainting. See the upstream DCM documentation for the Kubernetes-specific steps.

 

Your Red Hat OpenShift control plane is already protected. No toleration patching is required before proceeding with GPU partitioning. You can now configure and deploy the device config manager. 

Previous resource
Overview: Partition AMD Instinct GPU accelerators via device config manager (DCM) in Red Hat OpenShift
Next resource
Configure and deploy the device config manager