Page
Configure and deploy the device config manager
Now that your control-plane workloads are protected with tolerations, you can prepare the infrastructure components that control the accelerator devices. The AMD GPU Operator manages these hardware day2 operations using a component called the device config manager (DCM).
By default, the DCM is not active. To enable it, define a ConfigMap that maps partition profiles to your hardware—for example, a single monolithic GPU (SPX) or multiple smaller slices (CPX)—so the operator knows which compute and memory layout to apply on each node.
Prerequisites
- Control plane toleration verified.
- The AMD GPU Operator installed in the
openshift-amd-gpunamespace.
In this lesson, you will:
- Create a
ConfigMapwith GPU partition profiles - Patch the
DeviceConfigcustom resource to enable and configure the DCM, which orchestrates GPU partitioning.
Deploy DCM
To ensure your infrastructure components are mapped correctly, create and deploy DCM profiles with ConfigMap.
Validate the available compute and memory partitions that your target GPU node supports with the command below. Run this on the GPU node itself (e.g., via SSH or
oc debug node).cat /sys/module/amdgpu/drivers/pci\:amdgpu/*/{available_compute_partition,available_memory_partition}On a standard AMD Instinct MI300X system with an 8-GPU layout, your output will look like this:
SPX, DPX, QPX, CPX SPX, DPX, QPX, CPX SPX, DPX, QPX, CPX SPX, DPX, QPX, CPX SPX, DPX, QPX, CPX SPX, DPX, QPX, CPX SPX, DPX, QPX, CPX SPX, DPX, QPX, CPX NPS1, NPS2, NPS4 NPS1, NPS2, NPS4 NPS1, NPS2, NPS4 NPS1, NPS2, NPS4 NPS1, NPS2, NPS4 NPS1, NPS2, NPS4 NPS1, NPS2, NPS4 NPS1, NPS2, NPS4
Note
If some of the available partitions are not shown in your MI300X series systems (e.g., NPS2), it is recommended to update to the latest Firmware and BIOS versions from your vendor. For the MI300X system, the minimal VBIOS version supported for partitioning is 022.040.003.042.
Create DCM profiles via
ConfigMap. ThisConfigMapdefines three partition profiles: unpartitioned (spx-profile-nps1), dual-partition (dpx-profile-nps2), and maximum-partition (cpx-profile-nps4). Run the command below to create theConfigMapon your cluster.cat <<EOF | tee gpu-partition-profiles.yaml | oc apply -f - --- apiVersion: v1 kind: ConfigMap metadata: name: config-manager-config namespace: openshift-amd-gpu data: config.json: | { "gpu-config-profiles": { "spx-profile-nps1": { "skippedGPUs": { "ids": [] }, "profiles": [ { "computePartition": "SPX", "memoryPartition": "NPS1", "numGPUsAssigned": 8 } ] }, "dpx-profile-nps2": { "skippedGPUs": { "ids": [] }, "profiles": [ { "computePartition": "DPX", "memoryPartition": "NPS2", "numGPUsAssigned": 8 } ] }, "cpx-profile-nps4": { "skippedGPUs": { "ids": [] }, "profiles": [ { "computePartition": "CPX", "memoryPartition": "NPS4", "numGPUsAssigned": 8 } ] } }, "gpuClientSystemdServices": { "names": ["amd-metrics-exporter", "gpuagent"] } } EOFPatch the
DeviceConfigcustom resource to reference theConfigMapand enable DCM.oc patch deviceconfig amdgpu-driver-install -n openshift-amd-gpu --type='merge' -p '{ "spec": { "configManager": { "enable": true, "image": "docker.io/rocm/device-config-manager:v1.4.1", "imagePullPolicy": "IfNotPresent", "config": { "name": "config-manager-config" } } } }'To determine which DCM versions are publicly available, you can use Skopeo to list the tags released by AMD.
skopeo list-tags docker://docker.io/rocm/device-config-manager { "Repository": "docker.io/rocm/device-config-manager", "Tags": [ "v1.3.0", "v1.3.0-beta.0", "v1.3.1", "v1.3.1-beta.0", "v1.4.0", "v1.4.0-beta.0", "v1.4.1" ] }Wait for the
DCMPodto be ready:oc wait --for=condition=ready pod \ -l app.kubernetes.io/name=device-config-manager \ -n openshift-amd-gpu \ --timeout=300s
You have verified your GPU's available partition modes, created the partition profiles, and enabled DCM.
In the next lesson, you will taint and label the GPU node to trigger the partitioning process.