Page
Verify GPU partitioning
Now that DCM has reported a successful partition, we need to verify that the system recognizes the new GPU devices.
To inspect the low-level hardware changes without leaving the cluster environment, run commands directly inside the device config manager (DCM) Pod. In this lesson, you will use the built-in AMD System Management Interface (amd-smi) tool to confirm that the physical accelerator has successfully been partitioned according to the applied profile.
Prerequisites:
- Control plane toleration verified.
- Configure and deploy the device config manager.
- Taint and label GPU node.
In this lesson, you will:
- Use the user-space
amd-smitool inside theDCMPodto verify that DCM has successfully partitioned the GPUs according to the profile.
Verify GPU partitioning
Now that you have tainted and labeled the GPU node, it’s time to get the container name to verify that the GPU is partitioned and ready to schedule workloads.
Get the
DCMPodname:DCM_POD=$(oc get pods -n openshift-amd-gpu \ -l app.kubernetes.io/name=device-config-manager \ -o jsonpath='{.items[0].metadata.name}')With the container name saved, use
oc execto list the GPU devices and verify partitioning withamd-smi.oc exec -n openshift-amd-gpu $DCM_POD -- amd-smi static listOn a successfully partitioned AMD Instinct MI300X host configured for maximum density, the output will list 64 logical GPU devices (indices 0 through 63) instead of the original 8 physical GPUs. Example output in MI300X system:
GPU: 0 BDF: 0000:1b:00.0 UUID: 09ff74a1-0000-1000-802d-0540b39d66d6 KFD_ID: 3771 NODE_ID: 2 PARTITION_ID: 0 GPU: 1 BDF: 0000:1b:00.1 UUID: 04ff74a1-0000-1000-80b7-90c218050861 KFD_ID: 62138 NODE_ID: 3 PARTITION_ID: 1 ... TRUNCATED ... GPU: 62 BDF: 0000:dd:00.6 UUID: 5cff74a1-0000-1000-80e7-da6c49d7b176 KFD_ID: 32106 NODE_ID: 64 PARTITION_ID: 6 GPU: 63 BDF: 0000:dd:00.7 UUID: 66ff74a1-0000-1000-809a-675c73e66337 KFD_ID: 33131 NODE_ID: 65 PARTITION_ID: 7Next, confirm the specific compute and memory partition modes reported by the hardware:
oc exec -n openshift-amd-gpu $DCM_POD -- amd-smi static --partitionThis output should confirm Core Partition X (CPX) compute partitioning paired with Non-Uniform Memory Access (NUMA) Per Socket (NPS) 4 (NPS4) memory partitioning on each physical GPU. Example output in MI300X system:
GPU: 0 PARTITION: ACCELERATOR_PARTITION: CPX MEMORY_PARTITION: NPS4 PARTITION_ID: 0 GPU: 1 PARTITION: ACCELERATOR_PARTITION: N/A MEMORY_PARTITION: N/A PARTITION_ID: 1 ... TRUNCATED ... GPU: 56 PARTITION: ACCELERATOR_PARTITION: CPX MEMORY_PARTITION: NPS4 PARTITION_ID: 0 GPU: 57 PARTITION: ACCELERATOR_PARTITION: N/A MEMORY_PARTITION: N/A PARTITION_ID: 1 ... TRUNCATED ... GPU: 63 PARTITION: ACCELERATOR_PARTITION: N/A MEMORY_PARTITION: N/A PARTITION_ID: 7
Note
In a CPX + NPS4 configuration on an MI300X system, only PARTITION_ID: 0 reports the active compute and memory mode; the remaining IDs show N/A.
Success! You’ve verified that the 8 physical GPUs are now present as 64 logical devices. The hardware is partitioned and ready to schedule workloads.
In the next lesson, you will remove the node taint to open the host back up, allowing Red Hat OpenShift to start scheduling applications using your new hardware resources.