Breadcrumb

  1. Red Hat Interactive Learning Portal
  2. OpenShift learning
  3. Partition AMD Instinct GPU accelerators via device config manager (DCM) in Red Hat OpenShift
  4. Verify GPU partitioning

Partition AMD Instinct GPU accelerators via device config manager (DCM) in Red Hat OpenShift

Make your AI infrastructure more efficient by partitioning AMD Instinct GPUs via the device config manager in Red Hat OpenShift and validate your setup with a vLLM workload.

Now that DCM has reported a successful partition, we need to verify that the system recognizes the new GPU devices.

To inspect the low-level hardware changes without leaving the cluster environment, run commands directly inside the device config manager (DCM) Pod. In this lesson, you will use the built-in AMD System Management Interface (amd-smi) tool to confirm that the physical accelerator has successfully been partitioned according to the applied profile.

Prerequisites:

In this lesson, you will:

  • Use the user-space amd-smi tool inside the DCM Pod to verify that DCM has successfully partitioned the GPUs according to the profile.

Verify GPU partitioning 

Now that you have tainted and labeled the GPU node, it’s time to get the container name to verify that the GPU is partitioned and ready to schedule workloads. 

  1. Get the DCM Pod name:

    DCM_POD=$(oc get pods -n openshift-amd-gpu \
      -l app.kubernetes.io/name=device-config-manager \
      -o jsonpath='{.items[0].metadata.name}')
  2. With the container name saved, use oc exec to list the GPU devices and verify partitioning with amd-smi

    oc exec -n openshift-amd-gpu $DCM_POD -- amd-smi static list

    On a successfully partitioned AMD Instinct MI300X host configured for maximum density, the output will list 64 logical GPU devices (indices 0 through 63) instead of the original 8 physical GPUs. Example output in MI300X system:

    GPU: 0
        BDF: 0000:1b:00.0
        UUID: 09ff74a1-0000-1000-802d-0540b39d66d6
        KFD_ID: 3771
        NODE_ID: 2
        PARTITION_ID: 0
    
    GPU: 1
        BDF: 0000:1b:00.1
        UUID: 04ff74a1-0000-1000-80b7-90c218050861
        KFD_ID: 62138
        NODE_ID: 3
        PARTITION_ID: 1
    
    ... TRUNCATED ...
    
    GPU: 62
        BDF: 0000:dd:00.6
        UUID: 5cff74a1-0000-1000-80e7-da6c49d7b176
        KFD_ID: 32106
        NODE_ID: 64
        PARTITION_ID: 6
    
    GPU: 63
        BDF: 0000:dd:00.7
        UUID: 66ff74a1-0000-1000-809a-675c73e66337
        KFD_ID: 33131
        NODE_ID: 65
        PARTITION_ID: 7
  3. Next, confirm the specific compute and memory partition modes reported by the hardware:

    oc exec -n openshift-amd-gpu $DCM_POD -- amd-smi static --partition

    This output should confirm Core Partition X (CPX) compute partitioning paired with Non-Uniform Memory Access (NUMA) Per Socket (NPS) 4 (NPS4) memory partitioning on each physical GPU. Example output in MI300X system:

    GPU: 0
        PARTITION:
            ACCELERATOR_PARTITION: CPX
            MEMORY_PARTITION: NPS4
            PARTITION_ID: 0
    
    GPU: 1
        PARTITION:
            ACCELERATOR_PARTITION: N/A
            MEMORY_PARTITION: N/A
            PARTITION_ID: 1
    
    ... TRUNCATED ...
    GPU: 56
        PARTITION:
            ACCELERATOR_PARTITION: CPX
            MEMORY_PARTITION: NPS4
            PARTITION_ID: 0
    
    GPU: 57
        PARTITION:
            ACCELERATOR_PARTITION: N/A
            MEMORY_PARTITION: N/A
            PARTITION_ID: 1
    
    ... TRUNCATED ...
    
    GPU: 63
        PARTITION:
            ACCELERATOR_PARTITION: N/A
            MEMORY_PARTITION: N/A
            PARTITION_ID: 7

Note

In a CPX + NPS4 configuration on an MI300X system, only PARTITION_ID: 0 reports the active compute and memory mode; the remaining IDs show N/A.

 

Success! You’ve verified that the 8 physical GPUs are now present as 64 logical devices. The hardware is partitioned and ready to schedule workloads. 

In the next lesson, you will remove the node taint to open the host back up, allowing Red Hat OpenShift to start scheduling applications using your new hardware resources.

Previous resource
Taint and label GPU node to trigger AMD hardware partitioning
Next resource
Untaint GPU node