Breadcrumb

  1. Red Hat Interactive Learning Portal
  2. OpenShift learning
  3. Partition AMD Instinct GPU accelerators via device config manager (DCM) in Red Hat OpenShift
  4. Configure and deploy the device config manager

Partition AMD Instinct GPU accelerators via device config manager (DCM) in Red Hat OpenShift

Make your AI infrastructure more efficient by partitioning AMD Instinct GPUs via the device config manager in Red Hat OpenShift and validate your setup with a vLLM workload.

Now that your control-plane workloads are protected with tolerations, you can prepare the infrastructure components that control the accelerator devices. The AMD GPU Operator manages these hardware day2 operations using a component called the device config manager (DCM).

By default, the DCM is not active. To enable it, define a ConfigMap that maps partition profiles to your hardware—for example, a single monolithic GPU (SPX) or multiple smaller slices (CPX)—so the operator knows which compute and memory layout to apply on each node.

Prerequisites

In this lesson, you will:

  • Create a ConfigMap with GPU partition profiles
  • Patch the DeviceConfig custom resource to enable and configure the DCM,  which orchestrates GPU partitioning.

Deploy DCM 

To ensure your infrastructure components are mapped correctly, create and deploy DCM profiles with ConfigMap.

  1. Validate the available compute and memory partitions that your target GPU node supports with the command below. Run this on the GPU node itself (e.g., via SSH or oc debug node).

    cat /sys/module/amdgpu/drivers/pci\:amdgpu/*/{available_compute_partition,available_memory_partition}

    On a standard AMD Instinct MI300X system with an 8-GPU layout, your output will look like this:

    SPX, DPX, QPX, CPX
    SPX, DPX, QPX, CPX
    SPX, DPX, QPX, CPX
    SPX, DPX, QPX, CPX
    SPX, DPX, QPX, CPX
    SPX, DPX, QPX, CPX
    SPX, DPX, QPX, CPX
    SPX, DPX, QPX, CPX
    NPS1, NPS2, NPS4
    NPS1, NPS2, NPS4
    NPS1, NPS2, NPS4
    NPS1, NPS2, NPS4
    NPS1, NPS2, NPS4
    NPS1, NPS2, NPS4
    NPS1, NPS2, NPS4
    NPS1, NPS2, NPS4

Note

If some of the available partitions are not shown in your MI300X series systems (e.g., NPS2), it is recommended to update to the latest Firmware and BIOS versions from your vendor. For the MI300X system, the minimal VBIOS version supported for partitioning is 022.040.003.042.

 

  1. Create DCM profiles via ConfigMap. This ConfigMap defines three partition profiles: unpartitioned (spx-profile-nps1), dual-partition (dpx-profile-nps2), and maximum-partition (cpx-profile-nps4). Run the command below to create the ConfigMap on your cluster.

    cat <<EOF | tee gpu-partition-profiles.yaml | oc apply -f -
    ---
    apiVersion: v1
    kind: ConfigMap
    metadata:
      name: config-manager-config
      namespace: openshift-amd-gpu
    data:
      config.json: |
        {
          "gpu-config-profiles": {
            "spx-profile-nps1": {
              "skippedGPUs": {
                "ids": []
              },
              "profiles": [
                {
                  "computePartition": "SPX",
                  "memoryPartition": "NPS1",
                  "numGPUsAssigned": 8
                }
              ]
            },
            "dpx-profile-nps2": {
              "skippedGPUs": {
                "ids": []
              },
              "profiles": [
                {
                  "computePartition": "DPX",
                  "memoryPartition": "NPS2",
                  "numGPUsAssigned": 8
                }
              ]
            },
            "cpx-profile-nps4": {
              "skippedGPUs": {
                "ids": []
              },
              "profiles": [
                {
                  "computePartition": "CPX",
                  "memoryPartition": "NPS4",
                  "numGPUsAssigned": 8
                }
              ]
            }
          },
          "gpuClientSystemdServices": {
            "names": ["amd-metrics-exporter", "gpuagent"]
          }
        }
    EOF
  2. Patch the DeviceConfig custom resource to reference the ConfigMap and enable DCM.

    oc patch deviceconfig amdgpu-driver-install -n openshift-amd-gpu --type='merge' -p '{
      "spec": {
        "configManager": {
          "enable": true,
          "image": "docker.io/rocm/device-config-manager:v1.4.1",
          "imagePullPolicy": "IfNotPresent",
          "config": {
            "name": "config-manager-config"
          }
        }
      }
    }'
  3. To determine which DCM versions are publicly available, you can use Skopeo to list the tags released by AMD.

    skopeo list-tags docker://docker.io/rocm/device-config-manager
    {
        "Repository": "docker.io/rocm/device-config-manager",
        "Tags": [
            "v1.3.0",
            "v1.3.0-beta.0",
            "v1.3.1",
            "v1.3.1-beta.0",
            "v1.4.0",
            "v1.4.0-beta.0",
            "v1.4.1"
        ]
    }
  4. Wait for the DCM Pod to be ready:

    oc wait --for=condition=ready pod \
      -l app.kubernetes.io/name=device-config-manager \
      -n openshift-amd-gpu \
      --timeout=300s

You have verified your GPU's available partition modes, created the partition profiles, and enabled DCM. 

In the next lesson, you will taint and label the GPU node to trigger the partitioning process.

Previous resource
Verify control plane toleration
Next resource
Taint and label GPU node to trigger AMD hardware partitioning