Breadcrumb

  1. Red Hat Interactive Learning Portal
  2. OpenShift learning
  3. Partition AMD Instinct GPU accelerators via device config manager (DCM) in Red Hat OpenShift
  4. Deploy a vLLM inference workload to validate GPU partitioning

Partition AMD Instinct GPU accelerators via device config manager (DCM) in Red Hat OpenShift

Make your AI infrastructure more efficient by partitioning AMD Instinct GPUs via the device config manager in Red Hat OpenShift and validate your setup with a vLLM workload.

Now that your AMD Instinct accelerators are successfully partitioned into 64 devices and ready for scheduling, you need to verify that workloads can be scheduled onto the partitioned GPU resources.

To validate the newly partitioned setup, you will deploy a vLLM server to host a large language model. While the general process is similar to the official vLLM documentation, the following steps are specifically tailored for a Red Hat OpenShift environment using partitioned AMD Instinct GPUs.

Prerequisites:

In this lesson, you will:

  • Create storage, service account, and role-based access control (RBAC) resources for an inference server.
  • Deploy vLLM requesting a single partitioned GPU (amd.com/gpu: 1) resource.
  • Observe GPU memory and compute utilization on the partitioned device.

vLLM validation 

Test your control over your partitioned resources by deploying a single model on a single, partitioned resource.

  1. Optional: Create a PersistentVolumeClaim (PVC) to store the model cache. In this learning path, we have used LVMS as the storage backend.

    cat <<EOF | tee 01_pvc.yaml | oc apply -f -
    ---
    apiVersion: v1
    kind: PersistentVolumeClaim
    metadata:
      name: vllm-models
    spec:
      accessModes:
        - ReadWriteOnce
      resources:
        requests:
          storage: 50Gi
      storageClassName: lvms-vg1
    EOF
    
  2. Optional: Create a Kubernetes Secret (only required for accessing gated models in Hugging Face). 

    cat <<EOF | tee 02_secret.yaml | oc apply -f -
    ---
    apiVersion: v1
    kind: Secret
    metadata:
      name: hf-token
    type: Opaque
    stringData:
      token: <hf_xxxxxxxxxxxxxxxxxxxxxxxxxxxxx>
    EOF
  3. Create a Deployment for vLLM. For this learning path, we use a small LLM model like meta-llama/Llama-3.1-8B-Instruct for validation.

    Note: The resources section explicitly requests exactly one logical partition (amd.com/gpu: "1"):

    cat <<EOF | tee 03_deployment.yaml | oc apply -f -
    ---
    apiVersion: v1
    kind: ServiceAccount
    metadata:
      name: vllm
    ---
    apiVersion: rbac.authorization.k8s.io/v1
    kind: RoleBinding
    metadata:
      name: vllm-anyuid
    roleRef:
      apiGroup: rbac.authorization.k8s.io
      kind: ClusterRole
      name: system:openshift:scc:anyuid
    subjects:
      - kind: ServiceAccount
        name: vllm
    ---
    apiVersion: apps/v1
    kind: Deployment
    metadata:
      name: vllm
      labels:
        app: vllm
    spec:
      replicas: 1
      selector:
        matchLabels:
          app: vllm
      template:
        metadata:
          labels:
            app: vllm
        spec:
          serviceAccountName: vllm
          containers:
            - name: vllm
              image: docker.io/rocm/vllm:latest
              command:
                - python3
                - -m
                - vllm.entrypoints.openai.api_server
                - --model
                - <model-name-goes-here>  # -> e.g. meta-llama/Llama-3.1-8B-Instruct
                - --dtype
                - float16
                - --max-model-len
                - "4096"
                - --tensor-parallel-size
                - "1"
                - --host
                - "0.0.0.0"
                - --port
                - "8000"
              env:
                - name: HF_TOKEN
                  valueFrom:
                    secretKeyRef:
                      name: hf-token
                      key: token
                - name: HF_HOME
                  value: /models
              ports:
                - containerPort: 8000
                  name: http
              resources:
                requests:
                  amd.com/gpu: "1"
                  cpu: "4"
                  memory: 16Gi
                limits:
                  amd.com/gpu: "1"
                  cpu: "8"
                  memory: 32Gi
              volumeMounts:
                - name: models
                  mountPath: /models
                - name: shm
                  mountPath: /dev/shm
          volumes:
            - name: models
              persistentVolumeClaim:
                claimName: vllm-models
            - name: shm
              emptyDir:
                medium: Memory
                sizeLimit: 16Gi
          tolerations:
            - key: amd.com/gpu
              operator: Exists
              effect: NoSchedule
    EOF
  4. Optional: Create a Service and Route to expose the vLLM server to the outside world. Skip this step if you are not using a route to enable public access to the model. 

    cat <<EOF | tee 04_service_route.yaml | oc apply -f -
    ---
    apiVersion: v1
    kind: Service
    metadata:
      name: vllm-server
      labels:
        app: vllm
    spec:
      ports:
        - port: 8000
          targetPort: 8000
          name: http
      selector:
        app: vllm
    
    ---
    apiVersion: route.openshift.io/v1
    kind: Route
    metadata:
      name: vllm-server
      labels:
        app: vllm
    spec:
      to:
        kind: Service
        name: vllm-server
      port:
        targetPort: http
      tls:
        termination: edge
    EOF
  5. To test and validate the full Deployment, use the exposed API and the deployed monitoring service:

    VLLM_URL=$(oc get route vllm-server -o jsonpath='{.spec.host}')
  6. Test the API:

    curl -sk -X POST https://${VLLM_URL}/v1/chat/completions \
      -H "Content-Type: application/json" \
      -d '{
        "model": "meta-llama/Llama-3.1-8B-Instruct",
        "messages": [{"role": "user", "content": "What is AI?"}],
        "temperature": 0.1
      }' | jq .
    
    {
      "id": "chatcmpl-99b123e22a32d99e",
      "object": "chat.completion",
      "created": 1772549128,
      "model": "meta-llama/Llama-3.1-8B-Instruct",
      "choices": [
        {
          "index": 0,
          "message": {
            "role": "assistant",
            "content": "Artificial Intelligence (AI) refers to the simulation of human intelligence in machines that are programmed to think and learn like humans. The term can also be applied to any machine that exhibits traits associated with a human mind such as learning and problem-solving.\n\nAI technology is based on the principle of creating algorithms that can process data, identify patterns, and make decisions with minimal human intervention. This is achieved through various techniques such as machine learning, natural language processing, and computer vision.\n\nThere are several types of AI, including:\n\n1. **Narrow or Weak AI**: This type of AI is designed to perform a specific task, such as facial recognition, language translation, or playing chess. Narrow AI is trained on a specific dataset and is not capable of general reasoning or problem-solving.\n\n2. **General or Strong AI**: This type of AI is designed to perform any intellectual task that a human can. General AI is still in the realm of science fiction, but researchers are working towards creating AI systems that can learn and adapt like humans.\n\n3. **Superintelligence**: This type of AI is significantly more intelligent than the best human minds. Superintelligence is a hypothetical concept that is still being researched and debated.\n\nAI has many applications in various fields, including:\n\n1. **Virtual assistants**: AI-powered virtual assistants, such as Siri, Alexa, and Google Assistant, can perform tasks such as setting reminders, sending messages, and making phone calls.\n\n2. **Image recognition**: AI-powered image recognition systems can identify objects, people, and patterns in images.\n\n3. **Natural language processing**: AI-powered natural language processing systems can understand and generate human language, enabling applications such as language translation and chatbots.\n\n4. **Predictive analytics**: AI-powered predictive analytics systems can analyze data and make predictions about future events or trends.\n\n5. **Robotics**: AI-powered robots can perform tasks such as assembly, welding, and navigation.\n\nThe benefits of AI include:\n\n1. **Increased efficiency**: AI can automate tasks, freeing up human time and resources.\n\n2. **Improved accuracy**: AI can perform tasks with high accuracy and speed.\n\n3. **Enhanced decision-making**: AI can analyze large amounts of data and provide insights that can inform decision-making.\n\nHowever, AI also raises concerns about:\n\n1. **Job displacement**: AI may displace human workers in certain industries.\n\n2. **Bias and fairness**: AI systems can perpetuate biases and unfairness if they are trained on biased data.\n\n3. **Security and privacy**: AI systems can be vulnerable to cyber attacks and data breaches.\n\nOverall, AI has the potential to revolutionize many aspects of our lives, but it also requires careful consideration of its benefits and risks.",
            "refusal": null,
            "annotations": null,
            "audio": null,
            "function_call": null,
            "tool_calls": [],
            "reasoning": null,
            "reasoning_content": null
          },
          "logprobs": null,
          "finish_reason": "stop",
          "stop_reason": null,
          "token_ids": null
        }
      ],
      "service_tier": null,
      "system_fingerprint": null,
      "usage": {
        "prompt_tokens": 39,
        "total_tokens": 582,
        "completion_tokens": 543,
        "prompt_tokens_details": null
      },
      "prompt_logprobs": null,
      "prompt_token_ids": null,
      "kv_transfer_params": null
    }
  7. Once the model is deployed and actively processing API requests, you can observe the immediate impact on your partitioned resources. The dashboard shows GPU 0 memory usage at 2.2 GB, consumed by the model weights and KV cache, while the remaining 63 partitions stay idle. (Figure 1).

    Grafana dashboard showing a partitioned MI300X system serving a vLLM workload, where GPU 0 memory usage has jumped to 2.2 GB while the remaining 63 logical GPUs stay at ~20 MB, with a brief GPU utilization spike visible around 15:27 during inference.
    Figure 1: Dashboard view after deploying Llama-3.1-8B-Instruct on a single CPX partition—GPU 0 consumes 2.2 GB for model weights and KV cache while all other partitions remain idle.
  8. Observe the GPU compute utilization behavior. Because this is an inference workload, the compute utilization does not sustain a constantly high load. Instead, it experiences a sharp, temporary bump during the short period of time it takes to process the prompt and generate the response (Figure 2).

    Grafana dashboard with three panels showing GPU utilization over time: average GPU utilization across all GPUs remains near 0% with a small blip around 15:27; average memory utilization holds steady at 3.43%; and the GPU utilization min/max/avg panel shows a brief bell-curve spike peaking at ~12% max around 15:27 before returning to 0%.
    Figure 2: Inference workload compute profile—GPU utilization spikes briefly to ~12% during prompt processing and token generation, then drops back to idle, while memory utilization remains constant at 3.43% holding the model weights.


In this lesson, you deployed a single model occupying just one logical, partitioned GPU. However, this clearly demonstrates the massive scaling potential of this configuration. Because the Core Partition X (CPX) and NPS4 combination allows Kubernetes to recognize 64 GPUs as allocatable resources, this setup could (theoretically) scale to host 64 concurrent models on a single node.

Learning path summary

The device config manager (DCM) on Red Hat OpenShift enables cluster administrators to maximize utilization of AMD Instinct accelerators. By transitioning to advanced partitioning profiles, cluster administrators can drastically increase multi-tenant workload density.

When planning your deployments, ensure you select a supported compute and memory pairing—such as the Core Partition X (CPX) and Non-Uniform Memory Access (NUMA) Per Socket (NPS) 4 (NPS4)  memory combination used in this learning path. 

By transforming a single physical GPU into up to 64 allocatable resources, you can achieve maximum concurrency, reduce idle resources, and increase AI workload density.

Ready to learn more about AI workloads?

Acknowledgments

The author would like to acknowledge the thorough reviews of the following individuals, who have directly contributed to enhancing the quality of this learning path:

  • Brett Thurber — Director and Distinguished Engineer, Ecosystems Engineering, Red Hat
  • Ben Schmaus — Senior Principal Software Engineer, Ecosystems Engineering, Red Hat
  • Erwan Gallen — Senior Principal Product Manager, AI Business Unit, Red Hat
Previous resource
Untaint GPU node