How to use AMD GPUs for model serving in OpenShift AI

As artificial intelligence and machine learning (AI/ML) workloads continue to grow, the demand for powerful, efficient hardware accelerators such as GPUs has become paramount. In this article, we will explore how to integrate and utilize AMD GPUs in Red Hat OpenShift AI for model serving. Specifically, we'll dive into how to set up and configure the AMD Instinct MI300X GPU with KServe in OpenShift AI.

Note

This is the third article in our series covering OpenShift AI capabilities on various AI accelerators. Catch up on the other parts:

AMD GPU devices used

For this tutorial, we will focus on AMD's MI300X GPU, a powerful device designed to accelerate machine learning workloads.

Create an accelerator profile in OpenShift AI

To begin, you must create an accelerator profile that tells OpenShift AI about the AMD GPU. This profile will ensure that the system recognizes and utilizes the AMD GPU effectively for machine learning tasks.

You can use the following YAML file to create an accelerator profile:

apiVersion: dashboard.opendatahub.io/v1
kind: AcceleratorProfile
metadata:
  name: amd-gpu
  namespace: redhat-ods-applications
spec:
  displayName: AMD GPU
  enabled: true
  identifier: amd.com/gpu
  tolerations:
    - effect: NoSchedule
      key: amd.com/gpu
      operator: Exists

Parameters:

displayName: This field displays the name of the accelerator, i.e., AMD GPU.
identifier: The key (amd.com/gpu) is used to mark workloads that can be scheduled on AMD GPUs.
tolerations: This allows the workload to be scheduled on OpenShift nodes labeled with AMD GPU resources.

Selecting accelerator profile in dashboard

Once the accelerator profile is created, you can select this profile in the OpenShift AI dashboard to leverage AMD GPUs for your workloads. Ensure that the profile is enabled and available for selection.

Configure KServe serving runtimes with AMD GPUs

With OpenShift AI, vLLM is only supported as a single-model serving platform that is based on KServe.

Let’s configure serving runtimes for deploying machine learning models using KServe, a model serving tool that allows you to easily deploy and manage models on Kubernetes clusters.

HTTP is simpler and widely used for general web-based API requests, while gRPC offers better performance with lower latency and is ideal for high-throughput, real-time applications. You might choose HTTP for ease of integration or gRPC for performance efficiency, depending on your specific use case.

HTTP ServingRuntime with AMD GPU

To deploy models over HTTP using the AMD GPU, you can use the following ServingRuntime configuration:

apiVersion: serving.kserve.io/v1alpha1
kind: ServingRuntime
metadata:
  annotations:
    opendatahub.io/recommended-accelerators: '["amd.com/gpu"]'
    openshift.io/display-name: vLLM AMD HTTP ServingRuntime for KServe
  name: vllm-amd-http-runtime
spec:
  builtInAdapter:
    modelLoadingTimeoutMillis: 90000
  containers:
    - args:
        - '--port=8080'
        - '--model=/mnt/models'
        - '--served-model-name={{.Name}}'
        - '--distributed-executor-backend=mp'
        - '--chat-template=/app/data/template/template_chatml.jinja'
      command:
        - python3
        - '-m'
        - vllm.entrypoints.openai.api_server
      image: 'quay.io/modh/vllm@sha256:2e7f97b69d6e0aa7366ee6a841a7e709829136a143608bee859b1fe700c36d31'
      name: kserve-container
      ports:
        - containerPort: 8080
          name: http1
          protocol: TCP
  multiModel: false
  supportedModelFormats:
    - autoSelect: true
      name: pytorch

gRPC ServingRuntime with AMD GPU

For deploying models using gRPC, modify the ServingRuntime configuration as follows:

apiVersion: serving.kserve.io/v1alpha1
kind: ServingRuntime
metadata:
  annotations:
    opendatahub.io/recommended-accelerators: '["amd.com/gpu"]'
    openshift.io/display-name: vLLM AMD GRPC ServingRuntime for KServe
  name: vllm-amd-grpc-runtime
  namespace: tgismodel-granite-8b-code
spec:
  builtInAdapter:
    modelLoadingTimeoutMillis: 90000
  containers:
    - args:
        - '--port=8080'
        - '--model=/mnt/models'
        - '--served-model-name={{.Name}}'
        - '--distributed-executor-backend=mp'
        - '--chat-template=/app/data/template/template_chatml.jinja'
      command:
        - python3
        - '-m'
        - vllm_tgis_adapter
      image: 'quay.io/modh/vllm@sha256:2e7f97b69d6e0aa7366ee6a841a7e709829136a143608bee859b1fe700c36d31'
      name: kserve-container
      ports:
        - containerPort: 8033
          name: h2c
          protocol: TCP
  multiModel: false
  supportedModelFormats:
    - autoSelect: true
      name: pytorch

Parameters:

supportedModelFormats: The above configurations support PyTorch models, but you can modify this based on your model format.
image: As part of the developer preview for AMD GPU support in the OpenShift AI serving stack, Red Hat published a container image at (quay.io/modh/vllm@sha256:2e7f97b69d6e0aa7366ee6a841a7e709829136a143608bee859b1fe700c36d31). In this example, we are using vLLM for LLM inference and serving.

Configure inference services

Once your serving runtimes are set up, you can create an InferenceService for serverless or raw deployment modes. Below are the configurations for both modes.

Serverless deployment automatically scales based on demand and is ideal for dynamic workloads with fluctuating traffic, while raw deployment offers more control over resource management and is better suited for stable, predictable workloads requiring fine-tuned configurations.

Serverless InferenceService

To create an inference service for serverless mode, modify the InferenceService configuration as follows:

apiVersion: serving.kserve.io/v1beta1
kind: InferenceService
metadata:
  annotations:
    prometheus.io/path: /metrics
    prometheus.io/port: '3000'
    serving.knative.openshift.io/enablePassthrough: 'true'
    serving.kserve.io/deploymentMode: Serverless
    sidecar.istio.io/inject: 'true'
  name: granite-8b-code
  namespace: tgismodel-granite-8b-code
spec:
  predictor:
    minReplicas: 1
    model:
      env:
        - name: HF_HUB_CACHE
          value: /tmp
        - name: TRANSFORMERS_CACHE
          value: $(HF_HUB_CACHE)
        - name: DTYPE
          value: float16
      modelFormat:
        name: pytorch
      name: ''
      resources:
        limits:
          amd.com/gpu: '1'
        requests:
          memory: 40Gi
      runtime: vllm-runtime
      storageUri: 's3://ods-ci-wisdom/granite-8b-code-base'

Raw deployment InferenceService

To create an inference service for raw deployment, modify the InferenceService configuration as follows:

apiVersion: serving.kserve.io/v1beta1
kind: InferenceService
metadata:
  annotations:
    prometheus.io/path: /metrics
    prometheus.io/port: '3000'
    serving.knative.openshift.io/enablePassthrough: 'true'
    serving.kserve.io/deploymentMode: RawDeployment
    sidecar.istio.io/inject: 'true'
  name: granite-8b-code
  namespace: tgismodel-granite-8b-code
spec:
  predictor:
    minReplicas: 1
    model:
      env:
        - name: HF_HUB_CACHE
          value: /tmp
        - name: TRANSFORMERS_CACHE
          value: $(HF_HUB_CACHE)
        - name: DTYPE
          value: float16
      modelFormat:
        name: pytorch
      name: ''
      resources:
        limits:
          amd.com/gpu: '1'
        requests:
          memory: 40Gi
      runtime: vllm-runtime
      storageUri: 's3://ods-ci-wisdom/granite-8b-code-base'

AMD runtime considerations

Compared to NVIDIA-based runtimes, the AMD runtime typically requires more memory resources to run efficiently. In the examples above, note that 40GI of memory is requested; this can vary depending on the model complexity.

Tested models

Here are some of the models we tested using AMD GPUs:

granite-8b-code-base
Meta-Llama-3.1-8B models

AMD GPU ServingRuntime image

As earlier noted, the serving runtime image for AMD GPUs is available at:

quay.io/opendatahub/vllm@sha256:3a84d90113cb8bfc9623a3b4e2a14d4e9263e2649b9e2e51babdbaf9c3a6b1c8

This image is tailored for serving models on AMD GPUs and includes optimizations for performance and compatibility.

Conclusion

Integrating AMD GPUs into your Red Hat OpenShift AI environment offers an efficient way to accelerate AI/ML workloads. By setting up the proper accelerator profiles, serving runtimes, and inference services, you can unlock the full potential of AMD’s MI300X GPUs. Whether you're deploying models via HTTP or gRPC, serverless or raw deployment modes, the flexibility of KServe with AMD GPUs ensures smooth, scalable AI model serving.

Happy GPU computing!

For more information on Red Hat OpenShift AI, visit the OpenShift AI product page.

How to use AMD GPUs for model serving in OpenShift AI

Share:

AMD GPU devices used

Create an accelerator profile in OpenShift AI

Selecting accelerator profile in dashboard

Configure KServe serving runtimes with AMD GPUs

HTTP ServingRuntime with AMD GPU

gRPC ServingRuntime with AMD GPU

Configure inference services

Serverless InferenceService

Raw deployment InferenceService

AMD runtime considerations

Tested models

AMD GPU ServingRuntime image

Conclusion

Products

Build

Quicklinks

Communicate

RED HAT DEVELOPER

Red Hat legal and privacy links

Red Hat legal and privacy links

Report a website issue