Boost GPU efficiency in Kubernetes with NVIDIA Multi-Instance GPU

Multi-Instance GPU (MIG) is a technology provided by NVIDIA to expand the performance and value of some of its GPU products. It can partition a GPU into multiple instances, with each instance fully isolated and having its own high-bandwidth memory, cache, and compute cores. This gives administrators the ability to support every workload, small or large, ensuring quality of service (QoS) while extending accelerated computing resources to all users."

Kubernetes is an open source system for automating the deployment, scaling, and management of containerized applications. Kubernetes provides access to specialized hardware such as GPUs through its device plug-in framework. The NVIDIA GPU Operator uses the Kubernetes operator framework to automate the management of NVIDIA software components for GPU provisioning, including MIG.

This article explains the foundation of the GPU MIG partition, the efficiency challenges of different partition approaches, and how to improve efficiency with tools like MIG-Adapter.

Mixed strategy and MIG profiles

With mixed strategy offering greater flexibility, let's examine how these profiles are created and utilized in the context of GPU partitioning.

GPU partitioning and strategy

NVIDIA allows users to customize MIG partitions based on their needs. Using NVIDIA A100 (40GB) as an example, we will explain the flexibility and restrictions in creating MIG partitions. While other GPU types might have different numbers of compute or memory slices, the partitioning rules remain the same.

There are 7 compute slices and 8 memory slices on a NVIDIA A100 GPU; the partitioning process is really about how to divide them. Figure 1 represents the physical layout of A100 and can help you to better understand the partitioning.

A close-up of a computer

AI-generated content may be incorrect.

Figure 1: A100 GPU physical layout.

Creating a MIG partition combines compute and memory slices. To provide high performance, the GPU driver keeps a tight relationship with the physical layout, but it causes certain restrictions to the partitioning. For example:

A user can only create a MIG partition with either 1, 2, 3, 4, or all 7 compute slices.
All compute slices in a partition must be physically adjacent.

In practice, there are 2 partitioning strategies: single or mixed.

In a single strategy, all partitions in the GPU must be the same. That means the user gets 4 options (Figure 2) with a single strategy.

A diagram of different sizes of objects

AI-generated content may be incorrect.

Figure 2: Single partitioning strategy options.

Mixed strategy provides greater customization flexibility. Figure 3 shows the 19 supported configurations user can use as profile for partitioning.

A table with different colored squares

AI-generated content may be incorrect.

Figure 3: Mixed partitioning strategy configurations.

From a user’s point of view, some configurations overlap (e.g. #15–#18 all result in 1x2g.10gb+5x1g.5gb partitions). The NVIDIA GPU Operator simplifies this by offering profiles.

GPU Operator and MIG Profiles

The GPU Operator deploys the MIG Manager to manage MIG configuration on nodes in your Kubernetes cluster. It supports both single and mixed strategies. With a mixed strategy, users can label nodes in Kubernetes with profile names defined in a ConfigMap. For example, here is a profile that applies config #15~#18 to all GPUs in the desired node:

apiVersion: v1

kind: ConfigMap

metadata:

name: custom-mig-config

data:

config.yaml: |

version: v1

mig-configs:

five-1g-one-2g:

- devices: all

mig-enabled: true

mig-devices:

"1g.10gb": 5

"2g.20gb": 1

There are lots of useful profiles already provided by GPU Operator in the default ConfigMap. You can review them before defining your own profiles.

Efficiency challenges

In Kubernetes, different MIG partition types are registered as different resource types, and the user needs to explicitly put the type they want in resource requirements. A cluster administrator needs to understand the distribution of requirements to different resource types in the cluster to plan the GPU partitions accordingly. This creates many efficiency challenges.

Single strategy

Single strategy exposes a single type of MIG devices for all GPUs on the same node. In this scenario:

Single strategy does not support the 4g.20gb type; it can only expose a 1x4g.20gb partition for each GPU. This wastes the remaining 3 compute slices and 4 memory slices.
It wastes certain compute and memory slices because 7 compute slices and 8 memory slices are not good numbers to divide (for example, 3x2g.10gb partition wastes 1 compute slice and 2 memory slices).
It is hard to match certain distributions of resource types.

Single strategy is not widely used in production because of these limitations.

Mixed strategy

Mixed strategy provides greater flexibility and efficiency in partitioning. Cluster administrators can leverage every compute/memory slice in the GPU to match a certain distribution of resource requirements. This is the most popular use case for MIG.

However, the challenges come from the unpredictable resource requirements of the workloads in the cluster. A large cluster with multi-node multi-GPUs is good for sharing resources across different types of workloads, but distribution of resource requirement types is changing dynamically. It often happens that workloads are pending for certain MIG types when there is an available resource, but in unmatched MIG types.

Dynamic resource allocation

Partition GPUs on nodes and registering MIGs as resource types for workload resource requirements is also known as static resource allocation. There is separate research on dynamic resource allocation using ResourceClaim with ResourceClaimTemplate, DeviceClass, etc.

With dynamic resource allocation, GPUs are not partitioned until a Pod with resource requirements comes to the cluster. The device driver will only create partitions to fulfill immediate Pod resource requirements and keep remaining compute and memory slices untouched for future workloads.

In theory, this approach satisfies the dynamic distribution of resource requirement types, but it has some critical technical problems with the tight relationship to GPU physical layout:

Additional scheduling delays due to dynamic partitioning and back-and-forth communication between scheduler and nodes; the scheduler is the critical path in the cluster and is very sensitive to delays for a single workload.
GPU resource fragmentation; an MIG partition with more than 1 compute slice must be allocated on adjacent slices. Frequent allocation and free MIGs causes discontinuity between available slices and renders the GPU unable to fulfill certain resource requirements.
Increased Pod start time; creating a MIG partition needs more lead time, which is critical to scale-to-zero use cases.

Because of these issues, dynamic resource allocation with NVIDIA MIG is still not ready for production. It is hard to make progress with the tight relationship between MIG partitions and the GPU physical layout in the driver. So, we’re introducing MIG Adapter to improve cluster resource efficiency with a MIG mixed strategy.

How MIG Adapter works

In order to leverage free but unmatched MIG types in the system for pending workloads, MIG Adapter temporarily boosts the workload resource requirements to match free MIG resources without changing the GPU partitioning. It defines chains of compatible resource types, then performs Borrow and Return actions for workloads.

Chain(s) of compatible resources types

MIG Adapter defines a one-way chain to describe the compatibility of MIG types. For example, for a cluster with all A100 GPUs:

1g5gb → 2g10gb → 3g20gb → 4g20gb

If the cluster uses A100 and H100 together, the chain is:

1g5gb → 1g10b → 2g10gb → 2g20gb → 3g20gb → 3g40gb → 4g20gb → 4g40gb

These chains enable MIG Adapter to find replacements when a certain type is not available.

Borrow resource for pending workloads

MIG Adapter watches for pending pods in the cluster and identifies those waiting for MIG resources. Once a pending pod is identified, it extracts the resource requirements from the Pod spec and looks for available resources through the chain of compatible resources. That means if a pod is pending for 1g5gb MIG, MIG Adapter will try to find 2g10gb, 3g20gb, etc., available in the cluster.

Once an available resource is identified, MIG Adapter restarts the pending pod and patches the new pod with the elevated resource requirements with its mutating admission webhook.

Return resource by restoring the workload requests

MIG Adapter also watches for Terminated or Successed Pods in the cluster because it might free up MIG resources. Once a MIG resource is available, the MIG Adapter goes through all the workloads currently borrowing resources and identifies the one returned the most.

For example, if a Pod using 1g5gb is successed and there are Pods bumped from 1g5gb to 2g10gb and Pods bumped from 1g5gb to 3g20gb, the latter will be restored to its original resource requirements.

Future improvements

On top of the basic capabilities just described, you can introduce [olicies to:

Restrict resource types allowed to be used for adapting.
Restrict maximum room for adapting.
Enable/disable restore/preemption for certain workloads.

Beyond that, the adapting concept can go beyond MIG. As long as the chain of compatibility can be defined, you should be able to adapt the workload to all compatible resource types.

How to use MIG-Adapter

MIG-Adapter is implemented with operator framework using operator SDK. There are two ways to run it: locally or as a deployment inside the cluster.

Run locally (recommended)

Technically, you can run the webhooks locally, but for this to work, you need to generate certificates for the webhook server and store them at /tmp/k8s-webhook-server/serving-certs/tls.{crt,key}. For more details about running webhook locally, refer to these instructions: Running and deploying the controller.

Some shell commands to assist certificate generation are kept here.

The MutatingWebhookConfiguration also needs to be updated with the generated certificates

Run as a Deployment inside the cluster

Running the MIG Adapter as a Deployment inside the cluster is the same as deploying an Operator. For instructions on deploying MIG Adapter into a cluster, refer to the Operator SDK tutorial.

If the target cluster is an OpenShift cluster, refer to the documentation for injecting certificates.

Next steps

Now you know how to configure Profiles for NVIDIA GPU Operator to customize MIG partitions for your business needs and how you can leverage MIG Adapter to improve the system efficiency. Go ahead and try it!

Red Hat Developer Sandbox

Programming languages & frameworks

System design & architecture

Developer experience

Automated data processing

Platform engineering

Secure development & architectures

E-books

Cheat sheets

Documentation

Boost GPU efficiency in Kubernetes with NVIDIA Multi-Instance GPU

Mixed strategy and MIG profiles

GPU partitioning and strategy

GPU Operator and MIG Profiles

Efficiency challenges

Single strategy

Mixed strategy

Dynamic resource allocation

How MIG Adapter works

Chain(s) of compatible resources types

Borrow resource for pending workloads

Return resource by restoring the workload requests

Future improvements

How to use MIG-Adapter

Run locally (recommended)

Run as a Deployment inside the cluster

Next steps

Introducing Models-as-a-Service in OpenShift AI

Building domain-specific LLMs with synthetic data and SDG Hub

External IP visibility in Red Hat Advanced Cluster Security

How I used Red Hat Lightspeed image builder to create CIS (and more) compliant images

Building a oversaturation detector with iterative error analysis

Platforms

Build

Quicklinks

Communicate

RED HAT DEVELOPER

Red Hat legal and privacy links

Red Hat legal and privacy links

Report a website issue