Dynamic GPU slicing with Red Hat OpenShift and NVIDIA MIG

Your GPU has a split personality. On Monday morning, it idles while a tiny service waits for requests; by lunch, it’s pegged at 100% serving a single chunky model. What if the same GPU could flex between those extremes: running seven bite-size models before noon and then a full GPU workload after? That’s the promise of NVIDIA multi-instance GPU (MIG) paired with Red Hat OpenShift’s dynamic accelerator slicer operator.

In this post, we’ll take a tour of that world. We’ll start with a quick, human-friendly explanation of MIG. We'll show how the dynamic accelerator slicer turns "GPU partitions" into just-in-time, Kubernetes-native resources, and then spin up three live demos: from seven tiny models on one card, to two medium models, to a single full GPU workload. Along the way, you’ll see how to keep GPUs busy, teams isolated, and operations simple.

MIG explained without the jargon

Think of a large GPU as a high-rise. MIG (Multi-Instance GPU) lets you split that building into separate apartments with walls, doors, and their own utilities. An A100 40 GB card, for example, can become:

1g.5gb apartments for tiny workloads (you can fit seven)
3g.20gb apartments for mid-sized models
7g.40gb, a full-floor penthouse when one large tenant needs everything

Each apartment is isolated, with no noisy neighbors, and performance is predictable. The result: better utilization and safer multi-tenancy without the "who stole my GPU?" drama.

Why do it dynamically?

Static partitions go stale. Teams change, workloads spike, and idle slices collect dust. The dynamic accelerator slicer operator makes slicing ephemeral: your pod asks for a slice, the operator creates it right before the container starts, and removes it when the pod goes away. No SSH, no pets, no hand-carved layouts; just standard Kubernetes scheduling with right-sized GPU resources.

What you’ll need

You’re on OpenShift with nodes that support NVIDIA MIG, plus Node Feature Discovery and the NVIDIA GPU Operator installed. MIG should be enabled on your GPU nodes per the GPU operator docs. That’s it. No bespoke scripts or cluster snowflakes.

Install the dynamic accelerator slicer operator

Use the OpenShift web console to install cert-manager, then the NVIDIA GPU Operator and Node Feature Discovery, and finally the dynamic accelerator slicer operator. Create a DASOperator instance with defaults (emulation off), and wait for the operator pods to go green. For reference and deeper guidance, see the dynamic accelerator slicer operator documentation.

One-time setup: Hugging Face token

Our examples pull models from Hugging Face. Create a secret named huggingface-secret with key HF_TOKEN in the default namespace once and reuse it everywhere:

# Replace <your_hf_token>
kubectl create secret generic huggingface-secret \
  --from-literal=HF_TOKEN=<base64 ofyour hf_token>

# Verify it exists
kubectl get secret huggingface-secret

A day in the life of a GPU: Three demos

We’ll show the entire spectrum, from many tiny models to one full GPU workload, on a single cluster. Each example uses vLLM for serving and requests a different MIG profile. Watch how the same GPU card shape-shifts to match what you deploy.

Demo 1: Seven tiny models, one card

Sometimes throughput beats raw horsepower. Here we run seven replicas of Gemma 3 270M, each requesting nvidia.com/mig-1g.5gb. OpenShift co-schedules them on the same node; thedynamic accelerator slicer carves seven 1g.5gb slices right on time.

apiVersion: apps/v1
kind: Deployment
metadata:
  name: vllm-gemma-270m
  labels:
    app: vllm-gemma-270m
spec:
  replicas: 7
  selector:
    matchLabels:
      app: vllm-gemma-270m
  template:
    metadata:
      labels:
        app: vllm-gemma-270m
    spec:
      restartPolicy: Always
      affinity:
        podAffinity:
          preferredDuringSchedulingIgnoredDuringExecution:
          - weight: 100
            podAffinityTerm:
              labelSelector:
                matchLabels:
                  app: vllm-gemma-270m
              topologyKey: kubernetes.io/hostname
      containers:
      - name: vllm
        image: vllm/vllm-openai:latest
        ports:
        - containerPort: 8003
        env:
        - name: HUGGING_FACE_HUB_TOKEN
          valueFrom:
            secretKeyRef:
              name: huggingface-secret
              key: HF_TOKEN
        command: ["/bin/bash"]
        args:
        - -c
        - |
          # Start vLLM server with Gemma 3 270M model
          # Fixed port so a single Service can target all replicas
          echo "Starting vLLM on port 8003 for pod $HOSTNAME"
          vllm serve google/gemma-3-270m --host 0.0.0.0 --port 8003 --max-model-len 1024 --gpu-memory-utilization 0.7
        resources:
          limits:
            nvidia.com/mig-1g.5gb: 1  # Dynamically created by DAS
          requests:
            memory: "2Gi"
            cpu: "0.5"
---
apiVersion: v1
kind: Service
metadata:
  name: vllm-gemma-270m
spec:
  selector:
    app: vllm-gemma-270m
  ports:
  - port: 8003
    targetPort: 8003
    name: http

Deploy and watch them land on the same node:

kubectl apply -f gemma.yaml
kubectl get pods -l app=vllm-gemma-270m -o wide

Actual output on the cluster:

NAME                               READY   STATUS    RESTARTS   AGE   IP            NODE                                     NOMINATED NODE   READINESS GATES
vllm-gemma-270m-5d866f98db-2q6qv   1/1     Running   0          71m   10.129.2.56   harpatil000043jma-p5jcx-worker-f-7r4b8   <none>           <none>
vllm-gemma-270m-5d866f98db-8hxj4   1/1     Running   0          71m   10.129.2.53   harpatil000043jma-p5jcx-worker-f-7r4b8   <none>           <none>
vllm-gemma-270m-5d866f98db-b5rmx   1/1     Running   0          71m   10.129.2.59   harpatil000043jma-p5jcx-worker-f-7r4b8   <none>           <none>
vllm-gemma-270m-5d866f98db-m28kx   1/1     Running   0          71m   10.129.2.55   harpatil000043jma-p5jcx-worker-f-7r4b8   <none>           <none>
vllm-gemma-270m-5d866f98db-pbmhf   1/1     Running   0          71m   10.129.2.54   harpatil000043jma-p5jcx-worker-f-7r4b8   <none>           <none>
vllm-gemma-270m-5d866f98db-vqg9p   1/1     Running   0          71m   10.129.2.58   harpatil000043jma-p5jcx-worker-f-7r4b8   <none>           <none>
vllm-gemma-270m-5d866f98db-w8j7b   1/1     Running   0          71m   10.129.2.57   harpatil000043jma-p5jcx-worker-f-7r4b8   <none>           <none>

Quickly smoke-test one replica via the service:

kubectl port-forward svc/vllm-gemma-270m 8003:8003 &
curl -X POST http://localhost:8003/v1/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "google/gemma-3-270m",
    "prompt": "The capital of France is ",
    "max_tokens": 80,
    "temperature": 0.7
  }'

Sample response:

{
  "id": "cmpl-52bf37b81aed469986d69c9a866706c9",
  "object": "text_completion",
  "created": 1756411573,
  "model": "google/gemma-3-270m",
  "choices": [
    {
      "index": 0,
      "text": "<strong>Paris</strong>. It’s also the most expensive city in the world, with a price tag of <strong>$100,000,000</strong>.\n\nParis is the <strong>fifth most expensive city in the world</strong>, and it’s on the list of 100 most expensive cities in the world.\n\n<strong>Paris</strong> is the world’s",
      "logprobs": null,
      "finish_reason": "length",
      "stop_reason": null,
      "prompt_logprobs": null
    }
  ],
  "service_tier": null,
  "system_fingerprint": null,
  "usage": {
    "prompt_tokens": 7,
    "total_tokens": 87,
    "completion_tokens": 80,
    "prompt_tokens_details": null
  },
  "kv_transfer_params": null
}

The result: Seven isolated services cleanly share one GPU, each with its own slice and predictable performance.

Demo 2: Two Qwen2-7B-Instruct models

Now for the middle ground: two strong Instruct models side-by-side. Each requests a 3g.20gb slice for a balanced split of the card.

apiVersion: apps/v1
kind: Deployment
metadata:
  name: vllm-qwen-7b
  labels:
    app: vllm-qwen-7b
spec:
  replicas: 2
  selector:
    matchLabels:
      app: vllm-qwen-7b
  template:
    metadata:
      labels:
        app: vllm-qwen-7b
    spec:
      restartPolicy: Always
      affinity:
        podAffinity:
          preferredDuringSchedulingIgnoredDuringExecution:
          - weight: 100
            podAffinityTerm:
              labelSelector:
                matchLabels:
                  app: vllm-qwen-7b
              topologyKey: kubernetes.io/hostname
      containers:
      - name: vllm
        image: vllm/vllm-openai:latest
        ports:
        - containerPort: 8001
        env:
        - name: HUGGING_FACE_HUB_TOKEN
          valueFrom:
            secretKeyRef:
              name: huggingface-secret
              key: HF_TOKEN
        command: ["/bin/bash"]
        args:
        - -c
        - |
          # Start vLLM server with Qwen2 7B Instruct model
          # Fixed port so a single Service can target both replicas
          echo "Starting vLLM on port 8001 for pod $HOSTNAME"
          vllm serve Qwen/Qwen2-7B-Instruct --host 0.0.0.0 --port 8001 --max-model-len 4096 --gpu-memory-utilization 0.8
        resources:
          limits:
            nvidia.com/mig-3g.20gb: 1
          requests:
            memory: "8Gi"
            cpu: "2"
---
apiVersion: v1
kind: Service
metadata:
  name: vllm-qwen-7b
spec:
  selector:
    app: vllm-qwen-7b
  ports:
  - port: 8001
    targetPort: 8001
    name: http

Apply and verify both replicas:

kubectl apply -f qwen.yaml
kubectl get pods -l app=vllm-qwen-7b -o wide
kubectl port-forward svc/vllm-qwen-7b 8001:8001 &
curl -X POST http://localhost:8001/v1/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "Qwen/Qwen2-7B-Instruct",
    "prompt": "Answer in one word. Capital of United Kingdom is,",
    "max_tokens": 150,
    "temperature": 0.7
  }'

Actual output on the cluster:

NAME                            READY   STATUS    RESTARTS   AGE    IP            NODE                                     NOMINATED NODE   READINESS GATES
vllm-qwen-7b-847486497b-7qmgg   1/1     Running   0          118m   10.128.2.34   harpatil000043jma-p5jcx-worker-f-plch8   <none>           <none>
vllm-qwen-7b-847486497b-qj5vb   1/1     Running   0          118m   10.128.2.33   harpatil000043jma-p5jcx-worker-f-plch8   <none>           <none>

Sample response:

{
  "id": "cmpl-74a2a1444d5c415db1a7945879eb44ac",
  "object": "text_completion",
  "created": 1756412318,
  "model": "Qwen/Qwen2-7B-Instruct",
  "choices": [
    {
      "index": 0,
      "text": " London. \n\nStep-by-step justification:\n1. Identify the question asks for the capital of the United Kingdom.\n2. Recall that London is the capital city of the United Kingdom.\n3. Provide the answer in a single word: London.",
      "logprobs": null,
      "finish_reason": "stop",
      "stop_reason": null,
      "prompt_logprobs": null
    }
  ],
  "service_tier": null,
  "system_fingerprint": null,
  "usage": {
    "prompt_tokens": 13,
    "total_tokens": 62,
    "completion_tokens": 49,
    "prompt_tokens_details": null
  },
  "kv_transfer_params": null
}

You get two capable assistants sharing one card with clean isolation and predictable latency.

Demo 3: GPT-OSS 20B on a full slice

Some jobs need a lot of room. Here we dedicate nearly the whole GPU to a single model by requesting a 7g.40gb profile. We use GPT-OSS 20B here to illustrate a full GPU configuration; the goal is to show using the entire GPU, not model size.

apiVersion: v1
kind: Pod
metadata:
  name: vllm-gpt-oss-20b
  labels:
    app: vllm-gpt-oss-20b
spec:
  restartPolicy: OnFailure
  containers:
  - name: vllm
    image: vllm/vllm-openai:latest
    ports:
    - containerPort: 8000
    env:
    - name: HUGGING_FACE_HUB_TOKEN
      valueFrom:
        secretKeyRef:
          name: huggingface-secret
          key: HF_TOKEN
    command: ["/bin/bash"]
    args:
    - -c
    - |
      # Install latest Transformers from source to support gpt_oss model type
      pip install git+https://github.com/huggingface/transformers.git

      # Start vLLM server with GPT-OSS model
      # GPU memory utilization reduced to 0.6 to fit in 40GB MIG slice
      vllm serve openai/gpt-oss-20b --host 0.0.0.0 --port 8000 --max-model-len 2048 --gpu-memory-utilization 0.6
    resources:
      limits:
        nvidia.com/mig-7g.40gb: 1
      requests:
        memory: "8Gi"
        cpu: "2"
---
apiVersion: v1
kind: Service
metadata:
  name: vllm-gpt-oss-20b-service
spec:
  type: ClusterIP
  ports:
  - port: 8000
    targetPort: 8000
    name: http
  selector:
    app: vllm-gpt-oss-20b

Deploy and test:

kubectl apply -f samples/vllm_gpt_oss_20b.yaml
kubectl get pods vllm-gpt-oss-20b -o wide
kubectl port-forward svc/vllm-gpt-oss-20b-service 8002:8000 &
curl -X POST http://localhost:8002/v1/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "openai/gpt-oss-20b",
    "prompt": "Answer in one word. Capital of Portugal is,",
    "max_tokens": 200,
    "temperature": 0.7
  }'

Actual output on the cluster:

NAME               READY   STATUS    RESTARTS   AGE    IP            NODE                                     NOMINATED NODE   READINESS GATES
vllm-gpt-oss-20b   1/1     Running   0          155m   10.131.0.52   harpatil000043jma-p5jcx-worker-f-wj6tk   <none>           <none>

Sample response:

{
  "id": "cmpl-670cf60fd2ce4bb9a5b6021e1f97609d",
  "object": "text_completion",
  "created": 1756412724,
  "model": "openai/gpt-oss-20b",
  "choices": [
    {
      "index": 0,
      "text": " \"Lisbon\". But the letter L is missing from the sentence. So we could say \"Lisbon\". That is a city. \"Lisbon\" is the capital. But the clue says answer in one word. So \"Lisbon\" is fine. But is \"Lisbon\" the letter missing? No. But it's the capital of Portugal. So that fits the second part. So we choose \"Lisbon\".\n\nBut maybe they want \"Lisbon\" because it's the capital of Portugal. So answer: \"Lisbon\". That is one word. So the answer is \"Lisbon\". That fits the instruction to answer in one word.\n\nThus the answer: \"Lisbon\". But we need to check if the letter missing is L. Thus the missing letter is L. So the answer is \"Lisbon\". That would satisfy both clues.\n\nBut one might think the puzzle expects the answer \"Lisbon\". Because the missing letter is the first letter of the capital. So it's a pun",
      "logprobs": null,
      "finish_reason": "length",
      "stop_reason": null,
      "prompt_logprobs": null
    }
  ],
  "service_tier": null,
  "system_fingerprint": null,
  "usage": {
    "prompt_tokens": 10,
    "total_tokens": 210,
    "completion_tokens": 200,
    "prompt_tokens_details": null
  },
  "kv_transfer_params": null
}

You’ve now seen the full spectrum, from micro-slices to a near full GPU workload, on the same cluster.

What’s happening behind the scenes

When a pod requests a MIG resource like nvidia.com/mig-3g.20gb, the dynamic accelerator slicer operator coordinates with the GPU operator and the node to create that precise slice on demand. The container starts with the slice attached; when the pod terminates, the dynamic accelerator slicer cleans the slice up. The whole dance stays Kubernetes-native: you describe resources, and the platform orchestrates hardware to match.

To make it tangible, here’s a real snapshot from the cluster after deploying the three demos, using the NVIDIA driver DaemonSet to run nvidia-smi -L per node:

=== Node: harpatil000043jma-p5jcx-worker-f-7r4b8 ===
GPU 0: NVIDIA A100-SXM4-40GB
  MIG 1g.5gb Device 0
  MIG 1g.5gb Device 1
  MIG 1g.5gb Device 2
  MIG 1g.5gb Device 3
  MIG 1g.5gb Device 4
  MIG 1g.5gb Device 5
  MIG 1g.5gb Device 6

=== Node: harpatil000043jma-p5jcx-worker-f-plch8 ===
GPU 0: NVIDIA A100-SXM4-40GB
  MIG 3g.20gb Device 0
  MIG 3g.20gb Device 1

=== Node: harpatil000043jma-p5jcx-worker-f-wj6tk ===
GPU 0: NVIDIA A100-SXM4-40GB
  MIG 7g.40gb Device 0

Scale without the drama

Once you’re comfortable with the patterns above, the rest looks like standard Kubernetes.

Want more throughput? Add replicas of small slices to pack the card and raise utilization without crosstalk.

Different teams, different needs? Assign slice sizes that match their models and SLOs for clean tenancy and predictable costs.

Prefer native tools? Keep using Deployments, HPAs, and your existing observability stack. The only new thing you request is the MIG resource.

What’s next: Queues and fair-share with Kueue

Many platforms want queue-based, fair-share scheduling for ML and batch jobs. Kubernetes Kueue brings queuing, quotas, and admission control on top of Jobs and custom workloads. It pairs naturally with the dynamic accelerator slicer: Kueue admits work when capacity is available, and the dynamic accelerator slicer creates the right slice just in time. The outcome is higher utilization, better fairness, and simpler Day 2 operations.

Wrap-up

MIG turns one GPU into many. The dynamic accelerator slicer operator brings that power to OpenShift in an on-demand, developer-friendly way: request a slice, run your model, and move on, with no GPU babysitting required. Whether you’re shipping multiple small LLMs or dedicating a card to a single giant, dynamic slicing keeps the cluster busy and your users happy.

Questions or want a hand trying this in your cluster? Open an issue. We’d love to help.

Red Hat Developer Sandbox

Programming languages & frameworks

System design & architecture

Developer experience

Automated data processing

Platform engineering

Secure development & architectures

E-books

Cheat sheets

Documentation

Dynamic GPU slicing with Red Hat OpenShift and NVIDIA MIG

MIG explained without the jargon

Why do it dynamically?

What you’ll need

Install the dynamic accelerator slicer operator

One-time setup: Hugging Face token

A day in the life of a GPU: Three demos

Demo 1: Seven tiny models, one card

Demo 2: Two Qwen2-7B-Instruct models

Demo 3: GPT-OSS 20B on a full slice

What’s happening behind the scenes

Scale without the drama

What’s next: Queues and fair-share with Kueue

Wrap-up

Introducing Models-as-a-Service in OpenShift AI

Building domain-specific LLMs with synthetic data and SDG Hub

External IP visibility in Red Hat Advanced Cluster Security

How I used Red Hat Lightspeed image builder to create CIS (and more) compliant images

Building a oversaturation detector with iterative error analysis

Platforms

Build

Quicklinks

Communicate

RED HAT DEVELOPER

Red Hat legal and privacy links

Red Hat legal and privacy links

Report a website issue