Skip to main content
Redhat Developers  Logo
  • Products

    Platforms

    • Red Hat Enterprise Linux
      Red Hat Enterprise Linux Icon
    • Red Hat AI
      Red Hat AI
    • Red Hat OpenShift
      Openshift icon
    • Red Hat Ansible Automation Platform
      Ansible icon
    • View All Red Hat Products

    Featured

    • Red Hat build of OpenJDK
    • Red Hat Developer Hub
    • Red Hat JBoss Enterprise Application Platform
    • Red Hat OpenShift Dev Spaces
    • Red Hat OpenShift Local
    • Red Hat Developer Sandbox

      Try Red Hat products and technologies without setup or configuration fees for 30 days with this shared Openshift and Kubernetes cluster.
    • Try at no cost
  • Technologies

    Featured

    • AI/ML
      AI/ML Icon
    • Linux
      Linux Icon
    • Kubernetes
      Cloud icon
    • Automation
      Automation Icon showing arrows moving in a circle around a gear
    • View All Technologies
    • Programming Languages & Frameworks

      • Java
      • Python
      • JavaScript
    • System Design & Architecture

      • Red Hat architecture and design patterns
      • Microservices
      • Event-Driven Architecture
      • Databases
    • Developer Productivity

      • Developer productivity
      • Developer Tools
      • GitOps
    • Automated Data Processing

      • AI/ML
      • Data Science
      • Apache Kafka on Kubernetes
    • Platform Engineering

      • DevOps
      • DevSecOps
      • Ansible automation for applications and services
    • Secure Development & Architectures

      • Security
      • Secure coding
  • Learn

    Featured

    • Kubernetes & Cloud Native
      Openshift icon
    • Linux
      Rhel icon
    • Automation
      Ansible cloud icon
    • AI/ML
      AI/ML Icon
    • View All Learning Resources

    E-Books

    • GitOps Cookbook
    • Podman in Action
    • Kubernetes Operators
    • The Path to GitOps
    • View All E-books

    Cheat Sheets

    • Linux Commands
    • Bash Commands
    • Git
    • systemd Commands
    • View All Cheat Sheets

    Documentation

    • Product Documentation
    • API Catalog
    • Legacy Documentation
  • Developer Sandbox

    Developer Sandbox

    • Access Red Hat’s products and technologies without setup or configuration, and start developing quicker than ever before with our new, no-cost sandbox environments.
    • Explore Developer Sandbox

    Featured Developer Sandbox activities

    • Get started with your Developer Sandbox
    • OpenShift virtualization and application modernization using the Developer Sandbox
    • Explore all Developer Sandbox activities

    Ready to start developing apps?

    • Try at no cost
  • Blog
  • Events
  • Videos

Dynamic GPU slicing with Red Hat OpenShift and NVIDIA MIG

Why run one AI model when you can run ten?

October 14, 2025
Harshal Patil
Related topics:
Artificial intelligenceKubernetesOperators
Related products:
Red Hat OpenShift

Share:

    Your GPU has a split personality. On Monday morning, it idles while a tiny service waits for requests; by lunch, it’s pegged at 100% serving a single chunky model. What if the same GPU could flex between those extremes: running seven bite-size models before noon and then a full GPU workload after? That’s the promise of NVIDIA multi-instance GPU (MIG) paired with Red Hat OpenShift’s dynamic accelerator slicer operator.

    In this post, we’ll take a tour of that world. We’ll start with a quick, human-friendly explanation of MIG. We'll show how the dynamic accelerator slicer turns "GPU partitions" into just-in-time, Kubernetes-native resources, and then spin up three live demos: from seven tiny models on one card, to two medium models, to a single full GPU workload. Along the way, you’ll see how to keep GPUs busy, teams isolated, and operations simple.

    MIG explained without the jargon

    Think of a large GPU as a high-rise. MIG (Multi-Instance GPU) lets you split that building into separate apartments with walls, doors, and their own utilities. An A100 40 GB card, for example, can become:

    • 1g.5gb apartments for tiny workloads (you can fit seven)
    • 3g.20gb apartments for mid-sized models
    • 7g.40gb, a full-floor penthouse when one large tenant needs everything

    Each apartment is isolated, with no noisy neighbors, and performance is predictable. The result: better utilization and safer multi-tenancy without the "who stole my GPU?" drama.

    Why do it dynamically?

    Static partitions go stale. Teams change, workloads spike, and idle slices collect dust. The dynamic accelerator slicer operator makes slicing ephemeral: your pod asks for a slice, the operator creates it right before the container starts, and removes it when the pod goes away. No SSH, no pets, no hand-carved layouts; just standard Kubernetes scheduling with right-sized GPU resources.

    What you’ll need

    You’re on OpenShift with nodes that support NVIDIA MIG, plus Node Feature Discovery and the NVIDIA GPU Operator installed. MIG should be enabled on your GPU nodes per the GPU operator docs. That’s it. No bespoke scripts or cluster snowflakes.

    Install the dynamic accelerator slicer operator

    Use the OpenShift web console to install cert-manager, then the NVIDIA GPU Operator and Node Feature Discovery, and finally the dynamic accelerator slicer operator. Create a DASOperator instance with defaults (emulation off), and wait for the operator pods to go green. For reference and deeper guidance, see the dynamic accelerator slicer operator documentation.

    One-time setup: Hugging Face token

    Our examples pull models from Hugging Face. Create a secret named huggingface-secret with key HF_TOKEN in the default namespace once and reuse it everywhere:

    # Replace <your_hf_token>
    kubectl create secret generic huggingface-secret \
      --from-literal=HF_TOKEN=<base64 ofyour hf_token>
    
    # Verify it exists
    kubectl get secret huggingface-secret

    A day in the life of a GPU: Three demos

    We’ll show the entire spectrum, from many tiny models to one full GPU workload, on a single cluster. Each example uses vLLM for serving and requests a different MIG profile. Watch how the same GPU card shape-shifts to match what you deploy.

    Demo 1: Seven tiny models, one card

    Sometimes throughput beats raw horsepower. Here we run seven replicas of Gemma 3 270M, each requesting nvidia.com/mig-1g.5gb. OpenShift co-schedules them on the same node; thedynamic accelerator slicer carves seven 1g.5gb slices right on time.

    apiVersion: apps/v1
    kind: Deployment
    metadata:
      name: vllm-gemma-270m
      labels:
        app: vllm-gemma-270m
    spec:
      replicas: 7
      selector:
        matchLabels:
          app: vllm-gemma-270m
      template:
        metadata:
          labels:
            app: vllm-gemma-270m
        spec:
          restartPolicy: Always
          affinity:
            podAffinity:
              preferredDuringSchedulingIgnoredDuringExecution:
              - weight: 100
                podAffinityTerm:
                  labelSelector:
                    matchLabels:
                      app: vllm-gemma-270m
                  topologyKey: kubernetes.io/hostname
          containers:
          - name: vllm
            image: vllm/vllm-openai:latest
            ports:
            - containerPort: 8003
            env:
            - name: HUGGING_FACE_HUB_TOKEN
              valueFrom:
                secretKeyRef:
                  name: huggingface-secret
                  key: HF_TOKEN
            command: ["/bin/bash"]
            args:
            - -c
            - |
              # Start vLLM server with Gemma 3 270M model
              # Fixed port so a single Service can target all replicas
              echo "Starting vLLM on port 8003 for pod $HOSTNAME"
              vllm serve google/gemma-3-270m --host 0.0.0.0 --port 8003 --max-model-len 1024 --gpu-memory-utilization 0.7
            resources:
              limits:
                nvidia.com/mig-1g.5gb: 1  # Dynamically created by DAS
              requests:
                memory: "2Gi"
                cpu: "0.5"
    ---
    apiVersion: v1
    kind: Service
    metadata:
      name: vllm-gemma-270m
    spec:
      selector:
        app: vllm-gemma-270m
      ports:
      - port: 8003
        targetPort: 8003
        name: http
    
    Deploy and watch them land on the same node:
    
    kubectl apply -f gemma.yaml
    kubectl get pods -l app=vllm-gemma-270m -o wide

    Actual output on the cluster:

     

    NAME                               READY   STATUS    RESTARTS   AGE   IP            NODE                                     NOMINATED NODE   READINESS GATES
    vllm-gemma-270m-5d866f98db-2q6qv   1/1     Running   0          71m   10.129.2.56   harpatil000043jma-p5jcx-worker-f-7r4b8   <none>           <none>
    vllm-gemma-270m-5d866f98db-8hxj4   1/1     Running   0          71m   10.129.2.53   harpatil000043jma-p5jcx-worker-f-7r4b8   <none>           <none>
    vllm-gemma-270m-5d866f98db-b5rmx   1/1     Running   0          71m   10.129.2.59   harpatil000043jma-p5jcx-worker-f-7r4b8   <none>           <none>
    vllm-gemma-270m-5d866f98db-m28kx   1/1     Running   0          71m   10.129.2.55   harpatil000043jma-p5jcx-worker-f-7r4b8   <none>           <none>
    vllm-gemma-270m-5d866f98db-pbmhf   1/1     Running   0          71m   10.129.2.54   harpatil000043jma-p5jcx-worker-f-7r4b8   <none>           <none>
    vllm-gemma-270m-5d866f98db-vqg9p   1/1     Running   0          71m   10.129.2.58   harpatil000043jma-p5jcx-worker-f-7r4b8   <none>           <none>
    vllm-gemma-270m-5d866f98db-w8j7b   1/1     Running   0          71m   10.129.2.57   harpatil000043jma-p5jcx-worker-f-7r4b8   <none>           <none>

    Quickly smoke-test one replica via the service:

    kubectl port-forward svc/vllm-gemma-270m 8003:8003 &
    curl -X POST http://localhost:8003/v1/completions \
      -H "Content-Type: application/json" \
      -d '{
        "model": "google/gemma-3-270m",
        "prompt": "The capital of France is ",
        "max_tokens": 80,
        "temperature": 0.7
      }'

    Sample response:

    {
      "id": "cmpl-52bf37b81aed469986d69c9a866706c9",
      "object": "text_completion",
      "created": 1756411573,
      "model": "google/gemma-3-270m",
      "choices": [
        {
          "index": 0,
          "text": "<strong>Paris</strong>. It’s also the most expensive city in the world, with a price tag of <strong>$100,000,000</strong>.\n\nParis is the <strong>fifth most expensive city in the world</strong>, and it’s on the list of 100 most expensive cities in the world.\n\n<strong>Paris</strong> is the world’s",
          "logprobs": null,
          "finish_reason": "length",
          "stop_reason": null,
          "prompt_logprobs": null
        }
      ],
      "service_tier": null,
      "system_fingerprint": null,
      "usage": {
        "prompt_tokens": 7,
        "total_tokens": 87,
        "completion_tokens": 80,
        "prompt_tokens_details": null
      },
      "kv_transfer_params": null
    }

    The result: Seven isolated services cleanly share one GPU, each with its own slice and predictable performance.

    Demo 2: Two Qwen2-7B-Instruct models

    Now for the middle ground: two strong Instruct models side-by-side. Each requests a 3g.20gb slice for a balanced split of the card.

    apiVersion: apps/v1
    kind: Deployment
    metadata:
      name: vllm-qwen-7b
      labels:
        app: vllm-qwen-7b
    spec:
      replicas: 2
      selector:
        matchLabels:
          app: vllm-qwen-7b
      template:
        metadata:
          labels:
            app: vllm-qwen-7b
        spec:
          restartPolicy: Always
          affinity:
            podAffinity:
              preferredDuringSchedulingIgnoredDuringExecution:
              - weight: 100
                podAffinityTerm:
                  labelSelector:
                    matchLabels:
                      app: vllm-qwen-7b
                  topologyKey: kubernetes.io/hostname
          containers:
          - name: vllm
            image: vllm/vllm-openai:latest
            ports:
            - containerPort: 8001
            env:
            - name: HUGGING_FACE_HUB_TOKEN
              valueFrom:
                secretKeyRef:
                  name: huggingface-secret
                  key: HF_TOKEN
            command: ["/bin/bash"]
            args:
            - -c
            - |
              # Start vLLM server with Qwen2 7B Instruct model
              # Fixed port so a single Service can target both replicas
              echo "Starting vLLM on port 8001 for pod $HOSTNAME"
              vllm serve Qwen/Qwen2-7B-Instruct --host 0.0.0.0 --port 8001 --max-model-len 4096 --gpu-memory-utilization 0.8
            resources:
              limits:
                nvidia.com/mig-3g.20gb: 1
              requests:
                memory: "8Gi"
                cpu: "2"
    ---
    apiVersion: v1
    kind: Service
    metadata:
      name: vllm-qwen-7b
    spec:
      selector:
        app: vllm-qwen-7b
      ports:
      - port: 8001
        targetPort: 8001
        name: http

    Apply and verify both replicas:

    kubectl apply -f qwen.yaml
    kubectl get pods -l app=vllm-qwen-7b -o wide
    kubectl port-forward svc/vllm-qwen-7b 8001:8001 &
    curl -X POST http://localhost:8001/v1/completions \
      -H "Content-Type: application/json" \
      -d '{
        "model": "Qwen/Qwen2-7B-Instruct",
        "prompt": "Answer in one word. Capital of United Kingdom is,",
        "max_tokens": 150,
        "temperature": 0.7
      }'

    Actual output on the cluster:

    NAME                            READY   STATUS    RESTARTS   AGE    IP            NODE                                     NOMINATED NODE   READINESS GATES
    vllm-qwen-7b-847486497b-7qmgg   1/1     Running   0          118m   10.128.2.34   harpatil000043jma-p5jcx-worker-f-plch8   <none>           <none>
    vllm-qwen-7b-847486497b-qj5vb   1/1     Running   0          118m   10.128.2.33   harpatil000043jma-p5jcx-worker-f-plch8   <none>           <none>

    Sample response:

    {
      "id": "cmpl-74a2a1444d5c415db1a7945879eb44ac",
      "object": "text_completion",
      "created": 1756412318,
      "model": "Qwen/Qwen2-7B-Instruct",
      "choices": [
        {
          "index": 0,
          "text": " London. \n\nStep-by-step justification:\n1. Identify the question asks for the capital of the United Kingdom.\n2. Recall that London is the capital city of the United Kingdom.\n3. Provide the answer in a single word: London.",
          "logprobs": null,
          "finish_reason": "stop",
          "stop_reason": null,
          "prompt_logprobs": null
        }
      ],
      "service_tier": null,
      "system_fingerprint": null,
      "usage": {
        "prompt_tokens": 13,
        "total_tokens": 62,
        "completion_tokens": 49,
        "prompt_tokens_details": null
      },
      "kv_transfer_params": null
    }

    You get two capable assistants sharing one card with clean isolation and predictable latency.

    Demo 3: GPT-OSS 20B on a full slice

    Some jobs need a lot of room. Here we dedicate nearly the whole GPU to a single model by requesting a 7g.40gb profile. We use GPT-OSS 20B here to illustrate a full GPU configuration; the goal is to show using the entire GPU, not model size.

    apiVersion: v1
    kind: Pod
    metadata:
      name: vllm-gpt-oss-20b
      labels:
        app: vllm-gpt-oss-20b
    spec:
      restartPolicy: OnFailure
      containers:
      - name: vllm
        image: vllm/vllm-openai:latest
        ports:
        - containerPort: 8000
        env:
        - name: HUGGING_FACE_HUB_TOKEN
          valueFrom:
            secretKeyRef:
              name: huggingface-secret
              key: HF_TOKEN
        command: ["/bin/bash"]
        args:
        - -c
        - |
          # Install latest Transformers from source to support gpt_oss model type
          pip install git+https://github.com/huggingface/transformers.git
    
          # Start vLLM server with GPT-OSS model
          # GPU memory utilization reduced to 0.6 to fit in 40GB MIG slice
          vllm serve openai/gpt-oss-20b --host 0.0.0.0 --port 8000 --max-model-len 2048 --gpu-memory-utilization 0.6
        resources:
          limits:
            nvidia.com/mig-7g.40gb: 1
          requests:
            memory: "8Gi"
            cpu: "2"
    ---
    apiVersion: v1
    kind: Service
    metadata:
      name: vllm-gpt-oss-20b-service
    spec:
      type: ClusterIP
      ports:
      - port: 8000
        targetPort: 8000
        name: http
      selector:
        app: vllm-gpt-oss-20b

    Deploy and test:

    kubectl apply -f samples/vllm_gpt_oss_20b.yaml
    kubectl get pods vllm-gpt-oss-20b -o wide
    kubectl port-forward svc/vllm-gpt-oss-20b-service 8002:8000 &
    curl -X POST http://localhost:8002/v1/completions \
      -H "Content-Type: application/json" \
      -d '{
        "model": "openai/gpt-oss-20b",
        "prompt": "Answer in one word. Capital of Portugal is,",
        "max_tokens": 200,
        "temperature": 0.7
      }'

    Actual output on the cluster:

    NAME               READY   STATUS    RESTARTS   AGE    IP            NODE                                     NOMINATED NODE   READINESS GATES
    vllm-gpt-oss-20b   1/1     Running   0          155m   10.131.0.52   harpatil000043jma-p5jcx-worker-f-wj6tk   <none>           <none>

    Sample response:

    {
      "id": "cmpl-670cf60fd2ce4bb9a5b6021e1f97609d",
      "object": "text_completion",
      "created": 1756412724,
      "model": "openai/gpt-oss-20b",
      "choices": [
        {
          "index": 0,
          "text": " \"Lisbon\". But the letter L is missing from the sentence. So we could say \"Lisbon\". That is a city. \"Lisbon\" is the capital. But the clue says answer in one word. So \"Lisbon\" is fine. But is \"Lisbon\" the letter missing? No. But it's the capital of Portugal. So that fits the second part. So we choose \"Lisbon\".\n\nBut maybe they want \"Lisbon\" because it's the capital of Portugal. So answer: \"Lisbon\". That is one word. So the answer is \"Lisbon\". That fits the instruction to answer in one word.\n\nThus the answer: \"Lisbon\". But we need to check if the letter missing is L. Thus the missing letter is L. So the answer is \"Lisbon\". That would satisfy both clues.\n\nBut one might think the puzzle expects the answer \"Lisbon\". Because the missing letter is the first letter of the capital. So it's a pun",
          "logprobs": null,
          "finish_reason": "length",
          "stop_reason": null,
          "prompt_logprobs": null
        }
      ],
      "service_tier": null,
      "system_fingerprint": null,
      "usage": {
        "prompt_tokens": 10,
        "total_tokens": 210,
        "completion_tokens": 200,
        "prompt_tokens_details": null
      },
      "kv_transfer_params": null
    }

    You’ve now seen the full spectrum, from micro-slices to a near full GPU workload, on the same cluster.

    What’s happening behind the scenes

    When a pod requests a MIG resource like nvidia.com/mig-3g.20gb, the dynamic accelerator slicer operator coordinates with the GPU operator and the node to create that precise slice on demand. The container starts with the slice attached; when the pod terminates, the dynamic accelerator slicer cleans the slice up. The whole dance stays Kubernetes-native: you describe resources, and the platform orchestrates hardware to match.

    To make it tangible, here’s a real snapshot from the cluster after deploying the three demos, using the NVIDIA driver DaemonSet to run nvidia-smi -L per node:

    === Node: harpatil000043jma-p5jcx-worker-f-7r4b8 ===
    GPU 0: NVIDIA A100-SXM4-40GB
      MIG 1g.5gb Device 0
      MIG 1g.5gb Device 1
      MIG 1g.5gb Device 2
      MIG 1g.5gb Device 3
      MIG 1g.5gb Device 4
      MIG 1g.5gb Device 5
      MIG 1g.5gb Device 6
    
    === Node: harpatil000043jma-p5jcx-worker-f-plch8 ===
    GPU 0: NVIDIA A100-SXM4-40GB
      MIG 3g.20gb Device 0
      MIG 3g.20gb Device 1
    
    === Node: harpatil000043jma-p5jcx-worker-f-wj6tk ===
    GPU 0: NVIDIA A100-SXM4-40GB
      MIG 7g.40gb Device 0

    Scale without the drama

    Once you’re comfortable with the patterns above, the rest looks like standard Kubernetes.

    Want more throughput? Add replicas of small slices to pack the card and raise utilization without crosstalk.

    Different teams, different needs? Assign slice sizes that match their models and SLOs for clean tenancy and predictable costs.

    Prefer native tools? Keep using Deployments, HPAs, and your existing observability stack. The only new thing you request is the MIG resource.

    What’s next: Queues and fair-share with Kueue

    Many platforms want queue-based, fair-share scheduling for ML and batch jobs. Kubernetes Kueue brings queuing, quotas, and admission control on top of Jobs and custom workloads. It pairs naturally with the dynamic accelerator slicer: Kueue admits work when capacity is available, and the dynamic accelerator slicer creates the right slice just in time. The outcome is higher utilization, better fairness, and simpler Day 2 operations.

    Wrap-up

    MIG turns one GPU into many. The dynamic accelerator slicer operator brings that power to OpenShift in an on-demand, developer-friendly way: request a slice, run your model, and move on, with no GPU babysitting required. Whether you’re shipping multiple small LLMs or dedicating a card to a single giant, dynamic slicing keeps the cluster busy and your users happy.

    Questions or want a hand trying this in your cluster? Open an issue. We’d love to help.

    Related Posts

    • Boost GPU efficiency in Kubernetes with NVIDIA Multi-Instance GPU

    • How MIG maximizes GPU efficiency on OpenShift AI

    • Improve GPU utilization with Kueue in OpenShift AI

    • Enable GPU acceleration with the Kernel Module Management Operator

    • How to use AMD GPUs for model serving in OpenShift AI

    • The benefits of dynamic GPU slicing in OpenShift

    Recent Posts

    • Dynamic GPU slicing with Red Hat OpenShift and NVIDIA MIG

    • Protecting virtual machines from storage and secondary network node failures

    • How to use OCI for GitOps in OpenShift

    • Using AI agents with Red Hat Insights

    • Splitting OpenShift machine config pool without node reboots

    What’s up next?

    Install Red Hat Device Edge on an NVIDIA Jetson Orin/NVIDIA IGX Orin Developer Kit and explore new features brought by rpm-ostree.

    Start the activity
    Red Hat Developers logo LinkedIn YouTube Twitter Facebook

    Platforms

    • Red Hat AI
    • Red Hat Enterprise Linux
    • Red Hat OpenShift
    • Red Hat Ansible Automation Platform
    • See all products

    Build

    • Developer Sandbox
    • Developer Tools
    • Interactive Tutorials
    • API Catalog

    Quicklinks

    • Learning Resources
    • E-books
    • Cheat Sheets
    • Blog
    • Events
    • Newsletter

    Communicate

    • About us
    • Contact sales
    • Find a partner
    • Report a website issue
    • Site Status Dashboard
    • Report a security problem

    RED HAT DEVELOPER

    Build here. Go anywhere.

    We serve the builders. The problem solvers who create careers with code.

    Join us if you’re a developer, software engineer, web designer, front-end designer, UX designer, computer scientist, architect, tester, product manager, project manager or team lead.

    Sign me up

    Red Hat legal and privacy links

    • About Red Hat
    • Jobs
    • Events
    • Locations
    • Contact Red Hat
    • Red Hat Blog
    • Inclusion at Red Hat
    • Cool Stuff Store
    • Red Hat Summit
    © 2025 Red Hat

    Red Hat legal and privacy links

    • Privacy statement
    • Terms of use
    • All policies and guidelines
    • Digital accessibility

    Report a website issue