Skip to main content
Redhat Developers  Logo
  • Products

    Featured

    • Red Hat Enterprise Linux
      Red Hat Enterprise Linux Icon
    • Red Hat OpenShift AI
      Red Hat OpenShift AI
    • Red Hat Enterprise Linux AI
      Linux icon inside of a brain
    • Image mode for Red Hat Enterprise Linux
      RHEL image mode
    • Red Hat OpenShift
      Openshift icon
    • Red Hat Ansible Automation Platform
      Ansible icon
    • Red Hat Developer Hub
      Developer Hub
    • View All Red Hat Products
    • Linux

      • Red Hat Enterprise Linux
      • Image mode for Red Hat Enterprise Linux
      • Red Hat Universal Base Images (UBI)
    • Java runtimes & frameworks

      • JBoss Enterprise Application Platform
      • Red Hat build of OpenJDK
    • Kubernetes

      • Red Hat OpenShift
      • Microsoft Azure Red Hat OpenShift
      • Red Hat OpenShift Virtualization
      • Red Hat OpenShift Lightspeed
    • Integration & App Connectivity

      • Red Hat Build of Apache Camel
      • Red Hat Service Interconnect
      • Red Hat Connectivity Link
    • AI/ML

      • Red Hat OpenShift AI
      • Red Hat Enterprise Linux AI
    • Automation

      • Red Hat Ansible Automation Platform
      • Red Hat Ansible Lightspeed
    • Developer tools

      • Red Hat Trusted Software Supply Chain
      • Podman Desktop
      • Red Hat OpenShift Dev Spaces
    • Developer Sandbox

      Developer Sandbox
      Try Red Hat products and technologies without setup or configuration fees for 30 days with this shared Openshift and Kubernetes cluster.
    • Try at no cost
  • Technologies

    Featured

    • AI/ML
      AI/ML Icon
    • Linux
      Linux Icon
    • Kubernetes
      Cloud icon
    • Automation
      Automation Icon showing arrows moving in a circle around a gear
    • View All Technologies
    • Programming Languages & Frameworks

      • Java
      • Python
      • JavaScript
    • System Design & Architecture

      • Red Hat architecture and design patterns
      • Microservices
      • Event-Driven Architecture
      • Databases
    • Developer Productivity

      • Developer productivity
      • Developer Tools
      • GitOps
    • Secure Development & Architectures

      • Security
      • Secure coding
    • Platform Engineering

      • DevOps
      • DevSecOps
      • Ansible automation for applications and services
    • Automated Data Processing

      • AI/ML
      • Data Science
      • Apache Kafka on Kubernetes
      • View All Technologies
    • Start exploring in the Developer Sandbox for free

      sandbox graphic
      Try Red Hat's products and technologies without setup or configuration.
    • Try at no cost
  • Learn

    Featured

    • Kubernetes & Cloud Native
      Openshift icon
    • Linux
      Rhel icon
    • Automation
      Ansible cloud icon
    • Java
      Java icon
    • AI/ML
      AI/ML Icon
    • View All Learning Resources

    E-Books

    • GitOps Cookbook
    • Podman in Action
    • Kubernetes Operators
    • The Path to GitOps
    • View All E-books

    Cheat Sheets

    • Linux Commands
    • Bash Commands
    • Git
    • systemd Commands
    • View All Cheat Sheets

    Documentation

    • API Catalog
    • Product Documentation
    • Legacy Documentation
    • Red Hat Learning

      Learning image
      Boost your technical skills to expert-level with the help of interactive lessons offered by various Red Hat Learning programs.
    • Explore Red Hat Learning
  • Developer Sandbox

    Developer Sandbox

    • Access Red Hat’s products and technologies without setup or configuration, and start developing quicker than ever before with our new, no-cost sandbox environments.
    • Explore Developer Sandbox

    Featured Developer Sandbox activities

    • Get started with your Developer Sandbox
    • OpenShift virtualization and application modernization using the Developer Sandbox
    • Explore all Developer Sandbox activities

    Ready to start developing apps?

    • Try at no cost
  • Blog
  • Events
  • Videos

The benefits of dynamic GPU slicing in OpenShift

May 6, 2025
Gaurav Singh Harshal Patil Abhishek Malvankar
Related topics:
Artificial intelligenceData Science
Related products:
Red Hat AIRed Hat Enterprise Linux AIRed Hat OpenShift AI

Share:

    In the era of AI and machine learning, efficient resource management is paramount. As Red Hat OpenShift administrators, we face the challenge of deploying intensive AI workloads on a platform where GPUs represent a significant cost. Traditional methods like pre-slicing with NVIDIA’s Multi-Instance GPU (MIG) can lead to resource wastage, especially when the static slices do not align with dynamic workload demands.

    In this article, we will explore how dynamic GPU slicing—enabled by the dynamic accelerator slicer operator—can revolutionize GPU resource management in OpenShift by dynamically adjusting allocation based on workload needs.

    The challenge of GPU utilization

    OpenShift supports the NVIDIA GPU operator. However, conventional approaches come with several limitations:

    • Rigid slicing: Predefined MIG slices may not match the actual resource requirements of your pods, leading to underutilization.
    • Static allocation: Allocating GPU slices at node start-up forces administrators to reconfigure the nodes if workloads change, often causing disruptions.
    • Lack of dynamic provisioning: Without the ability to allocate GPU resources on demand, clusters end up either over-provisioning or underutilizing GPUs, both of which can increase operational costs.

    These issues highlight the need for a more flexible approach that enables fractional GPU sharing, ensuring each workload gets only what it needs. 

    Introducing the dynamic accelerator slicer

    The dynamic accelerator slicer operator is currently in developer preview (DP) and is designed to dynamically allocate and manage GPU slices in OpenShift. Its core objectives include:

    • Dynamic allocation: Provisioning MIG slices based on the precise pod resource requests and limits.
    • Intelligent scheduling: Leveraging Kubernetes scheduling gates to hold pods until the necessary GPU slices are available.
    • Seamless integration: Working alongside the NVIDIA GPU Operator, it manages GPU slices without requiring changes to the pod specifications.
    • Automated lifecycle management: Tracking allocations and automatically releasing GPU slices once workloads complete.

    How dynamic GPU slicing works

    Dynamic GPU slicing leverages the scheduling primitives within OpenShift to optimize resource usage in this three-step process:

    1. Dynamic allocation and placement

      When a pod requests GPU resources, it is held in a pre-scheduled state via Kubernetes scheduling gates. The operator dynamically allocates the required GPU slice only when the workload is ready to run, thereby avoiding the inefficiencies associated with pre-slicing.

    2. Integration with the NVIDIA GPU Operator

      By harnessing the power of the NVIDIA GPU Operator, the Dynamic Accelerator Slicer introduces an external controller that manages slice allocation. This integration ensures that GPU management remains robust and that the slicing mechanism fits seamlessly into the existing ecosystem.

    3. Automated slice lifecycle management

      From allocation to deallocation, the operator handles the entire lifecycle of GPU slices. This automation not only streamlines resource management but also ensures that resources are promptly returned to the pool once the pod completes its execution.

    Additionally, the solution extends GPU memory quota management via Kubernetes Quota and Kueue and introduces dynamic policy-driven slicing with efficient packing to meet diverse workload demands.

    Deploying the dev preview in OpenShift

    Deploying the dynamic accelerator slicer is straightforward. Before you begin, ensure you have the necessary dependencies installed:

    • oc CLI: Used to interact with your OpenShift cluster.
    • Operator SDK: Required to run the operator bundle.

    Note:

    Make sure the KUBECONFIG environment variable is set to point to your OpenShift cluster's configuration file.

    Follow these steps to deploy the operator:

    1. Create a new project:

      oc new-project instaslice-system 
    2. Run the operator bundle:

      operator-sdk run bundle quay.io/ibm/instaslice-bundle:0.0.1 -n instaslice-system

    These commands deploy the operator into your OpenShift cluster, allowing it to dynamically manage GPU slices alongside your existing NVIDIA GPU setup.

    Deploying an inference large language model (LLM) 

    Multiple instances of the inference can be deployed on a given GPU with the isolation provided by the NVIDIA MIG technology. Here we have a Red Hat OpenShift Container Platform worker node with 4 NVIDIA A100 GPUs:

    [root@nvidia-driver-daemonset-417 drivers]# nvidia-smi -L 
    
    GPU 0: NVIDIA A100-SXM4-40GB (UUID: GPU-42db43a1-526e-626d-ff2d-456bf26d9df0)
    
    GPU 1: NVIDIA A100-SXM4-40GB (UUID: GPU-143a66c4-bd69-7559-8898-26f9886a2a56)
    
    GPU 2: NVIDIA A100-SXM4-40GB (UUID: GPU-7d9ad99b-3d06-cf12-bab8-e1f31672e01f)
    
    GPU 3: NVIDIA A100-SXM4-40GB (UUID: GPU-7cf6fcb8-d173-0224-618c-4813e24a3383)

    These GPUs can be sliced to serve multiple inference LLM models. Download the sample vLLM model.  

    After downloading, make sure to edit the file to set the value of HF_TOKEN in huggingface-secret obtained from your HuggingFace account. 

    Apply that file using oc apply -f <filename>.

    Please note that this particular sample model is rather large and may take a while to download depending on your network speed. Make sure the vLLM deployment has come up fine.

    $ oc get deployments vllm 
    
    NAME   READY   UP-TO-DATE   AVAILABLE   AGE
    
    vllm   1/1     1            1           35s

    You can verify that instaslice has indeed provisioned a slice requested by the deployment by observing the instaslice object as well as by probing the NVIDIA GPU operator. 

    We will find out the uuid of the pod running on the host. This is one of the ways to get it:

     $ oc get pods  vllm-7dbb49b8f8-znd4s -o json | jq .metadata.uid
    
    "98c08795-2da3-4598-81b7-538c4e37093b"

     This uuid will help us track the allocation status in the instaslice object:

    $ oc get instaslice host-192-168-11-144 -o json | jq .status.podAllocationResults
    {
      "98c08795-2da3-4598-81b7-538c4e37093b": {
        "allocationStatus": {
          "allocationStatusController": "ungated",
          "allocationStatusDaemonset": "created"
        },
        "configMapResourceIdentifier": "a02ae459-0618-4804-83f9-e5ba36af756f",
        "gpuUUID": "GPU-143a66c4-bd69-7559-8898-26f9886a2a56",
        "migPlacement": {
          "size": 4,
          "start": 0
        },
        "nodename": "host-192-168-11-144"
      }
    }

    As you can see in the previous example, the slice has been successfully created dynamically. You can verify this with NVIDIA GPU operator:

    # nvidia-smi -L 
    
    GPU 0: NVIDIA A100-SXM4-40GB (UUID: GPU-42db43a1-526e-626d-ff2d-456bf26d9df0)
    
    GPU 1: NVIDIA A100-SXM4-40GB (UUID: GPU-143a66c4-bd69-7559-8898-26f9886a2a56)
    
      MIG 3g.20gb     Device  0: (UUID: MIG-b065e8ea-01ff-598f-83fd-3eea1f2869b9)
    
    GPU 2: NVIDIA A100-SXM4-40GB (UUID: GPU-7d9ad99b-3d06-cf12-bab8-e1f31672e01f)
    
    GPU 3: NVIDIA A100-SXM4-40GB (UUID: GPU-7cf6fcb8-d173-0224-618c-4813e24a3383)

    Now in another terminal, make sure to set the KUBECONFIG for the OpenShift cluster and execute the following command to forward the port:

    oc port-forward svc/vllm 8000:8000 -n instaslice-system

    Come back to the earlier terminal to query the model:

    $ curl http://localhost:8000/v1/completions \
      -H "Content-Type: application/json" \
      -d '{
        "model": "facebook/opt-125m",
        "prompt": "San Francisco is a",
        "max_tokens": 7,
        "temperature": 0
      }'
    {"id":"cmpl-474b5727068745baa4718925bcfa8f91","object":"text_completion","created":163554,"model":"facebook/opt-125m","choices":[{"index":0,"text":" great place to live.  I","logprobs":null,"finish_reason":"length"}],"usage":{"prompt_tokens":5,"total_tokens":12,"completion_tokens":7}}

    If you happen to delete this workload, instaslice will automatically clear corresponding slices on the respective GPUs dynamically.  

    The benefits of dynamic GPU slicing

    Adopting dynamic GPU slicing offers multiple benefits:

    • Increased utilization: Workloads can utilize GPU slices efficiently, reducing the need to reserve entire GPUs when only a fraction is necessary.
    • Cost reduction: By paying only for the resources actively used, organizations can significantly reduce GPU-related expenses.
    • Seamless scaling: Dynamic allocation allows clusters to adapt in real time to changing workloads, ensuring optimal performance and resource distribution.

    Looking ahead: Integration with Kubernetes DRA

    The Kubernetes Dynamic Resource Allocation (DRA) framework aims to provide even finer control over GPU scheduling. While DRA is still evolving, the dynamic accelerator slicer offers a practical and production-ready solution today. It paves the way for future enhancements, ensuring that OpenShift users can continue to push the boundaries of resource efficiency.

    Efficient GPU management is essential for maximizing the performance of AI workloads in OpenShift. The dynamic accelerator slicer operator delivers dynamic GPU slicing that optimizes resource usage, reduces operational costs, and adapts seamlessly to the fluctuating demands of modern AI applications.

    By integrating dynamically with the NVIDIA GPU operator and leveraging OpenShift’s scheduling capabilities, this innovative approach transforms GPU allocation from a static, often inefficient process into a dynamic, on-demand service. Embracing dynamic GPU slicing is a critical step towards unlocking the full potential of AI workloads in the OpenShift environment.

    Related Posts

    • What is GPU programming?

    • Your first GPU algorithm: Scan/prefix sum

    • GPU benchmarking and how to choose a GPU framework

    • Use Stable Diffusion to create images on Red Hat OpenShift AI on a ROSA cluster with GPU enabled

    • Deliver generative AI at scale with NVIDIA NIM on OpenShift AI

    Recent Posts

    • How to use RHEL 10 as a WSL Podman machine

    • MINC: Fast, local Kubernetes with Podman Desktop & MicroShift

    • How to stay informed with Red Hat status notifications

    • Getting started with RHEL on WSL

    • llm-d: Kubernetes-native distributed inferencing

    What’s up next?

    Learn how to run distributed AI training on Red Hat OpenShift using RoCE with this step-by-step guide from manual setup to fully automated training.

    Start the activity
    Red Hat Developers logo LinkedIn YouTube Twitter Facebook

    Products

    • Red Hat Enterprise Linux
    • Red Hat OpenShift
    • Red Hat Ansible Automation Platform

    Build

    • Developer Sandbox
    • Developer Tools
    • Interactive Tutorials
    • API Catalog

    Quicklinks

    • Learning Resources
    • E-books
    • Cheat Sheets
    • Blog
    • Events
    • Newsletter

    Communicate

    • About us
    • Contact sales
    • Find a partner
    • Report a website issue
    • Site Status Dashboard
    • Report a security problem

    RED HAT DEVELOPER

    Build here. Go anywhere.

    We serve the builders. The problem solvers who create careers with code.

    Join us if you’re a developer, software engineer, web designer, front-end designer, UX designer, computer scientist, architect, tester, product manager, project manager or team lead.

    Sign me up

    Red Hat legal and privacy links

    • About Red Hat
    • Jobs
    • Events
    • Locations
    • Contact Red Hat
    • Red Hat Blog
    • Inclusion at Red Hat
    • Cool Stuff Store
    • Red Hat Summit

    Red Hat legal and privacy links

    • Privacy statement
    • Terms of use
    • All policies and guidelines
    • Digital accessibility

    Report a website issue