Use Stable Diffusion to create images on Red Hat OpenShift AI on a ROSA cluster with GPU enabled

Stable Diffusion is an AI model to generate images from text description. It uses a diffusion process to iteratively de-noise random Gaussian noise into coherent images. This is a simple tutorial for creating images using Stable Diffusion model using Red Hat OpenShift AI (RHOAI) (formerly Red Hat OpenShift Data Science), which is our OpenShift platform for AI/ML projects lifecycle management, running on a Red Hat OpenShift Services on AWS (ROSA) cluster, which is our managed service OpenShift platform on AWS, with NVIDIA GPU enabled.

What you'll be doing

Once we logged in to the cluster as cluster admin, we will first set up the GPU environment by first creating the machine pools and installing Node Feature Discovery (NFD) operator, which will detect hardware and system configuration of the nodes. We will then create NFD instance to automatically label nodes which is important for GPU-aware scheduling. Then, we will install the NVIDIA GPU operator, which will handle the management of all NVIDIA software components needed to provision GPU-enabled containers, and we’ll wrap up the GPU setup by creating ClusterPolicy and fine-tuning the configuration to ensure proper setup for NVIDIA GPUs in the OpenShift environment.

Afterward, we will set up the RHOAI environment by installing the RHOAI operator, which automates management of AI/ML tools and services on OpenShift, creating a DataScienceCluster (DSC) instance to set up the necessary services and tools for AI/ML tasks, and we will also enable GPU accelerator for RHOAI. And finally, we will generate images by running the Stable Diffusion model on RHOAI's Jupyter notebook.

Note that the installation process in this tutorial is all completed through the CLI, so please make sure that you are logged in to the cluster with a cluster admin access prior to running the tutorial. If you prefer to install the RHOAI operator via console and/or save the models/images on Amazon S3 buckets, you can follow the other tutorial. If you do so, however, please be sure to enable the GPU accelerator prior to launching the Jupyter notebook. In addition, the RHOAI operator version used in this tutorial is version 2.10.0 and please note that as RHOAI undergoes ongoing development and refinement, certain features may evolve or change over time.

Disclaimer

When using Stable Diffusion or other open source image generation models, please be aware that while these tools include certain content filters and safety features, these are not foolproof. Therefore, it is your responsibility to use this tool in a safe manner, ensure the prompts you input are appropriate, and verify that the generated images are suitable for your intended audience. Neither the author of this tutorial nor the infrastructure providers can be held responsible for any inappropriate or unwanted results you may generate. By proceeding with this tutorial, you acknowledge that you understand the potential risks and agree to use the tool responsibly. Remember that the output of AI image generation models can sometimes be unpredictable and thus it is important to review all the generated images before sharing or using them in any context.

1. Prerequisites

Classic or HCP (hosted control planes) ROSA cluster.
- Classic: I ran this tutorial on a classic single-AZ ROSA 4.15.17 cluster with m6a.4xlarge node with auto-scaling enabled up to 10 nodes. The cluster has 80 vCPUs with ~339Gi memory.
- HCP: I also ran this tutorial on an HCP ROSA 4.16.3 cluster with m5.xlarge node with 28 vCPUs and ~108Gi memory.
- Either classic or HCP, be sure that you have cluster admin access to the cluster.
oc cli.
rosa cli.

2. Setting up GPU environment

To begin, let's set up the GPU environment.

2.1 Creating GPU machine pools

In this tutorial, I'm using g5.4xlarge node for the GPU machine pools with auto-scaling enabled up to 4 nodes. Replace your-cluster-name with the name of your cluster:

rosa create machinepool --cluster=<your-cluster-name> --name=gpu-pool --instance-type=g5.4xlarge --min-replicas=1 --max-replicas=4 --enable-autoscaling --labels='gpu-node=true' --taints='nvidia.com/gpu=present:NoSchedule'

Copy snippet

2.2 Installing NFD operator

Next, run the below script to install the NFD operator:

# create the openshift-nfd namespace
oc create namespace openshift-nfd

# apply the OperatorGroup and Subscription
cat <<EOF | oc apply -f -
apiVersion: operators.coreos.com/v1
kind: OperatorGroup
metadata:
  generateName: openshift-nfd-
  name: openshift-nfd
  namespace: openshift-nfd
spec:
  targetNamespaces:
  - openshift-nfd
---
apiVersion: operators.coreos.com/v1alpha1
kind: Subscription
metadata:
  name: nfd
  namespace: openshift-nfd
spec:
  channel: "stable"
  installPlanApproval: Automatic
  name: nfd
  source: redhat-operators
  sourceNamespace: openshift-marketplace
EOF

echo "Waiting for NFD Operator to be deployed..."

# wait for the NFD Operator deployment to be available
while ! oc get deployment -n openshift-nfd | grep -q nfd-controller-manager; do
  sleep 5
done

# wait for the deployment to be ready
oc wait --for=condition=available --timeout=300s deployment/nfd-controller-manager -n openshift-nfd

# check if the deployment is ready
if [ $? -eq 0 ]; then
  echo "NFD Operator is deployed and ready."
else
  echo "Timeout waiting for NFD Operator to be ready. Please check the deployment status manually."
fi

# display the pods in the openshift-nfd namespace
echo "Pods in openshift-nfd namespace:"
oc get pods -n openshift-nfd

Copy snippet

2.3 Creating NFD instance

Follow the script below to create an NFD instance:

# apply the NodeFeatureDiscovery configuration
cat <<EOF | oc apply -f -
kind: NodeFeatureDiscovery
apiVersion: nfd.openshift.io/v1
metadata:
  name: nfd-instance
  namespace: openshift-nfd
spec:
  operand:
    image: registry.redhat.io/openshift4/ose-node-feature-discovery@sha256:07658ef3df4b264b02396e67af813a52ba416b47ab6e1d2d08025a350ccd2b7b
    servicePort: 12000
EOF

echo "Waiting for NFD instance to be created..."
timeout 300 bash -c 'until oc get nodefeaturediscovery nfd-instance -n openshift-nfd &>/dev/null; do sleep 5; done'

if [ $? -eq 0 ]; then
    echo "NFD instance has been successfully created."
else
    echo "Timed out waiting for NFD instance to be created."
    exit 1
fi

Copy snippet

2.4 Installing GPU operator

Now that we have the NFD operator installed and NFD instance created, we will now install the GPU operator:

# fetch the latest channel
CHANNEL=$(oc get packagemanifest gpu-operator-certified -n openshift-marketplace -o jsonpath='{.status.defaultChannel}')

# fetch the latest CSV
STARTINGCSV=$(oc get packagemanifests/gpu-operator-certified -n openshift-marketplace -o json | jq -r --arg CHANNEL "$CHANNEL" '.status.channels[] | select(.name == $CHANNEL) | .currentCSV')

# create namespace if it doesn't exist
oc create namespace nvidia-gpu-operator 2>/dev/null || true

# apply the OperatorGroup and Subscription
cat << EOF | oc apply -f -
apiVersion: operators.coreos.com/v1
kind: OperatorGroup
metadata:
  name: nvidia-gpu-operator-group
  namespace: nvidia-gpu-operator
spec:
 targetNamespaces:
 - nvidia-gpu-operator
---
apiVersion: operators.coreos.com/v1alpha1
kind: Subscription
metadata:
  name: gpu-operator-certified
  namespace: nvidia-gpu-operator
spec:
  channel: "${CHANNEL}"
  installPlanApproval: Automatic
  name: gpu-operator-certified
  source: certified-operators
  sourceNamespace: openshift-marketplace
EOF

# wait for the CSV to be available
echo "Waiting for ClusterServiceVersion to be available..."
while ! oc get csv -n nvidia-gpu-operator ${STARTINGCSV} &>/dev/null; do
    sleep 5
done

# apply the ClusterPolicy
oc get csv -n nvidia-gpu-operator ${STARTINGCSV} -ojsonpath='{.metadata.annotations.alm-examples}' | jq '.[0]' | oc apply -f -

echo "GPU Operator installation completed successfully."

Copy snippet

2.5 Updating ClusterPolicy object

And next, let's update the ClusterPolicy:

# apply the ClusterPolicy for the GPU operator
cat <<EOF | oc apply -f -
apiVersion: nvidia.com/v1
kind: ClusterPolicy
metadata:
  name: gpu-cluster-policy
spec:
  operator:
    defaultRuntime: crio
  daemonsets:
    enabled: true
  dcgm:
    enabled: true
  dcgmExporter:
    enabled: true
  devicePlugin:
    enabled: true
  driver:
    enabled: true
  gfd:
    enabled: true
  migManager:
    enabled: true
  nodeStatusExporter:
    enabled: true
  toolkit:
    enabled: true
EOF

echo "Waiting for ClusterPolicy to be ready..."
if oc wait clusterpolicy/gpu-cluster-policy --for=condition=Ready --timeout=300s -n gpu-operator-resources; then
    echo "ClusterPolicy has been successfully created and is ready."
    echo "ClusterPolicy status:"
    oc get clusterpolicy -n gpu-operator-resources
else
    echo "Timed out waiting for ClusterPolicy to be ready."
    exit 1
fi

Copy snippet

3. Setting up the OpenShift AI environment

Next, we will install the RHOAI operator.

3.1 Installing RHOAI operator

Run the following command to begin the installation:

# apply the RHOAI Operator configurations
cat << EOF | oc apply -f -
---
apiVersion: v1
kind: Namespace
metadata:
  name: redhat-ods-operator
---
apiVersion: operators.coreos.com/v1
kind: OperatorGroup
metadata:
  name: rhods-operator
  namespace: redhat-ods-operator
---
apiVersion: operators.coreos.com/v1alpha1
kind: Subscription
metadata:
  name: rhods-operator
  namespace: redhat-ods-operator
spec:
  name: rhods-operator
  channel: stable
  source: redhat-operators
  sourceNamespace: openshift-marketplace
EOF

echo "RHOAI Operator configurations applied. Waiting for the operator to be ready..."

# check if the RHOAI Operator pod is ready
check_rhoai_operator_pod() {
    local pod_status=$(oc get pods -n redhat-ods-operator -l name=rhods-operator --no-headers -o custom-columns=":status.phase")
    if [[ "$pod_status" == "Running" ]]; then
        return 0
    else
        return 1
    fi
}

# wait for RHOAI Operator pod to be ready with a timeout
timeout=600  
end_time=$((SECONDS + timeout))
while ! check_rhoai_operator_pod; do
    if [ $SECONDS -ge $end_time ]; then
        echo "Timeout waiting for RHOAI Operator pod to be ready."
        break
    fi
    echo "Waiting for RHOAI Operator pod to be ready... ($(($end_time - $SECONDS))s remaining)"
    sleep 30
done

# display the RHOAI Operator pod status
echo "RHOAI Operator pod status:"
oc get pods -n redhat-ods-operator

# check the CSV (ClusterServiceVersion) status
echo "Checking RHOAI Operator CSV status:"
oc get csv -n redhat-ods-operator

# check the Subscription status
echo "Checking RHOAI Operator Subscription status:"
oc get subscription -n redhat-ods-operator rhods-operator -o jsonpath='{.status.state}{"\n"}'

# final check for operator readiness
if check_rhoai_operator_pod; then
    echo "RHOAI Operator is ready and running."
else
    echo "RHOAI Operator deployment might have issues. Please check the pod logs and events for more information."
fi

Copy snippet

3.2 Creating DSC instance

Now let's create a DSC instance:

# apply the DataScienceCluster configuration
cat <<EOF | oc apply -f -
apiVersion: datasciencecluster.opendatahub.io/v1
kind: DataScienceCluster
metadata:
  name: default-dsc
spec:
  components:
    dashboard:
      managementState: Managed
    datasciencepipelines:
      managementState: Managed
    kserve:
      managementState: Managed
    modelmeshserving:
      managementState: Managed
    workbenches:
      managementState: Managed
EOF

echo "Waiting for DataScienceCluster to be created..."
timeout 300 bash -c 'until oc get datasciencecluster default-dsc -n redhat-ods-operator &>/dev/null; do sleep 5; done'

if [ $? -eq 0 ]; then
    echo "DataScienceCluster has been successfully created."
    echo "DataScienceCluster status:"
    oc get datasciencecluster -n redhat-ods-operator
else
    echo "Timed out waiting for DataScienceCluster to be created."
    exit 1
fi

Copy snippet

3.3 Enabling GPU as RHOAI accelerator

Next, let's enable the GPU as the accelerator by deleting the migration-gpu-status ConfigMap:

echo "Checking for migration-gpu-status configmap..."
if oc get configmap migration-gpu-status -n redhat-ods-applications &>/dev/null; then
    echo "migration-gpu-status configmap found. Deleting..."
    oc delete configmap migration-gpu-status -n redhat-ods-applications
else
    echo "migration-gpu-status configmap not found. Continuing with dashboard restart."
fi

echo "Restarting rhods-dashboard deployment..."
oc rollout restart deployment/rhods-dashboard -n redhat-ods-applications
echo "Waiting for rhods-dashboard rollout to complete..."

oc rollout status deployment/rhods-dashboard -n redhat-ods-applications
echo "Script completed successfully. GPU acceleration should now be enabled."
echo "Please verify in the RHOAI dashboard that GPU options are available for your workbenches."

Copy snippet

4. Deploying Stable Diffusion model

In this tutorial, we'll use the Stable Diffusion 2.1 model from Stability AI to generate images based on text prompts. We'll generate three images based on prompts about cats and dogs, using 50 inference steps and a guidance scale of 7.5. These images are then displayed vertically using matplotlib, with each image titled by its corresponding prompt.

Now that we have the environment ready, let's go to the console and go to the RHOAI dashboard links (select it from the 9-box icon on the upper right side of the console). Then once you're at the RHOAI dashboard console, from the navigator pane on the left hand side, select Applications and click Enabled, which will then lead you to launch a Jupyter notebook.

Click Launch application and then select TensorFlow 2024.1 notebook. You can leave the container size to Small. Then select NVIDIA GPU as the accelerator from the dropdown option. See Figure 1.

RHOAI dashboard > Applications > Enabled. — Figure 1: Select NVIDIA GPU as Accelerator.

Click the Start server button and wait until the notebook is ready, and click Open in new tab. Once you're routed to the Jupyter notebook, click Python 3.9 notebook button on top, and run the following script in a single cell:

# install the necessary dependencies and libraries
!pip install --upgrade diffusers transformers torch accelerate matplotlib datasets torchvision

import torch
from diffusers import StableDiffusionPipeline
from datasets import load_dataset
import random
from torchvision import transforms
from PIL import Image
import matplotlib.pyplot as plt
import gc

# clean up memory and reset CUDA cache
def cleanup_memory():
    gc.collect()
    torch.cuda.empty_cache()
    if torch.cuda.is_available():
        torch.cuda.reset_peak_memory_stats()

# load the Stable Diffusion model
def load_model(model_id):
    pipeline = StableDiffusionPipeline.from_pretrained(model_id, torch_dtype=torch.float16)
    pipeline = pipeline.to("cuda" if torch.cuda.is_available() else "cpu")
    return pipeline

# generate the images
def generate_images(pipeline, prompts, num_images_per_prompt=1, num_inference_steps=50, guidance_scale=7.5):
    images = []
    for prompt in prompts:
        batch = pipeline(
            prompt, 
            num_images_per_prompt=num_images_per_prompt, 
            num_inference_steps=num_inference_steps,
            guidance_scale=guidance_scale,
            output_type="pil"
        )
        images.extend(batch.images)
        cleanup_memory()
    return images

# display the images
def display_images(images, prompts):
    rows = len(images)
    fig, axs = plt.subplots(rows, 1, figsize=(15, 5*rows))
    
    if rows == 1:
        axs = [axs]  
    
    for img, ax, prompt in zip(images, axs, prompts):
        ax.imshow(img)
        ax.set_title(prompt, fontsize=10)
        ax.axis('off')
    
    plt.tight_layout()
    plt.show()

# execute the script
if __name__ == "__main__":
    try:        
        pipeline = load_model('stabilityai/stable-diffusion-2-1')
                
        prompts = [
            "A cute cat",
            "A cute dog",
            "A cute cat and a cute dog sit next to each other"
        ]
        num_images_per_prompt = 1
        
        generated_images = generate_images(pipeline, prompts, num_images_per_prompt, num_inference_steps=50, guidance_scale=7.5)
        display_images(generated_images, prompts)
        
    except Exception as e:
        print(f"An error occurred: {str(e)}")
    finally:
        cleanup_memory()

Copy snippet

Here are some pictures that I've gotten from my run (Figures 2-4). Note that the pictures may vary every run.

AN accurate AI-generated image of a cat. — Figure 2: Accurate image of a cat.

An accurate AI-generated image of a dog. — Figure 3: Accurate image of a dog.

An accurate AI-generated image of a cat and a dog sitting next to each other. — Figure 4: Accurate image of a cat and a dog.

Note that these prompts, e.g., “A cute cat”, “A cute dog”, and “A cute cat and a cute dog sit next to each other”, are just examples, and you can modify your prompts to your liking by modifying the prompts in the main function.

If you experience hung kernel or something similar, erestart/refresh RHOAI dashboard and relaunch the notebook. Alternatively, if you were using an HCP cluster, you might also want to add more nodes into the machine pool.

Note that you might encounter the following warning messages, which are informational and generally harmless:

"The cache for model files in Transformers v4.22.0 has been updated…": This is just an informational message that can be safely ignored once the cache migration is complete.
"Unable to register cuDNN/cuFFT/cuBLAS factory...": These messages indicate that these CUDA libraries are being initialized multiple times.
"This TensorFlow binary is optimized to use available CPU instructions...": This is also just an informational message that TensorFlow installation is working but could potentially be optimized further.
"TF-TRT Warning: Could not find TensorRT": This warning indicates that TensorRT is not available, which might affect performance but not functionality.

5. Future research

Note that this is a simple tutorial intended to guide you through the necessary environment set up once you have a ROSA cluster spun up and followed by a simple deployment of generating images using the Stable Diffusion model. If you happen to get unsatisfactory results, i.e., inaccurate images, there are many ways you can go about improving them, such as by adjusting the parameters and using more specific prompts.

In one of my runs, I noticed that the model generated an inaccurate image of a cat and a dog (for the third prompt) as shown in Figure 5.

An inaccurate AI-generated image of a cat and dog sitting next to each other. — Figure 5: Inaccurate image of a cat and a dog.

So here I adjusted the num_inference_steps from 50 to 75, guidance_scale from 7.5 to 8.5, and modified the last prompt into “A cute cat and a cute dog sitting next to each other, both faces and bodies are in the same image and background”. And thus, I got Figure 6 as a result (note that results may vary).

Increasing num_inference_steps will allow the model more iterations to refine the image, adjusting guidance_scale can lead to images that are more closely matching the prompt, and using more detailed prompts can help guide the model better.

However, note that even with these optimizations, generating images with multiple specific elements can be tricky due to the inherent nature of generative models. You might still need to run the code multiple times to get the desired results.

Note that there are many other ways to improve the accuracy that I’m not going to delve further in this blog, such as using negative prompts to exclude what you don’t want to see in the image, fine-tuning the model, using another model, increasing the batch size, etc. These are all potential topics for future research.

Use Stable Diffusion to create images on Red Hat OpenShift AI on a ROSA cluster with GPU enabled

Share:

What you'll be doing

1. Prerequisites

2. Setting up GPU environment

2.1 Creating GPU machine pools

2.2 Installing NFD operator

2.3 Creating NFD instance

2.4 Installing GPU operator

2.5 Updating ClusterPolicy object

3. Setting up the OpenShift AI environment

3.1 Installing RHOAI operator

3.2 Creating DSC instance

3.3 Enabling GPU as RHOAI accelerator

4. Deploying Stable Diffusion model

5. Future research

Products

Build

Quicklinks

Communicate

RED HAT DEVELOPER

Red Hat legal and privacy links

Red Hat legal and privacy links

Report a website issue