Skip to main content
Redhat Developers  Logo
  • AI

    Get started with AI

    • Red Hat AI
      Accelerate the development and deployment of enterprise AI solutions.
    • AI learning hub
      Explore learning materials and tools, organized by task.
    • AI interactive demos
      Click through scenarios with Red Hat AI, including training LLMs and more.
    • AI/ML learning paths
      Expand your OpenShift AI knowledge using these learning resources.
    • AI quickstarts
      Focused AI use cases designed for fast deployment on Red Hat AI platforms.
    • No-cost AI training
      Foundational Red Hat AI training.

    Featured resources

    • OpenShift AI learning
    • Open source AI for developers
    • AI product application development
    • Open source-powered AI/ML for hybrid cloud
    • AI and Node.js cheat sheet

    Red Hat AI Factory with NVIDIA

    • Red Hat AI Factory with NVIDIA is a co-engineered, enterprise-grade AI solution for building, deploying, and managing AI at scale across hybrid cloud environments.
    • Explore the solution
  • Learn

    Self-guided

    • Documentation
      Find answers, get step-by-step guidance, and learn how to use Red Hat products.
    • Learning paths
      Explore curated walkthroughs for common development tasks.
    • See all learning

    Hands-on

    • Developer Sandbox
      Spin up Red Hat's products and technologies without setup or configuration.
    • Interactive labs
      Learn by doing in these hands-on, browser-based experiences.
    • Interactive demos
      Click through product features in these guided tours.

    Browse by topic

    • AI/ML
    • Automation
    • Java
    • Kubernetes
    • Linux
    • See all topics

    Training & certifications

    • Courses and exams
    • Certifications
    • Skills assessments
    • Red Hat Academy
    • Learning subscription
    • Explore training
  • Build

    Get started

    • Red Hat build of Podman Desktop
      A downloadable, local development hub to experiment with our products and builds.
    • Developer Sandbox
      Spin up Red Hat's products and technologies without setup or configuration.

    Download products

    • Access product downloads to start building and testing right away.
    • Red Hat Enterprise Linux
    • Red Hat AI
    • Red Hat OpenShift
    • Red Hat Ansible Automation Platform
    • See all products

    Featured

    • Red Hat build of OpenJDK
    • Red Hat JBoss Enterprise Application Platform
    • Red Hat OpenShift Dev Spaces
    • Red Hat Developer Toolset

    References

    • E-books
    • Documentation
    • Cheat sheets
    • Architecture center
  • Community

    Get involved

    • Events
    • Live AI events
    • Red Hat Summit
    • Red Hat Accelerators
    • Community discussions

    Follow along

    • Articles & blogs
    • Developer newsletter
    • Videos
    • Github

    Get help

    • Customer service
    • Customer support
    • Regional contacts
    • Find a partner

    Join the Red Hat Developer program

    • Download Red Hat products and project builds, access support documentation, learning content, and more.
    • Explore the benefits

Run DialoGPT-small on OpenShift AI for internal model testing

A practical guide to deploying large language models on OpenShift AI for internal testing and validation

September 25, 2025
Christina Zhang
Related topics:
Artificial intelligenceDevOpsLinuxRuntimes
Related products:
Red Hat AIRed Hat OpenShift AI

    With the rise of generative AI, many enterprises are exploring how to bring large language models (LLMs) into secure, internal cloud-native environments. When used with KServe, vLLM, and GPU support, platforms like Red Hat OpenShift AI provide a robust approach to serving models efficiently at scale.

    In this blog, I’ll walk you through a complete internal deployment workflow of the DialoGPT-small language model on OpenShift AI using Red Hat Inference Server—all without exposing any external endpoints. You’ll learn how to set up your environment, configure ServingRuntime, manage model storage with persistent volume claims (PVCs), and deploy an inference service ready for testing. The flow is illustrated in Figure 1.

    Warning

    This workflow is designed for internal testing and evaluation purposes only; it is not intended for production use. For production environments, follow the official product documentation and use supported configuration methods provided by Red Hat.

    RHAIIS usage on RHOAI
    Figure 1: Deployment flow from platform setup to runtime execution.

    Environment verification

    Ensure the following components are ready:

    • KServe controller running normally
    • All Knative Serving components running normally
    • Istio system components running normally
    • DataScienceCluster status is Ready

    Install the required operators:

    • NVIDIA GPU Operator: Provides GPU support
    • Red Hat OpenShift AI: Provides AI/ML platform functionality
    • Red Hat OpenShift Serverless: Provides Knative Serving support
    • Red Hat OpenShift Service Mesh 2: Provides Istio service mesh support
    • Node Feature Discovery Operator: Automatically discovers node features
    • Package Server: Manages operator packages

    Verify operator status:

    # Check required Operators status
    oc get csv -A | grep -E "(gpu-operator|rhods|serverless|servicemesh|nfd)"
    # View DataScienceCluster status
    oc get datasciencecluster -A

    Deploy an LLM on OpenShift AI

    1. Create and switch to the working namespace. Create a dedicated namespace for this:

      oc new-project ai-inference-demo

      Confirm you are under the project ai-inference-demo before you proceed.

    2. Configure the namespace as a service mesh member:

      # Add Istio injection label to namespace
      oc label namespace ai-inference-demo istio-injection=enabled
      # Check if ServiceMeshMemberRoll needs to be updated
      oc get servicemeshmemberroll -A
      # If ServiceMeshMemberRoll exists, add namespace to member list
      oc patch servicemeshmemberroll default -n istio-system --type='json' -p='[{"op": "add", "path": "/spec/members/-", "value": "ai-inference-demo"}]'
      # Verify namespace labels
      oc get namespace ai-inference-demo --show-labels
      # Enable anyuid SCC to avoid token and permission issues
      oc adm policy add-scc-to-user anyuid -z default -n ai-inference-demo
    3. Configure the Red Hat registry image pull permissions:

      # Create Red Hat Registry pull secret (requires valid Red Hat Customer Portal credentials)
      oc create secret docker-registry redhat-registry-secret \
          --docker-server=registry.redhat.io \
          --docker-username=YOUR_RH_USERNAME \
          --docker-password='YOUR_RH_PASSWORD' \
          --docker-email=YOUR_EMAIL
      # Link secret to default service account
      oc secrets link default redhat-registry-secret --for=pull
      oc secrets link deployer redhat-registry-secret --for=pull
      # Verify secret creation
      oc get secret redhat-registry-secret

      Note: Replace YOUR_RH_USERNAME, YOUR_RH_PASSWORD, and YOUR_EMAIL with your actual Red Hat Customer Portal credentials.

      Note

      registry.redhat.io/rhaiis/vllm-cuda-rhel9:3.2.0-1752784628 is the current latest version. You can find the latest version in the Red Hat Ecosystem Catalog by searching for rhaiis.

    4. Create a ServingRuntime as follows. A ServingRuntime defines the reusable runtime environment such as the container image, supported model formats, and resource settings that OpenShift AI uses to serve machine learning models.

      cat <<EOF | oc apply -f -
      apiVersion: serving.kserve.io/v1alpha1
      kind: ServingRuntime
      metadata:
        name: red-hat-vllm-runtime
        namespace: ai-inference-demo
      spec:
        supportedModelFormats:
          - name: vllm
            version: "1"
            autoSelect: true
          - name: pytorch
            version: "1"
            autoSelect: true
        containers:
          - name: kserve-container
            image: registry.redhat.io/rhaiis/vllm-cuda-rhel9:3.2.0-1752784628
            ports:
              - containerPort: 8080
                name: http1
                protocol: TCP
            command: ["python", "-m", "vllm.entrypoints.openai.api_server"]
            args:
              - "--model"
              - "/mnt/models/DialoGPT-small"
              - "--host"
              - "0.0.0.0"
              - "--port"
              - "8080"
              - "--served-model-name"
              - "DialoGPT-small"
              - "--max-model-len"
              - "1024"
              - "--disable-log-requests"
            env:
              - name: VLLM_CPU_KVCACHE_SPACE
                value: "4"
              - name: HF_HUB_OFFLINE
                value: "1"
              - name: TRANSFORMERS_OFFLINE
                value: "1"
            resources:
              requests:
                cpu: "1"
                memory: "4Gi"
                nvidia.com/gpu: "1"
              limits:
                cpu: "2"
                memory: "8Gi"
                nvidia.com/gpu: "1"
            readinessProbe:
              httpGet:
                path: /health
                port: 8080
              initialDelaySeconds: 120
              periodSeconds: 10
              timeoutSeconds: 10
            livenessProbe:
              httpGet:
                path: /health
                port: 8080
              initialDelaySeconds: 180
              periodSeconds: 30
              timeoutSeconds: 10
      EOF
    5. Verify the ServingRuntime status:

      # Check ServingRuntime status
      oc get servingruntime red-hat-vllm-runtime
      # View detailed information
      oc describe servingruntime red-hat-vllm-runtime
    6. Create a persistent volume claim for model storage. While this example uses a PVC to store model files locally in the cluster, other storage options such as downloading directly from Hugging Face, using object storage (like S3), or mounting a hostPath volume are also possible depending on your environment and security needs.

      cat <<EOF | oc apply -f -
      apiVersion: v1
      kind: PersistentVolumeClaim
      metadata:
        name: model-storage-pvc
        namespace: ai-inference-demo
      spec:
        accessModes:
          - ReadWriteOnce
        resources:
          requests:
            storage: 10Gi
        storageClassName: gp3-csi
      EOF

      Verify PVC creation:

      oc get pvc model-storage-pvc
    7. Download the model to the PVC:

      cat <<EOF | oc apply -f -
      apiVersion: batch/v1
      kind: Job
      metadata:
        name: dialogpt-model-downloader
        namespace: ai-inference-demo
      spec:
        template:
          spec:
            restartPolicy: Never
            containers:
            - name: downloader
              image: python:3.12-slim
              command:
              - /bin/sh
              - -c
              - |
                set -e
                export HOME=/tmp
                pip install --no-cache-dir --user huggingface_hub
                export PATH="\$HOME/.local/bin:\$PATH"
                mkdir -p /models/DialoGPT-small
                python3 -c "from huggingface_hub import hf_hub_download; files = ['config.json', 'pytorch_model.bin', 'tokenizer_config.json', 'vocab.json', 'merges.txt']; [hf_hub_download(repo_id='microsoft/DialoGPT-small', filename=f, local_dir='/models/DialoGPT-small') for f in files]"
                rm /models/DialoGPT-small/tokenizer.json || true
                ls -la /models/DialoGPT-small/
                du -sh /models/DialoGPT-small/pytorch_model.bin
              volumeMounts:
              - name: model-storage
                mountPath: /models
              env:
              - name: HF_TOKEN
                value: "YOUR_HF_TOKEN_HERE"
            volumes:
            - name: model-storage
              persistentVolumeClaim:
                claimName: model-storage-pvc
      EOF

      If you need to access private models, replace YOUR_HF_TOKEN_HERE with your Hugging Face token.

    8. Monitor the model download progress. View job status:

      oc get jobs

      View download logs:

      oc logs job/dialogpt-model-downloader -f
      # Wait to see "Download completed!" message
    9. Verify the model file location:

      # Create debug Pod to check model files in PVC
      cat <<EOF | oc apply -f -
      apiVersion: v1
      kind: Pod
      metadata:
        name: pvc-explorer
        namespace: ai-inference-demo
      spec:
        restartPolicy: Never
        containers:
        - name: explorer
          image: busybox:latest
          imagePullPolicy: IfNotPresent
          command: ["sleep", "300"]
          volumeMounts:
          - name: model-storage
            mountPath: /data
        volumes:
        - name: model-storage
          persistentVolumeClaim:
            claimName: model-storage-pvc
      EOF
      # Check model file location - ensure PVC has downloaded LLM
      oc exec pvc-explorer -- ls -la /data/
      oc exec pvc-explorer -- ls -la /data/DialoGPT-small/
      oc exec pvc-explorer -- find /data -name "config.json"
      oc exec pvc-explorer -- du -h /data/DialoGPT-small/pytorch_model.bin
      # Verify content (required)
      oc exec pvc-explorer -- head -n 5 /data/DialoGPT-small/config.json
      oc exec pvc-explorer -- head -n 5 /data/DialoGPT-small/tokenizer_config.json
      # Clean up debug Pod
      oc delete pod pvc-explorer

      You should see a /data/DialoGPT-small/ directory containing the following files:

      • config.json
      • pytorch_model.bin
      • tokenizer_config.json
      • vocab.json
      • merges.txt
    10. Create the InferenceService:

      cat <<EOF | oc apply -f -
      apiVersion: serving.kserve.io/v1beta1
      kind: InferenceService
      metadata:
        name: dialogpt-small-service
        namespace: ai-inference-demo
        annotations:
          sidecar.istio.io/inject: "false"  # Disable Istio sidecar to avoid envoy errors
          serving.kserve.io/enable-service-account-token-mount: "true"  # Mount token to resolve authentication failures
      spec:
        predictor:
          model:
            modelFormat:
              name: pytorch
            runtime: red-hat-vllm-runtime
            storageUri: pvc://model-storage-pvc
            resources:
              requests:
                cpu: "1"
                memory: "4Gi"
                nvidia.com/gpu: "1"
              limits:
                cpu: "2"
                memory: "8Gi"
                nvidia.com/gpu: "1"
            env:
              - name: VLLM_GPU_MEMORY_UTILIZATION
                value: "0.5"
      EOF

      Model information:

      • Microsoft/DialoGPT-small: 117 MB, 117 M parameters
      • Local storage: Loaded from PVC, fast and stable startup
      • Conversational generation: Suitable for testing inference functionality
      • vLLM optimized: Uses vLLM inference engine for better performance
    11. Monitor the deployment status:

      # Real-time monitor InferenceService status
      oc get inferenceservice dialogpt-small-service -w
      # View related pods
      oc get pods -l serving.kserve.io/inferenceservice=dialogpt-small-service
      # View detailed status
      oc describe inferenceservice dialogpt-small-service
      # View events
      oc get events --sort-by='.lastTimestamp' | head -20

      When you see READY=True, it means the service has started successfully.

    12. The simplest testing method is as follows.

      Important note: DialoGPT-small is a small conversational model (117 M parameters) with limited response quality. Sometimes it can generate incoherent content, which is normal behavior.

      # Set variables
      PREDICTOR_POD=$(oc get pods -l serving.kserve.io/inferenceservice=dialogpt-small-service -o jsonpath='{.items[0].metadata.name}')
      # Basic health check
      echo "=== Health Check ==="
      oc exec $PREDICTOR_POD -c kserve-container -- curl -s localhost:8080/health
      # Conversation test 1: Simple greeting
      echo -e "\n=== I ask: Hello, how are you? ==="
      oc exec $PREDICTOR_POD -c kserve-container -- curl -s -X POST localhost:8080/v1/chat/completions \
       -H "Content-Type: application/json" \
       -d '{
         "model": "DialoGPT-small",
         "messages": [{"role": "user", "content": "Hello, how are you?"}],
         "max_tokens": 30,
         "temperature": 0.7
       }'
      # Conversation test 2: Ask for name
      echo -e "\n=== I ask: What is your name? ==="
      oc exec $PREDICTOR_POD -c kserve-container -- curl -s -X POST localhost:8080/v1/chat/completions \
       -H "Content-Type: application/json" \
       -d '{
         "model": "DialoGPT-small",
         "messages": [{"role": "user", "content": "What is your name?"}],
         "max_tokens": 20,
         "temperature": 0.8
       }'
      # Conversation test 3: Simple question
      echo -e "\n=== I ask: Hi ==="
      oc exec $PREDICTOR_POD -c kserve-container -- curl -s -X POST localhost:8080/v1/chat/completions \
       -H "Content-Type: application/json" \
       -d '{
         "model": "DialoGPT-small",
         "messages": [{"role": "user", "content": "Hi"}],
         "max_tokens": 10,
         "temperature": 0.5
       }'
    13. Performance and monitoring check. To view resource usage:

      # View Pod resource usage (requires metrics-server support)
      oc adm top pod -l serving.kserve.io/inferenceservice=dialogpt-small-service
      # If the above command doesn't work, use alternative methods:
      # View Pod resource configuration and limits
      PREDICTOR_POD=$(oc get pods -l serving.kserve.io/inferenceservice=dialogpt-small-service -o jsonpath='{.items[0].metadata.name}')
      oc describe pod $PREDICTOR_POD | grep -A10 -B5 "Limits\|Requests"
      # View Pod status and runtime
      oc get pod $PREDICTOR_POD -o wide
      # View node resource usage
      oc adm top nodes
      # If metrics-server is not available, view basic Pod information
      oc get pod $PREDICTOR_POD -o jsonpath='{.status.containerStatuses[*].restartCount}'
      echo " (restart count)"

      Service status check:

      # Check InferenceService overall status
      oc get inferenceservice dialogpt-small-service -o yaml | grep -A20 status
      # View all related resource status
      oc get pods,svc,inferenceservice -l serving.kserve.io/inferenceservice=dialogpt-small-service
      # View recent cluster events
      oc get events --sort-by='.lastTimestamp' | head -20
      # Check service endpoints
      oc get endpoints dialogpt-small-service-predictor

    Troubleshooting common issues

    Inference service cannot be accessed:

    # Check service status
    oc get svc | grep dialogpt-small-service
    # Check endpoints
    oc get endpoints dialogpt-small-service-predictor
    # Check pods status
    oc get pods -l serving.kserve.io/inferenceservice=dialogpt-small-service

    Model loading failed:

    # View pod events
    oc describe pod $PREDICTOR_POD
    # Check model files
    oc exec $PREDICTOR_POD -c kserve-container -- ls -la /mnt/models/DialoGPT-small/
    # View vLLM startup logs
    oc logs $PREDICTOR_POD -c kserve-container | grep -i error

    Memory or GPU resource insufficient:

    # Check node resources
    oc describe nodes | grep -A5 -B5 "Allocated resources"
    # Reduce resource requirements
    oc patch inferenceservice dialogpt-small-service --type='merge' -p='{
      "spec": {
        "predictor": {
          "model": {
            "resources": {
              "requests": {"cpu": "500m", "memory": "2Gi"},
              "limits": {"cpu": "1", "memory": "4Gi"}
            }
          }
        }
      }
    }'

    Resource cleanup

    If you want to clear your environment after the tests, delete the resources with the following commands:

    # Delete test Pod
    oc delete pod inference-test-client
    # Delete InferenceService
    oc delete inferenceservice dialogpt-small-service
    # Delete ServingRuntime
    oc delete servingruntime red-hat-vllm-runtime
    # Delete download Job
    oc delete job dialogpt-model-downloader
    # Delete PVC (Note: this will delete all downloaded models)
    oc delete pvc model-storage-pvc
    # Delete Pull Secret
    oc delete secret redhat-registry-secret
    # Delete entire project
    oc delete project ai-inference-demo

    Summary

    This guide provides a complete Red Hat Inference Server deployment and internal testing process. Its advantages include the following:

    • Security-focused: All testing is done internally within the cluster, no need to expose external endpoints
    • Efficient: Uses PVC local storage for fast model loading
    • Flexible: Supports multiple testing methods and interaction approaches
    • Observable: Provides detailed monitoring and log viewing methods

    Use cases:

    • Development and testing environment verification
    • Internal API integration testing
    • Model performance evaluation
    • AI service deployment in security-compliant environments

    Following this guide, you can completely deploy and test Red Hat Inference Server without creating external routes.

    Explore the Red Hat AI Inference Server product page and our guided demo for more information or check out our technical documentation for detailed configurations.

    Related Posts

    • Distributed inference with vLLM

    • Speech-to-text with Whisper and Red Hat AI Inference Server

    • Integrate Red Hat AI Inference Server & LangChain in agentic workflows

    • Image mode for Red Hat Enterprise Linux quick start: AI inference

    • vLLM V1: Accelerating multimodal inference for large language models

    • LLM Compressor is here: Faster inference with vLLM

    Recent Posts

    • Federated identity across the hybrid cloud using zero trust workload identity manager

    • Confidential virtual machine storage attack scenarios

    • Introducing virtualization platform autopilot

    • Integrate zero trust workload identity manager with Red Hat OpenShift GitOps

    • Best Practice Configuration and Tuning for Linux and Windows VMs

    What’s up next?

    Open source AI for developers introduces and covers key features of Red Hat OpenShift AI, including Jupyter Notebooks, PyTorch, and enhanced monitoring and observability tools, along with MLOps and continuous integration/continuous deployment (CI/CD) workflows.

    Get the e-book
    Red Hat Developers logo LinkedIn YouTube Twitter Facebook

    Platforms

    • Red Hat AI
    • Red Hat Enterprise Linux
    • Red Hat OpenShift
    • Red Hat Ansible Automation Platform
    • See all products

    Build

    • Developer Sandbox
    • Developer tools
    • Interactive tutorials
    • API catalog

    Quicklinks

    • Learning resources
    • E-books
    • Cheat sheets
    • Blog
    • Events
    • Newsletter

    Communicate

    • About us
    • Contact sales
    • Find a partner
    • Report a website issue
    • Site status dashboard
    • Report a security problem

    RED HAT DEVELOPER

    Build here. Go anywhere.

    We serve the builders. The problem solvers who create careers with code.

    Join us if you’re a developer, software engineer, web designer, front-end designer, UX designer, computer scientist, architect, tester, product manager, project manager or team lead.

    Sign me up

    Red Hat legal and privacy links

    • About Red Hat
    • Jobs
    • Events
    • Locations
    • Contact Red Hat
    • Red Hat Blog
    • Inclusion at Red Hat
    • Cool Stuff Store
    • Red Hat Summit
    © 2026 Red Hat

    Red Hat legal and privacy links

    • Privacy statement
    • Terms of use
    • All policies and guidelines
    • Digital accessibility

    Chat Support

    Please log in with your Red Hat account to access chat support.