Page

Scale with OpenShift AI and important considerations

July 22, 2025

Mriganka Paul

We've validated our model's performance on RHEL AI, so now we are ready to leverage OpenShift AI to productionize and scale our solution.

Prerequisites:

Red Hat OpenShift Container Platform (RHOCP) 4.12 or later with:
- Minimum 3 control plane nodes
- Minimum 3 worker nodes (preferably with GPU support for AI workloads)
- GPU nodes with NVIDIA GPU operator installed (for optimal performance)
Red Hat subscriptions:
- Valid OpenShift subscription
- OpenShift AI subscription
Required CLI tools:
- oc (OpenShift CLI)
- kubectl
- podman or docker for container building
Storage requirements:
- Persistent storage provisioner (e.g., OpenShift Data Foundation, AWS EBS)
- Minimum 100GB available storage for models

In this lesson, you will:

Utilize OpenShift AI to scale our solution.

Phase 3: Scaling with OpenShift AI

OpenShift AI provides comprehensive tools for the entire ML lifecycle, from experimentation to production deployment at scale.

Note: Before starting with OpenShift AI, ensure you have satisfied the prerequisites.

Create a new data science project for our AI workloads:

# Create the project
oc new-project telecom-ai-prod

# Label it for OpenShift AI
oc label namespace telecom-ai-prod opendatahub.io/dashboard=true modelmesh-enabled=true

Create a PersistentVolumeClaim (PVC) to store your model:

cat << EOF | oc apply -f -
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
  name: telecom-model-pvc
  namespace: telecom-ai-prod
spec:
  accessModes:
    - ReadWriteOnce
  resources:
    requests:
      storage: 50Gi
EOF

Verify the PVC is bound:

oc get pvc -n telecom-ai-prod

Transfer model files to PVC

Follow these steps to transfer your trained model to the PVC using a temporary pod:

Create a temporary pod with the PVC mounted:

cat << EOF | oc apply -f -
apiVersion: v1
kind: Pod
metadata:
  name: model-upload-pod
  namespace: telecom-ai-prod
spec:
  containers:
  - name: upload-container
    image: registry.access.redhat.com/ubi9/ubi:latest
    command: ["/bin/bash", "-c", "sleep 3600"]
    volumeMounts:
    - name: model-storage
      mountPath: /models
  volumes:
  - name: model-storage
    persistentVolumeClaim:
      claimName: telecom-model-pvc
EOF

Wait for the pod to be ready:

oc wait --for=condition=Ready pod/model-upload-pod -n telecom-ai-prod --timeout=60s

Copy your model files to the PVC:

# Copy to OpenShift pod from your local machine
oc cp telecom-model.tar.gz telecom-ai-prod/model-upload-pod:/models/

# Extract in the pod
oc exec -n telecom-ai-prod model-upload-pod -- tar -xzf /models/telecom-model.tar.gz -C /models/

Clean up the temporary pod:

oc delete pod model-upload-pod -n telecom-ai-prod

Create a custom model server container

Since the GGUF format requires special handling, let's create a custom container that can serve GGUF models:

Create a Dockerfile following Red Hat standards:

# Use Red Hat Universal Base Image
FROM registry.access.redhat.com/ubi9/python-311:latest

# Switch to root for installation
USER 0

# Install system dependencies
RUN dnf install -y \
    gcc \
    gcc-c++ \
    make \
    git \
    && dnf clean all

# Install Python dependencies
RUN pip install --no-cache-dir \
    llama-cpp-python==0.2.57 \
    flask==3.0.2 \
    gunicorn==21.2.0 \
    prometheus-client==0.19.0

# Create app directory
WORKDIR /app

# Copy the server script
COPY model_server.py /app/

# Set permissions (user 1001 already exists in UBI images)
RUN chown -R 1001:0 /app && \
    chmod -R g=u /app

# Switch to non-root user
USER 1001

# Expose ports
EXPOSE 8080 9090

# Set environment variables
ENV MODEL_PATH="/models/instructlab-granite-7b-lab-Q4_K_M.gguf"
ENV HOST="0.0.0.0"
ENV PORT="8080"

# Run the server
CMD ["python", "model_server.py"]

Create the model server Python script:

# model_server.py
import os
import json
import logging
from flask import Flask, request, jsonify
from llama_cpp import Llama
from prometheus_client import Counter, Histogram, generate_latest, CONTENT_TYPE_LATEST
import time

# Configure logging
logging.basicConfig(level=logging.INFO)
logger = logging.getLogger(__name__)

# Prometheus metrics
REQUEST_COUNT = Counter('model_requests_total', 'Total number of requests')
REQUEST_LATENCY = Histogram('model_request_duration_seconds', 'Request latency')
ERROR_COUNT = Counter('model_errors_total', 'Total number of errors')

app = Flask(__name__)

# Load model
MODEL_PATH = os.environ.get('MODEL_PATH', '/models/model.gguf')
logger.info(f"Loading model from {MODEL_PATH}")

try:
    llm = Llama(
        model_path=MODEL_PATH,
        n_ctx=4096,
        n_threads=4,
        n_gpu_layers=-1  # Use all available GPU layers
    )
    logger.info("Model loaded successfully")
except Exception as e:
    logger.error(f"Failed to load model: {e}")
    raise

@app.route('/health', methods=['GET'])
def health():
    """Health check endpoint"""
    return jsonify({"status": "healthy", "model": MODEL_PATH})

@app.route('/ready', methods=['GET'])
def ready():
    """Readiness check endpoint"""
    return jsonify({"status": "ready"})

@app.route('/v1/completions', methods=['POST'])
def completions():
    """OpenAI-compatible completions endpoint"""
    REQUEST_COUNT.inc()
    start_time = time.time()
    
    try:
        data = request.json
        prompt = data.get('prompt', '')
        max_tokens = data.get('max_tokens', 500)
        temperature = data.get('temperature', 0.7)
        
        # Generate completion
        response = llm(
            prompt,
            max_tokens=max_tokens,
            temperature=temperature,
            echo=False
        )
        
        # Format response
        result = {
            "id": f"cmpl-{int(time.time())}",
            "object": "text_completion",
            "created": int(time.time()),
            "model": "telecom-assistant",
            "choices": [{
                "text": response['choices'][0]['text'],
                "index": 0,
                "finish_reason": "stop"
            }]
        }
        
        REQUEST_LATENCY.observe(time.time() - start_time)
        return jsonify(result)
        
    except Exception as e:
        ERROR_COUNT.inc()
        logger.error(f"Error generating completion: {e}")
        return jsonify({"error": str(e)}), 500

@app.route('/metrics', methods=['GET'])
def metrics():
    """Prometheus metrics endpoint"""
    return generate_latest(), 200, {'Content-Type': CONTENT_TYPE_LATEST}

if __name__ == '__main__':
    host = os.environ.get('HOST', '0.0.0.0')
    port = int(os.environ.get('PORT', 8080))
    app.run(host=host, port=port)

Build the container image:

# Create a build directory
mkdir telecom-model-server
cd telecom-model-server

# Copy the Dockerfile and model_server.py to this directory
# Then build using podman for linux/amd64 architecture
podman build --platform linux/amd64 -t telecom-model-server-amd64:latest .

# Tag for your registry
podman tag telecom-model-server-amd64:latest quay.io/<your-org>/telecom-model-server-amd64:latest

# Push to registry
podman push quay.io/<your-org>/telecom-model-server-amd64:latest

Deploy with custom deployment

Since the GGUF models require special handling with llama.cpp, let's create a custom deployment.

Create a deployment for the model server:

cat << EOF | oc apply -f -
apiVersion: apps/v1
kind: Deployment
metadata:
  name: telecom-model-server
  namespace: telecom-ai-prod
  labels:
    app: telecom-model-server
spec:
  replicas: 1
  selector:
    matchLabels:
      app: telecom-model-server
  template:
    metadata:
      labels:
        app: telecom-model-server
    spec:
      # Add tolerations to allow scheduling on GPU nodes
      tolerations:
      - key: "p4-gpu"
        operator: "Equal"
        value: "true"
        effect: "NoSchedule"
      containers:
      - name: model-server
        image: quay.io/<your-org>/telecom-model-server-amd64:latest
        ports:
        - containerPort: 8080
          name: http
        - containerPort: 9090
          name: metrics
        env:
        - name: MODEL_PATH
          value: "/models/instructlab-granite-7b-lab-trained/instructlab-granite-7b-lab-Q4_K_M.gguf"
        volumeMounts:
        - name: model-storage
          mountPath: /models
          readOnly: true
        resources:
          requests:
            memory: "16Gi"
            cpu: "4"
            nvidia.com/gpu: "1"
          limits:
            memory: "32Gi"
            cpu: "8"
            nvidia.com/gpu: "1"
        livenessProbe:
          httpGet:
            path: /health
            port: 8080
          initialDelaySeconds: 300
          periodSeconds: 10
          timeoutSeconds: 10
          failureThreshold: 3
        readinessProbe:
          httpGet:
            path: /ready
            port: 8080
          initialDelaySeconds: 60
          periodSeconds: 15
          timeoutSeconds: 10
          failureThreshold: 10
      volumes:
      - name: model-storage
        persistentVolumeClaim:
          claimName: telecom-model-pvc
---
apiVersion: v1
kind: Service
metadata:
  name: telecom-model-service
  namespace: telecom-ai-prod
spec:
  selector:
    app: telecom-model-server
  ports:
  - name: http
    port: 8080
    targetPort: 8080
  - name: metrics
    port: 9090
    targetPort: 9090
---
apiVersion: route.openshift.io/v1
kind: Route
metadata:
  name: telecom-model-route
  namespace: telecom-ai-prod
  annotations:
    haproxy.router.openshift.io/timeout: "600s"  # 10 minutes timeout for model inference
    haproxy.router.openshift.io/timeout-server: "600s"  # Server-side timeout
spec:
  to:
    kind: Service
    name: telecom-model-service
  port:
    targetPort: http
  tls:
    termination: edge
EOF

Monitoring and scaling

Set up monitoring for your deployed model:

cat << EOF | oc apply -f -
apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
  name: telecom-model-metrics
  namespace: telecom-ai-prod
spec:
  selector:
    matchLabels:
      app: telecom-model-server
  endpoints:
  - port: metrics
    interval: 30s
EOF

Configure the horizontal pod autoscaling:

cat << EOF | oc apply -f -
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: telecom-model-hpa
  namespace: telecom-ai-prod
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: telecom-model-server
  minReplicas: 1
  maxReplicas: 5
  metrics:
  - type: Resource
    resource:
      name: cpu
      target:
        type: Utilization
        averageUtilization: 70
  - type: Resource
    resource:
      name: memory
      target:
        type: Utilization
        averageUtilization: 80
EOF

Test your deployed model

Once deployed, test your model:

# Get the route URL
MODEL_URL=$(oc get route telecom-model-route -n telecom-ai-prod -o jsonpath='{.spec.host}')

# Test the health endpoint first
curl -k https://${MODEL_URL}/health

# Start with a minimal request to verify connectivity
curl -k -X POST https://${MODEL_URL}/v1/completions \
  -H "Content-Type: application/json" \
  -d '{
    "prompt": "Hi",
    "max_tokens": 10,
    "temperature": 0.7
  }'

# Once minimal requests work, test with larger requests
# Note: Larger token counts will take longer to process
curl -k -X POST https://${MODEL_URL}/v1/completions \
  -H "Content-Type: application/json" \
  -d '{
    "prompt": "What is fiber optic internet?",
    "max_tokens": 100,
    "temperature": 0.7
  }'

# For production use, consider implementing:
# 1. Streaming responses for better user experience
# 2. Request queuing for handling multiple concurrent requests
# 3. Response caching for common queries

Performance considerations

When serving GGUF models through OpenShift routes, consider the following:

Initial model loading: The first request after pod startup will be slower as the model loads into memory.
Token generation time: Each token takes time to generate; 200 tokens can take 30-60 seconds, depending on model size and GPU.
Route timeouts: The default OpenShift route timeout is 30 seconds. For LLM inference, you need longer timeouts (we set 600s).
Concurrent requests: Consider the pod's ability to handle multiple simultaneous requests.

For production deployments, consider:

Using streaming responses to provide feedback during generation.
Implementing a queue system for request management.
Setting appropriate resource limits based on your GPU capabilities.

Important considerations for GGUF models

There are a few key things to consider when deploying GGUF models:

GGUF models require special handling: Standard model serving frameworks expect PyTorch, TensorFlow, or ONNX formats. For GGUF, you need a custom container with llama.cpp.
Performance: GGUF models are optimized for inference and use less memory, making them ideal for edge deployments and resource-constrained environments.
GPU support: Ensure your container has GPU support compiled in llama.cpp for optimal performance.
Model format trade-offs:
- GGUF format is excellent for inference-only deployments, edge computing, and scenarios where resource efficiency is critical. However, it requires custom containers and doesn't integrate with standard Kubernetes model serving APIs like KServe InferenceService.
- HuggingFace format offers better operational integration with enterprise MLOps platforms, native support for KServe InferenceService APIs (enabling features like automatic scaling, canary deployments, and A/B testing), and compatibility with a wider ecosystem of tools. Consider keeping your model in HuggingFace format if you need these enterprise features and have sufficient GPU resources.

This setup provides a foundational deployment of your custom telecom AI assistant on OpenShift AI with basic monitoring and scaling capabilities. For production use, you should additionally implement:

Authentication and authorization for API endpoints.
Network policies and security constraints.
Comprehensive logging and distributed tracing.
Model versioning and rollback strategies.
Rate limiting and circuit breakers.
Backup and disaster recovery procedures.

Red Hat's InstructLab and RHEL AI provide a powerful foundation for developing and deploying custom AI applications. Through this guide, we've demonstrated how to:

Customize foundation models with domain-specific knowledge using InstructLab's intuitive taxonomy system.
Deploy and test models in a production-ready RHEL AI environment on AWS.
Serve models efficiently using the appropriate backend configuration for GGUF files.

The telecommunications customer support example shows how you can use these tools to create practical AI solutions that incorporate proprietary knowledge. By leveraging Red Hat's open source approach, organizations can:

Reduce dependency on generic models by adding their own domain expertise.
Accelerate AI development with user-friendly tools that don't require deep ML knowledge.
Deploy models in a secure, enterprise-ready environment.
Iterate quickly based on real-world testing and feedback.

Important considerations for QLoRA training using InstructLab

QLoRA training works well when:

You're teaching the model domain-specific knowledge and tone, not fundamentally changing its capabilities.
The base model already has strong conversational abilities.
You have limited compute resources (single GPU setup).
The adaptation is relatively narrow in scope.
You can achieve good results by updating just 0.1-1% of the model's parameters.

Use full fine-tuning of models to achieve higher quality and more sophisticated hardware to ensure production-grade model quality.

Next steps

To continue your AI journey with Red Hat:

Expand your taxonomy: Add more domain-specific knowledge to further enhance your model's capabilities.
Experiment with different base models: Try different foundation models to see which works best for your use case.
Scale with OpenShift AI: Once you've validated your approach, consider deploying at scale with OpenShift AI for production workloads.
Implement RAG: Add retrieval-augmented generation to keep your model's responses current with the latest information.
Join the community: Contribute to the InstructLab taxonomy repository and share your experiences with others.

By mastering InstructLab and RHEL AI, you've taken the crucial first steps in building production-ready AI applications customized for your specific needs while maintaining enterprise-grade reliability and security.

Check out these offerings to learn more about Red Hat AI:

Learning path: Demystify RAG with OpenShift AI and Elasticsearch
Learning path: Download, serve, and interact with LLMs on RHEL AI
E-book: Open source AI for developers

Red Hat Developer Sandbox

Programming Languages & Frameworks

System Design & Architecture

Developer Productivity

Automated Data Processing

Platform Engineering

Secure Development & Architectures

E-Books

Cheat Sheets

Documentation

A practical guide to InstructLab and RHEL AI

Scale with OpenShift AI and important considerations

Prerequisites:

In this lesson, you will:

Phase 3: Scaling with OpenShift AI

Transfer model files to PVC

Create a custom model server container

Deploy with custom deployment

Monitoring and scaling

Test your deployed model

Performance considerations

Important considerations for GGUF models

Important considerations for QLoRA training using InstructLab

Next steps

Platforms

Build

Quicklinks

Communicate

RED HAT DEVELOPER

Red Hat legal and privacy links

Red Hat legal and privacy links

Report a website issue

A practical guide to InstructLab and RHEL AI

Path resource: Scale with OpenShift AI and important considerations

Prerequisites:

In this lesson, you will:

Phase 3: Scaling with OpenShift AI

Transfer model files to PVC

Create a custom model server container

Deploy with custom deployment

Monitoring and scaling

Test your deployed model

Performance considerations

Important considerations for GGUF models

Important considerations for QLoRA training using InstructLab

Next steps

Platforms

Build

Quicklinks

Communicate

RED HAT DEVELOPER

Red Hat legal and privacy links

Red Hat legal and privacy links

Report a website issue