Page
Scale with OpenShift AI and important considerations

We've validated our model's performance on RHEL AI, so now we are ready to leverage OpenShift AI to productionize and scale our solution.
Prerequisites:
- Red Hat OpenShift Container Platform (RHOCP) 4.12 or later with:
- Minimum 3 control plane nodes
- Minimum 3 worker nodes (preferably with GPU support for AI workloads)
- GPU nodes with NVIDIA GPU operator installed (for optimal performance)
- Red Hat subscriptions:
- Valid OpenShift subscription
- OpenShift AI subscription
- Required CLI tools:
oc
(OpenShift CLI)kubectl
podman
ordocker
for container building
- Storage requirements:
- Persistent storage provisioner (e.g., OpenShift Data Foundation, AWS EBS)
- Minimum 100GB available storage for models
In this lesson, you will:
- Utilize OpenShift AI to scale our solution.
Phase 3: Scaling with OpenShift AI
OpenShift AI provides comprehensive tools for the entire ML lifecycle, from experimentation to production deployment at scale.
Note: Before starting with OpenShift AI, ensure you have satisfied the prerequisites.
Create a new data science project for our AI workloads:
# Create the project
oc new-project telecom-ai-prod
# Label it for OpenShift AI
oc label namespace telecom-ai-prod opendatahub.io/dashboard=true modelmesh-enabled=true
Create a PersistentVolumeClaim (PVC) to store your model:
cat << EOF | oc apply -f -
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
name: telecom-model-pvc
namespace: telecom-ai-prod
spec:
accessModes:
- ReadWriteOnce
resources:
requests:
storage: 50Gi
EOF
Verify the PVC is bound:
oc get pvc -n telecom-ai-prod
Transfer model files to PVC
Follow these steps to transfer your trained model to the PVC using a temporary pod:
Create a temporary pod with the PVC mounted:
cat << EOF | oc apply -f - apiVersion: v1 kind: Pod metadata: name: model-upload-pod namespace: telecom-ai-prod spec: containers: - name: upload-container image: registry.access.redhat.com/ubi9/ubi:latest command: ["/bin/bash", "-c", "sleep 3600"] volumeMounts: - name: model-storage mountPath: /models volumes: - name: model-storage persistentVolumeClaim: claimName: telecom-model-pvc EOF
Wait for the pod to be ready:
oc wait --for=condition=Ready pod/model-upload-pod -n telecom-ai-prod --timeout=60s
Copy your model files to the PVC:
# Copy to OpenShift pod from your local machine oc cp telecom-model.tar.gz telecom-ai-prod/model-upload-pod:/models/ # Extract in the pod oc exec -n telecom-ai-prod model-upload-pod -- tar -xzf /models/telecom-model.tar.gz -C /models/
Clean up the temporary pod:
oc delete pod model-upload-pod -n telecom-ai-prod
Create a custom model server container
Since the GGUF format requires special handling, let's create a custom container that can serve GGUF models:
Create a Dockerfile following Red Hat standards:
# Use Red Hat Universal Base Image FROM registry.access.redhat.com/ubi9/python-311:latest # Switch to root for installation USER 0 # Install system dependencies RUN dnf install -y \ gcc \ gcc-c++ \ make \ git \ && dnf clean all # Install Python dependencies RUN pip install --no-cache-dir \ llama-cpp-python==0.2.57 \ flask==3.0.2 \ gunicorn==21.2.0 \ prometheus-client==0.19.0 # Create app directory WORKDIR /app # Copy the server script COPY model_server.py /app/ # Set permissions (user 1001 already exists in UBI images) RUN chown -R 1001:0 /app && \ chmod -R g=u /app # Switch to non-root user USER 1001 # Expose ports EXPOSE 8080 9090 # Set environment variables ENV MODEL_PATH="/models/instructlab-granite-7b-lab-Q4_K_M.gguf" ENV HOST="0.0.0.0" ENV PORT="8080" # Run the server CMD ["python", "model_server.py"]
Create the model server Python script:
# model_server.py import os import json import logging from flask import Flask, request, jsonify from llama_cpp import Llama from prometheus_client import Counter, Histogram, generate_latest, CONTENT_TYPE_LATEST import time # Configure logging logging.basicConfig(level=logging.INFO) logger = logging.getLogger(__name__) # Prometheus metrics REQUEST_COUNT = Counter('model_requests_total', 'Total number of requests') REQUEST_LATENCY = Histogram('model_request_duration_seconds', 'Request latency') ERROR_COUNT = Counter('model_errors_total', 'Total number of errors') app = Flask(__name__) # Load model MODEL_PATH = os.environ.get('MODEL_PATH', '/models/model.gguf') logger.info(f"Loading model from {MODEL_PATH}") try: llm = Llama( model_path=MODEL_PATH, n_ctx=4096, n_threads=4, n_gpu_layers=-1 # Use all available GPU layers ) logger.info("Model loaded successfully") except Exception as e: logger.error(f"Failed to load model: {e}") raise @app.route('/health', methods=['GET']) def health(): """Health check endpoint""" return jsonify({"status": "healthy", "model": MODEL_PATH}) @app.route('/ready', methods=['GET']) def ready(): """Readiness check endpoint""" return jsonify({"status": "ready"}) @app.route('/v1/completions', methods=['POST']) def completions(): """OpenAI-compatible completions endpoint""" REQUEST_COUNT.inc() start_time = time.time() try: data = request.json prompt = data.get('prompt', '') max_tokens = data.get('max_tokens', 500) temperature = data.get('temperature', 0.7) # Generate completion response = llm( prompt, max_tokens=max_tokens, temperature=temperature, echo=False ) # Format response result = { "id": f"cmpl-{int(time.time())}", "object": "text_completion", "created": int(time.time()), "model": "telecom-assistant", "choices": [{ "text": response['choices'][0]['text'], "index": 0, "finish_reason": "stop" }] } REQUEST_LATENCY.observe(time.time() - start_time) return jsonify(result) except Exception as e: ERROR_COUNT.inc() logger.error(f"Error generating completion: {e}") return jsonify({"error": str(e)}), 500 @app.route('/metrics', methods=['GET']) def metrics(): """Prometheus metrics endpoint""" return generate_latest(), 200, {'Content-Type': CONTENT_TYPE_LATEST} if __name__ == '__main__': host = os.environ.get('HOST', '0.0.0.0') port = int(os.environ.get('PORT', 8080)) app.run(host=host, port=port)
Build the container image:
# Create a build directory mkdir telecom-model-server cd telecom-model-server # Copy the Dockerfile and model_server.py to this directory # Then build using podman for linux/amd64 architecture podman build --platform linux/amd64 -t telecom-model-server-amd64:latest . # Tag for your registry podman tag telecom-model-server-amd64:latest quay.io/<your-org>/telecom-model-server-amd64:latest # Push to registry podman push quay.io/<your-org>/telecom-model-server-amd64:latest
Deploy with custom deployment
Since the GGUF models require special handling with llama.cpp
, let's create a custom deployment.
Create a deployment for the model server:
cat << EOF | oc apply -f -
apiVersion: apps/v1
kind: Deployment
metadata:
name: telecom-model-server
namespace: telecom-ai-prod
labels:
app: telecom-model-server
spec:
replicas: 1
selector:
matchLabels:
app: telecom-model-server
template:
metadata:
labels:
app: telecom-model-server
spec:
# Add tolerations to allow scheduling on GPU nodes
tolerations:
- key: "p4-gpu"
operator: "Equal"
value: "true"
effect: "NoSchedule"
containers:
- name: model-server
image: quay.io/<your-org>/telecom-model-server-amd64:latest
ports:
- containerPort: 8080
name: http
- containerPort: 9090
name: metrics
env:
- name: MODEL_PATH
value: "/models/instructlab-granite-7b-lab-trained/instructlab-granite-7b-lab-Q4_K_M.gguf"
volumeMounts:
- name: model-storage
mountPath: /models
readOnly: true
resources:
requests:
memory: "16Gi"
cpu: "4"
nvidia.com/gpu: "1"
limits:
memory: "32Gi"
cpu: "8"
nvidia.com/gpu: "1"
livenessProbe:
httpGet:
path: /health
port: 8080
initialDelaySeconds: 300
periodSeconds: 10
timeoutSeconds: 10
failureThreshold: 3
readinessProbe:
httpGet:
path: /ready
port: 8080
initialDelaySeconds: 60
periodSeconds: 15
timeoutSeconds: 10
failureThreshold: 10
volumes:
- name: model-storage
persistentVolumeClaim:
claimName: telecom-model-pvc
---
apiVersion: v1
kind: Service
metadata:
name: telecom-model-service
namespace: telecom-ai-prod
spec:
selector:
app: telecom-model-server
ports:
- name: http
port: 8080
targetPort: 8080
- name: metrics
port: 9090
targetPort: 9090
---
apiVersion: route.openshift.io/v1
kind: Route
metadata:
name: telecom-model-route
namespace: telecom-ai-prod
annotations:
haproxy.router.openshift.io/timeout: "600s" # 10 minutes timeout for model inference
haproxy.router.openshift.io/timeout-server: "600s" # Server-side timeout
spec:
to:
kind: Service
name: telecom-model-service
port:
targetPort: http
tls:
termination: edge
EOF
Monitoring and scaling
Set up monitoring for your deployed model:
cat << EOF | oc apply -f -
apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
name: telecom-model-metrics
namespace: telecom-ai-prod
spec:
selector:
matchLabels:
app: telecom-model-server
endpoints:
- port: metrics
interval: 30s
EOF
Configure the horizontal pod autoscaling:
cat << EOF | oc apply -f -
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
name: telecom-model-hpa
namespace: telecom-ai-prod
spec:
scaleTargetRef:
apiVersion: apps/v1
kind: Deployment
name: telecom-model-server
minReplicas: 1
maxReplicas: 5
metrics:
- type: Resource
resource:
name: cpu
target:
type: Utilization
averageUtilization: 70
- type: Resource
resource:
name: memory
target:
type: Utilization
averageUtilization: 80
EOF
Test your deployed model
Once deployed, test your model:
# Get the route URL
MODEL_URL=$(oc get route telecom-model-route -n telecom-ai-prod -o jsonpath='{.spec.host}')
# Test the health endpoint first
curl -k https://${MODEL_URL}/health
# Start with a minimal request to verify connectivity
curl -k -X POST https://${MODEL_URL}/v1/completions \
-H "Content-Type: application/json" \
-d '{
"prompt": "Hi",
"max_tokens": 10,
"temperature": 0.7
}'
# Once minimal requests work, test with larger requests
# Note: Larger token counts will take longer to process
curl -k -X POST https://${MODEL_URL}/v1/completions \
-H "Content-Type: application/json" \
-d '{
"prompt": "What is fiber optic internet?",
"max_tokens": 100,
"temperature": 0.7
}'
# For production use, consider implementing:
# 1. Streaming responses for better user experience
# 2. Request queuing for handling multiple concurrent requests
# 3. Response caching for common queries
Performance considerations
When serving GGUF models through OpenShift routes, consider the following:
- Initial model loading: The first request after pod startup will be slower as the model loads into memory.
- Token generation time: Each token takes time to generate; 200 tokens can take 30-60 seconds, depending on model size and GPU.
- Route timeouts: The default OpenShift route timeout is 30 seconds. For LLM inference, you need longer timeouts (we set 600s).
- Concurrent requests: Consider the pod's ability to handle multiple simultaneous requests.
For production deployments, consider:
- Using streaming responses to provide feedback during generation.
- Implementing a queue system for request management.
- Setting appropriate resource limits based on your GPU capabilities.
Important considerations for GGUF models
There are a few key things to consider when deploying GGUF models:
- GGUF models require special handling: Standard model serving frameworks expect PyTorch, TensorFlow, or ONNX formats. For GGUF, you need a custom container with
llama.cpp
. - Performance: GGUF models are optimized for inference and use less memory, making them ideal for edge deployments and resource-constrained environments.
- GPU support: Ensure your container has GPU support compiled in
llama.cpp
for optimal performance. - Model format trade-offs:
- GGUF format is excellent for inference-only deployments, edge computing, and scenarios where resource efficiency is critical. However, it requires custom containers and doesn't integrate with standard Kubernetes model serving APIs like KServe InferenceService.
- HuggingFace format offers better operational integration with enterprise MLOps platforms, native support for KServe InferenceService APIs (enabling features like automatic scaling, canary deployments, and A/B testing), and compatibility with a wider ecosystem of tools. Consider keeping your model in HuggingFace format if you need these enterprise features and have sufficient GPU resources.
This setup provides a foundational deployment of your custom telecom AI assistant on OpenShift AI with basic monitoring and scaling capabilities. For production use, you should additionally implement:
- Authentication and authorization for API endpoints.
- Network policies and security constraints.
- Comprehensive logging and distributed tracing.
- Model versioning and rollback strategies.
- Rate limiting and circuit breakers.
- Backup and disaster recovery procedures.
Red Hat's InstructLab and RHEL AI provide a powerful foundation for developing and deploying custom AI applications. Through this guide, we've demonstrated how to:
- Customize foundation models with domain-specific knowledge using InstructLab's intuitive taxonomy system.
- Deploy and test models in a production-ready RHEL AI environment on AWS.
- Serve models efficiently using the appropriate backend configuration for GGUF files.
The telecommunications customer support example shows how you can use these tools to create practical AI solutions that incorporate proprietary knowledge. By leveraging Red Hat's open source approach, organizations can:
- Reduce dependency on generic models by adding their own domain expertise.
- Accelerate AI development with user-friendly tools that don't require deep ML knowledge.
- Deploy models in a secure, enterprise-ready environment.
- Iterate quickly based on real-world testing and feedback.
Important considerations for QLoRA training using InstructLab
QLoRA training works well when:
- You're teaching the model domain-specific knowledge and tone, not fundamentally changing its capabilities.
- The base model already has strong conversational abilities.
- You have limited compute resources (single GPU setup).
- The adaptation is relatively narrow in scope.
- You can achieve good results by updating just 0.1-1% of the model's parameters.
Use full fine-tuning of models to achieve higher quality and more sophisticated hardware to ensure production-grade model quality.
Next steps
To continue your AI journey with Red Hat:
- Expand your taxonomy: Add more domain-specific knowledge to further enhance your model's capabilities.
- Experiment with different base models: Try different foundation models to see which works best for your use case.
- Scale with OpenShift AI: Once you've validated your approach, consider deploying at scale with OpenShift AI for production workloads.
- Implement RAG: Add retrieval-augmented generation to keep your model's responses current with the latest information.
- Join the community: Contribute to the InstructLab taxonomy repository and share your experiences with others.
By mastering InstructLab and RHEL AI, you've taken the crucial first steps in building production-ready AI applications customized for your specific needs while maintaining enterprise-grade reliability and security.
Check out these offerings to learn more about Red Hat AI:
- Learning path: Demystify RAG with OpenShift AI and Elasticsearch
- Learning path: Download, serve, and interact with LLMs on RHEL AI
- E-book: Open source AI for developers