Deploy an enterprise RAG chatbot with Red Hat OpenShift AI

Enterprises generate an overwhelming amount of unstructured information, documents, policies, PDFs, wikis, knowledge bases, HR guidelines, legal documents, system manuals, architecture diagrams, and more. When employees struggle to find accurate answers quickly, productivity suffers and undocumented knowledge becomes a bottleneck.

Retrieval-augmented generation (RAG) solves this problem by grounding LLM responses in your company’s knowledge. Instead of relying on a model’s memory or hallucinations, RAG retrieves relevant document chunks from a vector database and supplies them to the model at inference time.

This blog explores the RAG quickstart, a comprehensive blueprint for deploying an enterprise RAG application on Red Hat OpenShift AI. It features high-performance inference, safety guardrails, and automated data ingestion pipelines. AI quickstarts are deployable examples that connect Red Hat technology to business value. You can try them today in the AI quickstart catalog.

What is RAG?

RAG is an architectural pattern that improves the output of an LLM by referencing an authoritative knowledge base outside of its training data.

Retrieval: When a user asks a question, the system searches a vector database for relevant snippets from your documents.
Augmentation: The system adds those snippets to the user's original prompt as context.
Generation: The LLM uses this enriched context to generate an accurate, hallucination-free response.

The enterprise difference

While a standard RAG setup serves as a functional proof of concept, deploying enterprise RAG system requires an architecture designed for resilience and compliance. It requires:

Security and safety: Guardrails to prevent toxic outputs and protect sensitive data.
Scalability: The ability to handle thousands of documents and concurrent users.
Governance: Clear data lineage from Amazon S3 or Git to the Vector database.
Multi-tenancy: Segmented knowledge bases for different departments like HR, Legal, and Sales.

Architecture and features

To transition from a demo to a deployed system, this AI quickstart automates the provisioning of critical components like model serving, vector databases, and ingestion pipelines. Figure 1 shows how the infrastructure is built on Red Hat OpenShift AI.

Architectural diagram of an AI infrastructure on Red Hat OpenShift AI showing two main sections: an ingestion pipeline that processes data from sources like S3 buckets and GitHub into a vector database, and a RAG pipeline where user queries are handled via Llama Stack APIs, an agent with guardrails, and model servers like vLLM. — Figure 1: The ingestion pipeline for document processing and the RAG pipeline for query handling.

Model serving and safety guardrails

To serve the LLM, the AI quickstart uses the ServingRuntime and InferenceService custom resource definitions (CRDs) in Red Hat OpenShift AI. This allows for production-grade serving with auto-scaling and monitoring.

This setup provides standardized, GPU-aware, and scalable Kubernetes-native endpoints. This removes the complexity of manual model deployment and lifecycle management.

The AI quickstart deploys two models: the primary LLM and a safety guardrail model, such as Llama Guard.

When you view the OpenShift console under your namespace, you will see two running pods: the main model and the safety shield. All requests are routed through the shield to ensure enterprise compliance.

The Llama Stack server

This AI quickstart deploys a Llama Stack server that acts as a unified and flexible, and open source platform for building AI applications. It ensures portability across different environments and prevents vendor lock-in.

Llama Stack provides integrated, enterprise-focused features like safety guardrails, telemetry, evaluation tools, and complex agentic orchestration, which can be challenging to build from scratch. This simplifies the creation of production-ready AI.

You can find the configuration in rag-values.yaml, which connects vLLM, the PGVector vector store, and tools.

Enterprise vector storage (PGVector and Minio)

To store document embeddings for a fictitious company, FantaCo, the AI quickstart creates specific vector databases for departments: HR, Legal, Procurement, Sales, and Tech Support.

The architecture uses Minio as a local S3-compatible store for raw document staging. It then uses PGVector as a high-performance enterprise vector database. This combination ensures data isolation. For example, a query from the HR department only retrieves HR documents, which prevents unauthorized data leakage.

Automated ingestion pipelines

Data can be ingested from GitHub, S3, or direct URLs. The AI quickstart uses Kubeflow Pipelines to automate the ingestion workflow. This process transforms documents from various sources into vector embeddings and stores them in a PGVector database for efficient retrieval. You can also use the bring your own document (BYOD) feature in the chatbot interface to upload files for immediate testing.

Data science workbenches (notebooks)

A pre-configured Jupyter Notebook is provided to allow data scientists to experiment with the ingestion logic.

In the OpenShift AI dashboard, select Data Science Projects, and launch the Workbench. This pre-configured notebook contains the logic to orchestrate a Kubeflow pipeline that fetches documents from an S3 bucket, generates embeddings, and ingests them into PGVector.

Deployment modes: OpenShift vs. Local

The AI quickstart supports two deployment strategies depending on the hardware available to you:

OpenShift (recommended): Uses full GPU acceleration, enterprise security, and automated pipelines.
Local (development): Allows running the stack via Docker or Podman. However, local performance is often slower because of limited VRAM and reliance on CPU, M1, or M2 chips compared to enterprise NVIDIA GPUs.

Getting started

Follow these requirements to prepare your environment and gather the necessary credentials before you begin the deployment.

Prerequisites

Here is what you need to get started.

Business requirements:

A need to centralize fragmented knowledge across teams
Interest in improving productivity, onboarding, policy accuracy, and customer-facing knowledge
Stakeholder support for AI-powered internal search
Understanding of governance and data privacy requirements

Technical requirements:

Before deploying the RAG AI quickstart, ensure:

Red Hat OpenShift cluster is configured (4.19+ recommended)
Red Hat OpenShift AI is installed (2.22+ recommended)
GPU worker nodes are enabled
Sufficient cluster resources (GPU, memory, storage)
oc CLI access
Helm installed

Accounts and keys:

Source code

Clone the RAG AI quickstart repository:

git clone https://github.com/rh-ai-quickstart/RAG
cd RAG

Sample data

Fantaco sample company documents (HR, benefits, onboarding) are included in the GitHub repository.

Configuration

Check GPU node taints. GPU nodes are high-cost, shared resources in an OpenShift cluster, so taints are used to restrict scheduling to workloads that explicitly require GPU acceleration. Corresponding tolerations must be configured to allow model-serving and embedding workloads to run on these GPU nodes.
Ops teams can view taints in the OpenShift console by selecting Compute → Nodes → GPU Node → YAML.
```
spec:
  taints:
           - key: nvidia.com/gpu
             effect: NoSchedule
```
This ensures model-serving pods are scheduled correctly.

Review Helm chart structure. The RAG Helm chart deploys the RAG UI along with the dependencies defined in deploy/helm/rag/Chart.yaml:

dependencies:
  - name: pgvector
    version: 0.5.1
    repository: https://rh-ai-quickstart.github.io/ai-architecture-charts
    condition: pgvector.enabled
  - name: llm-service
    version: 0.5.2
    repository: https://rh-ai-quickstart.github.io/ai-architecture-charts
    condition: llm-service.enabled
  - name: configure-pipeline
    version: 0.5.4
    repository: https://rh-ai-quickstart.github.io/ai-architecture-charts
    condition: configure-pipeline.enabled
  - name: ingestion-pipeline
    version: 0.5.1
    repository: https://rh-ai-quickstart.github.io/ai-architecture-charts
    condition: ingestion-pipeline.enabled
  - name: llama-stack
    version: 0.5.2
    repository: https://rh-ai-quickstart.github.io/ai-architecture-charts
    condition: llama-stack.enabled
  - name: mcp-servers
    version: 0.5.7
    repository: https://rh-ai-quickstart.github.io/ai-architecture-charts
    condition: mcp-servers.enabled

This automatically installs the following:

PGVector database
Model servers on OpenShift AI
Embedding pipelines
Chatbot UI
Workbench: Jupyter Notebook
Llama Stack server
MCP Servers (optional)

Review example values file:
```
vi helm/rag-values.example.yaml
```
This defines:
- GPU scheduling rules
- Model choices
- Pipeline behavior
- Vector DB settings

Deployment: Install the RAG AI quickstart

oc login --token=<<token>> --server=<<api-server-url>>

View GPU taints (optional):

oc get nodes -o jsonpath='{range .items[*]}{.metadata.name}{"  "}{.spec.taints}{"\n"}{end}'

List available models:

make list-models
[INFO] Listing available models...
model: llama-3-1-8b-instruct (meta-llama/Llama-3.1-8B-Instruct)
model: llama-3-2-1b-instruct (meta-llama/Llama-3.2-1B-Instruct)
model: llama-3-2-1b-instruct-quantized (RedHatAI/Llama-3.2-1B-Instruct-quantized.w8a8)
model: llama-3-2-3b-instruct (meta-llama/Llama-3.2-3B-Instruct)
model: llama-3-3-70b-instruct (meta-llama/Llama-3.3-70B-Instruct)
model: llama-3-3-70b-instruct-quantization-fp8 (meta-llama/Llama-3.3-70B-Instruct)
model: llama-guard-3-1b (meta-llama/Llama-Guard-3-1B)
model: llama-guard-3-8b (meta-llama/Llama-Guard-3-8B)
model: qwen-2-5-vl-3b-instruct (Qwen/Qwen2.5-VL-3B-Instruct)

Install the RAG application:

# Application will be installed in rag namespace
# It will deploy llama-3-1-8b-instruct model
# If the gpu node has taint, specify the toleration.
make install NAMESPACE=rag LLM=llama-3-1-8b-instruct LLM_TOLERATION="nvidia.com/gpu"

The installer will:

Prompt for Hugging Face token
Prompt for Tavily Search API key
Generate a local rag-values.yaml file (first run only)

After installation, OpenShift provisions:

RAG UI
Vector DB
Llama Stack server
Llama inference services
Document ingestion pipelines
A ready-to-run notebook

Execution: Run pipelines, models, and the chatbot interface

Verify running pods: After the installation finishes, navigate to Workloads → Pods in the rag namespace.Verify the pods are in Running or Completed status (Figure 2).

Screenshot of the Red Hat OpenShift console showing a list of pods in the rag namespace with statuses marked as either Completed or Running. — Figure 2: List of deployed pods in the rag namespace.

Launch Pipelines in OpenShift AI. Navigate to Red Hat OpenShift AI → Data Science Pipelines → Runs (project: rag). See Figure 3.

Figure 3: RAG AI quickstart ingestion pipelines populating the RAG vector database.
Launch the Notebook. Navigate to Red Hat OpenShift AI → Data Science Projects (project: rag). Launch the rag-pipeline-notebook.

Figure 4: Workbenches and pipelines created during the RAG AI quickstart deployment in OpenShift AI.
Verify deployed models. Navigate to Models → Model deployments (project: rag).

Figure 5: Deployed LLM inference service in Red Hat OpenShift AI provisioned by the RAG AI quickstart.
Launch the chat application. Navigate to Networking → Routes → rag. Select the location URL to load the chatbot UI.

Figure 6: RAG AI quickstart–deployed chatbot interface.

Verification: Test direct and agentic RAG

After deploying the application, use the chatbot interface to verify that the system accurately retrieves information from both internal documents and external sources.

Test direct RAG (database agent)

You can test direct RAG by asking a question like, "What are my HR benefits at Fantaco?" To do this, select the Database Agent labeled hr-vector-db-v1-0. The system retrieves the answer from Fantaco HR documents to provide accurate information.

Test agentic RAG (websearch agent)

To test agentic RAG, query a real-time event, such as "Who won the Super Bowl of 2025?" and select the Websearch Agent. In this mode, the system uses the Tavily external search tool to find and return a correct, real-time answer. This process demonstrates how the chatbot identifies the right tool for the job when internal data is insufficient.

Wrap up

In this guide, you learned how to:

Understand direct RAG and agentic RAG
Deploy the RAG AI quickstart on OpenShift AI
Ingest enterprise documents using pipelines
Serve Llama models on GPU nodes
Launch an enterprise-ready RAG chatbot
Run queries using both internal and external retrieval agents

You now have a fully operational enterprise RAG assistant capable of centralizing your company’s knowledge and enhancing employee productivity.

Next steps

Read Context as architecture: A practical look at retrieval-augmented generation.
Start a trial to explore what you can do with Red Hat OpenShift AI.
Browse the AI quickstart catalog for more example use cases.