Skip to main content
Redhat Developers  Logo
  • Products

    Platforms

    • Red Hat Enterprise Linux
      Red Hat Enterprise Linux Icon
    • Red Hat AI
      Red Hat AI
    • Red Hat OpenShift
      Openshift icon
    • Red Hat Ansible Automation Platform
      Ansible icon
    • See all Red Hat products

    Featured

    • Red Hat build of OpenJDK
    • Red Hat Developer Hub
    • Red Hat JBoss Enterprise Application Platform
    • Red Hat OpenShift Dev Spaces
    • Red Hat OpenShift Local
    • Red Hat Developer Sandbox

      Try Red Hat products and technologies without setup or configuration fees for 30 days with this shared Red Hat OpenShift and Kubernetes cluster.
    • Try at no cost
  • Technologies

    Featured

    • AI/ML
      AI/ML Icon
    • Linux
      Linux Icon
    • Kubernetes
      Cloud icon
    • Automation
      Automation Icon showing arrows moving in a circle around a gear
    • See all technologies
    • Programming languages & frameworks

      • Java
      • Python
      • JavaScript
    • System design & architecture

      • Red Hat architecture and design patterns
      • Microservices
      • Event-Driven Architecture
      • Databases
    • Developer experience

      • Productivity
      • Tools
      • GitOps
    • Automated data processing

      • AI/ML
      • Data science
      • Apache Kafka on Kubernetes
    • Platform engineering

      • DevOps
      • DevSecOps
      • Red Hat Ansible Automation Platform for applications and services
    • Secure development & architectures

      • Security
      • Secure coding
  • Learn

    Featured

    • Kubernetes & cloud native
      Openshift icon
    • Linux
      Rhel icon
    • Automation
      Ansible cloud icon
    • AI/ML
      AI/ML Icon
    • See all learning resources

    E-books

    • GitOps cookbook
    • Podman in action
    • Kubernetes operators
    • The path to GitOps
    • See all e-books

    Cheat sheets

    • Linux commands
    • Bash commands
    • Git
    • systemd commands
    • See all cheat sheets

    Documentation

    • Product documentation
    • API catalog
    • Legacy documentation
  • Developer Sandbox

    Developer Sandbox

    • Access Red Hat’s products and technologies without setup or configuration, and start developing quicker than ever before with our new, no-cost sandbox environments.
    • Explore the Developer Sandbox

    Featured Developer Sandbox activities

    • Get started with your Developer Sandbox
    • OpenShift virtualization and application modernization using the Developer Sandbox
    • Explore all Developer Sandbox activities

    Ready to start developing apps?

    • Try at no cost
  • Blog
  • Events
  • Videos

Deploy an enterprise RAG chatbot with Red Hat OpenShift AI

January 29, 2026
Saurabh Agarwal
Related topics:
Artificial intelligenceData science
Related products:
Red Hat AIRed Hat OpenShift AI

    Enterprises generate an overwhelming amount of unstructured information, documents, policies, PDFs, wikis, knowledge bases, HR guidelines, legal documents, system manuals, architecture diagrams, and more. When employees struggle to find accurate answers quickly, productivity suffers and undocumented knowledge becomes a bottleneck.

    Retrieval-augmented generation (RAG) solves this problem by grounding LLM responses in your company’s knowledge. Instead of relying on a model’s memory or hallucinations, RAG retrieves relevant document chunks from a vector database and supplies them to the model at inference time.

    This blog explores the RAG quickstart, a comprehensive blueprint for deploying an enterprise RAG application on Red Hat OpenShift AI. It features high-performance inference, safety guardrails, and automated data ingestion pipelines. AI quickstarts are deployable examples that connect Red Hat technology to business value. You can try them today in the AI quickstart catalog.

    What is RAG?

    RAG is an architectural pattern that improves the output of an LLM by referencing an authoritative knowledge base outside of its training data.

    • Retrieval: When a user asks a question, the system searches a vector database for relevant snippets from your documents.
    • Augmentation: The system adds those snippets to the user's original prompt as context.
    • Generation: The LLM uses this enriched context to generate an accurate, hallucination-free response.

    The enterprise difference

    While a standard RAG setup serves as a functional proof of concept, deploying enterprise RAG system requires an architecture designed for resilience and compliance. It requires:

    • Security and safety: Guardrails to prevent toxic outputs and protect sensitive data.
    • Scalability: The ability to handle thousands of documents and concurrent users.
    • Governance: Clear data lineage from Amazon S3 or Git to the Vector database.
    • Multi-tenancy: Segmented knowledge bases for different departments like HR, Legal, and Sales.

    Architecture and features

    To transition from a demo to a deployed system, this AI quickstart automates the provisioning of critical components like model serving, vector databases, and ingestion pipelines. Figure 1 shows how the infrastructure is built on Red Hat OpenShift AI.

    Architectural diagram of an AI infrastructure on Red Hat OpenShift AI showing two main sections: an ingestion pipeline that processes data from sources like S3 buckets and GitHub into a vector database, and a RAG pipeline where user queries are handled via Llama Stack APIs, an agent with guardrails, and model servers like vLLM.
    Figure 1: The ingestion pipeline for document processing and the RAG pipeline for query handling.

    Model serving and safety guardrails

    To serve the LLM, the AI quickstart uses the ServingRuntime and InferenceService custom resource definitions (CRDs) in Red Hat OpenShift AI. This allows for production-grade serving with auto-scaling and monitoring.

    This setup provides standardized, GPU-aware, and scalable Kubernetes-native endpoints. This removes the complexity of manual model deployment and lifecycle management.

    The AI quickstart deploys two models: the primary LLM and a safety guardrail model, such as Llama Guard.

    When you view the OpenShift console under your namespace, you will see two running pods: the main model and the safety shield. All requests are routed through the shield to ensure enterprise compliance.

    The Llama Stack server

    This AI quickstart deploys a Llama Stack server that acts as a unified and flexible, and open source platform for building AI applications. It ensures portability across different environments and prevents vendor lock-in.

    Llama Stack provides integrated, enterprise-focused features like safety guardrails, telemetry, evaluation tools, and complex agentic orchestration, which can be challenging to build from scratch. This simplifies the creation of production-ready AI.

    You can find the configuration in rag-values.yaml, which connects vLLM, the PGVector vector store, and tools.

    Enterprise vector storage (PGVector and Minio)

    To store document embeddings for a fictitious company, FantaCo, the AI quickstart creates specific vector databases for departments: HR, Legal, Procurement, Sales, and Tech Support.

    The architecture uses Minio as a local S3-compatible store for raw document staging. It then uses PGVector as a high-performance enterprise vector database. This combination ensures data isolation. For example, a query from the HR department only retrieves HR documents, which prevents unauthorized data leakage.

    Automated ingestion pipelines

    Data can be ingested from GitHub, S3, or direct URLs. The AI quickstart uses Kubeflow Pipelines to automate the ingestion workflow. This process transforms documents from various sources into vector embeddings and stores them in a PGVector database for efficient retrieval. You can also use the bring your own document (BYOD) feature in the chatbot interface to upload files for immediate testing.

    Data science workbenches (notebooks)

    A pre-configured Jupyter Notebook is provided to allow data scientists to experiment with the ingestion logic.

    In the OpenShift AI dashboard, select Data Science Projects, and launch the Workbench. This pre-configured notebook contains the logic to orchestrate a Kubeflow pipeline that fetches documents from an S3 bucket, generates embeddings, and ingests them into PGVector.

    Deployment modes: OpenShift vs. Local

    The AI quickstart supports two deployment strategies depending on the hardware available to you:

    • OpenShift (recommended): Uses full GPU acceleration, enterprise security, and automated pipelines.
    • Local (development): Allows running the stack via Docker or Podman. However, local performance is often slower because of limited VRAM and reliance on CPU, M1, or M2 chips compared to enterprise NVIDIA GPUs.

    Getting started

    Follow these requirements to prepare your environment and gather the necessary credentials before you begin the deployment.

    Prerequisites

    Here is what you need to get started.

    Business requirements:

    • A need to centralize fragmented knowledge across teams
    • Interest in improving productivity, onboarding, policy accuracy, and customer-facing knowledge
    • Stakeholder support for AI-powered internal search
    • Understanding of governance and data privacy requirements

    Technical requirements:

    Before deploying the RAG AI quickstart, ensure:

    • Red Hat OpenShift cluster is configured (4.19+ recommended)
    • Red Hat OpenShift AI is installed (2.22+ recommended)
    • GPU worker nodes are enabled
    • Sufficient cluster resources (GPU, memory, storage)
    • oc CLI access
    • Helm installed

    Accounts and keys:

    • Tavily Websearch API key
    • Hugging Face token
    • Access to Meta Llama model
    • Access to Meta Llama Guard model

    Source code

    Clone the RAG AI quickstart repository:

    git clone https://github.com/rh-ai-quickstart/RAG
    cd RAG

    Sample data

    Fantaco sample company documents (HR, benefits, onboarding) are included in the GitHub repository.

    Configuration

    1. Check GPU node taints. GPU nodes are high-cost, shared resources in an OpenShift cluster, so taints are used to restrict scheduling to workloads that explicitly require GPU acceleration. Corresponding tolerations must be configured to allow model-serving and embedding workloads to run on these GPU nodes.

      Ops teams can view taints in the OpenShift console by selecting Compute → Nodes → GPU Node → YAML.

      spec:
        taints:
                 - key: nvidia.com/gpu
                   effect: NoSchedule

      This ensures model-serving pods are scheduled correctly.

    2. Review Helm chart structure. The RAG Helm chart deploys the RAG UI along with the dependencies defined in deploy/helm/rag/Chart.yaml:

      dependencies:
        - name: pgvector
          version: 0.5.1
          repository: https://rh-ai-quickstart.github.io/ai-architecture-charts
          condition: pgvector.enabled
        - name: llm-service
          version: 0.5.2
          repository: https://rh-ai-quickstart.github.io/ai-architecture-charts
          condition: llm-service.enabled
        - name: configure-pipeline
          version: 0.5.4
          repository: https://rh-ai-quickstart.github.io/ai-architecture-charts
          condition: configure-pipeline.enabled
        - name: ingestion-pipeline
          version: 0.5.1
          repository: https://rh-ai-quickstart.github.io/ai-architecture-charts
          condition: ingestion-pipeline.enabled
        - name: llama-stack
          version: 0.5.2
          repository: https://rh-ai-quickstart.github.io/ai-architecture-charts
          condition: llama-stack.enabled
        - name: mcp-servers
          version: 0.5.7
          repository: https://rh-ai-quickstart.github.io/ai-architecture-charts
          condition: mcp-servers.enabled

      This automatically installs the following:

      • PGVector database
      • Model servers on OpenShift AI
      • Embedding pipelines
      • Chatbot UI
      • Workbench: Jupyter Notebook
      • Llama Stack server
      • MCP Servers (optional)
    3. Review example values file:

      vi helm/rag-values.example.yaml

      This defines:

      • GPU scheduling rules
      • Model choices
      • Pipeline behavior
      • Vector DB settings

    Deployment: Install the RAG AI quickstart

    1. Log in to OpenShift:

      oc login --token=<<token>> --server=<<api-server-url>>
    2. View GPU taints (optional):

      oc get nodes -o jsonpath='{range .items[*]}{.metadata.name}{"  "}{.spec.taints}{"\n"}{end}'
    3. List available models:

      make list-models
      [INFO] Listing available models...
      model: llama-3-1-8b-instruct (meta-llama/Llama-3.1-8B-Instruct)
      model: llama-3-2-1b-instruct (meta-llama/Llama-3.2-1B-Instruct)
      model: llama-3-2-1b-instruct-quantized (RedHatAI/Llama-3.2-1B-Instruct-quantized.w8a8)
      model: llama-3-2-3b-instruct (meta-llama/Llama-3.2-3B-Instruct)
      model: llama-3-3-70b-instruct (meta-llama/Llama-3.3-70B-Instruct)
      model: llama-3-3-70b-instruct-quantization-fp8 (meta-llama/Llama-3.3-70B-Instruct)
      model: llama-guard-3-1b (meta-llama/Llama-Guard-3-1B)
      model: llama-guard-3-8b (meta-llama/Llama-Guard-3-8B)
      model: qwen-2-5-vl-3b-instruct (Qwen/Qwen2.5-VL-3B-Instruct)
    4. Install the RAG application:

      # Application will be installed in rag namespace
      # It will deploy llama-3-1-8b-instruct model
      # If the gpu node has taint, specify the toleration.
      make install NAMESPACE=rag LLM=llama-3-1-8b-instruct LLM_TOLERATION="nvidia.com/gpu"

    The installer will:

    • Prompt for Hugging Face token
    • Prompt for Tavily Search API key
    • Generate a local rag-values.yaml file (first run only)

    After installation, OpenShift provisions:

    • RAG UI
    • Vector DB
    • Llama Stack server
    • Llama inference services
    • Document ingestion pipelines
    • A ready-to-run notebook

    Execution: Run pipelines, models, and the chatbot interface

    1. Verify running pods: After the installation finishes, navigate to Workloads → Pods in the rag namespace.Verify the pods are in Running or Completed status (Figure 2).
    Screenshot of the Red Hat OpenShift console showing a list of pods in the rag namespace with statuses marked as either Completed or Running.
    Figure 2: List of deployed pods in the rag namespace.
    1. Launch Pipelines in OpenShift AI. Navigate to Red Hat OpenShift AI → Data Science Pipelines → Runs (project: rag). See Figure 3. 

      Screenshot of the Runs page in Red Hat OpenShift AI showing five successfully completed pipeline runs in the rag project, each populating the RAG vector database.
      Figure 3: RAG AI quickstart ingestion pipelines populating the RAG vector database.
    2. Launch the Notebook. Navigate to Red Hat OpenShift AI → Data Science Projects (project: rag). Launch the rag-pipeline-notebook.

      Screenshot of the Red Hat OpenShift AI dashboard showing the Workbenches and Pipelines cards for the rag project, including one running workbench named rag-pipeline-notebook and five successfully created data science pipelines.
      Figure 4: Workbenches and pipelines created during the RAG AI quickstart deployment in OpenShift AI.
    3. Verify deployed models. Navigate to Models → Model deployments (project: rag). 

      Screenshot of the Model deployments page in Red Hat OpenShift AI showing a successfully started model deployment for meta-llama/Llama-3.1-8B-Instruct using the vLLM serving runtime in the rag project.
      Figure 5: Deployed LLM inference service in Red Hat OpenShift AI provisioned by the RAG AI quickstart.
    4. Launch the chat application. Navigate to Networking → Routes → rag. Select the location URL to load the chatbot UI. 

      Screenshot of the chatbot interface in Red Hat OpenShift AI showing the Chat playground with a configuration sidebar for selecting models, processing modes like Direct or Agent-based, and adjusting sampling parameters.
      Figure 6: RAG AI quickstart–deployed chatbot interface.

    Verification: Test direct and agentic RAG

    After deploying the application, use the chatbot interface to verify that the system accurately retrieves information from both internal documents and external sources.

    Test direct RAG (database agent)

    You can test direct RAG by asking a question like, "What are my HR benefits at Fantaco?" To do this, select the Database Agent labeled hr-vector-db-v1-0. The system retrieves the answer from Fantaco HR documents to provide accurate information.

    Test agentic RAG (websearch agent)

    To test agentic RAG, query a real-time event, such as "Who won the Super Bowl of 2025?" and select the Websearch Agent. In this mode, the system uses the Tavily external search tool to find and return a correct, real-time answer. This process demonstrates how the chatbot identifies the right tool for the job when internal data is insufficient.

    Wrap up

    In this guide, you learned how to:

    • Understand direct RAG and agentic RAG
    • Deploy the RAG AI quickstart on OpenShift AI
    • Ingest enterprise documents using pipelines
    • Serve Llama models on GPU nodes
    • Launch an enterprise-ready RAG chatbot
    • Run queries using both internal and external retrieval agents

    You now have a fully operational enterprise RAG assistant capable of centralizing your company’s knowledge and enhancing employee productivity.

    Next steps

    • Read Context as architecture: A practical look at retrieval-augmented generation.
    • Start a trial to explore what you can do with Red Hat OpenShift AI.
    • Browse the AI quickstart catalog for more example use cases.

    Related Posts

    • Improve RAG retrieval and training with Feast and Kubeflow Trainer

    • AI quickstart: Self-service agent for IT process automation

    • AI quickstart: How to build an AI-driven product recommender with Red Hat OpenShift AI

    • Understanding the recommender system's two-tower model

    • Transform complex metrics into actionable insights with this AI quickstart

    • Fine-tune a RAG model with Feast and Kubeflow Trainer

    Recent Posts

    • Deploy an enterprise RAG chatbot with Red Hat OpenShift AI

    • Transform Kiali with OpenShift Lightspeed and Kubernetes MCP

    • Auto-registration v2: Easier management of Red Hat Enterprise Linux on AWS

    • How we turned OpenShift installation into a smart chatbot-driven experience

    • So you need more than port 80: Exposing custom ports in Kubernetes

    Red Hat Developers logo LinkedIn YouTube Twitter Facebook

    Platforms

    • Red Hat AI
    • Red Hat Enterprise Linux
    • Red Hat OpenShift
    • Red Hat Ansible Automation Platform
    • See all products

    Build

    • Developer Sandbox
    • Developer tools
    • Interactive tutorials
    • API catalog

    Quicklinks

    • Learning resources
    • E-books
    • Cheat sheets
    • Blog
    • Events
    • Newsletter

    Communicate

    • About us
    • Contact sales
    • Find a partner
    • Report a website issue
    • Site status dashboard
    • Report a security problem

    RED HAT DEVELOPER

    Build here. Go anywhere.

    We serve the builders. The problem solvers who create careers with code.

    Join us if you’re a developer, software engineer, web designer, front-end designer, UX designer, computer scientist, architect, tester, product manager, project manager or team lead.

    Sign me up

    Red Hat legal and privacy links

    • About Red Hat
    • Jobs
    • Events
    • Locations
    • Contact Red Hat
    • Red Hat Blog
    • Inclusion at Red Hat
    • Cool Stuff Store
    • Red Hat Summit
    © 2025 Red Hat

    Red Hat legal and privacy links

    • Privacy statement
    • Terms of use
    • All policies and guidelines
    • Digital accessibility

    Report a website issue