Skip to main content
Redhat Developers  Logo
  • AI

    Get started with AI

    • Red Hat AI
      Accelerate the development and deployment of enterprise AI solutions.
    • AI learning hub
      Explore learning materials and tools, organized by task.
    • AI interactive demos
      Click through scenarios with Red Hat AI, including training LLMs and more.
    • AI/ML learning paths
      Expand your OpenShift AI knowledge using these learning resources.
    • AI quickstarts
      Focused AI use cases designed for fast deployment on Red Hat AI platforms.
    • No-cost AI training
      Foundational Red Hat AI training.

    Featured resources

    • OpenShift AI learning
    • Open source AI for developers
    • AI product application development
    • Open source-powered AI/ML for hybrid cloud
    • AI and Node.js cheat sheet

    Red Hat AI Factory with NVIDIA

    • Red Hat AI Factory with NVIDIA is a co-engineered, enterprise-grade AI solution for building, deploying, and managing AI at scale across hybrid cloud environments.
    • Explore the solution
  • Learn

    Self-guided

    • Documentation
      Find answers, get step-by-step guidance, and learn how to use Red Hat products.
    • Learning paths
      Explore curated walkthroughs for common development tasks.
    • Guided learning
      Receive custom learning paths powered by our AI assistant.
    • See all learning

    Hands-on

    • Developer Sandbox
      Spin up Red Hat's products and technologies without setup or configuration.
    • Interactive labs
      Learn by doing in these hands-on, browser-based experiences.
    • Interactive demos
      Click through product features in these guided tours.

    Browse by topic

    • AI/ML
    • Automation
    • Java
    • Kubernetes
    • Linux
    • See all topics

    Training & certifications

    • Courses and exams
    • Certifications
    • Skills assessments
    • Red Hat Academy
    • Learning subscription
    • Explore training
  • Build

    Get started

    • Red Hat build of Podman Desktop
      A downloadable, local development hub to experiment with our products and builds.
    • Developer Sandbox
      Spin up Red Hat's products and technologies without setup or configuration.

    Download products

    • Access product downloads to start building and testing right away.
    • Red Hat Enterprise Linux
    • Red Hat AI
    • Red Hat OpenShift
    • Red Hat Ansible Automation Platform
    • See all products

    Featured

    • Red Hat build of OpenJDK
    • Red Hat JBoss Enterprise Application Platform
    • Red Hat OpenShift Dev Spaces
    • Red Hat Developer Toolset

    References

    • E-books
    • Documentation
    • Cheat sheets
    • Architecture center
  • Community

    Get involved

    • Events
    • Live AI events
    • Red Hat Summit
    • Red Hat Accelerators
    • Community discussions

    Follow along

    • Articles & blogs
    • Developer newsletter
    • Videos
    • Github

    Get help

    • Customer service
    • Customer support
    • Regional contacts
    • Find a partner

    Join the Red Hat Developer program

    • Download Red Hat products and project builds, access support documentation, learning content, and more.
    • Explore the benefits

Build an enterprise RAG system with OGX

May 26, 2026
Abdelhamid Soliman
Related topics:
AI inferenceAPIsArtificial intelligence
Related products:
Red Hat AIRed Hat OpenShift AI

    Anyone who has deployed a naive retrieval-augmented generation (RAG) system has likely run into the same frustrating outcome: a retrieved chunk that appears relevant at first glance, yet fails to answer the question. It might be mathematically similar to the user's query, but contextually, it misses the point entirely.

    The problem is that vanilla semantic search optimizes for vector similarity and quietly ignores everything else that matters in the enterprise: user intent, recency, access boundaries, and domain context. Closing that gap requires more than just a more capable embedding model. It calls for a layered retrieval strategy: metadata filtering to narrow the search space, hybrid retrieval that combines dense and sparse signals, and neural reranking to refine the final results. Together, these techniques transform a simple proof-of-concept chatbot into a production-ready synthesis engine.

    OGX, or Open GenAI Stack, (formerly Llama Stack) is an open source framework for building production-ready RAG applications because it provides a standardized API layer across inference, vector stores, agents, tools, safety, evaluation, and telemetry. Its OpenAI-compatible APIs allow developers to use familiar OpenAI-style clients while running models and retrieval infrastructure on their own platform. For RAG specifically, OGX supports vector store workflows that associate files and documents with vector databases such as Milvus, making it easier to build retrieval pipelines, apply metadata-aware search, and evolve from a simple chatbot into a portable enterprise RAG architecture.

    In this article, we will explore how to evolve a naive RAG chatbot to enterprise RAG by applying metadata filtering, hybrid search, and neural reranking using the OGX framework in Red Hat OpenShift AI.

    Understanding metadata filtering

    Metadata filtering is one of the most practical ways to move a RAG application beyond basic semantic search. Instead of retrieving chunks only because they are mathematically close to the user's query, metadata filters allow the system to narrow the search space using structured attributes such as document type, category, customer, department, date, language, source system, security level, or topic. This makes retrieval more precise because the vector search runs only over content that matches the user's real context and business constraints.

    Why using metadata matters

    Using metadata filters provides practical benefits that help your retrieval pipeline move beyond basic semantic search:

    • Increases application accuracy by retrieving context that better matches the user's intent and business context.
    • Lowers hallucination risk because the model receives fewer unrelated or misleading chunks.
    • Reduces token usage and cost by sending only the most relevant content to the generation step.
    • Allows access control by filtering results based on user permissions, roles, departments, or security levels.
    • Supports multi-tenancy by ensuring users or customers only retrieve data that belongs to their own tenant or workspace.
    • Increases recency-based retrieval by filtering documents using timestamps, publication dates, or version metadata.
    • Allows domain-specific filtering such as filtering by case type, document category, product, region, language, or source system.
    • Increases explainability by making it clearer why specific documents or chunks were selected for the response.

    OGX supports metadata at multiple levels, and each level serves a different purpose in a production RAG application.

    Vector-store metadata is useful for describing and organizing the vector store itself, such as the domain, tenant, version, environment, or ingestion pipeline that created it. This helps when you manage multiple knowledge bases across teams, customers, or applications.

    File or document-based metadata applies to the source document as a whole, such as document type, owner, department, language, access level, publication date, or category. All chunks in the same document will have the same metadata. This is especially useful when you want to filter retrieval to a specific set of documents before searching. For example, "only HR policies," or "only documents from 2025."

    Chunk-based metadata is the most granular level and is attached to each retrieved text chunk, such as section title, page number, topic, incident type, source file name, or security classification. This level is critical when the answer depends on precise context, traceability, and citation back to the exact part of the document.

    Together, these metadata layers make RAG retrieval more controllable. Vector metadata helps organize knowledge bases, file metadata narrows the search scope, and chunk metadata improves answer relevance, filtering precision, and explainability.

    Hybrid search and ranking algorithms

    Hybrid search is an advanced information retrieval technique. It combines dense vector search and sparse keyword search in parallel and merges their results.

    Vector search is effective at finding semantically related content, even when the user uses different wording from the source documents, while keyword-based search, such as BM25, is better at matching exact terms, names, IDs, product codes, or domain-specific phrases.

    By merging both signals, a RAG pipeline can retrieve context that is both meaningfully related and lexically precise. This is especially valuable in enterprise use cases where users might ask natural-language questions but still expect exact matches for entities, policies, error codes, or technical terms.

    Hybrid search reduces missed results, improves recall, lowers the chance of irrelevant context being passed to the large language model (LLM), and produces more accurate and grounded answers.

    Vector search and keyword-based retrieval use fundamentally different scoring models. Dense vector similarity typically produces normalized scores between 0 and 1, while keyword retrieval methods generate unbounded relevance scores based on term frequency and document statistics. Because these scoring systems are not directly comparable, hybrid search combines them using ranking algorithms:

    Reciprocal Rank Fusion (RRF) does not rely on raw retrieval scores. Instead, it evaluates the ranking position of each document across multiple retrieval methods. Documents that consistently appear near the top of different result lists receive higher final rankings. As a result, RRF favors documents that demonstrate strong relevance signals from both semantic understanding and keyword matching, leading to more balanced and reliable retrieval results.

    Weighted Average Fusion combines retrieval methods by preserving their original relevance scores while normalizing them into a common scale, typically between 0 and 1. After normalization, configurable weighting factors are applied to control how much influence each retrieval method contributes to the final ranking. This approach provides fine-grained control over hybrid retrieval behavior, allowing teams to prioritize either semantic vector search or keyword-based retrieval depending on which performs better for their domain.

    OGX provides hybrid search through the Vector Store Search API by allowing the application to set search_mode="hybrid" while specifying ranking options when searching a vector store with a supported provider like Milvus.

    Neural reranking using cross-encoder models

    Traditional embedding models use a bi-encoder architecture where the query and documents are encoded independently into dense vector representations. For example:

    • The user query is converted into a vector embedding.
    • Each document or chunk is also converted into its own vector embedding.
    • Retrieval is then performed by comparing these vectors using similarity metrics such as cosine similarity.

    This design is highly scalable because document embeddings can be precomputed and stored in a vector database during ingestion time, allowing fast similarity search at query time. However, the approach has an important limitation where the query and document are encoded separately and never directly interact during retrieval. As a result, the model might miss deeper contextual relationships between the query and the retrieved content, which can lead to less accurate ranking of results.

    Cross-encoder models use a fundamentally different retrieval strategy compared to traditional bi-encoders. Instead of encoding the query and document separately, a cross-encoder processes both together as a single input pair through the transformer model. The model then produces a single relevance score that reflects how well the document answers the query.

    Because the query and document interact directly across all transformer layers, cross-encoders can capture much richer contextual relationships between terms, phrases, and intent.

    This allows significantly more accurate relevance ranking, especially when semantic similarity alone cannot determine whether a document truly answers the user's question. However, this accuracy comes at the cost of performance, because the model must perform inference for every query-document pair individually. This makes cross-encoders considerably more computationally expensive than traditional bi-encoder retrieval. Figure 1 compares the data flows of the bi-encoder and cross-encoder architectures.

    bi-encoder VS cross-encoder
    Figure 1: Bi-encoder versus cross-encoder.

    To see these advanced retrieval strategies in action, you can set up a hands-on demo environment using Red Hat OpenShift AI.

    Prerequisites

    • Access to an OpenShift cluster with Red Hat OpenShift AI 3.4 or later installed with the LlamaStack/OGX operator enabled.
    • A configured NVIDIA GPU operator with at least one NVIDIA GPU (for example, A10, A100, L40S, T4, or similar).

    Demo setup

    In this demo, we will use the AG News dataset from Hugging Face, which contains news articles along with their corresponding categories. These categories will be used as metadata to improve retrieval in the RAG pipeline.

    1. Clone the GitHub demo repository to bring the sample notebooks and deployment assets into your local environment.
    2. In your terminal, log in to your OpenShift cluster:

      oc login --token=<token> --server=<server>
    3. Create a new OpenShift project:

      PROJECT="agnews-rag-demo" 
      oc new-project ${PROJECT}
      oc label namespace ${PROJECT} opendatahub.io/dashboard=true --overwrite
    4. Install the Helm chart into the project.

      helm install agnews-rag-demo ./chart --set namespace=${PROJECT}
    5. Log in to the OpenShift AI dashboard and create a workbench of type Jupyter minimal CPU inside the agnews-rag-demo project.
    6. Inside the workbench, upload the notebook files from the cloned folder named notebooks (Figure 2).

      Figure 2: Import notebook files into the workbench
      Figure 2: Import notebook files into the workbench.

    The demo uses the following deployed models, each serving a specific role in the RAG pipeline: generation, embedding, and reranking (Figure 3).

    Model namePurposeRuntime
    llama-32-3b-instructLLMvLLM NVIDIA GPU
    granite-embedding-125Embedding modelOGX inline provider sentence-transformers
    qwen3-reranker-0.6bCross-encoder rerankervLLM on CPU (for demo purposes, a CPU is used)
    Figure 2: RAG Demo Architecture
    Figure 3: RAG demo architecture.

    Ingestion pipeline

    The Ingestion_pipeline_ag_news notebook uses a file-based ingestion pattern with OGX vector stores (illustrated in Figure 4):

    1. The pipeline creates a vector store and configures it with the selected embedding model and vector store-level metadata such as version_no and tenant_id to support versioning and tenant isolation.
    2. For each AG News article, the pipeline creates a text document and uploads that file through the Files API.
    3. The pipeline associates the uploaded file with the created vector store. During the association step, the pipeline attaches metadata attributes such as category and document type.
    4. The OGX server handles the heavy lifting: chunking the content, generating embeddings with the configured embedding model, and indexing the vectors into the Milvus vector store.
    RAG Ingestion Pipeline
    Figure 4: RAG ingestion pipeline.

    Retrieval pipeline

    The following flow (illustrated in Figure 5) is implemented in the retrieval_pipeline_ag_news notebook:

    1. The user asks something like: "Find business news about oil prices." The chatbot receives the query as the starting point for the retrieval pipeline.
    2. The user's query is first processed by the LLM using function calling to extract relevant metadata from the natural language query, using chat.completions with a tool such as build_metadata_filter to understand whether the query implies any structured filters.
    3. The chatbot performs vector store search in hybrid mode using vector.stores.search() with the original user query, the vector store ID, search_mode="hybrid", and the extracted metadata filter.
    4. The chatbot then sends the retrieved candidate documents to the /rerank endpoint, together with the original query. The reranker compares the query against each candidate result to determine which is more relevant.
    5. The chatbot makes a final chat.completions call with a system prompt and the retrieved context. The model is instructed to answer only using the reranked results.
    RAG Retrieval Pipeline
    Figure 5: RAG retrieval pipeline.

    Summary

    In this article, we explored how to evolve a naive RAG chatbot into a reliable enterprise RAG application using the OGX framework. The article explained how metadata filtering, hybrid search, and neural reranking improve retrieval accuracy, reduce irrelevant context, and produce more grounded answers. It also demonstrated these techniques using the AG News dataset, Milvus, open embedding models, and a reranker model. Finally, it walked through the ingestion and retrieval pipelines, showing how OGX APIs can orchestrate file ingestion, vector search, metadata-aware filtering, reranking, and final answer.

    Ready to build your own enterprise retrieval pipeline? Explore the AG News RAG demo repository to experiment with metadata filtering and neural reranking on Red Hat OpenShift AI.

    Related Posts

    • Deploy an enterprise RAG chatbot with Red Hat OpenShift AI

    • Fine-tune a RAG model with Feast and Kubeflow Trainer

    • Level up your generative AI with LLMs and RAG

    • Scale LLM fine-tuning with Training Hub and OpenShift AI

    • How to manage Red Hat OpenShift AI dependencies with Kustomize and Argo CD

    • Fine-tune AI pipelines in Red Hat OpenShift AI 3.3

    Recent Posts

    • Testing infrastructure red teaming with abliterated models

    • Build an enterprise RAG system with OGX

    • Solutions for SELinux MCS challenges with GitLab runners

    • MCP servers vs. skills: Choosing the right context for your AI

    • How to route external and local LLMs with Models-as-a-Service

    What’s up next?

    Learning Path intro-to-OS-LP-feature-image

    Introduction to OpenShift AI

    Learn how to use Red Hat OpenShift AI to quickly develop, train, and deploy...
    Red Hat Developers logo LinkedIn YouTube Twitter Facebook

    Platforms

    • Red Hat AI
    • Red Hat Enterprise Linux
    • Red Hat OpenShift
    • Red Hat Ansible Automation Platform
    • See all products

    Build

    • Developer Sandbox
    • Developer tools
    • Interactive tutorials
    • API catalog

    Quicklinks

    • Learning resources
    • E-books
    • Cheat sheets
    • Blog
    • Events
    • Newsletter

    Communicate

    • About us
    • Contact sales
    • Find a partner
    • Report a website issue
    • Site status dashboard
    • Report a security problem

    RED HAT DEVELOPER

    Build here. Go anywhere.

    We serve the builders. The problem solvers who create careers with code.

    Join us if you’re a developer, software engineer, web designer, front-end designer, UX designer, computer scientist, architect, tester, product manager, project manager or team lead.

    Sign me up

    Red Hat legal and privacy links

    • About Red Hat
    • Jobs
    • Events
    • Locations
    • Contact Red Hat
    • Red Hat Blog
    • Inclusion at Red Hat
    • Cool Stuff Store
    • Red Hat Summit
    © 2026 Red Hat

    Red Hat legal and privacy links

    • Privacy statement
    • Terms of use
    • All policies and guidelines
    • Digital accessibility

    Chat Support

    Please log in with your Red Hat account to access chat support.