Skip to main content
Redhat Developers  Logo
  • Products

    Platforms

    • Red Hat Enterprise Linux
      Red Hat Enterprise Linux Icon
    • Red Hat AI
      Red Hat AI
    • Red Hat OpenShift
      Openshift icon
    • Red Hat Ansible Automation Platform
      Ansible icon
    • See all Red Hat products

    Featured

    • Red Hat build of OpenJDK
    • Red Hat Developer Hub
    • Red Hat JBoss Enterprise Application Platform
    • Red Hat OpenShift Dev Spaces
    • Red Hat OpenShift Local
    • Red Hat Developer Sandbox

      Try Red Hat products and technologies without setup or configuration fees for 30 days with this shared Red Hat OpenShift and Kubernetes cluster.
    • Try at no cost
  • Technologies

    Featured

    • AI/ML
      AI/ML Icon
    • Linux
      Linux Icon
    • Kubernetes
      Cloud icon
    • Automation
      Automation Icon showing arrows moving in a circle around a gear
    • See all technologies
    • Programming languages & frameworks

      • Java
      • Python
      • JavaScript
    • System design & architecture

      • Red Hat architecture and design patterns
      • Microservices
      • Event-Driven Architecture
      • Databases
    • Developer experience

      • Productivity
      • Tools
      • GitOps
    • Automated data processing

      • AI/ML
      • Data science
      • Apache Kafka on Kubernetes
    • Platform engineering

      • DevOps
      • DevSecOps
      • Red Hat Ansible Automation Platform for applications and services
    • Secure development & architectures

      • Security
      • Secure coding
  • Learn

    Featured

    • Kubernetes & cloud native
      Openshift icon
    • Linux
      Rhel icon
    • Automation
      Ansible cloud icon
    • AI/ML
      AI/ML Icon
    • See all learning resources

    E-books

    • GitOps cookbook
    • Podman in action
    • Kubernetes operators
    • The path to GitOps
    • See all e-books

    Cheat sheets

    • Linux commands
    • Bash commands
    • Git
    • systemd commands
    • See all cheat sheets

    Documentation

    • Product documentation
    • API catalog
    • Legacy documentation
  • Developer Sandbox

    Developer Sandbox

    • Access Red Hat’s products and technologies without setup or configuration, and start developing quicker than ever before with our new, no-cost sandbox environments.
    • Explore the Developer Sandbox

    Featured Developer Sandbox activities

    • Get started with your Developer Sandbox
    • OpenShift virtualization and application modernization using the Developer Sandbox
    • Explore all Developer Sandbox activities

    Ready to start developing apps?

    • Try at no cost
  • Blog
  • Events
  • Videos

Synthetic data for RAG evaluation: Why your RAG system needs better testing

February 23, 2026
Aditi Saluja William Caban Babilonia Suhas Kashyap
Related topics:
Artificial intelligence
Related products:
Red Hat AI

    Retrieval-augmented generation (RAG) has become the default architecture for enterprise large-language models (LLM) applications. By grounding models in external knowledge bases, RAG systems can provide accurate, up-to-date responses without the cost and complexity of fine-tuning. In practice, most RAG systems reach production with weak evaluation strategies.

    Teams tune embeddings, retrievers, chunking strategies, and prompts—but still rely on manual spot checks, small hand-labeled datasets, or generic LLM-as-a-judge metrics to assess quality. The result: systems that appear to work, but fail silently under real user traffic. So the real question becomes: How do you know your RAG system actually works—and why it fails when it doesn't?

    Why RAG evaluation is hard

    Evaluating RAG systems is fundamentally harder than evaluating stand-alone language models. Traditional approaches break down in several ways:

    • Retrieval or generation entanglement: When a RAG system fails, it's often unclear whether the retriever surfaced the wrong documents or the LLM hallucinated despite correct context. Without known ground truth (verifiable fact) context, debugging becomes guesswork.
    • No ground truth at scale: Most evaluation datasets are manually curated or lightly generated. This makes them expensive, slow to produce, and impossible to scale as knowledge bases grow. More importantly, they rarely include ground truth context, making retrieval evaluation unreliable.
    • Knowledge-base drift: Enterprise documents evolve continuously. Static test sets become outdated almost immediately, resulting in misleading evaluations that no longer reflect production behavior.
    • Domain-specific blind spots: Generic metrics often miss failure modes that matter most in practice, such as regulatory correctness in finance, clinical precision in health care, or procedural accuracy in technical documentation.

    Why evaluation matters more than ever

    As RAG systems are increasingly embedded inside agentic and multi-step workflows, evaluation failures compound. A single retrieval error can cascade across tool calls, memory updates, or downstream decisions. Without rigorous evaluation, teams lose the ability to:

    • Compare retriever or embedding changes objectively
    • Debug failures at the component level
    • Improve system quality systematically over time

    As the saying goes, you can't improve what you can't measure.

    The solution: Synthetic data generation

    Synthetic data unlocks a fundamentally different approach:

    • High-quality question-answer-context triplets: Generate evaluation datasets directly from your knowledge base with realistic questions, grounded answers, and ground truth context.
    • Automatic ground truth creation: No manual annotation needed. Test retrieval and generation separately. Know exactly which component failed.
    • Repeatable benchmarks: Compare different embedding models, chunking strategies, and LLM configurations. Track improvements over time with confidence.

    How the RAG evaluation dataset flow works

    SDG Hub is Red Hat's open source Python framework for building synthetic data generation pipelines. SDG HUb offers a pre-built RAG evaluation dataset flow that generates high-quality question-answer-context triplets, as shown in Figure 1.

    RAG evaluation dataset generation pipeline (supported by SDG Hub)
    Figure 1: RAG evaluation dataset generation pipeline (supported by SDG Hub),

    The RAG evaluation flow turns raw documents into high-quality evaluation datasets through a structured, grounded pipeline:

    1. Topic extraction identifies key concepts from the document to anchor evaluation.
    2. Question generation creates initial questions guided by the document outline.
    3. Question evolution refines these into realistic, user-style queries.
    4. Answer generation produces answers grounded strictly in the source content.
    5. Groundedness filtering removes low-quality question-answer pairs.
    6. Context extraction isolates the minimal ground truth context needed to answer each question.

    The result is a clean, RAG-ready dataset of questions, answers, and gold contexts designed for reliable retrieval and answer evaluation.

    Getting started

    Generate grounded evaluation datasets for your RAG system in minutes.

    1. Install

    For SDG Hub, it's recommended to use the uv command. After you've installed it, use it to install the sdg-hub Python package:

    uv pip install sdg-hub

    2. Try the RAG evaluation example

    Clone the repo and open the notebook:

    git clone https://github.com/Red-Hat-AI-Innovation-Team/sdg_hub

    For more details about data preprocessing for SDG flow, and data post processing for specific evaluation frameworks, refer to the notebook. You can also optimize the flow SDG Hub ships out of the box for your custom evaluation needs. Here's a simple example of SDG Hub in use:

    from datasets import Dataset
    from sdg_hub import Flow, FlowRegistry
    
    # 1) Load the RAG evaluation flow
    flow = Flow.from_yaml(
        FlowRegistry.get_flow_path("RAG Evaluation Dataset Flow")
    )
    
    # 2) Provide minimal input: content + outline
    input_dataset = Dataset.from_dict({
        "document": [
            "Kubernetes is an open source system for automating deployment, scaling, and management of containerized applications.",
            "OpenShift is Red Hat's enterprise Kubernetes platform with added developer and operational tooling."
        ],
        "document_outline": [
            "Kubernetes overview as the standard platform for containers",
            "OpenShift overview as the enteprise Kubenertes platform "
        ],
    })
    
    # 3) Generate the evaluation dataset
    result = flow.generate(input_dataset)
    
    # 4) Inspect outputs
    df = result.to_pandas()
    print(df.columns)
    df.head()

    3. Explore the docs

    Take a look at the documentation available at https://ai-innovation.team/sdg_hub.

    Input contract

    The RAG Evaluation flow expects exactly two columns, enforced to ensure grounded and debuggable evaluation.

    • Document: The atomic unit of knowledge your RAG system should retrieve (document, section, or chunk).This is treated as the gold reference context, so it must match your production chunking strategy.
    • Document_outline: A short, intent-level label (title or summary) used to guide realistic question generation.Good outlines prevent trivial or purely extractive questions.

    This separation ensures questions reflect real user intent while answers remain strictly grounded in known context, making downstream retrieval and generation metrics meaningful.

    Output contract

    The RAG evaluation flow returns a dataset containing:

    • question: Synthetic user question grounded in your content
    • answer: Answer generated based on the ground-truth context
    • ground_truth_context: The exact chunk/section used as the "gold" context

    From synthetic data to end-to-end RAG evaluation

    After generation, SDG Hub outputs are post-processed into an evaluation-ready dataset containing synthetic user queries, ground-truth answers, and gold reference contexts. This dataset is then executed against a real RAG pipeline, which produces retrieved contexts and generated answers.

    Together, these signals form the full input required by downstream evaluation frameworks. Because the ground-truth context is known, evaluation metrics reflect true retrieval and generation quality—not proxy judgments from another LLM.

    Typical metrics include:

    • Context precision and recall: Did retrieval surface the correct document spans?
    • Faithfulness: Is the answer supported by retrieved context?
    • Answer relevance: Does the response actually answer the question?

    A closed-loop evaluation workflow

    SDG Hub enables a repeatable, metric-driven workflow:

    1. Generate grounded evaluation data from your knowledge base
    2. Run the dataset through your RAG system
    3. Score retrieval and generation quality
    4. Compare configurations and track improvements over time

    Why this matters

    Turn synthetic data from "nice test data" into a systematic optimization tool:

    • Benchmark retrievers, embeddings, and chunking strategies
    • Isolate whether failures come from retrieval or generation
    • Re-run the same evaluation as models or data evolve

    In short, SDG Hub combined with downstream evaluation frameworks replace intuition-driven RAG tuning with measurable, repeatable improvement. Get started today.

    Related Posts

    • Deploy an enterprise RAG chatbot with Red Hat OpenShift AI

    • Fine-tune a RAG model with Feast and Kubeflow Trainer

    • Improve RAG retrieval and training with Feast and Kubeflow Trainer

    • Level up your generative AI with LLMs and RAG

    Recent Posts

    • Synthetic data for RAG evaluation: Why your RAG system needs better testing

    • Prompt engineering: Big vs. small prompts for AI agents

    • OpenShift networking evolved: Real routing, no NAT or asymmetry

    • Understanding ATen: PyTorch's tensor library

    • Reimagining Red Hat Enterprise Linux image creation with Red Hat Lightspeed Model Context Protocol

    What’s up next?

    Open source AI for developers share image

    Open source AI for developers

    Red Hat
    Red Hat Developers logo LinkedIn YouTube Twitter Facebook

    Platforms

    • Red Hat AI
    • Red Hat Enterprise Linux
    • Red Hat OpenShift
    • Red Hat Ansible Automation Platform
    • See all products

    Build

    • Developer Sandbox
    • Developer tools
    • Interactive tutorials
    • API catalog

    Quicklinks

    • Learning resources
    • E-books
    • Cheat sheets
    • Blog
    • Events
    • Newsletter

    Communicate

    • About us
    • Contact sales
    • Find a partner
    • Report a website issue
    • Site status dashboard
    • Report a security problem

    RED HAT DEVELOPER

    Build here. Go anywhere.

    We serve the builders. The problem solvers who create careers with code.

    Join us if you’re a developer, software engineer, web designer, front-end designer, UX designer, computer scientist, architect, tester, product manager, project manager or team lead.

    Sign me up

    Red Hat legal and privacy links

    • About Red Hat
    • Jobs
    • Events
    • Locations
    • Contact Red Hat
    • Red Hat Blog
    • Inclusion at Red Hat
    • Cool Stuff Store
    • Red Hat Summit
    © 2025 Red Hat

    Red Hat legal and privacy links

    • Privacy statement
    • Terms of use
    • All policies and guidelines
    • Digital accessibility

    Report a website issue