Skip to main content
Redhat Developers  Logo
  • AI

    Get started with AI

    • Red Hat AI
      Accelerate the development and deployment of enterprise AI solutions.
    • AI learning hub
      Explore learning materials and tools, organized by task.
    • AI interactive demos
      Click through scenarios with Red Hat AI, including training LLMs and more.
    • AI/ML learning paths
      Expand your OpenShift AI knowledge using these learning resources.
    • AI quickstarts
      Focused AI use cases designed for fast deployment on Red Hat AI platforms.
    • No-cost AI training
      Foundational Red Hat AI training.

    Featured resources

    • OpenShift AI learning
    • Open source AI for developers
    • AI product application development
    • Open source-powered AI/ML for hybrid cloud
    • AI and Node.js cheat sheet

    Red Hat AI Factory with NVIDIA

    • Red Hat AI Factory with NVIDIA is a co-engineered, enterprise-grade AI solution for building, deploying, and managing AI at scale across hybrid cloud environments.
    • Explore the solution
  • Learn

    Self-guided

    • Documentation
      Find answers, get step-by-step guidance, and learn how to use Red Hat products.
    • Learning paths
      Explore curated walkthroughs for common development tasks.
    • Guided learning
      Receive custom learning paths powered by our AI assistant.
    • See all learning

    Hands-on

    • Developer Sandbox
      Spin up Red Hat's products and technologies without setup or configuration.
    • Interactive labs
      Learn by doing in these hands-on, browser-based experiences.
    • Interactive demos
      Click through product features in these guided tours.

    Browse by topic

    • AI/ML
    • Automation
    • Java
    • Kubernetes
    • Linux
    • See all topics

    Training & certifications

    • Courses and exams
    • Certifications
    • Skills assessments
    • Red Hat Academy
    • Learning subscription
    • Explore training
  • Build

    Get started

    • Red Hat build of Podman Desktop
      A downloadable, local development hub to experiment with our products and builds.
    • Developer Sandbox
      Spin up Red Hat's products and technologies without setup or configuration.

    Download products

    • Access product downloads to start building and testing right away.
    • Red Hat Enterprise Linux
    • Red Hat AI
    • Red Hat OpenShift
    • Red Hat Ansible Automation Platform
    • See all products

    Featured

    • Red Hat build of OpenJDK
    • Red Hat JBoss Enterprise Application Platform
    • Red Hat OpenShift Dev Spaces
    • Red Hat Developer Toolset

    References

    • E-books
    • Documentation
    • Cheat sheets
    • Architecture center
  • Community

    Get involved

    • Events
    • Live AI events
    • Red Hat Summit
    • Red Hat Accelerators
    • Community discussions

    Follow along

    • Articles & blogs
    • Developer newsletter
    • Videos
    • Github

    Get help

    • Customer service
    • Customer support
    • Regional contacts
    • Find a partner

    Join the Red Hat Developer program

    • Download Red Hat products and project builds, access support documentation, learning content, and more.
    • Explore the benefits

Synthetic data for RAG evaluation: Why your RAG system needs better testing

February 23, 2026
Aditi Saluja William Caban Babilonia Suhas Kashyap
Related topics:
Artificial intelligence
Related products:
Red Hat AI

    Retrieval-augmented generation (RAG) has become the default architecture for enterprise large-language models (LLM) applications. By grounding models in external knowledge bases, RAG systems can provide accurate, up-to-date responses without the cost and complexity of fine-tuning. In practice, most RAG systems reach production with weak evaluation strategies.

    Teams tune embeddings, retrievers, chunking strategies, and prompts—but still rely on manual spot checks, small hand-labeled datasets, or generic LLM-as-a-judge metrics to assess quality. The result: systems that appear to work, but fail silently under real user traffic. So the real question becomes: How do you know your RAG system actually works—and why it fails when it doesn't?

    Why RAG evaluation is hard

    Evaluating RAG systems is fundamentally harder than evaluating stand-alone language models. Traditional approaches break down in several ways:

    • Retrieval or generation entanglement: When a RAG system fails, it's often unclear whether the retriever surfaced the wrong documents or the LLM hallucinated despite correct context. Without known ground truth (verifiable fact) context, debugging becomes guesswork.
    • No ground truth at scale: Most evaluation datasets are manually curated or lightly generated. This makes them expensive, slow to produce, and impossible to scale as knowledge bases grow. More importantly, they rarely include ground truth context, making retrieval evaluation unreliable.
    • Knowledge-base drift: Enterprise documents evolve continuously. Static test sets become outdated almost immediately, resulting in misleading evaluations that no longer reflect production behavior.
    • Domain-specific blind spots: Generic metrics often miss failure modes that matter most in practice, such as regulatory correctness in finance, clinical precision in health care, or procedural accuracy in technical documentation.

    Why evaluation matters more than ever

    As RAG systems are increasingly embedded inside agentic and multi-step workflows, evaluation failures compound. A single retrieval error can cascade across tool calls, memory updates, or downstream decisions. Without rigorous evaluation, teams lose the ability to:

    • Compare retriever or embedding changes objectively
    • Debug failures at the component level
    • Improve system quality systematically over time

    As the saying goes, you can't improve what you can't measure.

    The solution: Synthetic data generation

    Synthetic data unlocks a fundamentally different approach:

    • High-quality question-answer-context triplets: Generate evaluation datasets directly from your knowledge base with realistic questions, grounded answers, and ground truth context.
    • Automatic ground truth creation: No manual annotation needed. Test retrieval and generation separately. Know exactly which component failed.
    • Repeatable benchmarks: Compare different embedding models, chunking strategies, and LLM configurations. Track improvements over time with confidence.

    How the RAG evaluation dataset flow works

    SDG Hub is Red Hat's open source Python framework for building synthetic data generation pipelines. SDG HUb offers a pre-built RAG evaluation dataset flow that generates high-quality question-answer-context triplets, as shown in Figure 1.

    RAG evaluation dataset generation pipeline (supported by SDG Hub)
    Figure 1: RAG evaluation dataset generation pipeline (supported by SDG Hub),

    The RAG evaluation flow turns raw documents into high-quality evaluation datasets through a structured, grounded pipeline:

    1. Topic extraction identifies key concepts from the document to anchor evaluation.
    2. Question generation creates initial questions guided by the document outline.
    3. Question evolution refines these into realistic, user-style queries.
    4. Answer generation produces answers grounded strictly in the source content.
    5. Groundedness filtering removes low-quality question-answer pairs.
    6. Context extraction isolates the minimal ground truth context needed to answer each question.

    The result is a clean, RAG-ready dataset of questions, answers, and gold contexts designed for reliable retrieval and answer evaluation.

    Getting started

    Generate grounded evaluation datasets for your RAG system in minutes.

    1. Install

    For SDG Hub, it's recommended to use the uv command. After you've installed it, use it to install the sdg-hub Python package:

    uv pip install sdg-hub

    2. Try the RAG evaluation example

    Clone the repo and open the notebook:

    git clone https://github.com/Red-Hat-AI-Innovation-Team/sdg_hub

    For more details about data preprocessing for SDG flow, and data post processing for specific evaluation frameworks, refer to the notebook. You can also optimize the flow SDG Hub ships out of the box for your custom evaluation needs. Here's a simple example of SDG Hub in use:

    from datasets import Dataset
    from sdg_hub import Flow, FlowRegistry
    
    # 1) Load the RAG evaluation flow
    flow = Flow.from_yaml(
        FlowRegistry.get_flow_path("RAG Evaluation Dataset Flow")
    )
    
    # 2) Provide minimal input: content + outline
    input_dataset = Dataset.from_dict({
        "document": [
            "Kubernetes is an open source system for automating deployment, scaling, and management of containerized applications.",
            "OpenShift is Red Hat's enterprise Kubernetes platform with added developer and operational tooling."
        ],
        "document_outline": [
            "Kubernetes overview as the standard platform for containers",
            "OpenShift overview as the enteprise Kubenertes platform "
        ],
    })
    
    # 3) Generate the evaluation dataset
    result = flow.generate(input_dataset)
    
    # 4) Inspect outputs
    df = result.to_pandas()
    print(df.columns)
    df.head()

    3. Explore the docs

    Take a look at the documentation available at https://ai-innovation.team/sdg_hub.

    Input contract

    The RAG Evaluation flow expects exactly two columns, enforced to ensure grounded and debuggable evaluation.

    • Document: The atomic unit of knowledge your RAG system should retrieve (document, section, or chunk).This is treated as the gold reference context, so it must match your production chunking strategy.
    • Document_outline: A short, intent-level label (title or summary) used to guide realistic question generation.Good outlines prevent trivial or purely extractive questions.

    This separation ensures questions reflect real user intent while answers remain strictly grounded in known context, making downstream retrieval and generation metrics meaningful.

    Output contract

    The RAG evaluation flow returns a dataset containing:

    • question: Synthetic user question grounded in your content
    • answer: Answer generated based on the ground-truth context
    • ground_truth_context: The exact chunk/section used as the "gold" context

    From synthetic data to end-to-end RAG evaluation

    After generation, SDG Hub outputs are post-processed into an evaluation-ready dataset containing synthetic user queries, ground-truth answers, and gold reference contexts. This dataset is then executed against a real RAG pipeline, which produces retrieved contexts and generated answers.

    Together, these signals form the full input required by downstream evaluation frameworks. Because the ground-truth context is known, evaluation metrics reflect true retrieval and generation quality—not proxy judgments from another LLM.

    Typical metrics include:

    • Context precision and recall: Did retrieval surface the correct document spans?
    • Faithfulness: Is the answer supported by retrieved context?
    • Answer relevance: Does the response actually answer the question?

    A closed-loop evaluation workflow

    SDG Hub enables a repeatable, metric-driven workflow:

    1. Generate grounded evaluation data from your knowledge base
    2. Run the dataset through your RAG system
    3. Score retrieval and generation quality
    4. Compare configurations and track improvements over time

    Why this matters

    Turn synthetic data from "nice test data" into a systematic optimization tool:

    • Benchmark retrievers, embeddings, and chunking strategies
    • Isolate whether failures come from retrieval or generation
    • Re-run the same evaluation as models or data evolve

    In short, SDG Hub combined with downstream evaluation frameworks replace intuition-driven RAG tuning with measurable, repeatable improvement. Get started today.

    Related Posts

    • Deploy an enterprise RAG chatbot with Red Hat OpenShift AI

    • Fine-tune a RAG model with Feast and Kubeflow Trainer

    • Improve RAG retrieval and training with Feast and Kubeflow Trainer

    • Level up your generative AI with LLMs and RAG

    Recent Posts

    • Every layer counts: Defense in depth for AI agents with Red Hat AI

    • Fun in the RUN instruction: Why container builds with distroless images can surprise you

    • Trusted software factory: Building trust in the agentic AI era

    • Build a zero trust AI pipeline with OpenShift and RHEL CVMs

    • Red Hat Hardened Images: Top 5 benefits for software developers

    What’s up next?

    Open source AI for developers share image

    Open source AI for developers

    Red Hat
    Red Hat Developers logo LinkedIn YouTube Twitter Facebook

    Platforms

    • Red Hat AI
    • Red Hat Enterprise Linux
    • Red Hat OpenShift
    • Red Hat Ansible Automation Platform
    • See all products

    Build

    • Developer Sandbox
    • Developer tools
    • Interactive tutorials
    • API catalog

    Quicklinks

    • Learning resources
    • E-books
    • Cheat sheets
    • Blog
    • Events
    • Newsletter

    Communicate

    • About us
    • Contact sales
    • Find a partner
    • Report a website issue
    • Site status dashboard
    • Report a security problem

    RED HAT DEVELOPER

    Build here. Go anywhere.

    We serve the builders. The problem solvers who create careers with code.

    Join us if you’re a developer, software engineer, web designer, front-end designer, UX designer, computer scientist, architect, tester, product manager, project manager or team lead.

    Sign me up

    Red Hat legal and privacy links

    • About Red Hat
    • Jobs
    • Events
    • Locations
    • Contact Red Hat
    • Red Hat Blog
    • Inclusion at Red Hat
    • Cool Stuff Store
    • Red Hat Summit
    © 2026 Red Hat

    Red Hat legal and privacy links

    • Privacy statement
    • Terms of use
    • All policies and guidelines
    • Digital accessibility

    Chat Support

    Please log in with your Red Hat account to access chat support.