Skip to main content
Redhat Developers  Logo
  • Products

    Platforms

    • Red Hat Enterprise Linux
      Red Hat Enterprise Linux Icon
    • Red Hat AI
      Red Hat AI
    • Red Hat OpenShift
      Openshift icon
    • Red Hat Ansible Automation Platform
      Ansible icon
    • See all Red Hat products

    Featured

    • Red Hat build of OpenJDK
    • Red Hat Developer Hub
    • Red Hat JBoss Enterprise Application Platform
    • Red Hat OpenShift Dev Spaces
    • Red Hat OpenShift Local
    • Red Hat Developer Sandbox

      Try Red Hat products and technologies without setup or configuration fees for 30 days with this shared Red Hat OpenShift and Kubernetes cluster.
    • Try at no cost
  • Technologies

    Featured

    • AI/ML
      AI/ML Icon
    • Linux
      Linux Icon
    • Kubernetes
      Cloud icon
    • Automation
      Automation Icon showing arrows moving in a circle around a gear
    • See all technologies
    • Programming languages & frameworks

      • Java
      • Python
      • JavaScript
    • System design & architecture

      • Red Hat architecture and design patterns
      • Microservices
      • Event-Driven Architecture
      • Databases
    • Developer experience

      • Productivity
      • Tools
      • GitOps
    • Automated data processing

      • AI/ML
      • Data science
      • Apache Kafka on Kubernetes
    • Platform engineering

      • DevOps
      • DevSecOps
      • Red Hat Ansible Automation Platform for applications and services
    • Secure development & architectures

      • Security
      • Secure coding
  • Learn

    Featured

    • Kubernetes & cloud native
      Openshift icon
    • Linux
      Rhel icon
    • Automation
      Ansible cloud icon
    • AI/ML
      AI/ML Icon
    • See all learning resources

    E-books

    • GitOps cookbook
    • Podman in action
    • Kubernetes operators
    • The path to GitOps
    • See all e-books

    Cheat sheets

    • Linux commands
    • Bash commands
    • Git
    • systemd commands
    • See all cheat sheets

    Documentation

    • Product documentation
    • API catalog
    • Legacy documentation
  • Developer Sandbox

    Developer Sandbox

    • Access Red Hat’s products and technologies without setup or configuration, and start developing quicker than ever before with our new, no-cost sandbox environments.
    • Explore the Developer Sandbox

    Featured Developer Sandbox activities

    • Get started with your Developer Sandbox
    • OpenShift virtualization and application modernization using the Developer Sandbox
    • Explore all Developer Sandbox activities

    Ready to start developing apps?

    • Try at no cost
  • Blog
  • Events
  • Videos

Generate synthetic data for your AI models with SDG Hub

How to turn a bit of data into a bunch of data

December 2, 2025
Legare Kerrison Frank La Vigne
Related topics:
APIsArtificial intelligencePython
Related products:
Red Hat AI Inference ServerRed Hat AIRed Hat OpenShift AI

    "Garbage in, garbage out" might be the oldest cliché in computer science, but in the age of AI, it's practically a prophecy. Models are only as good as the data they're trained on—and most of us don't have mountains of clean, specialized, regulation-safe data lying around.

    Enter SDG Hub, an open source framework that turns a bit of quality data into a lot of useful data. It works by stringing together modular blocks into flexible flows, which are automated pipelines that generate, transform, and validate synthetic data tailored to your domain. You can run these flows locally (so no sensitive data leaves your premises) or through APIs, scaling your dataset without increasing your risk.

    For developers fine-tuning smaller, task-specific models or building agentic AI systems, synthetic data offers a way to teach models faster, safer, and cheaper. With async execution, Pydantic validation, and detailed monitoring baked in. Learn more in this related blog post: SDG Hub: Building synthetic data pipelines with modular blockstg

    Try it yourself

    Here is a clean, copy-pasteable Jupyter Notebook walkthrough showing how to use SDG Hub to turn a minimal seed dataset into a more extensive, high-quality set of synthetic question-and-answer pairs built for training a model. Check out the notebook or run the following commands in your preferred terminal.

    Regardless of what tools you use, these commands will:

    1. Install dependencies.
    2. Load a flow that generates question-and-answer pairs.
    3. Create a small dataset.
    4. Test with a dry run.
    5. Generate a large synthetic dataset.

    Environment setup

    Before running any commands, make sure your environment is set up for SDG Hub. You'll need Python 3.10 or newer, a virtual environment (recommended for dependency management), and either a local model endpoint like Ollama or vLLM, or access to an OpenAI-compatible API key.

    Once your environment is ready, installing SDG Hub and its example flows is as simple as a few pip commands. Then you're ready to start generating synthetic data right from your terminal or Jupyter Notebook.

    Step 1: Install dependencies

    In a terminal or a Jupyter Notebook cell, run the following commands to install SDG Hub along with example flows and vLLM integration.

    pip install sdg-hubpip install sdg-hub[vllm,examples]

    Step 2: Include the necessary libraries

    from sdg_hub.core.flow import FlowRegistryfrom sdg_hub.core.blocks import BlockRegistry

    Show the available flows

    List all of the available flows. Flows are pre-built workflows for generating synthetic data.

    FlowRegistry.discover_flows()

    Show the available blocks

    Then list all of the available blocks. Blocks are components that make up the flows, you can rearrange them to build your own flow, like LEGOs.

    BlockRegistry.discover_blocks()

    Step 3: Run your first flow

    Here we'll import a pre-built question-answer generation flow for knowledge tuning.

    from sdg_hub.core.flow import FlowRegistry, Flowfrom datasets import Dataset

    For our purposes here, we will run one of the pre-built workflows that generates question and answer pairs.

    # load a pre-built flow flow_name \= "Advanced Document Grounded Question-Answer Generation Flow for Knowledge Tuning"flow_path \= FlowRegistry.get_flow_path(flow_name)flow \= Flow.from_yaml(flow_path)

    Configure your model backend

    This workflow requires a large language model to generate content, and also to act as a teacher and a critic. SDG Hub doesn't download or run these models for you; you'll need to have your chosen model endpoint set up separately before proceeding.

    SDG Hub can connect to any OpenAI-compatible API, whether that's a locally hosted option like Ollama or vLLM, or a cloud-hosted service such as OpenAI or Anthropic.

    Once your endpoint is running, you'll point SDG Hub to it by specifying the model name, API base URL, and API key in the configuration.

    Option A: Ollama (free, easiest local option)

    Ollama is great for testing. Install it, pull a model (for example, ollama pull llama3), and SDG Hub can use it as an OpenAI-compatible endpoint for free.

    To run locally on CPU/GPU via Ollama:

    flow.set_model_config({    "model": "ollama/llama3",    "api_base": "http://localhost:11434/v1",    "api_key": "ollama"})

    Option B: Local vLLM (free, GPU required)

    If you're running vLLM locally or as a remote endpoint:

    flow.set_model_config(     model="hosted_vllm/meta-llama/Llama-3.1-8B-Instruct",     api_base="http://remote-ip or localhost:8000/v1",     api_key="your_api_key_here or dummy", )

    (Note: Using vllm/ as a local vLLM SDK in-process has been deprecated.)

    Option C: OpenAI or Claude API (paid)

    You can use any OpenAI-compatible endpoint, local or hosted.

    flow.set_model_config(     model="openai/gpt-3.5-turbo",     api_key="your_api_key_here" )

    Getting the default model for the flow

    Each pre-built flow has a list of recommended models to use. To view them, run the following code:

    # load a pre-built flow
    flow_name = "Advanced Document Grounded Question-Answer Generation Flow for Knowledge Tuning"
    flow_path = FlowRegistry.get_flow_path(flow_name)
    flow = Flow.from_yaml(flow_path)

    Step 4: Create a sample dataset

    We'll start with a simple document and a few in-context learning (ICL) example queries and responses. To keep things simple, we have defined the dataset in code. You can also load data from multiple sources, such as documents via Docling, or other data storage systems. 

    # Create a sample and simple dataset
    dataset = Dataset.from_dict({
        'document': ['The Great Dane is a German breed of domestic dog known for its imposing size. It is one of the world\'s tallest dog breeds, often referred to as the "Apollo of Dogs."'],
        'document_outline': ['1. Great Dane Origin; 2. Size and Height; 3. Breed Nicknames'],
        'domain': ['Canine Breeds'],
        'icl_document': ['The Labrador Retriever is a British breed of retriever gun dog that is consistently one of the most popular dog breeds in the world.'],
        'icl_query_1': ['What is the origin of the Labrador Retriever?'],
        'icl_response_1': ['The Labrador Retriever is a British breed.'],
        'icl_query_2': ['What type of dog is a Labrador?'],
        'icl_response_2': ['The Labrador is a retriever gun dog.'],
        'icl_query_3': ['How popular is the Labrador Retriever?'],
        'icl_response_3': ['It is consistently one of the most popular dog breeds in the world.']
    })

    Quick note if running the code in a Jupyter Notebook

    Before running asynchronous code in a Jupyter Notebook, you might encounter runtime errors like RuntimeError: This event loop is already running.

    That's because SDG Hub executes parts of its pipelines asynchronously to handle multiple model requests efficiently. Jupyter itself already runs an event loop, so without a patch, Python would try to start a second loop and fail.

    The following lines fix that by applying the nest_asyncio patch, which safely allows nested event loops in the same runtime:

    import nest_asyncionest_asyncio.apply()

    Step 5: Dry run (recommended first)

    This runs a quick test to ensure the pipeline has no errors or configuration issues:

    # Test with a small sample first (recommended!)
    print("🧪 Running dry run...")
    dry_result = flow.dry_run(dataset, sample_size=1)

    If that runs without errors, run the following to see the results:

    print(f"✅ Dry run completed in {dry_result['execution_time_seconds']:.2f}s")
    print(f"📊 Output columns: {list(dry_result['final_dataset']['columns'])}")

    Step 6: Generate synthetic data

    Once the dry run successfully completes, you have confirmed that it is ready for a full run. Run the following code:

    # Generate high-quality QA pairs
    print("🏗️ Generating synthetic data...")
    result = flow.generate(dataset)

    Step 7: Review and export your generated data

    You will notice that it takes longer to complete the full run than the dry run. That's because far more data is being generated.

    Run the following code to see how many QA (question and answer) pairs have been generated.

    # Explore the results
    print(f"\n📈 Generated {len(result)} QA pairs!")

    Now we know how many pairs have been generated. Run the following code to look at the QA pairs we generated synthetically.

    Review the generated QA pairs:

    # The length is determined by the length of any of the lists (e.g., 'question')
    num_pairs = len(result['question'])
    print(f"\n--- Generated {num_pairs} QA pairs ---")
    i = 0
    # Iterate from index 0 up to (but not including) num_pairs
    for i in range(num_pairs -1):
        print(f"\n--- QA Pair #{i+1} ---")
        print(f"📝 Question: {result['question'][i]}")
        print(f"💬 Answer: {result['response'][i]}")
        print(f"🎯 Faithfulness Score: {result['faithfulness_judgment'][i]}")
        print(f"📏 Relevancy Score: {result['relevancy_score'][i]}")
    print("\n--- End of Report ---")

    Explore synthetic data more closely:

    type(result)
    df \= result.to_pandas()
    Df.shape
    df.info()

    Show synthetic data:

    df.head()

    Export the entire dataset to a CSV file:

    df.to_csv('entire_synthetic_dataset.csv')

    Narrow the dataset to just Q&A pairs:

    qa_df = result.to_pandas()[["question", "response", "verification_rating", "relevancy_score", "faithfulness_judgment" ]]
    qa_df

    Export the Q&A pairs to a CSV file:

    # assuming qa_df is your DataFrameqa_df.to_csv("synthetic_qa_pairs.csv", index=False)

    Hopefully, you found this helpful and are now left with a purpose-built dataset, ready to train your model. If so, check out SDG Hub, clone the repo, tweak a flow, and start teaching your model something new. Happy generating!

    Related Posts

    • Building domain-specific LLMs with synthetic data and SDG Hub

    • SDG Hub: Building synthetic data pipelines with modular blocks

    • Autoscaling vLLM with OpenShift AI model serving: Performance validation

    • Ollama or vLLM? How to choose the right LLM serving tool for your use case

    • How I built an agentic application for Docling with MCP

    • The strategic choice: Making sense of LLM customization

    Recent Posts

    • Automate unique compliance checks with OpenShift and CustomRule

    • Build custom OS images for IBM Power systems (ppc64le) with bootc

    • Generate synthetic data for your AI models with SDG Hub

    • Kafka Monthly Digest: November 2025

    • What you need to know about Red Hat's .NET container images

    What’s up next?

    Jupyter Notebook works with OpenShift AI to interactively classify images. In this learning path, you will use TensorFlow and ipywidgets to simulate real-time data streaming and visualization and interact directly with AI models.

    Start the activity
    Red Hat Developers logo LinkedIn YouTube Twitter Facebook

    Platforms

    • Red Hat AI
    • Red Hat Enterprise Linux
    • Red Hat OpenShift
    • Red Hat Ansible Automation Platform
    • See all products

    Build

    • Developer Sandbox
    • Developer tools
    • Interactive tutorials
    • API catalog

    Quicklinks

    • Learning resources
    • E-books
    • Cheat sheets
    • Blog
    • Events
    • Newsletter

    Communicate

    • About us
    • Contact sales
    • Find a partner
    • Site status dashboard
    • Report a security problem

    RED HAT DEVELOPER

    Build here. Go anywhere.

    We serve the builders. The problem solvers who create careers with code.

    Join us if you’re a developer, software engineer, web designer, front-end designer, UX designer, computer scientist, architect, tester, product manager, project manager or team lead.

    Sign me up

    Red Hat legal and privacy links

    • About Red Hat
    • Jobs
    • Events
    • Locations
    • Contact Red Hat
    • Red Hat Blog
    • Inclusion at Red Hat
    • Cool Stuff Store
    • Red Hat Summit
    © 2025 Red Hat

    Red Hat legal and privacy links

    • Privacy statement
    • Terms of use
    • All policies and guidelines
    • Digital accessibility

    Report a website issue