Generate synthetic data for your AI models with SDG Hub

"Garbage in, garbage out" might be the oldest cliché in computer science, but in the age of AI, it's practically a prophecy. Models are only as good as the data they're trained on—and most of us don't have mountains of clean, specialized, regulation-safe data lying around.

Enter SDG Hub, an open source framework that turns a bit of quality data into a lot of useful data. It works by stringing together modular blocks into flexible flows, which are automated pipelines that generate, transform, and validate synthetic data tailored to your domain. You can run these flows locally (so no sensitive data leaves your premises) or through APIs, scaling your dataset without increasing your risk.

For developers fine-tuning smaller, task-specific models or building agentic AI systems, synthetic data offers a way to teach models faster, safer, and cheaper. With async execution, Pydantic validation, and detailed monitoring baked in. Learn more in this related blog post: SDG Hub: Building synthetic data pipelines with modular blocks

Try it yourself

Here is a clean, copy-pasteable Jupyter Notebook walkthrough showing how to use SDG Hub to turn a minimal seed dataset into a more extensive, high-quality set of synthetic question-and-answer pairs built for training a model. Check out the notebook or run the following commands in your preferred terminal.

Regardless of what tools you use, these commands will:

Install dependencies.
Load a flow that generates question-and-answer pairs.
Create a small dataset.
Test with a dry run.
Generate a large synthetic dataset.

Environment setup

Before running any commands, make sure your environment is set up for SDG Hub. You'll need Python 3.10 or newer, a virtual environment (recommended for dependency management), and either a local model endpoint like Ollama or vLLM, or access to an OpenAI-compatible API key.

Once your environment is ready, installing SDG Hub and its example flows is as simple as a few pip commands. Then you're ready to start generating synthetic data right from your terminal or Jupyter Notebook.

Step 1: Install dependencies

In a terminal or a Jupyter Notebook cell, run the following commands to install SDG Hub along with example flows and vLLM integration.

pip install sdg-hub
pip install sdg-hub[vllm,examples]

Step 2: Include the necessary libraries

from sdg_hub.core.flow import FlowRegistry
from sdg_hub.core.blocks import BlockRegistry

Show the available flows

List all of the available flows. Flows are pre-built workflows for generating synthetic data.

FlowRegistry.discover_flows()

Show the available blocks

Then list all of the available blocks. Blocks are components that make up the flows, you can rearrange them to build your own flow, like LEGOs.

BlockRegistry.discover_blocks()

Step 3: Run your first flow

Here we'll import a pre-built question-answer generation flow for knowledge tuning.

from sdg_hub.core.flow import FlowRegistry, Flow
from datasets import Dataset

For our purposes here, we will run one of the pre-built workflows that generates question and answer pairs.

# load a pre-built flow 
flow_name = "Advanced Document Grounded Question-Answer Generation Flow for Knowledge Tuning"
flow_path = FlowRegistry.get_flow_path(flow_name)
flow = Flow.from_yaml(flow_path)

Configure your model backend

This workflow requires a large language model to generate content, and also to act as a teacher and a critic. SDG Hub doesn't download or run these models for you; you'll need to have your chosen model endpoint set up separately before proceeding.

SDG Hub can connect to any OpenAI-compatible API, whether that's a locally hosted option like Ollama or vLLM, or a cloud-hosted service such as OpenAI or Anthropic.

Once your endpoint is running, you'll point SDG Hub to it by specifying the model name, API base URL, and API key in the configuration.

Option A: Ollama (free, easiest local option)

Ollama is great for testing. Install it, pull a model (for example, ollama pull llama3), and SDG Hub can use it as an OpenAI-compatible endpoint for free.

To run locally on CPU/GPU via Ollama:

flow.set_model_config({
    "model": "ollama/llama3",
    "api_base": "http://localhost:11434/v1",
    "api_key": "ollama"
})

Option B: Local vLLM (free, GPU required)

If you're running vLLM locally or as a remote endpoint:

flow.set_model_config(
     model="hosted_vllm/meta-llama/Llama-3.1-8B-Instruct",
     api_base="http://remote-ip or localhost:8000/v1",
     api_key="your_api_key_here or dummy", 
)

(Note: Using vllm/ as a local vLLM SDK in-process has been deprecated.)

Option C: OpenAI or Claude API (paid)

You can use any OpenAI-compatible endpoint, local or hosted.

flow.set_model_config(
     model="openai/gpt-3.5-turbo",
     api_key="your_api_key_here" 
)

Getting the default model for the flow

Each pre-built flow has a list of recommended models to use. To view them, run the following code:

# load a pre-built flow
flow_name = "Advanced Document Grounded Question-Answer Generation Flow for Knowledge Tuning"
flow_path = FlowRegistry.get_flow_path(flow_name)
flow = Flow.from_yaml(flow_path)

Step 4: Create a sample dataset

We'll start with a simple document and a few in-context learning (ICL) example queries and responses. To keep things simple, we have defined the dataset in code. You can also load data from multiple sources, such as documents via Docling, or other data storage systems.

# Create a sample and simple dataset
dataset = Dataset.from_dict({
    'document': ['The Great Dane is a German breed of domestic dog known for its imposing size. It is one of the world\'s tallest dog breeds, often referred to as the "Apollo of Dogs."'],
    'document_outline': ['1. Great Dane Origin; 2. Size and Height; 3. Breed Nicknames'],
    'domain': ['Canine Breeds'],
    'icl_document': ['The Labrador Retriever is a British breed of retriever gun dog that is consistently one of the most popular dog breeds in the world.'],
    'icl_query_1': ['What is the origin of the Labrador Retriever?'],
    'icl_response_1': ['The Labrador Retriever is a British breed.'],
    'icl_query_2': ['What type of dog is a Labrador?'],
    'icl_response_2': ['The Labrador is a retriever gun dog.'],
    'icl_query_3': ['How popular is the Labrador Retriever?'],
    'icl_response_3': ['It is consistently one of the most popular dog breeds in the world.']
})

Quick note if running the code in a Jupyter Notebook

Before running asynchronous code in a Jupyter Notebook, you might encounter runtime errors like RuntimeError: This event loop is already running.

That's because SDG Hub executes parts of its pipelines asynchronously to handle multiple model requests efficiently. Jupyter itself already runs an event loop, so without a patch, Python would try to start a second loop and fail.

The following lines fix that by applying the nest_asyncio patch, which safely allows nested event loops in the same runtime:

import nest_asyncio
nest_asyncio.apply()

Step 5: Dry run (recommended first)

This runs a quick test to ensure the pipeline has no errors or configuration issues:

# Test with a small sample first (recommended!)
print("🧪 Running dry run...")
dry_result = flow.dry_run(dataset, sample_size=1)

If that runs without errors, run the following to see the results:

print(f"✅ Dry run completed in {dry_result['execution_time_seconds']:.2f}s")
print(f"📊 Output columns: {list(dry_result['final_dataset']['columns'])}")

Step 6: Generate synthetic data

Once the dry run successfully completes, you have confirmed that it is ready for a full run. Run the following code:

# Generate high-quality QA pairs
print("🏗️ Generating synthetic data...")
result = flow.generate(dataset)

Step 7: Review and export your generated data

You will notice that it takes longer to complete the full run than the dry run. That's because far more data is being generated.

Run the following code to see how many QA (question and answer) pairs have been generated.

# Explore the results
print(f"\n📈 Generated {len(result)} QA pairs!")

Now we know how many pairs have been generated. Run the following code to look at the QA pairs we generated synthetically.

Review the generated QA pairs:

# The length is determined by the length of any of the lists (e.g., 'question')
num_pairs = len(result['question'])
print(f"\n--- Generated {num_pairs} QA pairs ---")
i = 0
# Iterate from index 0 up to (but not including) num_pairs
for i in range(num_pairs -1):
    print(f"\n--- QA Pair #{i+1} ---")
    print(f"📝 Question: {result['question'][i]}")
    print(f"💬 Answer: {result['response'][i]}")
    print(f"🎯 Faithfulness Score: {result['faithfulness_judgment'][i]}")
    print(f"📏 Relevancy Score: {result['relevancy_score'][i]}")
print("\n--- End of Report ---")

Explore synthetic data more closely:

type(result)

df = result.to_pandas()

df.shape

df.info()

Show synthetic data:

df.head()

Export the entire dataset to a CSV file:

df.to_csv('entire_synthetic_dataset.csv')

Narrow the dataset to just Q&A pairs:

qa_df = result.to_pandas()[["question", "response", "verification_rating", "relevancy_score", "faithfulness_judgment" ]]
qa_df

Export the Q&A pairs to a CSV file:

# assuming qa_df is your DataFrame
qa_df.to_csv("synthetic_qa_pairs.csv", index=False)

Hopefully, you found this helpful and are now left with a purpose-built dataset, ready to train your model. If so, check out SDG Hub, clone the repo, tweak a flow, and start teaching your model something new. Happy generating!

Last updated: December 4, 2025