Skip to main content
Redhat Developers  Logo
  • AI

    Get started with AI

    • Red Hat AI
      Accelerate the development and deployment of enterprise AI solutions.
    • AI learning hub
      Explore learning materials and tools, organized by task.
    • AI interactive demos
      Click through scenarios with Red Hat AI, including training LLMs and more.
    • AI/ML learning paths
      Expand your OpenShift AI knowledge using these learning resources.
    • AI quickstarts
      Focused AI use cases designed for fast deployment on Red Hat AI platforms.
    • No-cost AI training
      Foundational Red Hat AI training.

    Featured resources

    • OpenShift AI learning
    • Open source AI for developers
    • AI product application development
    • Open source-powered AI/ML for hybrid cloud
    • AI and Node.js cheat sheet

    Red Hat AI Factory with NVIDIA

    • Red Hat AI Factory with NVIDIA is a co-engineered, enterprise-grade AI solution for building, deploying, and managing AI at scale across hybrid cloud environments.
    • Explore the solution
  • Learn

    Self-guided

    • Documentation
      Find answers, get step-by-step guidance, and learn how to use Red Hat products.
    • Learning paths
      Explore curated walkthroughs for common development tasks.
    • See all learning

    Hands-on

    • Developer Sandbox
      Spin up Red Hat's products and technologies without setup or configuration.
    • Interactive labs
      Learn by doing in these hands-on, browser-based experiences.
    • Interactive demos
      Click through product features in these guided tours.

    Browse by topic

    • AI/ML
    • Automation
    • Java
    • Kubernetes
    • Linux
    • See all topics

    Training & certifications

    • Courses and exams
    • Certifications
    • Skills assessments
    • Red Hat Academy
    • Learning subscription
    • Explore training
  • Build

    Get started

    • Red Hat build of Podman Desktop
      A downloadable, local development hub to experiment with our products and builds.
    • Developer Sandbox
      Spin up Red Hat's products and technologies without setup or configuration.

    Download products

    • Access product downloads to start building and testing right away.
    • Red Hat Enterprise Linux
    • Red Hat AI
    • Red Hat OpenShift
    • Red Hat Ansible Automation Platform
    • See all products

    Featured

    • Red Hat build of OpenJDK
    • Red Hat JBoss Enterprise Application Platform
    • Red Hat OpenShift Dev Spaces
    • Red Hat Developer Toolset

    References

    • E-books
    • Documentation
    • Cheat sheets
    • Architecture center
  • Community

    Get involved

    • Events
    • Live AI events
    • Red Hat Summit
    • Red Hat Accelerators
    • Community discussions

    Follow along

    • Articles & blogs
    • Developer newsletter
    • Videos
    • Github

    Get help

    • Customer service
    • Customer support
    • Regional contacts
    • Find a partner

    Join the Red Hat Developer program

    • Download Red Hat products and project builds, access support documentation, learning content, and more.
    • Explore the benefits

SDG Hub: Building synthetic data pipelines with modular blocks

Synthetic data in 2025: Everywhere in the LLM pipeline

October 27, 2025
Aditi Saluja Abhishek Bhandwaldar Shivchander Sudalairaj
Related topics:
Artificial intelligencePython
Related products:
Red Hat AI

    Large language models (LLMs) today rely on synthetic data at every stage. This includes pre-training with billions of synthetic tokens (e.g., Cosmopedia), instruction-tuning with synthetic SFT (supervised fine tuning) datasets (e.g., LAB, Tülu3, Orca), and evaluation with benchmarks powered by LLM-as-a-judge (e.g., MT-Bench, AlpacaEval).

    Why use synthetic data?

    • Strong LMs. Open source models now match closed-source performance, making them effective teacher models.
    • It is cheap and fast. Human-labeled instruction data is expensive, but synthetic data generation pipelines scale instantly.
    • It is diverse and controllable. Human expertise is limited; synthetic data enables targeted coverage.

    Synthetic data has shifted from nice-to-have to fundamental. But most teams are still hacking together one-off scripts, which slows innovation and reproducibility, which makes scaling difficult.

    That’s why we built the SDG Hub.

    Introducing SDG Hub

    SDG Hub is an open framework to build, compose, and scale synthetic data pipelines with modular blocks. The following table summarizes its key capabilities.

    What it does

    Why it matters

    Build from reusable blocks (LLM-powered or traditional)

    Replace ad hoc scripts with a repeatable framework

    Compose flows in Python or YAML

    Scale data generation with asynchronous execution and monitoring

    Automatically discover data generation algorithms

    Extend with custom blocks to fit your domain

    Blocks: The building units

    At the core of SDG Hub are blocks: self-contained, composable units that are reusable, with each block transforming data in a specific way. Blocks have a consistent interface (Input → Process → Output) and are composable, so users can stack them together to form complex flows.

    from sdg_hub.core.blocks import LLMChatBlock, JSONStructureBlock
    
    chat_block = LLMChatBlock(
        block_name="question_answerer",
        model="openai/gpt-4o",
        input_cols=["question"],
        output_cols=["answer"],
        prompt_template="Answer this question: {question}"
    )
    
    structure_block = JSONStructureBlock(
        block_name="json_structurer",
        input_cols=["summary", "entities", "sentiment"],
        output_cols=["structured_analysis"],
        ensure_json_serializable=True
    )

    From blocks to flows

    Flows are pipelines created by chaining blocks. They act as an orchestration layer, combining multiple blocks into sophisticated pipelines. You can define flows flexibly in Python or YAML. They provide optimized execution with asynchronous parallelism, debugging, and dry-run validation.

    blocks:
      - block_type: "PromptBuilderBlock"
        block_config:
          block_name: "build_summary_prompt"
          input_cols: ["text"]
          output_cols: ["summary_prompt"]
    
      - block_type: "LLMChatBlock"
        block_config:
          block_name: "generate_summary"
          input_cols: ["summary_prompt"]
          output_cols: ["raw_summary"]
          max_tokens: 1024
          temperature: 0.3
          async_mode: true
    
      - block_type: "TextParserBlock"
        block_config:
          block_name: "parse_summary"
          input_cols: ["raw_summary"]
          output_cols: ["summary"]

    SDG Hub offers pre-built flows for common use cases such as knowledge tuning. Users can also build their own custom flows (see Figure 1). 

    Process flow for building a custom data generation flow with SDG Hub: Explore Existing Blocks, Fill gaps with Custom Blocks, Wrap as a Flow, and Generate.
    Figure 1: Build your own custom flow using SDG HUB.
    from sdg_hub.core.blocks.base import BaseBlock
    from sdg_hub.core.blocks.registry import BlockRegistry
    
    # Define a custom block
    @BlockRegistry.register("MyCustomBlock", "custom", "Description of my block")
    class MyCustomBlock(BaseBlock):
        def generate(self, samples, **kwargs):
            # Add domain-specific processing logic here
            return samples

    Applications of SDG Hub

    SDG Hub has several applications:

    • Customizing LLMs on domain knowledge: Fine-tune open-weight models with synthetic, domain-rich data. Flows are model-agnostic: we provide a recommended teacher model, but you can swap in any teacher model.
    • Reasoning data generation: Use reasoning-capable teacher models (for example, GPT-OSS, Qwen, DeepSeek) to generate datasets with reasoning traces. This helps models learn step-by-step reasoning in complex tasks.
    • Multilingual data generation for knowledge tuning: SDG Hub offers a pre-built flow for data generation in Japanese. Users can build other flows for various target languages. 

    Getting started

    SDG Hub provides pre-built flows for common tasks like knowledge tuning, composable pipelines for custom generation and filtering, and example notebooks that show end-to-end use cases. These notebooks also include document pre-processing to ingest any type of document and data mixing to produce training-ready datasets for different models. SDG Hub provides default and compatible teacher model recommendations. You can also experiment with any teacher model and easily swap it into new and existing flows.

    What’s next

    Future updates for SDG Hub will bring new algorithms for synthetic generation and filtering, SDG for retrieval-augmented generation (RAG) evaluations (systematic testing of retrieval pipelines), and evaluation of teacher models to compare the quality of LLMs used for synthetic data.

    Synthetic data is everywhere in the LLM pipeline. Now, with SDG Hub, you can build it in a way that is modular, scalable, and production-ready. SDG Hub helps you move from raw documents to structured data and finally to instruction datasets with composable building blocks.

    With Red Hat AI 3, you can run SDG Hub's pre-built validated pipelines or your own custom pipeline on Red Hat OpenShift AI. This is available as a tech preview feature with a supported build of the SDG Hub Python library.

    Check out this video to learn more about new ways to build data with SDG Hub:

    Last updated: October 28, 2025

    Related Posts

    • Optimize and deploy LLMs for production with OpenShift AI

    • Master KV cache aware routing with llm-d for efficient AI inference

    • Simplify AI data integration with RamaLama and RAG

    • Batch inference on OpenShift AI with Ray Data, vLLM, and CodeFlare

    • From raw data to model serving with OpenShift AI

    • LLM Compressor 0.8.0: Extended support for Qwen3 and more

    Recent Posts

    • Confidential virtual machine storage attack scenarios

    • Introducing virtualization platform autopilot

    • Integrate zero trust workload identity manager with Red Hat OpenShift GitOps

    • Best Practice Configuration and Tuning for Linux and Windows VMs

    • Red Hat UBI 8 builders have been promoted to the Paketo Buildpacks organization

    What’s up next?

    Open source AI for developers introduces and covers key features of Red Hat OpenShift AI, including Jupyter Notebooks, PyTorch, and enhanced monitoring and observability tools, along with MLOps and continuous integration/continuous deployment (CI/CD) workflows.

    Get the e-book
    Red Hat Developers logo LinkedIn YouTube Twitter Facebook

    Platforms

    • Red Hat AI
    • Red Hat Enterprise Linux
    • Red Hat OpenShift
    • Red Hat Ansible Automation Platform
    • See all products

    Build

    • Developer Sandbox
    • Developer tools
    • Interactive tutorials
    • API catalog

    Quicklinks

    • Learning resources
    • E-books
    • Cheat sheets
    • Blog
    • Events
    • Newsletter

    Communicate

    • About us
    • Contact sales
    • Find a partner
    • Report a website issue
    • Site status dashboard
    • Report a security problem

    RED HAT DEVELOPER

    Build here. Go anywhere.

    We serve the builders. The problem solvers who create careers with code.

    Join us if you’re a developer, software engineer, web designer, front-end designer, UX designer, computer scientist, architect, tester, product manager, project manager or team lead.

    Sign me up

    Red Hat legal and privacy links

    • About Red Hat
    • Jobs
    • Events
    • Locations
    • Contact Red Hat
    • Red Hat Blog
    • Inclusion at Red Hat
    • Cool Stuff Store
    • Red Hat Summit
    © 2026 Red Hat

    Red Hat legal and privacy links

    • Privacy statement
    • Terms of use
    • All policies and guidelines
    • Digital accessibility