Skip to main content
Redhat Developers  Logo
  • Products

    Platforms

    • Red Hat Enterprise Linux
      Red Hat Enterprise Linux Icon
    • Red Hat AI
      Red Hat AI
    • Red Hat OpenShift
      Openshift icon
    • Red Hat Ansible Automation Platform
      Ansible icon
    • View All Red Hat Products

    Featured

    • Red Hat build of OpenJDK
    • Red Hat Developer Hub
    • Red Hat JBoss Enterprise Application Platform
    • Red Hat OpenShift Dev Spaces
    • Red Hat OpenShift Local
    • Red Hat Developer Sandbox

      Try Red Hat products and technologies without setup or configuration fees for 30 days with this shared Openshift and Kubernetes cluster.
    • Try at no cost
  • Technologies

    Featured

    • AI/ML
      AI/ML Icon
    • Linux
      Linux Icon
    • Kubernetes
      Cloud icon
    • Automation
      Automation Icon showing arrows moving in a circle around a gear
    • View All Technologies
    • Programming Languages & Frameworks

      • Java
      • Python
      • JavaScript
    • System Design & Architecture

      • Red Hat architecture and design patterns
      • Microservices
      • Event-Driven Architecture
      • Databases
    • Developer Productivity

      • Developer productivity
      • Developer Tools
      • GitOps
    • Automated Data Processing

      • AI/ML
      • Data Science
      • Apache Kafka on Kubernetes
    • Platform Engineering

      • DevOps
      • DevSecOps
      • Ansible automation for applications and services
    • Secure Development & Architectures

      • Security
      • Secure coding
  • Learn

    Featured

    • Kubernetes & Cloud Native
      Openshift icon
    • Linux
      Rhel icon
    • Automation
      Ansible cloud icon
    • AI/ML
      AI/ML Icon
    • View All Learning Resources

    E-Books

    • GitOps Cookbook
    • Podman in Action
    • Kubernetes Operators
    • The Path to GitOps
    • View All E-books

    Cheat Sheets

    • Linux Commands
    • Bash Commands
    • Git
    • systemd Commands
    • View All Cheat Sheets

    Documentation

    • Product Documentation
    • API Catalog
    • Legacy Documentation
  • Developer Sandbox

    Developer Sandbox

    • Access Red Hat’s products and technologies without setup or configuration, and start developing quicker than ever before with our new, no-cost sandbox environments.
    • Explore Developer Sandbox

    Featured Developer Sandbox activities

    • Get started with your Developer Sandbox
    • OpenShift virtualization and application modernization using the Developer Sandbox
    • Explore all Developer Sandbox activities

    Ready to start developing apps?

    • Try at no cost
  • Blog
  • Events
  • Videos

Best practices for InstructLab instruction datasets

How to build training materials that fine-tune your personal LLM

November 21, 2024
Legare Kerrison
Related topics:
Artificial intelligenceOpen source
Related products:
Red Hat Enterprise Linux AI

Share:

    Instruction datasets are specialized datasets designed to help a language model understand and respond to specific instructions or prompts. They provide the model with structured examples of questions, commands, or statements and the corresponding desired responses. These datasets essentially "teach" the model how to follow instructions by exposing it to various scenarios where it learns the correct patterns and formats for responses.

    When providing instruction datasets to InstructLab, you can think of the model as a "student" and yourself as the "teacher." You need to provide your student with an educational reading in the form of a Markdown file that will act as the student's source of truth when answering all questions. You also need to provide some example questions and answers in the form of a qna.qaml file to demonstrate how the student will be expected to apply the knowledge from the reading during the test. 

    Providing strong instruction datasets is essential

    The model will output answers that are the same level of quality as the example question and answer pairs you provided. InstructLab uses your Markdown file as its source of truth and your qna.yaml to generate additional similar synthetic questions and answer pairs for the model to train off of, giving your instruction datasets an exponential impact on the model. 

    This article provides best practices for instruction datasets so you can effectively train your model to generate relevant outputs.

    Building the context text file

    The "reading" you give your model is the context file which should be a Markdown or text file. It is most effective if it uses a variety of paragraphs, tables, bullet points, and lists. The model learns best when it sees diverse formats, just like we do. 

    If you are pulling this information from somewhere other than your brain, copy the original source verbatim for best results. Make sure to remove any links. 

    Any code or preformatted text in the Markdown document should be enclosed in the corresponding Markdown block using backticks for the cli command or triple ‘ ‘ ‘ for multi-line formatted outputs or code blocks. 

    Build: the qna.yaml file: Chunking blurbs from context file

    Prior to question and answer pairs in the qna.file, you will place paragraphs extracted verbatim from the Markdown file. These chunks of text should come from various parts of the Markdown file so that the beginning, middle, or end is not overrepresented. This chunk should correspond with the questions and answers before it, and it should be about 500 tokens or 375 words. 

    For example:

     version: 3
    domain: coral_reefs
    created_by: lkerriso
    seed_examples:
      - context: |
    Coral reefs support marine biodiversity by creating complex habitats that sustain a wide array of marine species. They provide essential shelter, feeding grounds, and breeding spaces for numerous organisms, including fish, sea turtles, and invertebrates. The structure of coral reefs, with its crevices and nooks, offers safe places for animals to hide from predators, facilitating a diverse community of species to thrive. Coral reefs act as nurseries for young fish, offering them shelter and protection during early life stages, which increases their chances of survival and contributes to population growth. The reefs’ physical structure enables juvenile fish to evade predators, which is essential for the health of broader marine populations. As these fish grow, they may leave the reef and populate other marine environments, supporting biodiversity in nearby ecosystems. The reefs also play a crucial role in the food web. They house tiny algae called zooxanthellae that live within coral tissues and perform photosynthesis, producing energy that sustains both the corals and the organisms that feed on them. This energy supports primary consumers like plankton and herbivorous fish, which are eaten by larger predators, establishing a balanced food web. This structure supports fish species diversity, providing resources for fish of all sizes.
    questions_and_answers:
          - question: |
              [Question based on context]
            answer: |
              [Answer related to question and context]       
      - context: |
    In addition, coral reefs are highly efficient in nutrient recycling, capturing and redistributing essential elements that benefit surrounding marine areas. Organisms like sponges help filter water, removing excess organic material and releasing nutrients usable by other reef organisms. This recycling maintains a healthy, balanced ecosystem that can support diverse marine life.In summary, coral reefs support marine biodiversity by providing habitats, nursery grounds, and a balanced food web that sustains various species. The intricate ecosystem services of coral reefs make them indispensable for the health and diversity of marine life worldwide. Protecting coral reefs is essential for the survival of countless marine species and for the stability of ocean ecosystems.
    In the insurance industry, accurately predicting the likelihood of claims is essential for risk assessment and
      questions_and_answers:
          - question: |
              [Question based on context]
            answer: |
              [Answer related to question and context]

    The questions

    Now you are crafting example questions and answers for your model student. Strive for a diversity of simple and complex questions to prepare the model to handle various user needs. Include "what, how, and why" questions, and do not be vague, as this will lead to potentially confusing and inaccurate outputs.

    For example, instead of asking the question "What is a coral reef?", I suggest you go deeper with something like "How do coral reefs support marine biodiversity?" or "Why are coral reefs sensitive to environmental changes?"

    The answers

    The answers should refer back to the question and use complete sentences so the student/model will learn to provide contextual and clear outputs. Avoid giving single word or short phrases as answers, as it could lead to the model outputting answers that appear less thoughtful, and instead just grab keywords. 

    When writing both the questions and the answers, pay attention to the wording used in the context/reading text file and use similar language in the qna.file to limit how much the model has to extrapolate. Make sure to wrap text at 120 characters for readability, and follow markdown’s formatting. 

    For example, if the question is "How do coral reefs support marine biodiversity?" the answer should be "Coral reefs support marine biodiversity by providing a habitat for marine animals, including fish, sea turtles, and invertebrates, to live, feed, and reproduce. Coral reefs also act as a "nursery" for young fish, offering shelter and protection from predators, thereby increasing fish populations."

    InstructLab’s synthetic data generation 

    Again, good formatting here is essential, because Instructlab will use the files you provided to generate more synthetic questions and answers. Therefore, high quality human-generated training data will be multiplied for stronger results, as will low quality data. 

    The format typically looks something like this:

    ```yaml
    version: 3
    domain: Coral_Reefs 
    created_by: <user-name>
    seed_examples:
      - context: |
          [Insert sample paragraph, table, or list with markdown formatting]
        questions_and_answers:
          - question: |
              [Question based on context]
            answer: |
              [Answer related to question and context]

    How long should my files be? 

    The thing to remember when building a qna.yaml is that you should aim for short, yet context rich. InstructLab uses tokens (basically, pieces of words or characters) to process text, and longer content uses more tokens. The rule of thumb:

    • Context should be about 500 tokens (roughly 375 words).
    • Each Q&A pair should total around 250 tokens (or about 185 words).

    Can I add visuals, hyperlinks, graphs, etc?

    Adding visuals or graphs can also save token space, but keep in mind the combined length should stay under 750 tokens. A bit of trimming here and there will help fit more information without overwhelming the model.

    Avoid hyperlinks in the context—these consume tokens but don’t provide helpful info.

    Can I use multiple documents?

    If you’re working with multiple documents, you can combine them into one qna.yaml if they’re closely related. If they’re unrelated, it’s better to create separate files to keep each topic focused.

    By following these guidelines, you’ll create a qna.yaml file. Taking a little extra care in building these files will go a long way toward building a more accurate and user-friendly model. 

    To ask more questions and share knowledge you discover, join the InstructLab Slack.

    Related Posts

    • InstructLab: Advancing generative AI through open source

    • Enhance LLMs and streamline MLOps using InstructLab and KitOps

    • How InstructLab enables accessible model fine-tuning for gen AI

    • Tutorial: Tool up your LLM with Apache Camel on OpenShift

    • Introducing Podman AI Lab: Developer tooling for working with LLMs

    • Getting started with InstructLab for generative AI model tuning

    Recent Posts

    • How to modify system-reserved parameters on OpenShift nodes

    • The odo CLI is deprecated: What developers need to know

    • Exposing OpenShift networks using BGP

    • Camel integration quarterly digest: Q3 2025

    • How to run I/O workloads on OpenShift Virtualization VMs

    What’s up next?

    Learn how large language models (LLMs) are created and use Red Hat Enterprise Linux AI to experiment within an LLM in this hands-on learning path.

    Start the activity
    Red Hat Developers logo LinkedIn YouTube Twitter Facebook

    Platforms

    • Red Hat AI
    • Red Hat Enterprise Linux
    • Red Hat OpenShift
    • Red Hat Ansible Automation Platform
    • See all products

    Build

    • Developer Sandbox
    • Developer Tools
    • Interactive Tutorials
    • API Catalog

    Quicklinks

    • Learning Resources
    • E-books
    • Cheat Sheets
    • Blog
    • Events
    • Newsletter

    Communicate

    • About us
    • Contact sales
    • Find a partner
    • Report a website issue
    • Site Status Dashboard
    • Report a security problem

    RED HAT DEVELOPER

    Build here. Go anywhere.

    We serve the builders. The problem solvers who create careers with code.

    Join us if you’re a developer, software engineer, web designer, front-end designer, UX designer, computer scientist, architect, tester, product manager, project manager or team lead.

    Sign me up

    Red Hat legal and privacy links

    • About Red Hat
    • Jobs
    • Events
    • Locations
    • Contact Red Hat
    • Red Hat Blog
    • Inclusion at Red Hat
    • Cool Stuff Store
    • Red Hat Summit
    © 2025 Red Hat

    Red Hat legal and privacy links

    • Privacy statement
    • Terms of use
    • All policies and guidelines
    • Digital accessibility

    Report a website issue