Skip to main content
Redhat Developers  Logo
  • AI

    Get started with AI

    • Red Hat AI
      Accelerate the development and deployment of enterprise AI solutions.
    • AI learning hub
      Explore learning materials and tools, organized by task.
    • AI interactive demos
      Click through scenarios with Red Hat AI, including training LLMs and more.
    • AI/ML learning paths
      Expand your OpenShift AI knowledge using these learning resources.
    • AI quickstarts
      Focused AI use cases designed for fast deployment on Red Hat AI platforms.
    • No-cost AI training
      Foundational Red Hat AI training.

    Featured resources

    • OpenShift AI learning
    • Open source AI for developers
    • AI product application development
    • Open source-powered AI/ML for hybrid cloud
    • AI and Node.js cheat sheet

    Red Hat AI Factory with NVIDIA

    • Red Hat AI Factory with NVIDIA is a co-engineered, enterprise-grade AI solution for building, deploying, and managing AI at scale across hybrid cloud environments.
    • Explore the solution
  • Learn

    Self-guided

    • Documentation
      Find answers, get step-by-step guidance, and learn how to use Red Hat products.
    • Learning paths
      Explore curated walkthroughs for common development tasks.
    • See all learning

    Hands-on

    • Developer Sandbox
      Spin up Red Hat's products and technologies without setup or configuration.
    • Interactive labs
      Learn by doing in these hands-on, browser-based experiences.
    • Interactive demos
      Click through product features in these guided tours.

    Browse by topic

    • AI/ML
    • Automation
    • Java
    • Kubernetes
    • Linux
    • See all topics

    Training & certifications

    • Courses and exams
    • Certifications
    • Skills assessments
    • Red Hat Academy
    • Learning subscription
    • Explore training
  • Build

    Get started

    • Red Hat build of Podman Desktop
      A downloadable, local development hub to experiment with our products and builds.
    • Developer Sandbox
      Spin up Red Hat's products and technologies without setup or configuration.

    Download products

    • Access product downloads to start building and testing right away.
    • Red Hat Enterprise Linux
    • Red Hat AI
    • Red Hat OpenShift
    • Red Hat Ansible Automation Platform
    • See all products

    Featured

    • Red Hat build of OpenJDK
    • Red Hat JBoss Enterprise Application Platform
    • Red Hat OpenShift Dev Spaces
    • Red Hat Developer Toolset

    References

    • E-books
    • Documentation
    • Cheat sheets
    • Architecture center
  • Community

    Get involved

    • Events
    • Live AI events
    • Red Hat Summit
    • Red Hat Accelerators
    • Community discussions

    Follow along

    • Articles & blogs
    • Developer newsletter
    • Videos
    • Github

    Get help

    • Customer service
    • Customer support
    • Regional contacts
    • Find a partner

    Join the Red Hat Developer program

    • Download Red Hat products and project builds, access support documentation, learning content, and more.
    • Explore the benefits

Best practices for InstructLab instruction datasets

How to build training materials that fine-tune your personal LLM

November 21, 2024
Legare Kerrison
Related topics:
Artificial intelligenceOpen source
Related products:
Red Hat Enterprise Linux AI

    Instruction datasets are specialized datasets designed to help a language model understand and respond to specific instructions or prompts. They provide the model with structured examples of questions, commands, or statements and the corresponding desired responses. These datasets essentially "teach" the model how to follow instructions by exposing it to various scenarios where it learns the correct patterns and formats for responses.

    When providing instruction datasets to InstructLab, you can think of the model as a "student" and yourself as the "teacher." You need to provide your student with an educational reading in the form of a Markdown file that will act as the student's source of truth when answering all questions. You also need to provide some example questions and answers in the form of a qna.qaml file to demonstrate how the student will be expected to apply the knowledge from the reading during the test. 

    Providing strong instruction datasets is essential

    The model will output answers that are the same level of quality as the example question and answer pairs you provided. InstructLab uses your Markdown file as its source of truth and your qna.yaml to generate additional similar synthetic questions and answer pairs for the model to train off of, giving your instruction datasets an exponential impact on the model. 

    This article provides best practices for instruction datasets so you can effectively train your model to generate relevant outputs.

    Building the context text file

    The "reading" you give your model is the context file which should be a Markdown or text file. It is most effective if it uses a variety of paragraphs, tables, bullet points, and lists. The model learns best when it sees diverse formats, just like we do. 

    If you are pulling this information from somewhere other than your brain, copy the original source verbatim for best results. Make sure to remove any links. 

    Any code or preformatted text in the Markdown document should be enclosed in the corresponding Markdown block using backticks for the cli command or triple ‘ ‘ ‘ for multi-line formatted outputs or code blocks. 

    Build: the qna.yaml file: Chunking blurbs from context file

    Prior to question and answer pairs in the qna.file, you will place paragraphs extracted verbatim from the Markdown file. These chunks of text should come from various parts of the Markdown file so that the beginning, middle, or end is not overrepresented. This chunk should correspond with the questions and answers before it, and it should be about 500 tokens or 375 words. 

    For example:

     version: 3
    domain: coral_reefs
    created_by: lkerriso
    seed_examples:
      - context: |
    Coral reefs support marine biodiversity by creating complex habitats that sustain a wide array of marine species. They provide essential shelter, feeding grounds, and breeding spaces for numerous organisms, including fish, sea turtles, and invertebrates. The structure of coral reefs, with its crevices and nooks, offers safe places for animals to hide from predators, facilitating a diverse community of species to thrive. Coral reefs act as nurseries for young fish, offering them shelter and protection during early life stages, which increases their chances of survival and contributes to population growth. The reefs’ physical structure enables juvenile fish to evade predators, which is essential for the health of broader marine populations. As these fish grow, they may leave the reef and populate other marine environments, supporting biodiversity in nearby ecosystems. The reefs also play a crucial role in the food web. They house tiny algae called zooxanthellae that live within coral tissues and perform photosynthesis, producing energy that sustains both the corals and the organisms that feed on them. This energy supports primary consumers like plankton and herbivorous fish, which are eaten by larger predators, establishing a balanced food web. This structure supports fish species diversity, providing resources for fish of all sizes.
    questions_and_answers:
          - question: |
              [Question based on context]
            answer: |
              [Answer related to question and context]       
      - context: |
    In addition, coral reefs are highly efficient in nutrient recycling, capturing and redistributing essential elements that benefit surrounding marine areas. Organisms like sponges help filter water, removing excess organic material and releasing nutrients usable by other reef organisms. This recycling maintains a healthy, balanced ecosystem that can support diverse marine life.In summary, coral reefs support marine biodiversity by providing habitats, nursery grounds, and a balanced food web that sustains various species. The intricate ecosystem services of coral reefs make them indispensable for the health and diversity of marine life worldwide. Protecting coral reefs is essential for the survival of countless marine species and for the stability of ocean ecosystems.
    In the insurance industry, accurately predicting the likelihood of claims is essential for risk assessment and
      questions_and_answers:
          - question: |
              [Question based on context]
            answer: |
              [Answer related to question and context]

    The questions

    Now you are crafting example questions and answers for your model student. Strive for a diversity of simple and complex questions to prepare the model to handle various user needs. Include "what, how, and why" questions, and do not be vague, as this will lead to potentially confusing and inaccurate outputs.

    For example, instead of asking the question "What is a coral reef?", I suggest you go deeper with something like "How do coral reefs support marine biodiversity?" or "Why are coral reefs sensitive to environmental changes?"

    The answers

    The answers should refer back to the question and use complete sentences so the student/model will learn to provide contextual and clear outputs. Avoid giving single word or short phrases as answers, as it could lead to the model outputting answers that appear less thoughtful, and instead just grab keywords. 

    When writing both the questions and the answers, pay attention to the wording used in the context/reading text file and use similar language in the qna.file to limit how much the model has to extrapolate. Make sure to wrap text at 120 characters for readability, and follow markdown’s formatting. 

    For example, if the question is "How do coral reefs support marine biodiversity?" the answer should be "Coral reefs support marine biodiversity by providing a habitat for marine animals, including fish, sea turtles, and invertebrates, to live, feed, and reproduce. Coral reefs also act as a "nursery" for young fish, offering shelter and protection from predators, thereby increasing fish populations."

    InstructLab’s synthetic data generation 

    Again, good formatting here is essential, because Instructlab will use the files you provided to generate more synthetic questions and answers. Therefore, high quality human-generated training data will be multiplied for stronger results, as will low quality data. 

    The format typically looks something like this:

    ```yaml
    version: 3
    domain: Coral_Reefs 
    created_by: <user-name>
    seed_examples:
      - context: |
          [Insert sample paragraph, table, or list with markdown formatting]
        questions_and_answers:
          - question: |
              [Question based on context]
            answer: |
              [Answer related to question and context]

    How long should my files be? 

    The thing to remember when building a qna.yaml is that you should aim for short, yet context rich. InstructLab uses tokens (basically, pieces of words or characters) to process text, and longer content uses more tokens. The rule of thumb:

    • Context should be about 500 tokens (roughly 375 words).
    • Each Q&A pair should total around 250 tokens (or about 185 words).

    Can I add visuals, hyperlinks, graphs, etc?

    Adding visuals or graphs can also save token space, but keep in mind the combined length should stay under 750 tokens. A bit of trimming here and there will help fit more information without overwhelming the model.

    Avoid hyperlinks in the context—these consume tokens but don’t provide helpful info.

    Can I use multiple documents?

    If you’re working with multiple documents, you can combine them into one qna.yaml if they’re closely related. If they’re unrelated, it’s better to create separate files to keep each topic focused.

    By following these guidelines, you’ll create a qna.yaml file. Taking a little extra care in building these files will go a long way toward building a more accurate and user-friendly model. 

    To ask more questions and share knowledge you discover, join the InstructLab Slack.

    Related Posts

    • InstructLab: Advancing generative AI through open source

    • Enhance LLMs and streamline MLOps using InstructLab and KitOps

    • How InstructLab enables accessible model fine-tuning for gen AI

    • Tutorial: Tool up your LLM with Apache Camel on OpenShift

    • Introducing Podman AI Lab: Developer tooling for working with LLMs

    • Getting started with InstructLab for generative AI model tuning

    Recent Posts

    • Federated identity across the hybrid cloud using zero trust workload identity manager

    • Confidential virtual machine storage attack scenarios

    • Introducing virtualization platform autopilot

    • Integrate zero trust workload identity manager with Red Hat OpenShift GitOps

    • Best Practice Configuration and Tuning for Linux and Windows VMs

    What’s up next?

    Learn how large language models (LLMs) are created and use Red Hat Enterprise Linux AI to experiment within an LLM in this hands-on learning path.

    Start the activity
    Red Hat Developers logo LinkedIn YouTube Twitter Facebook

    Platforms

    • Red Hat AI
    • Red Hat Enterprise Linux
    • Red Hat OpenShift
    • Red Hat Ansible Automation Platform
    • See all products

    Build

    • Developer Sandbox
    • Developer tools
    • Interactive tutorials
    • API catalog

    Quicklinks

    • Learning resources
    • E-books
    • Cheat sheets
    • Blog
    • Events
    • Newsletter

    Communicate

    • About us
    • Contact sales
    • Find a partner
    • Report a website issue
    • Site status dashboard
    • Report a security problem

    RED HAT DEVELOPER

    Build here. Go anywhere.

    We serve the builders. The problem solvers who create careers with code.

    Join us if you’re a developer, software engineer, web designer, front-end designer, UX designer, computer scientist, architect, tester, product manager, project manager or team lead.

    Sign me up

    Red Hat legal and privacy links

    • About Red Hat
    • Jobs
    • Events
    • Locations
    • Contact Red Hat
    • Red Hat Blog
    • Inclusion at Red Hat
    • Cool Stuff Store
    • Red Hat Summit
    © 2026 Red Hat

    Red Hat legal and privacy links

    • Privacy statement
    • Terms of use
    • All policies and guidelines
    • Digital accessibility

    Chat Support

    Please log in with your Red Hat account to access chat support.