Skip to main content
Redhat Developers  Logo
  • Products

    Platforms

    • Red Hat Enterprise Linux
      Red Hat Enterprise Linux Icon
    • Red Hat AI
      Red Hat AI
    • Red Hat OpenShift
      Openshift icon
    • Red Hat Ansible Automation Platform
      Ansible icon
    • View All Red Hat Products

    Featured

    • Red Hat build of OpenJDK
    • Red Hat Developer Hub
    • Red Hat JBoss Enterprise Application Platform
    • Red Hat OpenShift Dev Spaces
    • Red Hat OpenShift Local
    • Red Hat Developer Sandbox

      Try Red Hat products and technologies without setup or configuration fees for 30 days with this shared Openshift and Kubernetes cluster.
    • Try at no cost
  • Technologies

    Featured

    • AI/ML
      AI/ML Icon
    • Linux
      Linux Icon
    • Kubernetes
      Cloud icon
    • Automation
      Automation Icon showing arrows moving in a circle around a gear
    • View All Technologies
    • Programming Languages & Frameworks

      • Java
      • Python
      • JavaScript
    • System Design & Architecture

      • Red Hat architecture and design patterns
      • Microservices
      • Event-Driven Architecture
      • Databases
    • Developer Productivity

      • Developer productivity
      • Developer Tools
      • GitOps
    • Automated Data Processing

      • AI/ML
      • Data Science
      • Apache Kafka on Kubernetes
    • Platform Engineering

      • DevOps
      • DevSecOps
      • Ansible automation for applications and services
    • Secure Development & Architectures

      • Security
      • Secure coding
  • Learn

    Featured

    • Kubernetes & Cloud Native
      Openshift icon
    • Linux
      Rhel icon
    • Automation
      Ansible cloud icon
    • AI/ML
      AI/ML Icon
    • View All Learning Resources

    E-Books

    • GitOps Cookbook
    • Podman in Action
    • Kubernetes Operators
    • The Path to GitOps
    • View All E-books

    Cheat Sheets

    • Linux Commands
    • Bash Commands
    • Git
    • systemd Commands
    • View All Cheat Sheets

    Documentation

    • Product Documentation
    • API Catalog
    • Legacy Documentation
  • Developer Sandbox

    Developer Sandbox

    • Access Red Hat’s products and technologies without setup or configuration, and start developing quicker than ever before with our new, no-cost sandbox environments.
    • Explore Developer Sandbox

    Featured Developer Sandbox activities

    • Get started with your Developer Sandbox
    • OpenShift virtualization and application modernization using the Developer Sandbox
    • Explore all Developer Sandbox activities

    Ready to start developing apps?

    • Try at no cost
  • Blog
  • Events
  • Videos

Retrieval-augmented generation with Llama Stack and Python

August 5, 2025
Michael Dawson
Related topics:
Artificial intelligenceProgramming languages & frameworksPythonRuntimes
Related products:
Red Hat AIRed Hat Enterprise Linux AIRed Hat OpenShift AI

Share:

    With the recent release of Llama Stack earlier this year, we decided to look at how to implement key aspects of an AI application with Python and Llama Stack. This post covers retrieval-augmented generation (RAG). 

    Catch up on the rest of our series exploring how to use large language models with Python and Llama Stack:

    • Part 1: Exploring Llama Stack with Python: Tool calling and agents
    • Part 2: Retrieval-augmented generation with Llama Stack and Python (this post)
    • Part 3: Implement AI safeguards with Python and Llama Stack
    • Part 4: How to implement observability with Python and Llama Stack

    How retrieval-augmented generation works

    Retrieval-augmented generation is one way to provide context that helps a model respond with the most appropriate answer. The basic concept is as follows:

    1. Data that provides additional context—in our case, the Markdown files from the Node.js reference architecture—is transformed into a format suitable for model augmentation. This often includes breaking up the data into maximum sized chunks that will later be provided as additional context to the model.
    2. An embedding model is used to convert the source data into a set of vectors that represents the words in the data. These are stored in a database so the data can be retrieved through a query against the matching vectors. Most commonly, the data is stored in a vector database like Chroma.
    3. The application is enhanced so that before passing on a query to the model, it first uses the question to query the database for matching documents chunks. The most relevant document chunks are then added to the context and sent along with the question to the model as part of the prompt.
    4. The model returns an answer based partly on the context provided.

    So why not just pass the content of all of the documents to the model? There are a number of reasons:

    • The size of the context that can be passed to a model might be limited.
    • Passing a large context might cost you more money.
    • Providing a smaller, more closely related set of information can result in better answers.

    With that in mind, we need to identify the most relevant document chunks and pass only that subset. 

    Now, let's look at how we implemented retrieval-augmented generation with Llama Stack.

    Setting up Llama Stack

    The first step was to get a running Llama Stack instance that we could experiment with. Llama Stack is a bit different from other frameworks in a few ways. 

    Instead of providing a single implementation with a set of defined APIs, Llama Stack aims to standardize a set of APIs and drive a number of distributions. In other words, the goal is to have many implementations of the same API, with each implementation being shipped by a different organization as a distribution. 

    As is common when this approach is followed, a "reference distribution" is provided, but there are already a number of alternative distributions available. You can see the list of available distributions in the GitHub README.

    Another difference is a strong focus on plug-in APIs that allow you to add implementations for specific components behind the API implementation itself. For example, you could plug in an implementation (maybe one that is custom tailored for your organization) for a specific feature like Telemetry while using an existing distribution. We won't go into the details of these APIs in this post, but hope to look at them later on.

    With that said, the first question was which distribution to use in order to get stated. The Llama Stack quick start shows how to spin up a container running Llama Stack, which uses Ollama to serve the large language model. Because we already had a working Ollama install, we decided that was the path of least resistance.

    Getting the Llama Stack instance running

    We followed the original Llama Stack quick start, which used a container to run the stack with it pointing to an existing Ollama server. Following the instructions, we put together this short script that allowed us to easily start/stop the Llama Stack instance:

    export INFERENCE_MODEL="meta-llama/Llama-3.1-8B-Instruct"
    export LLAMA_STACK_PORT=8321
    export OLLAMA_HOST=10.1.2.46
    podman run -it \
     --user 1000 \
     -p $LLAMA_STACK_PORT:$LLAMA_STACK_PORT \
     -v ~/.llama:/root/.llama \
     llamastack/distribution-ollama:0.2.8 \
     --port $LLAMA_STACK_PORT \
     --env INFERENCE_MODEL=$INFERENCE_MODEL \
     --env OLLAMA_URL=http://$OLLAMA_HOST:11434 
    

    Our existing Ollama server was running on a machine with the IP 10.1.2.38, which is what we set OLLAMA_HOST to.

    We followed the instructions for starting the container from the quick start so Llama Stack was running on the default port. We ran the container on a Fedora virtual machine with IP 10.1.2.128, so you will see us using http://10.1.2.128:8321 as the endpoint for the Llama Stack instance in our code examples.

    At this point, we had a running Llama Stack instance to experiment with.

    Llama Stack now supports more models

    At the time of our initial exploration of Llama Stack, only a limited set of models were supported when using Ollama. This time, we used a later version of Llama Stack, which allows you to register and use different models. 

    In our previous experiments, we often used more quantized models because they are smaller and easier to run on smaller GPUs. Given the ability to run new models, we tried running with llama3.1:8b-instruct-q4_K_M instead of the larger llama3.2:8b-instruct-fp16.

    To use a model not already known to Llama Stack, you use the model's API to register the new model. The code to do that was as follows:

          model_id = "meta-llama/Llama-3.1-8B-instruct-q4_K_M"
          client.models.register(
            model_id=model_id,
            provider_id="ollama",
            provider_model_id="llama3.1:8b-instruct-q4_K_M",
            model_type="llm",
          )

    After having done that, we could then request the model using the model ID meta-llama/Llama-3.1-8B-instruct-q4_K_M. It's great to have that additional flexibility. In our case, it let us run the first RAG example much faster.

    Retrieval-augmented generation with the completion API

    As outlined earlier, the first step to using RAG is to ingest and store the documents in a vector database. Llama Stack provides a set of APIs to make this easy and consistent across different vector databases.  The full code for the example is in python-ai-experimentation/llama-stack-rag/llama-stack-chat-rag.py.

    Creating the database

    To get started, you'll need to create a database in which the documents will be stored. Llama Stack provides an API to manage vector databases and supports an in-memory implementation by default. Additionally, you can register other vector database implementations to use the one best suited to your organization. In our example, we are using the in-memory vector database that is available by default.

    The code we used to create the database was as follows:

        # use the first available provider
        providers = client.providers.list()
        provider = next(p for p in providers if p.api == "vector_io")
    
        # register a vector database
        vector_db_id = f"test-vector-db-{uuid.uuid4()}"
        client.vector_dbs.register(
            vector_db_id=vector_db_id,
            provider_id=provider.provider_id,
            embedding_model="all-MiniLM-L6-v2",
        )

    The in-memory database was the first provider in the default configuration, so we just used that. When creating the database with the register call, we created a random vector_db_id and used the embedding model that comes with Llama Stack by default, all-MiniLM-L6-v2.

    The database persists as long as the Llama Stack container runs, so we deleted the database at the end of our experiment for cleanup. In a production deployment, you would probably separate the creation and ingestion process so that you ingest the documents on a less frequent basis and use the database for multiple requests. 

    The clean-up code was as follows:

        client.vector_dbs.unregister(vector_db_id)

    Ingesting the documents

    As with our earlier explorations, we wanted to use the documents in the Node.js reference architecture as our knowledge base. We first cloned the GitHub repo to a local directory (in this case, /home/user1/newpull/nodejs-reference-architecture). 

    We then wrote some code that would read the documents into objects we could pass to the Llama Stack APIs:

        # read in all of the files to be used with RAG
        rag_documents = []
        docs_path = Path("/home/user1/newpull/nodejs-reference-architecture/docs")
    
        i = 0
        for file_path in docs_path.rglob("*.md"):
            i += 1
            if file_path.is_file():
                with open(file_path, "r", encoding="utf-8") as f:
                    contents = f.read()
    
                # Convert markdown to plain text using strip_markdown
                plain_text = strip_markdown(contents)
    
                rag_documents.append(
                    {
                        "document_id": f"doc-{i}",
                        "content": plain_text,
                        "mime_type": "text/plain",
                        "metadata": {},
                    }
                )

    Because the reference architecture documents were in Markdown, we used a package (strip_markdown) to convert them to text and then set the mime_type to text in the objects we planned to pass to Llama Stack.

    Next, we used the Llama Stack ragTool insert API to ingest the documents:

        client.tool_runtime.rag_tool.insert(
            documents=rag_documents,
            vector_db_id=vector_db_id,
            chunk_size_in_tokens=125,
        )

    The Llama Stack API handled breaking up the documents into chunks based on the value we passed for chunk_size_in_tokens. Because this value was in tokens, we needed to make it smaller than the value passed for other frameworks where the value that controls how documents are split is in characters. We could not find a way to tune the overlap between chunks; hopefully we'll see that in later versions of the API.

    Querying the documents

    Now that we have the documents in the vector database, we can use the Llama Stack APIs to find the most relevant chunks for a given question from the user. You can do this with the ragTool query method:

                raw_rag_results = client.tool_runtime.rag_tool.query(
                    content=question,
                    vector_db_ids=[vector_db_id],
                )

    This returns the document chunks that are most relevant to the question being asked by the user. You then provide those chunks to the model as part of the prompt, as follows:

                rag_results = []
                for content_item in raw_rag_results.content:
                    rag_results.append(str(content_item.text))
    
                if SHOW_RAG_DOCUMENTS:
                    for result in rag_results:
                        print(result)
    
                prompt = f"""Answer the question based only on the context provided
                       <question>{question}</question>
                       <context>{' '.join(rag_results)}</context>"""

    Asking questions

    Having created the prompt with the context retrieved using the ragTool, you can now send the user's question to the model to get the answer based on the RAG results:

                messages.append({"role": "user", "content": prompt})
                response = client.inference.chat_completion(
                    messages=messages,
                    model_id=model_id,
                )
    
                print("  RESPONSE:" + response.completion_message.content)

    The result

    We used the same question we used in previous RAG experiments based on the Node.js reference architecture: Should I use npm to start an application 

    The result was as expected:

    python llama-stack-chat-rag.py 
    Iteration 0 ------------------------------------------------------------
    QUESTION: Should I use npm to start an application
      RESPONSE:No, you generally don't need npm to start your application. It can be avoided to prevent security vulnerabilities and reduce the number of processes running.

    Based on the answer, we can see that the information from the Node.js reference architecture was used to answer the question. Yay!

    If you want to see which document chunks were used, change the value of SHOW_RAG_DOCUMENTS = False to True and the example will print them out. Here, we can see that the first document chunk used in our run was:

    Result 1
    Content: start"] in docker files
    used to build Node.js applications there are a number
    of good reasons to avoid this:
    
    One less component. You generally don't need npm to start
      your application. If you avoid using it in the container
      then you will not be exposed to any security vulnerabilities
      that might exist in that component or its dependencies.
    One less process. Instead of running 2 process (npm and node)
      you will only run 1.
    There can be issues with signals and child processes. You
      can read more about that in the Node.js docker best practices
      CMD.

    It was obviously used in formulating the response returned by the model.

    Agents, agents, agents

    Agents are often regarded as the best way to leverage large language model capabilities. Llama Stack provides an easier way to use RAG though agents, with a built-in tool that can search and return relevant information based on a vector database. This tool is in the builtin::rag/knowledge_search tool group.

    Compared to the previous example using the completions API, the required code is simplified, as we don't need to query the vector database and build the results into the prompt. Instead, we simply provide the agent with the tools from the builtin::rag/knowledge_search tool group, and the agent figures out that it needs to use the tool to get the related results from the vector database.

    The full code for the agent example is in python-ai-experimentation/llama-stack-rag/llama-stack-agent-rag.py The code to read and ingest the documents into the vector database is the same as when using the completions API, so we won't cover that again. 

    In order to use RAG with the agent, we started by creating the agent that specifies the builtin::rag/knowledge_search tool group and specifying the ID for the vector database that should be used: 

        agentic_system_create_response = client.agents.create(
            agent_config={
                "model": model_id,
                "instructions": "You are a helpful assistant, answer questions only based on information in the documents provided",
                "toolgroups": [
                    {
                        "name": "builtin::rag/knowledge_search",
                        "args": {"vector_db_ids": [vector_db_id]},
                    }
                ],
                "tool_choice": "auto",
                "input_shields": [],
                "output_shields": [],
                "max_infer_iters": 10,
            }
        )
        agent_id = agentic_system_create_response.agent_id

    We also created a session that will be used to maintain state across questions:

        session_create_response = client.agents.session.create(
            agent_id, session_name="agent1"
        ) 
        session_id = session_create_response.session_id

    We could then use the agent to ask the user's question. Most of the code is for displaying the chunks used from the database when SHOW_RAG_DOCUMENTS is True:

              # Handle streaming response
                response = ""
                for chunk in response_stream:
                    if hasattr(chunk, "event") and hasattr(chunk.event, "payload"):
                        if chunk.event.payload.event_type == "turn_complete":
                            response = (
                                response + chunk.event.payload.turn.output_message.content
                            )
                        elif (
                            chunk.event.payload.event_type == "step_complete"
                            and chunk.event.payload.step_type == "tool_execution"
                            and SHOW_RAG_DOCUMENTS
                        ):
                            # Extract and print RAG document content in readable format
                            step_details = chunk.event.payload.step_details
                            if (
                                hasattr(step_details, "tool_responses")
                                and step_details.tool_responses
                            ):
                                print("\n" + "=" * 60)
                                print("RAG DOCUMENTS RETRIEVED")
                                print("=" * 60)
    
                                for tool_response in step_details.tool_responses:
                                    if (
                                        hasattr(tool_response, "content")
                                        and tool_response.content
                                    ):
                                        for item in tool_response.content:
                                            if (
                                                hasattr(item, "text")
                                                and "Result" in item.text
                                            ):
                                                # This is a result item, extract the content
                                                text = item.text
                                                if "Content:" in text:
                                                    # Extract content after "Content:"
                                                    content_start = text.find(
                                                        "Content:"
                                                    ) + len("Content:")
                                                    content_end = text.find("\nMetadata:")
                                                    if content_end == -1:
                                                        content_end = len(text)
    
                                                    content = text[
                                                        content_start:content_end
                                                    ].strip()
                                                    result_num = (
                                                        text.split("\n")[0]
                                                        if "\n" in text
                                                        else "Result"
                                                    )
    
                                                    print(f"\n--- {result_num} ---")
                                                    print(content)
                                                    print("-" * 40)
                                print("=" * 60)
    
                print("  RESPONSE:" + response)
    

    Overall, we can see that it takes less code using the agent. We simply give the agent a tool it can use to get the information needed to answer the question, and it figures out to call it when needed.

    We did find, however, that using the more quantized model did not work with the agent example. When we used the llama3.1:8b-instruct-q4_K_M model, the agent failed to figure out that it should call the RAG tool and answered without using the additional information. We therefore had to fall back to using the llama3.2:8b-instruct-fp16 model.

    Having done that, we got a similar answer:

    python llama-stack-agent-rag.py 
    Iteration 0 ------------------------------------------------------------
    QUESTION: Should I use npm to start a node.js application
      RESPONSE:Based on the search results, it seems that you don't necessarily need npm to start a Node.js application. In fact, one of the benefits of not using npm is that you'll have one less process running (npm and node) which can be beneficial for security and performance reasons.
    
    However, if you're planning to use a monorepo setup with multiple projects in a single repository, it's recommended to use either yarn workspaces or npm workspaces to manage the dependencies and packages.
    
    So, to answer your question, you don't necessarily need npm to start a Node.js application, but if you're planning a complex project structure, using npm or yarn workspaces might be beneficial.

    Just like with the earlier example, we can change  SHOW_RAG_DOCUMENTS = False to True  to get more info on what documents chunks are being used. From that, we can see that the first chunk was the same as before:

    --- Result 1 ---
    of good reasons to avoid this:
    
    One less component. You generally don't need npm to start
      your application. If you avoid using it in the container
      then you will not be exposed to any security vulnerabilities
      that might exist in that component or its dependencies.
    One less process. Instead of running 2 process (npm and node)
      you will only run 1.
    There can be issues with signals and child processes. You
      can read more about that in the Node.js docker best practices
      CMD.
    
    Instead use a command like CMD ["node","index.js"],
    tooling for building containers
    ----------------------------------------

    Despite getting similar results from the RAG tool call, looking at the Llama Stack logs, we can see that the tool call was made with a different query than our question:

    tool call: knowledge_search with args: 
        {'query': 'npm vs yarn for node.js applications'}    

    We found this interesting, as using this questions instead of the user's question for the RAG tool query could have easily returned fewer related document chunks, leading to a poorer answer. This illustrates that when you use lower-level APIs like the completion APIs, you might have more control than when you are using agents. In the agent case, you don't really have any direct control over the question used for the RAG, as the agent decides when and how to call the tools.

    In addition, the fact that we had to use a model with less quantization to get a similar result does illustrate that due to the higher resource requirements typically needed when using agents, they might not always be the first choice for some deployments. Using simpler options like the completion API might give you more control and allow you to achieve similar results with lower cost. On the flip side, agents often require less work or knowledge of low-level details to quickly get started.

    Wrapping up

    This post looked at implementing retrieval-augmented generation (RAG) using Python with large language models and Llama Stack. We explored how to ingest documents, query the vector database and ask questions that leverage RAG. We also showed how Llama Stack makes this even easier by integrating RAG into its agent support. We hope it has given you, as a Python developer, a good start on using large language models with Llama Stack.

    Read Part 3: Implement AI safeguards with Python and Llama Stack

    You can also explore more AI tutorials on our AI/ML topic page.

    Last updated: September 15, 2025

    Related Posts

    • Exploring Llama Stack with Python: Tool calling and agents

    • Implement AI safeguards with Python and Llama Stack

    • How to install multiple versions of Python on Red Hat Enterprise Linux

    • How to deploy a Flask application in Python with Gunicorn

    • Retrieval-augmented generation with Llama Stack and Node.js

    • How to implement observability with Python and Llama Stack

    Recent Posts

    • Cloud bursting with confidential containers on OpenShift

    • Reach native speed with MacOS llama.cpp container inference

    • A deep dive into Apache Kafka's KRaft protocol

    • Staying ahead of artificial intelligence threats

    • Strengthen privacy and security with encrypted DNS in RHEL

    What’s up next?

    Are you a developer looking to integrate artificial intelligence into your Node.js applications? Our AI and Node.js cheat sheet provides a practical overview of key concepts and tools to get you started.

    Get the cheat sheet
    Red Hat Developers logo LinkedIn YouTube Twitter Facebook

    Products

    • Red Hat Enterprise Linux
    • Red Hat OpenShift
    • Red Hat Ansible Automation Platform

    Build

    • Developer Sandbox
    • Developer Tools
    • Interactive Tutorials
    • API Catalog

    Quicklinks

    • Learning Resources
    • E-books
    • Cheat Sheets
    • Blog
    • Events
    • Newsletter

    Communicate

    • About us
    • Contact sales
    • Find a partner
    • Report a website issue
    • Site Status Dashboard
    • Report a security problem

    RED HAT DEVELOPER

    Build here. Go anywhere.

    We serve the builders. The problem solvers who create careers with code.

    Join us if you’re a developer, software engineer, web designer, front-end designer, UX designer, computer scientist, architect, tester, product manager, project manager or team lead.

    Sign me up

    Red Hat legal and privacy links

    • About Red Hat
    • Jobs
    • Events
    • Locations
    • Contact Red Hat
    • Red Hat Blog
    • Inclusion at Red Hat
    • Cool Stuff Store
    • Red Hat Summit
    © 2025 Red Hat

    Red Hat legal and privacy links

    • Privacy statement
    • Terms of use
    • All policies and guidelines
    • Digital accessibility

    Report a website issue