Retrieval-augmented generation with Llama Stack and Python

With the recent release of Llama Stack earlier this year, we decided to look at how to implement key aspects of an AI application with Python and Llama Stack. This post covers retrieval-augmented generation (RAG).

Catch up on the rest of our series exploring how to use large language models with Python and Llama Stack:

Part 1: Exploring Llama Stack with Python: Tool calling and agents
Part 2: Retrieval-augmented generation with Llama Stack and Python (this post)
Part 3: Implement AI safeguards with Python and Llama Stack
Part 4: How to implement observability with Python and Llama Stack

How retrieval-augmented generation works

Retrieval-augmented generation is one way to provide context that helps a model respond with the most appropriate answer. The basic concept is as follows:

Data that provides additional context—in our case, the Markdown files from the Node.js reference architecture—is transformed into a format suitable for model augmentation. This often includes breaking up the data into maximum sized chunks that will later be provided as additional context to the model.
An embedding model is used to convert the source data into a set of vectors that represents the words in the data. These are stored in a database so the data can be retrieved through a query against the matching vectors. Most commonly, the data is stored in a vector database like Chroma.
The application is enhanced so that before passing on a query to the model, it first uses the question to query the database for matching documents chunks. The most relevant document chunks are then added to the context and sent along with the question to the model as part of the prompt.
The model returns an answer based partly on the context provided.

So why not just pass the content of all of the documents to the model? There are a number of reasons:

The size of the context that can be passed to a model might be limited.
Passing a large context might cost you more money.
Providing a smaller, more closely related set of information can result in better answers.

With that in mind, we need to identify the most relevant document chunks and pass only that subset.

Now, let's look at how we implemented retrieval-augmented generation with Llama Stack.

Setting up Llama Stack

The first step was to get a running Llama Stack instance that we could experiment with. Llama Stack is a bit different from other frameworks in a few ways.

Instead of providing a single implementation with a set of defined APIs, Llama Stack aims to standardize a set of APIs and drive a number of distributions. In other words, the goal is to have many implementations of the same API, with each implementation being shipped by a different organization as a distribution.

As is common when this approach is followed, a "reference distribution" is provided, but there are already a number of alternative distributions available. You can see the list of available distributions in the GitHub README.

Another difference is a strong focus on plug-in APIs that allow you to add implementations for specific components behind the API implementation itself. For example, you could plug in an implementation (maybe one that is custom tailored for your organization) for a specific feature like Telemetry while using an existing distribution. We won't go into the details of these APIs in this post, but hope to look at them later on.

With that said, the first question was which distribution to use in order to get stated. The Llama Stack quick start shows how to spin up a container running Llama Stack, which uses Ollama to serve the large language model. Because we already had a working Ollama install, we decided that was the path of least resistance.

Getting the Llama Stack instance running

We followed the original Llama Stack quick start, which used a container to run the stack with it pointing to an existing Ollama server. Following the instructions, we put together this short script that allowed us to easily start/stop the Llama Stack instance:

export INFERENCE_MODEL="meta-llama/Llama-3.1-8B-Instruct"
export LLAMA_STACK_PORT=8321
export OLLAMA_HOST=10.1.2.46
podman run -it \
 --user 1000 \
 -p $LLAMA_STACK_PORT:$LLAMA_STACK_PORT \
 -v ~/.llama:/root/.llama \
 llamastack/distribution-ollama:0.2.8 \
 --port $LLAMA_STACK_PORT \
 --env INFERENCE_MODEL=$INFERENCE_MODEL \
 --env OLLAMA_URL=http://$OLLAMA_HOST:11434

Our existing Ollama server was running on a machine with the IP 10.1.2.38, which is what we set OLLAMA_HOST to.

We followed the instructions for starting the container from the quick start so Llama Stack was running on the default port. We ran the container on a Fedora virtual machine with IP 10.1.2.128, so you will see us using http://10.1.2.128:8321 as the endpoint for the Llama Stack instance in our code examples.

At this point, we had a running Llama Stack instance to experiment with.

Llama Stack now supports more models

At the time of our initial exploration of Llama Stack, only a limited set of models were supported when using Ollama. This time, we used a later version of Llama Stack, which allows you to register and use different models.

In our previous experiments, we often used more quantized models because they are smaller and easier to run on smaller GPUs. Given the ability to run new models, we tried running with llama3.1:8b-instruct-q4_K_M instead of the larger llama3.2:8b-instruct-fp16.

To use a model not already known to Llama Stack, you use the model's API to register the new model. The code to do that was as follows:

      model_id = "meta-llama/Llama-3.1-8B-instruct-q4_K_M"
      client.models.register(
        model_id=model_id,
        provider_id="ollama",
        provider_model_id="llama3.1:8b-instruct-q4_K_M",
        model_type="llm",
      )

After having done that, we could then request the model using the model ID meta-llama/Llama-3.1-8B-instruct-q4_K_M. It's great to have that additional flexibility. In our case, it let us run the first RAG example much faster.

Retrieval-augmented generation with the completion API

As outlined earlier, the first step to using RAG is to ingest and store the documents in a vector database. Llama Stack provides a set of APIs to make this easy and consistent across different vector databases. The full code for the example is in python-ai-experimentation/llama-stack-rag/llama-stack-chat-rag.py.

Creating the database

To get started, you'll need to create a database in which the documents will be stored. Llama Stack provides an API to manage vector databases and supports an in-memory implementation by default. Additionally, you can register other vector database implementations to use the one best suited to your organization. In our example, we are using the in-memory vector database that is available by default.

The code we used to create the database was as follows:

    # use the first available provider
    providers = client.providers.list()
    provider = next(p for p in providers if p.api == "vector_io")

    # register a vector database
    vector_db_id = f"test-vector-db-{uuid.uuid4()}"
    client.vector_dbs.register(
        vector_db_id=vector_db_id,
        provider_id=provider.provider_id,
        embedding_model="all-MiniLM-L6-v2",
    )

The in-memory database was the first provider in the default configuration, so we just used that. When creating the database with the register call, we created a random vector_db_id and used the embedding model that comes with Llama Stack by default, all-MiniLM-L6-v2.

The database persists as long as the Llama Stack container runs, so we deleted the database at the end of our experiment for cleanup. In a production deployment, you would probably separate the creation and ingestion process so that you ingest the documents on a less frequent basis and use the database for multiple requests.

The clean-up code was as follows:

    client.vector_dbs.unregister(vector_db_id)

Ingesting the documents

As with our earlier explorations, we wanted to use the documents in the Node.js reference architecture as our knowledge base. We first cloned the GitHub repo to a local directory (in this case, /home/user1/newpull/nodejs-reference-architecture).

We then wrote some code that would read the documents into objects we could pass to the Llama Stack APIs:

    # read in all of the files to be used with RAG
    rag_documents = []
    docs_path = Path("/home/user1/newpull/nodejs-reference-architecture/docs")

    i = 0
    for file_path in docs_path.rglob("*.md"):
        i += 1
        if file_path.is_file():
            with open(file_path, "r", encoding="utf-8") as f:
                contents = f.read()

            # Convert markdown to plain text using strip_markdown
            plain_text = strip_markdown(contents)

            rag_documents.append(
                {
                    "document_id": f"doc-{i}",
                    "content": plain_text,
                    "mime_type": "text/plain",
                    "metadata": {},
                }
            )

Because the reference architecture documents were in Markdown, we used a package (strip_markdown) to convert them to text and then set the mime_type to text in the objects we planned to pass to Llama Stack.

Next, we used the Llama Stack ragTool insert API to ingest the documents:

    client.tool_runtime.rag_tool.insert(
        documents=rag_documents,
        vector_db_id=vector_db_id,
        chunk_size_in_tokens=125,
    )

The Llama Stack API handled breaking up the documents into chunks based on the value we passed for chunk_size_in_tokens. Because this value was in tokens, we needed to make it smaller than the value passed for other frameworks where the value that controls how documents are split is in characters. We could not find a way to tune the overlap between chunks; hopefully we'll see that in later versions of the API.

Querying the documents

Now that we have the documents in the vector database, we can use the Llama Stack APIs to find the most relevant chunks for a given question from the user. You can do this with the ragTool query method:

            raw_rag_results = client.tool_runtime.rag_tool.query(
                content=question,
                vector_db_ids=[vector_db_id],
            )

This returns the document chunks that are most relevant to the question being asked by the user. You then provide those chunks to the model as part of the prompt, as follows:

            rag_results = []
            for content_item in raw_rag_results.content:
                rag_results.append(str(content_item.text))

            if SHOW_RAG_DOCUMENTS:
                for result in rag_results:
                    print(result)

            prompt = f"""Answer the question based only on the context provided
                   <question>{question}</question>
                   <context>{' '.join(rag_results)}</context>"""

Asking questions

Having created the prompt with the context retrieved using the ragTool, you can now send the user's question to the model to get the answer based on the RAG results:

            messages.append({"role": "user", "content": prompt})
            response = client.inference.chat_completion(
                messages=messages,
                model_id=model_id,
            )

            print("  RESPONSE:" + response.completion_message.content)

The result

We used the same question we used in previous RAG experiments based on the Node.js reference architecture: Should I use npm to start an application

The result was as expected:

python llama-stack-chat-rag.py 
Iteration 0 ------------------------------------------------------------
QUESTION: Should I use npm to start an application
  RESPONSE:No, you generally don't need npm to start your application. It can be avoided to prevent security vulnerabilities and reduce the number of processes running.

Based on the answer, we can see that the information from the Node.js reference architecture was used to answer the question. Yay!

If you want to see which document chunks were used, change the value of SHOW_RAG_DOCUMENTS = False to True and the example will print them out. Here, we can see that the first document chunk used in our run was:

Result 1
Content: start"] in docker files
used to build Node.js applications there are a number
of good reasons to avoid this:

One less component. You generally don't need npm to start
  your application. If you avoid using it in the container
  then you will not be exposed to any security vulnerabilities
  that might exist in that component or its dependencies.
One less process. Instead of running 2 process (npm and node)
  you will only run 1.
There can be issues with signals and child processes. You
  can read more about that in the Node.js docker best practices
  CMD.

It was obviously used in formulating the response returned by the model.

Agents, agents, agents

Agents are often regarded as the best way to leverage large language model capabilities. Llama Stack provides an easier way to use RAG though agents, with a built-in tool that can search and return relevant information based on a vector database. This tool is in the builtin::rag/knowledge_search tool group.

Compared to the previous example using the completions API, the required code is simplified, as we don't need to query the vector database and build the results into the prompt. Instead, we simply provide the agent with the tools from the builtin::rag/knowledge_search tool group, and the agent figures out that it needs to use the tool to get the related results from the vector database.

The full code for the agent example is in python-ai-experimentation/llama-stack-rag/llama-stack-agent-rag.py The code to read and ingest the documents into the vector database is the same as when using the completions API, so we won't cover that again.

In order to use RAG with the agent, we started by creating the agent that specifies the builtin::rag/knowledge_search tool group and specifying the ID for the vector database that should be used:

    agentic_system_create_response = client.agents.create(
        agent_config={
            "model": model_id,
            "instructions": "You are a helpful assistant, answer questions only based on information in the documents provided",
            "toolgroups": [
                {
                    "name": "builtin::rag/knowledge_search",
                    "args": {"vector_db_ids": [vector_db_id]},
                }
            ],
            "tool_choice": "auto",
            "input_shields": [],
            "output_shields": [],
            "max_infer_iters": 10,
        }
    )
    agent_id = agentic_system_create_response.agent_id

We also created a session that will be used to maintain state across questions:

    session_create_response = client.agents.session.create(
        agent_id, session_name="agent1"
    ) 
    session_id = session_create_response.session_id

We could then use the agent to ask the user's question. Most of the code is for displaying the chunks used from the database when SHOW_RAG_DOCUMENTS is True:

          # Handle streaming response
            response = ""
            for chunk in response_stream:
                if hasattr(chunk, "event") and hasattr(chunk.event, "payload"):
                    if chunk.event.payload.event_type == "turn_complete":
                        response = (
                            response + chunk.event.payload.turn.output_message.content
                        )
                    elif (
                        chunk.event.payload.event_type == "step_complete"
                        and chunk.event.payload.step_type == "tool_execution"
                        and SHOW_RAG_DOCUMENTS
                    ):
                        # Extract and print RAG document content in readable format
                        step_details = chunk.event.payload.step_details
                        if (
                            hasattr(step_details, "tool_responses")
                            and step_details.tool_responses
                        ):
                            print("\n" + "=" * 60)
                            print("RAG DOCUMENTS RETRIEVED")
                            print("=" * 60)

                            for tool_response in step_details.tool_responses:
                                if (
                                    hasattr(tool_response, "content")
                                    and tool_response.content
                                ):
                                    for item in tool_response.content:
                                        if (
                                            hasattr(item, "text")
                                            and "Result" in item.text
                                        ):
                                            # This is a result item, extract the content
                                            text = item.text
                                            if "Content:" in text:
                                                # Extract content after "Content:"
                                                content_start = text.find(
                                                    "Content:"
                                                ) + len("Content:")
                                                content_end = text.find("\nMetadata:")
                                                if content_end == -1:
                                                    content_end = len(text)

                                                content = text[
                                                    content_start:content_end
                                                ].strip()
                                                result_num = (
                                                    text.split("\n")[0]
                                                    if "\n" in text
                                                    else "Result"
                                                )

                                                print(f"\n--- {result_num} ---")
                                                print(content)
                                                print("-" * 40)
                            print("=" * 60)

            print("  RESPONSE:" + response)

Overall, we can see that it takes less code using the agent. We simply give the agent a tool it can use to get the information needed to answer the question, and it figures out to call it when needed.

We did find, however, that using the more quantized model did not work with the agent example. When we used the llama3.1:8b-instruct-q4_K_M model, the agent failed to figure out that it should call the RAG tool and answered without using the additional information. We therefore had to fall back to using the llama3.2:8b-instruct-fp16 model.

Having done that, we got a similar answer:

python llama-stack-agent-rag.py 
Iteration 0 ------------------------------------------------------------
QUESTION: Should I use npm to start a node.js application
  RESPONSE:Based on the search results, it seems that you don't necessarily need npm to start a Node.js application. In fact, one of the benefits of not using npm is that you'll have one less process running (npm and node) which can be beneficial for security and performance reasons.

However, if you're planning to use a monorepo setup with multiple projects in a single repository, it's recommended to use either yarn workspaces or npm workspaces to manage the dependencies and packages.

So, to answer your question, you don't necessarily need npm to start a Node.js application, but if you're planning a complex project structure, using npm or yarn workspaces might be beneficial.

Just like with the earlier example, we can change SHOW_RAG_DOCUMENTS = False to True to get more info on what documents chunks are being used. From that, we can see that the first chunk was the same as before:

--- Result 1 ---
of good reasons to avoid this:

One less component. You generally don't need npm to start
  your application. If you avoid using it in the container
  then you will not be exposed to any security vulnerabilities
  that might exist in that component or its dependencies.
One less process. Instead of running 2 process (npm and node)
  you will only run 1.
There can be issues with signals and child processes. You
  can read more about that in the Node.js docker best practices
  CMD.

Instead use a command like CMD ["node","index.js"],
tooling for building containers
----------------------------------------

Despite getting similar results from the RAG tool call, looking at the Llama Stack logs, we can see that the tool call was made with a different query than our question:

tool call: knowledge_search with args: 
    {'query': 'npm vs yarn for node.js applications'}

We found this interesting, as using this questions instead of the user's question for the RAG tool query could have easily returned fewer related document chunks, leading to a poorer answer. This illustrates that when you use lower-level APIs like the completion APIs, you might have more control than when you are using agents. In the agent case, you don't really have any direct control over the question used for the RAG, as the agent decides when and how to call the tools.

In addition, the fact that we had to use a model with less quantization to get a similar result does illustrate that due to the higher resource requirements typically needed when using agents, they might not always be the first choice for some deployments. Using simpler options like the completion API might give you more control and allow you to achieve similar results with lower cost. On the flip side, agents often require less work or knowledge of low-level details to quickly get started.

Wrapping up

This post looked at implementing retrieval-augmented generation (RAG) using Python with large language models and Llama Stack. We explored how to ingest documents, query the vector database and ask questions that leverage RAG. We also showed how Llama Stack makes this even easier by integrating RAG into its agent support. We hope it has given you, as a Python developer, a good start on using large language models with Llama Stack.

Read Part 3: Implement AI safeguards with Python and Llama Stack

You can also explore more AI tutorials on our AI/ML topic page.

Last updated: September 15, 2025

Red Hat Developer Sandbox

Programming languages & frameworks

System design & architecture

Developer experience

Automated data processing

Platform engineering

Secure development & architectures

E-books

Cheat sheets

Documentation

Retrieval-augmented generation with Llama Stack and Python

How retrieval-augmented generation works

Setting up Llama Stack

Getting the Llama Stack instance running

Llama Stack now supports more models

Retrieval-augmented generation with the completion API

Creating the database

Ingesting the documents

Querying the documents

Asking questions

The result

Agents, agents, agents

Wrapping up

Introduction to distributed inference with llm-d

How to build your dynamic plug-ins for Developer Hub

Defining success: Evaluation metrics and data augmentation for oversaturation detection

Deploying OpenShift hosted clusters on bare metal

Get started with language model post-training using Training Hub

Platforms

Build

Quicklinks

Communicate

RED HAT DEVELOPER

Red Hat legal and privacy links

Red Hat legal and privacy links

Report a website issue