With the recent release of Llama Stack earlier this year, we decided to look at how to implement key aspects of an AI application with Python and Llama Stack. This post covers retrieval-augmented generation (RAG).
Catch up on the rest of our series exploring how to use large language models with Python and Llama Stack:
- Part 1: Exploring Llama Stack with Python: Tool calling and agents
- Part 2: Retrieval-augmented generation with Llama Stack and Python (this post)
- Part 3: Implement AI safeguards with Python and Llama Stack
- Part 4: How to implement observability with Python and Llama Stack
How retrieval-augmented generation works
Retrieval-augmented generation is one way to provide context that helps a model respond with the most appropriate answer. The basic concept is as follows:
- Data that provides additional context—in our case, the Markdown files from the Node.js reference architecture—is transformed into a format suitable for model augmentation. This often includes breaking up the data into maximum sized chunks that will later be provided as additional context to the model.
- An embedding model is used to convert the source data into a set of vectors that represents the words in the data. These are stored in a database so the data can be retrieved through a query against the matching vectors. Most commonly, the data is stored in a vector database like Chroma.
- The application is enhanced so that before passing on a query to the model, it first uses the question to query the database for matching documents chunks. The most relevant document chunks are then added to the context and sent along with the question to the model as part of the prompt.
- The model returns an answer based partly on the context provided.
So why not just pass the content of all of the documents to the model? There are a number of reasons:
- The size of the context that can be passed to a model might be limited.
- Passing a large context might cost you more money.
- Providing a smaller, more closely related set of information can result in better answers.
With that in mind, we need to identify the most relevant document chunks and pass only that subset.
Now, let's look at how we implemented retrieval-augmented generation with Llama Stack.
Setting up Llama Stack
The first step was to get a running Llama Stack instance that we could experiment with. Llama Stack is a bit different from other frameworks in a few ways.
Instead of providing a single implementation with a set of defined APIs, Llama Stack aims to standardize a set of APIs and drive a number of distributions. In other words, the goal is to have many implementations of the same API, with each implementation being shipped by a different organization as a distribution.
As is common when this approach is followed, a "reference distribution" is provided, but there are already a number of alternative distributions available. You can see the list of available distributions in the GitHub README.
Another difference is a strong focus on plug-in APIs that allow you to add implementations for specific components behind the API implementation itself. For example, you could plug in an implementation (maybe one that is custom tailored for your organization) for a specific feature like Telemetry while using an existing distribution. We won't go into the details of these APIs in this post, but hope to look at them later on.
With that said, the first question was which distribution to use in order to get stated. The Llama Stack quick start shows how to spin up a container running Llama Stack, which uses Ollama to serve the large language model. Because we already had a working Ollama install, we decided that was the path of least resistance.
Getting the Llama Stack instance running
We followed the original Llama Stack quick start, which used a container to run the stack with it pointing to an existing Ollama server. Following the instructions, we put together this short script that allowed us to easily start/stop the Llama Stack instance:
export INFERENCE_MODEL="meta-llama/Llama-3.1-8B-Instruct"
export LLAMA_STACK_PORT=8321
export OLLAMA_HOST=10.1.2.46
podman run -it \
--user 1000 \
-p $LLAMA_STACK_PORT:$LLAMA_STACK_PORT \
-v ~/.llama:/root/.llama \
llamastack/distribution-ollama:0.2.8 \
--port $LLAMA_STACK_PORT \
--env INFERENCE_MODEL=$INFERENCE_MODEL \
--env OLLAMA_URL=http://$OLLAMA_HOST:11434
Our existing Ollama server was running on a machine with the IP 10.1.2.38, which is what we set OLLAMA_HOST
to.
We followed the instructions for starting the container from the quick start so Llama Stack was running on the default port. We ran the container on a Fedora virtual machine with IP 10.1.2.128, so you will see us using http://10.1.2.128:8321
as the endpoint for the Llama Stack instance in our code examples.
At this point, we had a running Llama Stack instance to experiment with.
Llama Stack now supports more models
At the time of our initial exploration of Llama Stack, only a limited set of models were supported when using Ollama. This time, we used a later version of Llama Stack, which allows you to register and use different models.
In our previous experiments, we often used more quantized models because they are smaller and easier to run on smaller GPUs. Given the ability to run new models, we tried running with llama3.1:8b-instruct-q4_K_M
instead of the larger llama3.2:8b-instruct-fp16
.
To use a model not already known to Llama Stack, you use the model's API to register the new model. The code to do that was as follows:
model_id = "meta-llama/Llama-3.1-8B-instruct-q4_K_M"
client.models.register(
model_id=model_id,
provider_id="ollama",
provider_model_id="llama3.1:8b-instruct-q4_K_M",
model_type="llm",
)
After having done that, we could then request the model using the model ID meta-llama/Llama-3.1-8B-instruct-q4_K_M
. It's great to have that additional flexibility. In our case, it let us run the first RAG example much faster.
Retrieval-augmented generation with the completion API
As outlined earlier, the first step to using RAG is to ingest and store the documents in a vector database. Llama Stack provides a set of APIs to make this easy and consistent across different vector databases. The full code for the example is in python-ai-experimentation/llama-stack-rag/llama-stack-chat-rag.py.
Creating the database
To get started, you'll need to create a database in which the documents will be stored. Llama Stack provides an API to manage vector databases and supports an in-memory implementation by default. Additionally, you can register other vector database implementations to use the one best suited to your organization. In our example, we are using the in-memory vector database that is available by default.
The code we used to create the database was as follows:
# use the first available provider
providers = client.providers.list()
provider = next(p for p in providers if p.api == "vector_io")
# register a vector database
vector_db_id = f"test-vector-db-{uuid.uuid4()}"
client.vector_dbs.register(
vector_db_id=vector_db_id,
provider_id=provider.provider_id,
embedding_model="all-MiniLM-L6-v2",
)
The in-memory database was the first provider in the default configuration, so we just used that. When creating the database with the register call, we created a random vector_db_id
and used the embedding model that comes with Llama Stack by default, all-MiniLM-L6-v2
.
The database persists as long as the Llama Stack container runs, so we deleted the database at the end of our experiment for cleanup. In a production deployment, you would probably separate the creation and ingestion process so that you ingest the documents on a less frequent basis and use the database for multiple requests.
The clean-up code was as follows:
client.vector_dbs.unregister(vector_db_id)
Ingesting the documents
As with our earlier explorations, we wanted to use the documents in the Node.js reference architecture as our knowledge base. We first cloned the GitHub repo to a local directory (in this case, /home/user1/newpull/nodejs-reference-architecture
).
We then wrote some code that would read the documents into objects we could pass to the Llama Stack APIs:
# read in all of the files to be used with RAG
rag_documents = []
docs_path = Path("/home/user1/newpull/nodejs-reference-architecture/docs")
i = 0
for file_path in docs_path.rglob("*.md"):
i += 1
if file_path.is_file():
with open(file_path, "r", encoding="utf-8") as f:
contents = f.read()
# Convert markdown to plain text using strip_markdown
plain_text = strip_markdown(contents)
rag_documents.append(
{
"document_id": f"doc-{i}",
"content": plain_text,
"mime_type": "text/plain",
"metadata": {},
}
)
Because the reference architecture documents were in Markdown, we used a package (strip_markdown
) to convert them to text and then set the mime_type
to text in the objects we planned to pass to Llama Stack.
Next, we used the Llama Stack ragTool insert API to ingest the documents:
client.tool_runtime.rag_tool.insert(
documents=rag_documents,
vector_db_id=vector_db_id,
chunk_size_in_tokens=125,
)
The Llama Stack API handled breaking up the documents into chunks based on the value we passed for chunk_size_in_tokens
. Because this value was in tokens, we needed to make it smaller than the value passed for other frameworks where the value that controls how documents are split is in characters. We could not find a way to tune the overlap between chunks; hopefully we'll see that in later versions of the API.
Querying the documents
Now that we have the documents in the vector database, we can use the Llama Stack APIs to find the most relevant chunks for a given question from the user. You can do this with the ragTool
query method:
raw_rag_results = client.tool_runtime.rag_tool.query(
content=question,
vector_db_ids=[vector_db_id],
)
This returns the document chunks that are most relevant to the question being asked by the user. You then provide those chunks to the model as part of the prompt, as follows:
rag_results = []
for content_item in raw_rag_results.content:
rag_results.append(str(content_item.text))
if SHOW_RAG_DOCUMENTS:
for result in rag_results:
print(result)
prompt = f"""Answer the question based only on the context provided
<question>{question}</question>
<context>{' '.join(rag_results)}</context>"""
Asking questions
Having created the prompt with the context retrieved using the ragTool, you can now send the user's question to the model to get the answer based on the RAG results:
messages.append({"role": "user", "content": prompt})
response = client.inference.chat_completion(
messages=messages,
model_id=model_id,
)
print(" RESPONSE:" + response.completion_message.content)
The result
We used the same question we used in previous RAG experiments based on the Node.js reference architecture: Should I use npm to start an application
The result was as expected:
python llama-stack-chat-rag.py
Iteration 0 ------------------------------------------------------------
QUESTION: Should I use npm to start an application
RESPONSE:No, you generally don't need npm to start your application. It can be avoided to prevent security vulnerabilities and reduce the number of processes running.
Based on the answer, we can see that the information from the Node.js reference architecture was used to answer the question. Yay!
If you want to see which document chunks were used, change the value of SHOW_RAG_DOCUMENTS = False
to True
and the example will print them out. Here, we can see that the first document chunk used in our run was:
Result 1
Content: start"] in docker files
used to build Node.js applications there are a number
of good reasons to avoid this:
One less component. You generally don't need npm to start
your application. If you avoid using it in the container
then you will not be exposed to any security vulnerabilities
that might exist in that component or its dependencies.
One less process. Instead of running 2 process (npm and node)
you will only run 1.
There can be issues with signals and child processes. You
can read more about that in the Node.js docker best practices
CMD.
It was obviously used in formulating the response returned by the model.
Agents, agents, agents
Agents are often regarded as the best way to leverage large language model capabilities. Llama Stack provides an easier way to use RAG though agents, with a built-in tool that can search and return relevant information based on a vector database. This tool is in the builtin::rag/knowledge_search
tool group.
Compared to the previous example using the completions API, the required code is simplified, as we don't need to query the vector database and build the results into the prompt. Instead, we simply provide the agent with the tools from the builtin::rag/knowledge_search
tool group, and the agent figures out that it needs to use the tool to get the related results from the vector database.
The full code for the agent example is in python-ai-experimentation/llama-stack-rag/llama-stack-agent-rag.py The code to read and ingest the documents into the vector database is the same as when using the completions API, so we won't cover that again.
In order to use RAG with the agent, we started by creating the agent that specifies the builtin::rag/knowledge_search
tool group and specifying the ID for the vector database that should be used:
agentic_system_create_response = client.agents.create(
agent_config={
"model": model_id,
"instructions": "You are a helpful assistant, answer questions only based on information in the documents provided",
"toolgroups": [
{
"name": "builtin::rag/knowledge_search",
"args": {"vector_db_ids": [vector_db_id]},
}
],
"tool_choice": "auto",
"input_shields": [],
"output_shields": [],
"max_infer_iters": 10,
}
)
agent_id = agentic_system_create_response.agent_id
We also created a session that will be used to maintain state across questions:
session_create_response = client.agents.session.create(
agent_id, session_name="agent1"
)
session_id = session_create_response.session_id
We could then use the agent to ask the user's question. Most of the code is for displaying the chunks used from the database when SHOW_RAG_DOCUMENTS
is True
:
# Handle streaming response
response = ""
for chunk in response_stream:
if hasattr(chunk, "event") and hasattr(chunk.event, "payload"):
if chunk.event.payload.event_type == "turn_complete":
response = (
response + chunk.event.payload.turn.output_message.content
)
elif (
chunk.event.payload.event_type == "step_complete"
and chunk.event.payload.step_type == "tool_execution"
and SHOW_RAG_DOCUMENTS
):
# Extract and print RAG document content in readable format
step_details = chunk.event.payload.step_details
if (
hasattr(step_details, "tool_responses")
and step_details.tool_responses
):
print("\n" + "=" * 60)
print("RAG DOCUMENTS RETRIEVED")
print("=" * 60)
for tool_response in step_details.tool_responses:
if (
hasattr(tool_response, "content")
and tool_response.content
):
for item in tool_response.content:
if (
hasattr(item, "text")
and "Result" in item.text
):
# This is a result item, extract the content
text = item.text
if "Content:" in text:
# Extract content after "Content:"
content_start = text.find(
"Content:"
) + len("Content:")
content_end = text.find("\nMetadata:")
if content_end == -1:
content_end = len(text)
content = text[
content_start:content_end
].strip()
result_num = (
text.split("\n")[0]
if "\n" in text
else "Result"
)
print(f"\n--- {result_num} ---")
print(content)
print("-" * 40)
print("=" * 60)
print(" RESPONSE:" + response)
Overall, we can see that it takes less code using the agent. We simply give the agent a tool it can use to get the information needed to answer the question, and it figures out to call it when needed.
We did find, however, that using the more quantized model did not work with the agent example. When we used the llama3.1:8b-instruct-q4_K_M
model, the agent failed to figure out that it should call the RAG tool and answered without using the additional information. We therefore had to fall back to using the llama3.2:8b-instruct-fp16
model.
Having done that, we got a similar answer:
python llama-stack-agent-rag.py
Iteration 0 ------------------------------------------------------------
QUESTION: Should I use npm to start a node.js application
RESPONSE:Based on the search results, it seems that you don't necessarily need npm to start a Node.js application. In fact, one of the benefits of not using npm is that you'll have one less process running (npm and node) which can be beneficial for security and performance reasons.
However, if you're planning to use a monorepo setup with multiple projects in a single repository, it's recommended to use either yarn workspaces or npm workspaces to manage the dependencies and packages.
So, to answer your question, you don't necessarily need npm to start a Node.js application, but if you're planning a complex project structure, using npm or yarn workspaces might be beneficial.
Just like with the earlier example, we can change SHOW_RAG_DOCUMENTS = False
to True
to get more info on what documents chunks are being used. From that, we can see that the first chunk was the same as before:
--- Result 1 ---
of good reasons to avoid this:
One less component. You generally don't need npm to start
your application. If you avoid using it in the container
then you will not be exposed to any security vulnerabilities
that might exist in that component or its dependencies.
One less process. Instead of running 2 process (npm and node)
you will only run 1.
There can be issues with signals and child processes. You
can read more about that in the Node.js docker best practices
CMD.
Instead use a command like CMD ["node","index.js"],
tooling for building containers
----------------------------------------
Despite getting similar results from the RAG tool call, looking at the Llama Stack logs, we can see that the tool call was made with a different query than our question:
tool call: knowledge_search with args:
{'query': 'npm vs yarn for node.js applications'}
We found this interesting, as using this questions instead of the user's question for the RAG tool query could have easily returned fewer related document chunks, leading to a poorer answer. This illustrates that when you use lower-level APIs like the completion APIs, you might have more control than when you are using agents. In the agent case, you don't really have any direct control over the question used for the RAG, as the agent decides when and how to call the tools.
In addition, the fact that we had to use a model with less quantization to get a similar result does illustrate that due to the higher resource requirements typically needed when using agents, they might not always be the first choice for some deployments. Using simpler options like the completion API might give you more control and allow you to achieve similar results with lower cost. On the flip side, agents often require less work or knowledge of low-level details to quickly get started.
Wrapping up
This post looked at implementing retrieval-augmented generation (RAG) using Python with large language models and Llama Stack. We explored how to ingest documents, query the vector database and ask questions that leverage RAG. We also showed how Llama Stack makes this even easier by integrating RAG into its agent support. We hope it has given you, as a Python developer, a good start on using large language models with Llama Stack.
Read Part 3: Implement AI safeguards with Python and Llama Stack
You can also explore more AI tutorials on our AI/ML topic page.
Last updated: September 15, 2025