How to implement observability with Python and Llama Stack

With the release of Llama Stack earlier this year, we decided to look at how to implement key aspects of an AI application with Python and Llama Stack. This post covers AI observability.

Catch up on the rest of our series exploring how to use large language models with Python and Llama Stack:

Part 1: Exploring Llama Stack with Python: Tool calling and agents
Part 2: Retrieval-augmented generation with Llama Stack and Python
Part 3: Implement AI safeguards with Python and Llama Stack
Part 4: How to implement observability with Python and Llama Stack (this post)

What is observability?

When your application is running in production it is important to be able to figure out what is going on when things are not working properly. Three key components of observability include:

Logging
Metrics
Distributed tracing

Learn more: What is observability?

In this post, we will dive into distributed tracing when using Llama Stack and Python.

What is OpenTelemetry?

OpenTelemetry is quickly becoming the de facto standard for observability. It includes support for a number of languages including Python. It provides a set of APIs that you can use to instrument your application. While you might want to add additional instrumentation to your application, the good news is that there are already packages that will automatically instrument existing libraries so you might not need to instrument your application in order to get the information needed.

OpenTelemetry is not specific to the AI and large language model ecosystem, but like in other domains, it is being adopted as the standard for capturing traces. Auto instrumentation or support has already been added to a number of the libraries used to interact with large language models and Llama Stack is no different.

OpenTelemetry provides a way to instrument your application and generate traces, but it does not provide the mechanism needed to receive, store, and visualize those traces. One of the tools commonly used for that is Jaeger.

Setting up Jaeger

For experimentation we can easily set up a Jaeger instance using Podman or Docker. This is the script that we used:

podman run --pull always --rm --name jaeger \
 -p 16686:16686 -p 4318:4318 \
 jaegertracing/jaeger:2.1.0

This starts a Jaeger instance where the UI is available on port 16686 and traces can be sent to be captured and stored on port 4318.

Once you start the container, you can verify that you can access the Jaeger UI, shown in Figure 1.

Picutre of default jaeger UI — Figure 1: The Jaeger user interface.

Both our Llama Stack instance and our Jaeger instance were running on a Fedora machine with IP 10.1.2.128.

Setting up Llama Stack

We wanted to get a running Llama Stack instance with tracing enabled that we could experiment with. The Llama Stack quick start shows how to spin up a container running Llama Stack, which uses Ollama to serve the large language model. Because we already had a working Ollama install, we decided that was the path of least resistance.

We followed the original Llama Stack quick start, which used a container to run the stack with it pointing to an existing Ollama server. Following the instructions, we put together this short script that allowed us to easily start/stop the Llama Stack instance:

export INFERENCE_MODEL="meta-llama/Llama-3.1-8B-Instruct"
export LLAMA_STACK_PORT=8321
export OLLAMA_HOST=10.1.2.46
export OTEL_SERVICE_NAME=LlamaStack
export TELEMETRY_SINKS=otel_trace,otel_metric
podman run -it \
  --user 1000 \
  -p $LLAMA_STACK_PORT:$LLAMA_STACK_PORT \
  -v ./run-otel-new.yaml:/app/run.yaml:z \
  llamastack/distribution-ollama:0.2.8 \
  --port $LLAMA_STACK_PORT \
  --env INFERENCE_MODEL=$INFERENCE_MODEL \
  --env OLLAMA_URL=http://$OLLAMA_HOST:11434 \
  --env OTEL_SERVICE_NAME=$OTEL_SERVICE_NAME \
  --env TELEMETRY_SINKS=$TELEMETRY_SINKS \
  --yaml-config run.yaml \

Note that it is different from what we used in other posts in the series, in that the Llama Stack version has been updated to 0.2.8 and we use a modified run.yaml that we extracted from the Docker container. We needed to use at least 0.2.8 as some of the trace propagation that we will cover later was fixed in that version.

The first difference you might notice is that we set some additional environment variables and pass them with --env command line options when we start the container.

export OTEL_SERVICE_NAME=LlamaStack
export TELEMETRY_SINKS=otel_trace,otel_metric

These set the service name that is used for traces sent to Jaeger and enables trace and metric generation within the Llama Stack server.

The next difference is in the updated run.yaml. We started the default 0.2.8 llama stack container and extracted the run.yaml template for Ollama (./usr/local/lib/python3.10/site-packages/llama_stack/templates/ollama/run.yaml) from it. We then added the pointers to our Jaeger instance in the telemetry section using the otel_metric_endpoint and otel_trace_endpoint configuration options. After our changes, the telemetry section looked like the following:

 telemetry:
  - provider_id: meta-reference
    provider_type: inline::meta-reference
    config:
      service_name: ${env.OTEL_SERVICE_NAME:}
      sinks: ${env.TELEMETRY_SINKS:console,sqlite}
      otel_metric_endpoint: "http://10.1.2.128:4318/v1/metrics"
      otel_trace_endpoint: "http://10.1.2.128:4318/v1/traces"
      sqlite_db_path: ${env.SQLITE_STORE_DIR:~/.llama/distributions/ollama}/trace_store.db

We ran the container on a Fedora virtual machine with IP 10.1.2.128, so you will see us using http://10.1.2.128:8321 as the endpoint for the Llama Stack instance in our code examples as 8321 is the default port for Llama Stack.

At this point, we had a running Llama Stack instance that we could use to start to experiment with distributed tracing.

A first run

Having enabled tracing in Llama Stack, we wanted to see what would be captured with one of the programs we had used in earlier experimentations. We chose llama-stack-agent-rag.py. If you want to learn more about the application, read Retrieval-augmented generation with Llama Stack and Python.

Running that application and then going to the Jaeger search page, we selected LlamaStack for the Service, which is the same as what we configured earlier (Figure 2).

Picture of Jaeger search page — Figure 2: Selecting the LlamaStack service in the Jaeger dashboard.

And voilà—there were spans for each of the calls to the Llama Stack APIs, as shown in Figure 3.

Picture of traces in Jaeger — Figure 3: Jaeger returns the spans for the Llama Stack API calls.

Expanding the trace for the creation of a vector database (VectorDB), you can see details such as the API endpoint and arguments, as illustrated in Figure 4.

Picture of Jager traces in UI — Figure 4: Detailed view of the VectorDB trace in Jaeger.

You can see that the endpoint used when we created the vector database was /v1/vector-dbs and the arguments were:

{'vector_db_id': 'test-vector-db-6f0546e3-0d2e-46e2-9ac0-edee30f5b92d', 'embedding_model': 'all-MiniLM-L6-v2', 'embedding_dimension': '384', 'provider_id': 'faiss', 'provider_vector_db_id': ''}

You can similarly expand the spans for each of the calls to get the details of what's going on. And that is without having to change our application at all!

Not quite perfect

One thing we did notice when looking at the spans, however, was that each of the Llama Stack API calls was in its own span and the sequence of calls that we made in the application where not tied together in any way. This would be OK if we only had one request at a time but in a real Llama Stack instance we are likely to have many concurrent applications making requests. This would mean that we would not have a way to figure out which spans belong together.

What is typically done to address this issue is to instrument the application and to open a span at the beginning of a request and then close it after all of the related work is done.

There is an API in Llama Stack (/v1/telemetry/events) which allows us to start/stop a span but after having experimented with it there is no way to get the span id created and we cannot, therefore, use it as the parent span for our sequence of calls. Instead, we needed use the OpenTelemetry APIs to instrument the application so that we could create a parent span to tie things together.

Instrumenting the application

The full code for the instrumented application is available in llama-stack-agent-rag-otel.py. This is a modified version of the original application which enables the OpenTelemetry SDK auto-instrumentation for http and creates a span that wraps all of the steps in the application.

One key thing to keep in mind is that because spans are now being sent both by the Llama Stack server and from the application, they must both have access to the Jaeger instance, as outlined in Figure 5.

Picture of both application and llama stack feeding traces to Jaeger — Figure 5: Diagram showing the flow of spans from both the Llama Stack server and the Python application to the Jaeger instance.

If we don't have the span information from the Python application, the spans from Llama Stack will have the right parent span, but Jaeger will report it as an invalid span.

OpenTelemetry Python SDK and auto-instrumentation

The first step to instrumenting the application was to set up the Python SDK and configure it to send spans to the Jaeger instance and instrument http calls.

The code do that was as follows:

########################
# Set up the instrumentation tracing
from opentelemetry import trace
from opentelemetry.sdk.trace import TracerProvider
from opentelemetry.sdk.trace.export import BatchSpanProcessor
from opentelemetry.exporter.otlp.proto.http.trace_exporter import OTLPSpanExporter
from opentelemetry.instrumentation.httpx import HTTPXClientInstrumentor
from opentelemetry.propagate import set_global_textmap
from opentelemetry.trace.propagation.tracecontext import TraceContextTextMapPropagator
import uuid
import logging
from pathlib import Path
from strip_markdown import strip_markdown

# Set up the tracer provider
trace.set_tracer_provider(TracerProvider())

# Set up the OTLP exporter
otlp_exporter = OTLPSpanExporter(
    endpoint="http://10.1.2.128:4318/v1/traces",
)

# Set up the span processor
span_processor = BatchSpanProcessor(otlp_exporter)
trace.get_tracer_provider().add_span_processor(span_processor)

# Set up instrumentations
HTTPXClientInstrumentor().instrument()

# Set up propagator
set_global_textmap(TraceContextTextMapPropagator())

There are a few key parts we will point out:

We have include a propagator TraceContextTextMapPropagator(). We'll talk about this in more detail later.
Llama Stack is using HTTPX for remote calls so we enable auto instrumentation for it.
This code must be before any of the application code, including its imports for application code.

Starting and stopping a span

The next modification to the application is to wrap the existing code with the creation of a span as follows:

    # Start the span for the overall request
    tracer = trace.get_tracer("Python LlamaStack application")

    with tracer.start_as_current_span(f"Python LlamaStack request - {uuid.uuid4()}"):

Python LlamaStack request - {uuid.uuid4()} is the name that we give the top level span, and it is what we will look for in Jaeger after running the instrumented application.

Otherwise the application is the same as it was in our earlier experimentation.

Running the application

The last thing to mention is that before running the application we set:

export OTEL_SERVICE_NAME=PythonAgentRAG

This sets the service name for our Python application.

A better result

After running the instrumented application, we can again go to Jaeger. This time, let's search on the PythonAgentRAG service name that we configured for our application (Figure 6).

Then, look at the resulting spans (Figure 7).

Picture of Jaeger UI showing a single trace for the full request — Figure 7: The Jaeger UI shows a single trace for a PythonAgentRAG request.

It shows a single top-level span for our application run (which could be a request if it was a microservice). It is named PythonAgentRAG: Python LlamaStack request - e0b618b8-1c2a-4955-b610-e551bba9fe91 as requested in our application, with the last part being unique GUID generated for this particular run.

We can then expand the span to see all of the work related to that run, shown in Figure 8.

Picutre of Jaeger UI showing expanded trace — Figure 8: Viewing the PythonAgentRAG: Python LlamaStack request - e0b618b8-1c2a-4955-b610-e551bba9fe91 top-level span.

Looking at just a smaller part (Figure 9), you can see that the top-level span includes spans that have the following service IDs:

PythonAgentRAG: These come from the Python application.
LlamaStack: These come from Llama Stack.

Picture of subset of trace shown the trace IDs including PythonAgentRAG and LlamaStack — Figure 9: The PythonAgentRAG and LlamaStack spans.

The first one, tagged with PythonAgentRAG, is the top-level span that we created in the application.

The second one, tagged with PythonAgentRAG, is the outgoing GET request made by the Llama Stack client. That span is generated by the HTTPX auto-instrumentation.

Under that GET request, you can see the spans generated in the Llama Stack server as it processes the request.

Having the top level span gives us a much better grouping of related spans and we can now easily find all of the related work for a run/request even if there were multiple concurrent runs at the same time. It also allows us to better visualize the overall timing as we can see the time the full run took as well as how long each of the individual Llama Stack API requests took.

If we had a more complicated application, we could also instrument other aspects of the application (for example database access) or create additional subspans get information on a more granular basis.

One last thing to note is that is important to ensure that sensitive information is not being captured in the spans, so it's a good idea to review what is being captured in the spans and to carefully control access to the Jaeger UIs.

Connecting the spans between the application and Llama Stack

It's great that the spans created by the application and the spans created by Llama Stack are tied together so that we get a unified view, but how does that work across the http call?

Earlier we mentioned the inclusion of a propagator when we configured the OpenTelemetry SDK:

set_global_textmap(TraceContextTextMapPropagator())

It's the job of the propagator working with the HTTPX instrumentation to add an additional http header to each of the requests made by the Llama Stack client to the Llama Stack server.

These headers look like this:

traceparent: version-traceid-spanid-sampled

Here's a specific example:

traceparent: 00-1e9a84a8c5ae45c30b1305a0f41ed275-215435bcec6efa72-00

When the Llama Stack server receives an API request it looks for this header and if it is present, sets it as the parent span for any spans created as part of the request.

If you want to read about all the details of http trace propagation, see the W3C recommendation Trace Context.

Wrapping up

In this post, we outlined our experiments that used distributed tracing with Python, large language models, and Llama Stack. As part of that we showed you the code to to get a unified span for all of the work in an application request or run. As you can see, distributed tracing is well supported in Llama Stack, which is key, because observability is important for real world applications. Llama Stack also includes support for logging and metrics but we will leave looking at those to a later post.

Explore more tutorials on our AI/ML and observability topic pages.

Red Hat Developer Sandbox

Programming languages & frameworks

System design & architecture

Developer experience

Automated data processing

Platform engineering

Secure development & architectures

E-books

Cheat sheets

Documentation

How to implement observability with Python and Llama Stack

What is observability?

What is OpenTelemetry?

Setting up Jaeger

Setting up Llama Stack

A first run

Not quite perfect

Instrumenting the application

OpenTelemetry Python SDK and auto-instrumentation

Starting and stopping a span

Running the application

A better result

Connecting the spans between the application and Llama Stack

Wrapping up

Right-sizing recommendations for OpenShift Virtualization

OpenJDK 25 now available in Red Hat Enterprise Linux 10.1

Migrating Red Hat Ansible Automation Platform: From RPM to container on Red Hat Enterprise Linux

Python 3.9 reaches end of life: What it means for RHEL users

Upgrade air-gapped OpenShift with self-signed certificates

Platforms

Build

Quicklinks

Communicate

RED HAT DEVELOPER

Red Hat legal and privacy links

Red Hat legal and privacy links

Report a website issue