A deep dive into the Responses API with Llama Stack

The OpenAI Responses API provides substantial value for developers building AI applications. With many earlier inference APIs, creating agents that could use tools involved a clunky, multi-step process. Client applications had to orchestrate each part of the process: call the model with a list of possible tools, get the plan for tool execution from the model, execute the tools, send the results of the tool execution back to the model, and repeat.

This required developers to build and maintain complex state and orchestration logic in their own applications. Less experienced developers might have done this poorly, resulting in applications that are slow, provide unnecessary load on the model servers, or have poor accuracy because the orchestration is suboptimal.

The Responses API provides a structured interface where the AI service can perform multi-step reasoning, call multiple tools, and manage conversational state within a single, consolidated interaction. By allowing the server to handle more of the internal orchestration, it greatly streamlines the development of sophisticated agentic applications.

However, OpenAI’s implementation of the Responses API comes with a catch: it is tied to specific models and a proprietary cloud service. What if you wanted this advanced architecture with the freedom to choose your own models? What if your organization's security posture demands you run on your own infrastructure?

Introducing Llama Stack

Llama Stack is a powerful, open source server for AI capabilities. Among its many features is a Responses API that is compatible with the OpenAI Responses API specification. It allows you to deploy a production-ready endpoint on your own hardware, powered by any model you choose, from versatile open source models to highly specialized models you've created or fine-tuned yourself, including highly optimized models you can find in the Red Hat AI Hugging Face repository.

This post will walk you through the key features of the Responses API. We'll show you how Llama Stack empowers you to build next-generation AI agents with ease and performance you need.

Note

Follow along with our companion Python Notebook for hands-on examples.

Private and powerful RAG

Retrieval-augmented generation (RAG) enhances large language models by grounding them in authoritative knowledge sources. This enables them to provide answers that are more accurate and trustworthy, whether they're drawing from up-to-the-minute data, private documents, or a canonical set of public information. The Responses API formalizes this with built-in tools like file_search, which allows a model to intelligently query document collections.

With a public hosted service, using this feature might require uploading your sensitive documents to a third party, which can be a non-starter for organizations in finance, healthcare, law, and other highly regulated industries. With Llama Stack, your RAG workflows remain entirely within your security perimeter.

Our companion notebook demonstrates this with a practical example.

Document ingestion: A PDF document containing information about U.S. National Parks is downloaded. Using the Llama Stack client, we create a vector_store and upload the file. This entire process happens on your local server, ensuring the document remains private.
Intelligent querying: We then ask the model a question that can be answered from knowledge from the document: When did the Bering Land Bridge become a national preserve?
Automated retrieval and synthesis: In a single API call, the model running on Llama Stack sees the user's question and the available file_search tool. It automatically generates and executes a search query against the vector store, finds the relevant passage, and synthesizes the correct answer: December 2, 1980. Crucially, the response also includes references to the source text, allowing for easy verification.

In the RAG section of the notebook, you can run the code to see how Llama Stack acts as a secure, intelligent orchestrator for private RAG.

Automated multi-tool orchestration with MCP

The true power of an agent lies in its ability to deconstruct a complex request into a sequence of smaller steps. The Responses API enables this by allowing a model to plan and execute a chain of tool calls in a single turn. Llama Stack brings this sophisticated capability to the model of your choice.

The notebook showcases this with an example using the Model Context Protocol (MCP), an open standard for third-party tool integration. We ask our agent a complex question: Tell me about some parks in Rhode Island, and let me know if there are any upcoming events at them.

To answer this, the model needs to perform several steps. With Llama Stack, this entire workflow is automated within one API interaction:

Tool discovery: The model first inspects the available tools from a connected National Parks Service (NPS) MCP server.
Initial search: It identifies the search_parks tool and calls it with the argument state_code="RI" to find relevant parks.
Iterative event search: The search_parks tool returns four national parks. The model then intelligently calls the get_park_events tool for each of the four parks, automatically using the correct park_code from the initial search response.
Final synthesis: After receiving the event information from all four calls, the model synthesizes the data into a single, user-friendly summary.

The most important part? This entire 7-step process (1 tool discovery, 1 park search, 4 event searches, and 1 final synthesis) happens within a single call to the Responses API. The client-side application doesn't need to write any of the complex orchestration logic.

You can see this entire interaction, including the model's intermediate steps and final output, in the MCP tool calling section of the companion notebook.

Use your favorite framework: LangChain, OpenAI, and more

If you're already using a popular agentic framework, integrating Llama Stack is seamless. Because Llama Stack implements the OpenAI-compatible Responses API, you can use it as a drop-in replacement for a proprietary, hosted endpoint. Llama Stack becomes the server-side engine that powers your existing client-side toolkit.

The notebook demonstrates this by running the exact same basic RAG and MCP queries with both the native Llama Stack Python client and the OpenAI Python client. It also provides a brief introduction to using Llama Stack with LangChain. To switch from a proprietary service to your self-hosted Llama Stack server in LangChain, you only need to change the ChatOpenAI constructor. The rest of your agent and chain logic remains exactly the same. The LangChain section of the notebook shows you how.

This drop-in compatibility allows you to leverage the vast ecosystem of frameworks like LangChain while maintaining full control over your model, data, and infrastructure.

Your agent, your way

Llama Stack’s Responses API compatibility is still maturing. Furthermore, the OpenAI API specification is proprietary and moves quickly; there will be a delay between when a new feature is added to the official OpenAI specification and when it is fully implemented in Llama Stack. However, Llama Stack provides significant benefits that offset the downsides of having to wait to get the latest features from the next API release.

The Responses API provides an excellent blueprint for the future of AI agents, and Llama Stack takes that blueprint and makes it open, flexible, and yours to command.

With Llama Stack, you gain:

Model freedom where you can go beyond a handful of proprietary models and choose from Llama Stack inference providers, host an open source model, or deploy one you fine-tuned yourself.
Data sovereignty, which means you can build powerful RAG and tool calling agents for your most sensitive data with confidence that it remains within your secure infrastructure.
An open, extensible stack that helps you avoid vendor lock-in by building on an open source server that implements the widely adopted Responses API.

To see these examples in action and start building more powerful, private, and customizable agents, explore the Llama Stack documentation and run the companion notebook for this blog post today.

The power to choose is yours, today and tomorrow.

Red Hat Developer Sandbox

Programming languages & frameworks

System design & architecture

Developer experience

Automated data processing

Platform engineering

Secure development & architectures

E-books

Cheat sheets

Documentation

Your agent, your rules: A deep dive into the Responses API with Llama Stack

Introducing Llama Stack

Private and powerful RAG

Automated multi-tool orchestration with MCP

Use your favorite framework: LangChain, OpenAI, and more

Your agent, your way

Introducing Models-as-a-Service in OpenShift AI

Building domain-specific LLMs with synthetic data and SDG Hub

External IP visibility in Red Hat Advanced Cluster Security

How I used Red Hat Lightspeed image builder to create CIS (and more) compliant images

Building a oversaturation detector with iterative error analysis

Platforms

Build

Quicklinks

Communicate

RED HAT DEVELOPER

Red Hat legal and privacy links

Red Hat legal and privacy links

Report a website issue