Modernize Llama Stack agents by migrating to the Responses API

The world of AI agent development moves fast. Sometimes, that means the tools we rely on evolve in ways that require us to rethink our implementations. If you built agents using Llama Stack's original Agent APIs, you've likely heard they're being deprecated in favor of the more powerful, OpenAI-compatible Responses API. What does that mean for your existing code? How do you make the transition without rebuilding from scratch?

This post walks you through practical migration strategies, from simple rewrites to sophisticated emulation layers. You will learn how to keep your agents working while you access the new capabilities in the Responses API. Whether you maintain a production system or want to update your agent architecture, you can use these methods to balance effort with benefit.

Why migrate to Responses API? Understanding the shift

The original Llama Stack Agent APIs required you to explicitly manage every aspect of agent lifecycle: creating tool groups, defining agents, establishing sessions, and orchestrating turns. Furthermore, those APIs locked you into a relatively narrow and limited concept of an agent and were not particularly well suited for more complex and powerful agentic reasoning tasks. Other API providers had similar constructs, such as the now-deprecated OpenAI's Assistants API. The industry has shifted toward APIs like OpenAI's Responses API or Anthropic's Messages API in which the server provides agentic operations and the client application manages the abstraction layers.

Llama Stack has adopted the Responses API because of its well-earned popularity as a powerful tool for agentic reasoning. This API lets you avoid juggling multiple tool endpoints and identifiers. You make one API call that handles tool discovery, execution planning, and response synthesis automatically. The Responses API supports advanced patterns, such as multi-step reasoning and automatic tool chaining, that previously required extensive custom orchestration with the legacy APIs.

Deprecation doesn't mean your agents stop working immediately. The client-side Agent APIs still function, so you have time to plan your migration. You might not need to migrate now, but you might want to consider it if you plan to add more advanced capabilities to your applications.

Seeing agent modernization in action

This post references concrete implementations and working code examples. To keep this article readable, the detailed implementations, including runnable examples and step-by-step explanations, are in a companion notebook. This allows you to review the code for each migration strategy works while this post focuses on the concepts and decision factors.

Follow along with our companion Python Notebook for complete, runnable examples of every migration strategy discussed here.

The simple approach: Agents in the Python client

The old server APIs for creating and managing agents are deprecated and will be removed, but the Agent class in the Llama Stack Python client now uses the updated server APIs. It doesn't match the documented behavior of the old agent APIs exactly, but it is a decent approximation. If you want the simplest and easiest migration path, using those structures might make sense. See the Legacy Agent API Example section in the companion notebook for code that works with the latest Python client. This approach works well when:

You are using the Llama Stack Python client.
The existing implementation meets your needs well, so you don't need to extend it.
You don't intend to add much or any more advanced functionality in the future.

The direct approach: Rewriting with Responses API

For many applications, the best migration strategy is to rewrite your agent calls using the OpenAI-compatible Responses API directly. The transformation can be surprisingly straightforward, and you can get the same behaviors with simpler, more maintainable code.

When you use legacy APIs, you typically create agent configurations, establish sessions, and manage turns through multiple API calls. Each step requires careful orchestration and state management. With the Responses API, you can replace this verbose multi-call pattern with a single, elegant API call. The complexity of tool discovery, execution planning, and context management happens automatically. For follow-up questions, you pass the previous response ID to maintain conversation context.

See the companion notebook's Legacy Agent API Example and Equivalent Responses API Example sections for complete code comparisons that demonstrate this dramatic simplification.

This direct rewrite approach works best when:

You want to take advantage of the Responses API's full capabilities immediately.
You don't really need a data structure to contain your agentic reasoning capability.
You want to start adopting a more modern approach to agentic reasoning based on agentic functions instead of agent objects.

The compatibility bridge: Emulating legacy APIs

A complete rewrite is not always practical. You might have extensive test suites built around the legacy API structure, or your application's architecture deeply depends on the separation between agent creation and execution. As noted earlier, the Agent class in the Llama Stack Python client is the easiest way to do this but a more flexible approach is to build and control your own approximation of the legacy APIs while using newer APIs such as Responses internally.

Our companion notebook demonstrates this with a LegacyAgent class that emulates the original API structure. The emulator stores agent configurations when created, manages session state, and translates legacy turn creation calls into Responses API calls while maintaining the interface your existing code expects.

With this emulation layer, your existing code continues to work unchanged while benefiting from the Responses API's improved orchestration. The notebook's Emulating the Legacy Agent API section provides a complete implementation you can adapt to your needs. This approach is particularly valuable when:

You have extensive legacy code that would be expensive to rewrite.
Multiple teams or systems depend on the current API structure.
You want to migrate gradually while maintaining backward compatibility.
You want your own implementation of your agent APIs so you can extend and enhance them over time.

A pragmatic middle ground: Simplified agent classes

Between complete rewrites and full emulation lies a practical compromise: creating simplified agent classes that capture the essence of your agent patterns without strict API compatibility. This approach lets you modernize incrementally while keeping familiar abstractions.

The notebook's Adopting a simpler agent class section shows how to create a SimpleExampleAgent class that wraps the Responses API while maintaining the conceptual model of agents and turns. The agent stores its configuration and conversation state, and allows you to create sessions that handle the complexity of multiple-turn conversations internally.

This pattern provides a clean migration path that:

Your legacy code is not so extensive that a modest rewrite is infeasible, but you still want to keep your existing structure mostly intact.
You want your own implementation of your agent object so you can extend and enhance them over time.
You would prefer to start from a more elegant object structure than you get from the emulator approach above.

Advanced patterns: Beyond basic migration

Migrating to the OpenAI-compatible Responses API involves more than maintaining existing functionality. It is an opportunity to implement more sophisticated agent patterns that were cumbersome with the legacy APIs.

Human-in-the-loop tool approval

The Responses API makes it straightforward to implement tool approval workflows where agents must get permission before executing sensitive operations. The notebook's Human-in-the-loop tool approval section demonstrates an SimpleExampleAgentWithApproval class that intercepts tool calls, presents them to users for review, and only proceeds with execution after receiving explicit approval.

This pattern is essential for production systems where certain actions such as database modifications or external API calls require human oversight. The legacy APIs made this cumbersome to implement, but with the Responses API, it becomes a natural extension of the standard flow.

Model safety

The Model Safety section in the notebook provides some options for how to use "guardrail" models, or models that detect problematic language like profanity or endorsement of criminal behavior. Such models can be useful for detecting problematic user requests and/or model responses.

The notebook provides some simple example classes to illustrate how to wrap guardrail capabilities as agent objects. There are many other good sources for information about model guardrail technology such as Meta Llama Guard, NVIDIA NeMo Guardrails, and Guardrails AI.

Reasoning and acting

The notebook's ReAct section discusses the ReAct agent object in the Llama Stack Python client, its purpose, and how you can use it. It also explains why you might avoid it with newer models that have similar built-in capabilities. The notebook provides alternatives that use the Responses API, including a ReactAgent class that combines the capabilities of the Responses API with ReAct innovations.

Authored multi-step workflows

For complex tasks that benefit from explicit orchestration, the Responses API enables elegant multi-step workflows. The notebook's Authored multi-step agentic flows section shows how breaking a complex query into focused steps dramatically improves accuracy.

For example, when searching for events at national parks, instead of asking the agent to do everything at once, you can:

Retrieve a list of parks in a specific state.
Use the list to systematically query events at each park.
Synthesize the results into a comprehensive summary.

This pattern improves accuracy on complex tasks by breaking them into focused, manageable steps. This would have required extensive custom code with the legacy APIs.

The notebook shows how to do this by calling the Responses API directly, but third-party frameworks like LangGraph and CrewAI are also useful for building more complex agentic workflows. You can implement your applications with those frameworks and then connect them to a Llama Stack server. The LangGraph + Llama Stack example agent and the CrewAI + Llama Stack example agent demonstrate this capability.

Multi-process architectures

The notebook's Multi-process architectures section discusses considerations for legacy APIs, which stored agent configuration in the server. In contrast, the Responses API relies on the client to manage the configuration. This adds extra challenges for multi-process applications where configuring and consuming an agent configuration might happen in different processes.

Your agents, evolved

Migration from Llama Stack Agent APIs to the OpenAI-compatible Responses API is more than a technical upgrade. It is an opportunity to simplify your code, improve agent capabilities, and position your system for future innovations. Whether you choose a complete rewrite, build compatibility layers, or redesign your architecture, the Responses API provides a stable and maintainable foundation for agent development.

The deprecation of the legacy APIs isn't a disruption; it's an evolution. Your agents will work better, with less code, fewer edge cases, and more sophisticated reasoning capabilities. The migration strategies covered in this post ensure you can transition at your own pace without sacrificing your existing investments.

Ready to modernize your agents? Explore the complete examples in our companion notebook to see these migration strategies in action. Start with one agent, choose the approach that fits your needs, and experience firsthand how the Responses API transforms agent development from complex orchestration to elegant simplicity.

Red Hat Developer Sandbox

Programming languages & frameworks

System design & architecture

Developer experience

Automated data processing

Platform engineering

Secure development & architectures

E-books

Cheat sheets

Documentation

Your AI agents, evolved: Modernize Llama Stack agents by migrating to the Responses API

Why migrate to Responses API? Understanding the shift

The simple approach: Agents in the Python client

The direct approach: Rewriting with Responses API

The compatibility bridge: Emulating legacy APIs

A pragmatic middle ground: Simplified agent classes

Advanced patterns: Beyond basic migration

Human-in-the-loop tool approval

Model safety

Reasoning and acting

Authored multi-step workflows

Multi-process architectures

Your agents, evolved

What's new in network observability 1.11

From local prototype to enterprise production: Private speech transcription with Whisper and Red Hat AI

Temurin JDK 25 now available in Red Hat Customer Portal

Boring RAG: When similarity is just a SQL query

How to collaborate with AI to improve your Ansible skills

Platforms

Build

Quicklinks

Communicate

RED HAT DEVELOPER

Red Hat legal and privacy links

Red Hat legal and privacy links

Report a website issue