Skip to main content
Redhat Developers  Logo
  • AI

    Get started with AI

    • Red Hat AI
      Accelerate the development and deployment of enterprise AI solutions.
    • AI learning hub
      Explore learning materials and tools, organized by task.
    • AI interactive demos
      Click through scenarios with Red Hat AI, including training LLMs and more.
    • AI/ML learning paths
      Expand your OpenShift AI knowledge using these learning resources.
    • AI quickstarts
      Focused AI use cases designed for fast deployment on Red Hat AI platforms.
    • No-cost AI training
      Foundational Red Hat AI training.

    Featured resources

    • OpenShift AI learning
    • Open source AI for developers
    • AI product application development
    • Open source-powered AI/ML for hybrid cloud
    • AI and Node.js cheat sheet

    Red Hat AI Factory with NVIDIA

    • Red Hat AI Factory with NVIDIA is a co-engineered, enterprise-grade AI solution for building, deploying, and managing AI at scale across hybrid cloud environments.
    • Explore the solution
  • Learn

    Self-guided

    • Documentation
      Find answers, get step-by-step guidance, and learn how to use Red Hat products.
    • Learning paths
      Explore curated walkthroughs for common development tasks.
    • See all learning

    Hands-on

    • Developer Sandbox
      Spin up Red Hat's products and technologies without setup or configuration.
    • Interactive labs
      Learn by doing in these hands-on, browser-based experiences.
    • Interactive demos
      Click through product features in these guided tours.

    Browse by topic

    • AI/ML
    • Automation
    • Java
    • Kubernetes
    • Linux
    • See all topics

    Training & certifications

    • Courses and exams
    • Certifications
    • Skills assessments
    • Red Hat Academy
    • Learning subscription
    • Explore training
  • Build

    Get started

    • Red Hat build of Podman Desktop
      A downloadable, local development hub to experiment with our products and builds.
    • Developer Sandbox
      Spin up Red Hat's products and technologies without setup or configuration.

    Download products

    • Access product downloads to start building and testing right away.
    • Red Hat Enterprise Linux
    • Red Hat AI
    • Red Hat OpenShift
    • Red Hat Ansible Automation Platform
    • See all products

    Featured

    • Red Hat build of OpenJDK
    • Red Hat JBoss Enterprise Application Platform
    • Red Hat OpenShift Dev Spaces
    • Red Hat Developer Toolset

    References

    • E-books
    • Documentation
    • Cheat sheets
    • Architecture center
  • Community

    Get involved

    • Events
    • Live AI events
    • Red Hat Summit
    • Red Hat Accelerators
    • Community discussions

    Follow along

    • Articles & blogs
    • Developer newsletter
    • Videos
    • Github

    Get help

    • Customer service
    • Customer support
    • Regional contacts
    • Find a partner

    Join the Red Hat Developer program

    • Download Red Hat products and project builds, access support documentation, learning content, and more.
    • Explore the benefits

Your agent, your rules: A deep dive into the Responses API with Llama Stack

August 20, 2025
J William Murdock Roland Huß Ann Marie Fred
Related topics:
Artificial intelligenceOpen source
Related products:
Red Hat AI

    The OpenAI Responses API provides substantial value for developers building AI applications. With many earlier inference APIs, creating agents that could use tools involved a clunky, multi-step process. Client applications had to orchestrate each part of the process: call the model with a list of possible tools, get the plan for tool execution from the model, execute the tools, send the results of the tool execution back to the model, and repeat. 

    This required developers to build and maintain complex state and orchestration logic in their own applications. Less experienced developers might have done this poorly, resulting in applications that are slow, provide unnecessary load on the model servers, or have poor accuracy because the orchestration is suboptimal.

    The Responses API provides a structured interface where the AI service can perform multi-step reasoning, call multiple tools, and manage conversational state within a single, consolidated interaction. By allowing the server to handle more of the internal orchestration, it greatly streamlines the development of sophisticated agentic applications.

    However, OpenAI’s implementation of the Responses API comes with a catch: it is tied to specific models and a proprietary cloud service. What if you wanted this advanced architecture with the freedom to choose your own models? What if your organization's security posture demands you run on your own infrastructure?

    Introducing Llama Stack

    Llama Stack is a powerful, open source server for AI capabilities. Among its many features is a Responses API that is compatible with the OpenAI Responses API specification. It allows you to deploy a production-ready endpoint on your own hardware, powered by any model you choose, from versatile open source models to highly specialized models you've created or fine-tuned yourself, including highly optimized models you can find in the Red Hat AI Hugging Face repository.

    This post will walk you through the key features of the Responses API. We'll show you how Llama Stack empowers you to build next-generation AI agents with ease and performance you need.

    Note

    Follow along with our companion Python Notebook for hands-on examples.

    Private and powerful RAG

    Retrieval-augmented generation (RAG) enhances large language models by grounding them in authoritative knowledge sources. This enables them to provide answers that are more accurate and trustworthy, whether they're drawing from up-to-the-minute data, private documents, or a canonical set of public information. The Responses API formalizes this with built-in tools like file_search, which allows a model to intelligently query document collections.

    With a public hosted service, using this feature might require uploading your sensitive documents to a third party, which can be a non-starter for organizations in finance, healthcare, law, and other highly regulated industries. With Llama Stack, your RAG workflows remain entirely within your security perimeter.

    Our companion notebook demonstrates this with a practical example.

    1. Document ingestion: A PDF document containing information about U.S. National Parks is downloaded. Using the Llama Stack client, we create a vector_store and upload the file. This entire process happens on your local server, ensuring the document remains private.
    2. Intelligent querying: We then ask the model a question that can be answered from knowledge from the document: When did the Bering Land Bridge become a national preserve?
    3. Automated retrieval and synthesis: In a single API call, the model running on Llama Stack sees the user's question and the available file_search tool. It automatically generates and executes a search query against the vector store, finds the relevant passage, and synthesizes the correct answer: December 2, 1980. Crucially, the response also includes references to the source text, allowing for easy verification.

    In the RAG section of the notebook, you can run the code to see how Llama Stack acts as a secure, intelligent orchestrator for private RAG.

    Automated multi-tool orchestration with MCP

    The true power of an agent lies in its ability to deconstruct a complex request into a sequence of smaller steps. The Responses API enables this by allowing a model to plan and execute a chain of tool calls in a single turn. Llama Stack brings this sophisticated capability to the model of your choice.

    The notebook showcases this with an example using the Model Context Protocol (MCP), an open standard for third-party tool integration. We ask our agent a complex question: Tell me about some parks in Rhode Island, and let me know if there are any upcoming events at them.

    To answer this, the model needs to perform several steps. With Llama Stack, this entire workflow is automated within one API interaction:

    1. Tool discovery: The model first inspects the available tools from a connected National Parks Service (NPS) MCP server.
    2. Initial search: It identifies the search_parks tool and calls it with the argument state_code="RI" to find relevant parks.
    3. Iterative event search: The search_parks tool returns four national parks. The model then intelligently calls the get_park_events tool for each of the four parks, automatically using the correct park_code from the initial search response.
    4. Final synthesis: After receiving the event information from all four calls, the model synthesizes the data into a single, user-friendly summary.

    The most important part? This entire 7-step process (1 tool discovery, 1 park search, 4 event searches, and 1 final synthesis) happens within a single call to the Responses API. The client-side application doesn't need to write any of the complex orchestration logic.

    You can see this entire interaction, including the model's intermediate steps and final output, in the MCP tool calling section of the companion notebook.

    Use your favorite framework: LangChain, OpenAI, and more

    If you're already using a popular agentic framework, integrating Llama Stack is seamless. Because Llama Stack implements the OpenAI-compatible Responses API, you can use it as a drop-in replacement for a proprietary, hosted endpoint. Llama Stack becomes the server-side engine that powers your existing client-side toolkit.

    The notebook demonstrates this by running the exact same basic RAG and MCP queries with both the native Llama Stack Python client and the OpenAI Python client. It also provides a brief introduction to using Llama Stack with LangChain. To switch from a proprietary service to your self-hosted Llama Stack server in LangChain, you only need to change the ChatOpenAI constructor. The rest of your agent and chain logic remains exactly the same. The LangChain section of the notebook shows you how.

    This drop-in compatibility allows you to leverage the vast ecosystem of frameworks like LangChain while maintaining full control over your model, data, and infrastructure.

    Your agent, your way

    Llama Stack’s Responses API compatibility is still maturing. Furthermore, the OpenAI API specification is proprietary and moves quickly; there will be a delay between when a new feature is added to the official OpenAI specification and when it is fully implemented in Llama Stack. However, Llama Stack provides significant benefits that offset the downsides of having to wait to get the latest features from the next API release.

    The Responses API provides an excellent blueprint for the future of AI agents, and Llama Stack takes that blueprint and makes it open, flexible, and yours to command. 

    With Llama Stack, you gain: 

    • Model freedom where you can go beyond a handful of proprietary models and choose from Llama Stack inference providers, host an open source model, or deploy one you fine-tuned yourself.
    • Data sovereignty, which means you can build powerful RAG and tool calling agents for your most sensitive data with confidence that it remains within your secure infrastructure.
    • An open, extensible stack that helps you avoid vendor lock-in by building on an open source server that implements the widely adopted Responses API.

    To see these examples in action and start building more powerful, private, and customizable agents, explore the Llama Stack documentation and run the companion notebook for this blog post today.

    The power to choose is yours, today and tomorrow.

    Related Posts

    • How to build a simple agentic AI server with MCP

    • Exploring Llama Stack with Python: Tool calling and agents

    • Implement AI safeguards with Node.js and Llama Stack

    • Retrieval-augmented generation with Llama Stack and Python

    • A practical guide to Llama Stack for Node.js developers

    • How I built an agentic application for Docling with MCP

    Recent Posts

    • Confidential virtual machine storage attack scenarios

    • Introducing virtualization platform autopilot

    • Integrate zero trust workload identity manager with Red Hat OpenShift GitOps

    • Best Practice Configuration and Tuning for Linux and Windows VMs

    • Red Hat UBI 8 builders have been promoted to the Paketo Buildpacks organization

    What’s up next?

    Open source AI for developers introduces and covers key features of Red Hat OpenShift AI, including Jupyter Notebooks, PyTorch, and enhanced monitoring and observability tools, along with MLOps and continuous integration/continuous deployment (CI/CD) workflows.

    Get the e-book
    Red Hat Developers logo LinkedIn YouTube Twitter Facebook

    Platforms

    • Red Hat AI
    • Red Hat Enterprise Linux
    • Red Hat OpenShift
    • Red Hat Ansible Automation Platform
    • See all products

    Build

    • Developer Sandbox
    • Developer tools
    • Interactive tutorials
    • API catalog

    Quicklinks

    • Learning resources
    • E-books
    • Cheat sheets
    • Blog
    • Events
    • Newsletter

    Communicate

    • About us
    • Contact sales
    • Find a partner
    • Report a website issue
    • Site status dashboard
    • Report a security problem

    RED HAT DEVELOPER

    Build here. Go anywhere.

    We serve the builders. The problem solvers who create careers with code.

    Join us if you’re a developer, software engineer, web designer, front-end designer, UX designer, computer scientist, architect, tester, product manager, project manager or team lead.

    Sign me up

    Red Hat legal and privacy links

    • About Red Hat
    • Jobs
    • Events
    • Locations
    • Contact Red Hat
    • Red Hat Blog
    • Inclusion at Red Hat
    • Cool Stuff Store
    • Red Hat Summit
    © 2026 Red Hat

    Red Hat legal and privacy links

    • Privacy statement
    • Terms of use
    • All policies and guidelines
    • Digital accessibility