Skip to main content
Redhat Developers  Logo
  • Products

    Platforms

    • Red Hat Enterprise Linux
      Red Hat Enterprise Linux Icon
    • Red Hat AI
      Red Hat AI
    • Red Hat OpenShift
      Openshift icon
    • Red Hat Ansible Automation Platform
      Ansible icon
    • View All Red Hat Products

    Featured

    • Red Hat build of OpenJDK
    • Red Hat Developer Hub
    • Red Hat JBoss Enterprise Application Platform
    • Red Hat OpenShift Dev Spaces
    • Red Hat OpenShift Local
    • Red Hat Developer Sandbox

      Try Red Hat products and technologies without setup or configuration fees for 30 days with this shared Openshift and Kubernetes cluster.
    • Try at no cost
  • Technologies

    Featured

    • AI/ML
      AI/ML Icon
    • Linux
      Linux Icon
    • Kubernetes
      Cloud icon
    • Automation
      Automation Icon showing arrows moving in a circle around a gear
    • View All Technologies
    • Programming Languages & Frameworks

      • Java
      • Python
      • JavaScript
    • System Design & Architecture

      • Red Hat architecture and design patterns
      • Microservices
      • Event-Driven Architecture
      • Databases
    • Developer Productivity

      • Developer productivity
      • Developer Tools
      • GitOps
    • Automated Data Processing

      • AI/ML
      • Data Science
      • Apache Kafka on Kubernetes
    • Platform Engineering

      • DevOps
      • DevSecOps
      • Ansible automation for applications and services
    • Secure Development & Architectures

      • Security
      • Secure coding
  • Learn

    Featured

    • Kubernetes & Cloud Native
      Openshift icon
    • Linux
      Rhel icon
    • Automation
      Ansible cloud icon
    • AI/ML
      AI/ML Icon
    • View All Learning Resources

    E-Books

    • GitOps Cookbook
    • Podman in Action
    • Kubernetes Operators
    • The Path to GitOps
    • View All E-books

    Cheat Sheets

    • Linux Commands
    • Bash Commands
    • Git
    • systemd Commands
    • View All Cheat Sheets

    Documentation

    • Product Documentation
    • API Catalog
    • Legacy Documentation
  • Developer Sandbox

    Developer Sandbox

    • Access Red Hat’s products and technologies without setup or configuration, and start developing quicker than ever before with our new, no-cost sandbox environments.
    • Explore Developer Sandbox

    Featured Developer Sandbox activities

    • Get started with your Developer Sandbox
    • OpenShift virtualization and application modernization using the Developer Sandbox
    • Explore all Developer Sandbox activities

    Ready to start developing apps?

    • Try at no cost
  • Blog
  • Events
  • Videos

Your agent, your rules: A deep dive into the Responses API with Llama Stack

August 20, 2025
J William Murdock Roland Huß Ann Marie Fred
Related topics:
Artificial intelligenceOpen source
Related products:
Red Hat AI

Share:

    The OpenAI Responses API provides substantial value for developers building AI applications. With many earlier inference APIs, creating agents that could use tools involved a clunky, multi-step process. Client applications had to orchestrate each part of the process: call the model with a list of possible tools, get the plan for tool execution from the model, execute the tools, send the results of the tool execution back to the model, and repeat. 

    This required developers to build and maintain complex state and orchestration logic in their own applications. Less experienced developers might have done this poorly, resulting in applications that are slow, provide unnecessary load on the model servers, or have poor accuracy because the orchestration is suboptimal.

    The Responses API provides a structured interface where the AI service can perform multi-step reasoning, call multiple tools, and manage conversational state within a single, consolidated interaction. By allowing the server to handle more of the internal orchestration, it greatly streamlines the development of sophisticated agentic applications.

    However, OpenAI’s implementation of the Responses API comes with a catch: it is tied to specific models and a proprietary cloud service. What if you wanted this advanced architecture with the freedom to choose your own models? What if your organization's security posture demands you run on your own infrastructure?

    Introducing Llama Stack

    Llama Stack is a powerful, open source server for AI capabilities. Among its many features is a Responses API that is compatible with the OpenAI Responses API specification. It allows you to deploy a production-ready endpoint on your own hardware, powered by any model you choose, from versatile open source models to highly specialized models you've created or fine-tuned yourself, including highly optimized models you can find in the Red Hat AI Hugging Face repository.

    This post will walk you through the key features of the Responses API. We'll show you how Llama Stack empowers you to build next-generation AI agents with ease and performance you need.

    Note

    Follow along with our companion Python Notebook for hands-on examples.

    Private and powerful RAG

    Retrieval-augmented generation (RAG) enhances large language models by grounding them in authoritative knowledge sources. This enables them to provide answers that are more accurate and trustworthy, whether they're drawing from up-to-the-minute data, private documents, or a canonical set of public information. The Responses API formalizes this with built-in tools like file_search, which allows a model to intelligently query document collections.

    With a public hosted service, using this feature might require uploading your sensitive documents to a third party, which can be a non-starter for organizations in finance, healthcare, law, and other highly regulated industries. With Llama Stack, your RAG workflows remain entirely within your security perimeter.

    Our companion notebook demonstrates this with a practical example.

    1. Document ingestion: A PDF document containing information about U.S. National Parks is downloaded. Using the Llama Stack client, we create a vector_store and upload the file. This entire process happens on your local server, ensuring the document remains private.
    2. Intelligent querying: We then ask the model a question that can be answered from knowledge from the document: When did the Bering Land Bridge become a national preserve?
    3. Automated retrieval and synthesis: In a single API call, the model running on Llama Stack sees the user's question and the available file_search tool. It automatically generates and executes a search query against the vector store, finds the relevant passage, and synthesizes the correct answer: December 2, 1980. Crucially, the response also includes references to the source text, allowing for easy verification.

    In the RAG section of the notebook, you can run the code to see how Llama Stack acts as a secure, intelligent orchestrator for private RAG.

    Automated multi-tool orchestration with MCP

    The true power of an agent lies in its ability to deconstruct a complex request into a sequence of smaller steps. The Responses API enables this by allowing a model to plan and execute a chain of tool calls in a single turn. Llama Stack brings this sophisticated capability to the model of your choice.

    The notebook showcases this with an example using the Model Context Protocol (MCP), an open standard for third-party tool integration. We ask our agent a complex question: Tell me about some parks in Rhode Island, and let me know if there are any upcoming events at them.

    To answer this, the model needs to perform several steps. With Llama Stack, this entire workflow is automated within one API interaction:

    1. Tool discovery: The model first inspects the available tools from a connected National Parks Service (NPS) MCP server.
    2. Initial search: It identifies the search_parks tool and calls it with the argument state_code="RI" to find relevant parks.
    3. Iterative event search: The search_parks tool returns four national parks. The model then intelligently calls the get_park_events tool for each of the four parks, automatically using the correct park_code from the initial search response.
    4. Final synthesis: After receiving the event information from all four calls, the model synthesizes the data into a single, user-friendly summary.

    The most important part? This entire 7-step process (1 tool discovery, 1 park search, 4 event searches, and 1 final synthesis) happens within a single call to the Responses API. The client-side application doesn't need to write any of the complex orchestration logic.

    You can see this entire interaction, including the model's intermediate steps and final output, in the MCP tool calling section of the companion notebook.

    Use your favorite framework: LangChain, OpenAI, and more

    If you're already using a popular agentic framework, integrating Llama Stack is seamless. Because Llama Stack implements the OpenAI-compatible Responses API, you can use it as a drop-in replacement for a proprietary, hosted endpoint. Llama Stack becomes the server-side engine that powers your existing client-side toolkit.

    The notebook demonstrates this by running the exact same basic RAG and MCP queries with both the native Llama Stack Python client and the OpenAI Python client. It also provides a brief introduction to using Llama Stack with LangChain. To switch from a proprietary service to your self-hosted Llama Stack server in LangChain, you only need to change the ChatOpenAI constructor. The rest of your agent and chain logic remains exactly the same. The LangChain section of the notebook shows you how.

    This drop-in compatibility allows you to leverage the vast ecosystem of frameworks like LangChain while maintaining full control over your model, data, and infrastructure.

    Your agent, your way

    Llama Stack’s Responses API compatibility is still maturing. Furthermore, the OpenAI API specification is proprietary and moves quickly; there will be a delay between when a new feature is added to the official OpenAI specification and when it is fully implemented in Llama Stack. However, Llama Stack provides significant benefits that offset the downsides of having to wait to get the latest features from the next API release.

    The Responses API provides an excellent blueprint for the future of AI agents, and Llama Stack takes that blueprint and makes it open, flexible, and yours to command. 

    With Llama Stack, you gain: 

    • Model freedom where you can go beyond a handful of proprietary models and choose from Llama Stack inference providers, host an open source model, or deploy one you fine-tuned yourself.
    • Data sovereignty, which means you can build powerful RAG and tool calling agents for your most sensitive data with confidence that it remains within your secure infrastructure.
    • An open, extensible stack that helps you avoid vendor lock-in by building on an open source server that implements the widely adopted Responses API.

    To see these examples in action and start building more powerful, private, and customizable agents, explore the Llama Stack documentation and run the companion notebook for this blog post today.

    The power to choose is yours, today and tomorrow.

    Related Posts

    • How to build a simple agentic AI server with MCP

    • Exploring Llama Stack with Python: Tool calling and agents

    • Implement AI safeguards with Node.js and Llama Stack

    • Retrieval-augmented generation with Llama Stack and Python

    • A practical guide to Llama Stack for Node.js developers

    • How I built an agentic application for Docling with MCP

    Recent Posts

    • Cloud bursting with confidential containers on OpenShift

    • Reach native speed with MacOS llama.cpp container inference

    • A deep dive into Apache Kafka's KRaft protocol

    • Staying ahead of artificial intelligence threats

    • Strengthen privacy and security with encrypted DNS in RHEL

    What’s up next?

    Open source AI for developers introduces and covers key features of Red Hat OpenShift AI, including Jupyter Notebooks, PyTorch, and enhanced monitoring and observability tools, along with MLOps and continuous integration/continuous deployment (CI/CD) workflows.

    Get the e-book
    Red Hat Developers logo LinkedIn YouTube Twitter Facebook

    Products

    • Red Hat Enterprise Linux
    • Red Hat OpenShift
    • Red Hat Ansible Automation Platform

    Build

    • Developer Sandbox
    • Developer Tools
    • Interactive Tutorials
    • API Catalog

    Quicklinks

    • Learning Resources
    • E-books
    • Cheat Sheets
    • Blog
    • Events
    • Newsletter

    Communicate

    • About us
    • Contact sales
    • Find a partner
    • Report a website issue
    • Site Status Dashboard
    • Report a security problem

    RED HAT DEVELOPER

    Build here. Go anywhere.

    We serve the builders. The problem solvers who create careers with code.

    Join us if you’re a developer, software engineer, web designer, front-end designer, UX designer, computer scientist, architect, tester, product manager, project manager or team lead.

    Sign me up

    Red Hat legal and privacy links

    • About Red Hat
    • Jobs
    • Events
    • Locations
    • Contact Red Hat
    • Red Hat Blog
    • Inclusion at Red Hat
    • Cool Stuff Store
    • Red Hat Summit
    © 2025 Red Hat

    Red Hat legal and privacy links

    • Privacy statement
    • Terms of use
    • All policies and guidelines
    • Digital accessibility

    Report a website issue