Skip to main content
Redhat Developers  Logo
  • AI

    Get started with AI

    • Red Hat AI
      Accelerate the development and deployment of enterprise AI solutions.
    • AI learning hub
      Explore learning materials and tools, organized by task.
    • AI interactive demos
      Click through scenarios with Red Hat AI, including training LLMs and more.
    • AI/ML learning paths
      Expand your OpenShift AI knowledge using these learning resources.
    • AI quickstarts
      Focused AI use cases designed for fast deployment on Red Hat AI platforms.
    • No-cost AI training
      Foundational Red Hat AI training.

    Featured resources

    • OpenShift AI learning
    • Open source AI for developers
    • AI product application development
    • Open source-powered AI/ML for hybrid cloud
    • AI and Node.js cheat sheet

    Red Hat AI Factory with NVIDIA

    • Red Hat AI Factory with NVIDIA is a co-engineered, enterprise-grade AI solution for building, deploying, and managing AI at scale across hybrid cloud environments.
    • Explore the solution
  • Learn

    Self-guided

    • Documentation
      Find answers, get step-by-step guidance, and learn how to use Red Hat products.
    • Learning paths
      Explore curated walkthroughs for common development tasks.
    • Guided learning
      Receive custom learning paths powered by our AI assistant.
    • See all learning

    Hands-on

    • Developer Sandbox
      Spin up Red Hat's products and technologies without setup or configuration.
    • Interactive labs
      Learn by doing in these hands-on, browser-based experiences.
    • Interactive demos
      Click through product features in these guided tours.

    Browse by topic

    • AI/ML
    • Automation
    • Java
    • Kubernetes
    • Linux
    • See all topics

    Training & certifications

    • Courses and exams
    • Certifications
    • Skills assessments
    • Red Hat Academy
    • Learning subscription
    • Explore training
  • Build

    Get started

    • Red Hat build of Podman Desktop
      A downloadable, local development hub to experiment with our products and builds.
    • Developer Sandbox
      Spin up Red Hat's products and technologies without setup or configuration.

    Download products

    • Access product downloads to start building and testing right away.
    • Red Hat Enterprise Linux
    • Red Hat AI
    • Red Hat OpenShift
    • Red Hat Ansible Automation Platform
    • See all products

    Featured

    • Red Hat build of OpenJDK
    • Red Hat JBoss Enterprise Application Platform
    • Red Hat OpenShift Dev Spaces
    • Red Hat Developer Toolset

    References

    • E-books
    • Documentation
    • Cheat sheets
    • Architecture center
  • Community

    Get involved

    • Events
    • Live AI events
    • Red Hat Summit
    • Red Hat Accelerators
    • Community discussions

    Follow along

    • Articles & blogs
    • Developer newsletter
    • Videos
    • Github

    Get help

    • Customer service
    • Customer support
    • Regional contacts
    • Find a partner

    Join the Red Hat Developer program

    • Download Red Hat products and project builds, access support documentation, learning content, and more.
    • Explore the benefits

Build a local voice agent with Red Hat OpenShift AI

June 8, 2026
Mike Hepburn
Related topics:
AI inferenceArtificial intelligenceObservability
Related products:
Red Hat OpenShift AI

    Writing a voice agent can be a daunting task. There are many moving parts to designing and developing a functional voice agent. In this post, I will explain how to to create a functional Red Hat pizza shop voice agent using Red Hat OpenShift AI, focusing on practical architecture choices and implementation lessons learned along the way. Like a lot of journeys you might take, the fun is not only the destination but the stops you get to take along the way.

    My voice agent is far from complete, but I decided it was good enough to write a blog post about.

    Working out the basic architecture

    I want to give a massive shout out to the folks at LangChain for blogging about how they built an online voice agent. I wanted to use the voice sandwich architecture described in the video (Figure 1), but inside my cluster with my own models. I wanted to replace the use of online voice services so that I could control all the models and steps in my environment.

    Horizontal sequence from left to right showing the progression from User input through VAD, STT, Agent, and TTS to Agent output.
    Figure 1: The voice sandwich architecture.

    As shown in Figure 1, the pipeline begins when an STT model converts human speech into clean text inputs. The voice agent routes this text to the model for reasoning. After processing, a TTS model converts the generated response back into spoken audio, and a transport layer manages moving the audio data between the client and the server backend.

    In the Red Hat pizza shop voice agent, I decided to omit voice activity detection (VAD), leaving it for a future iteration.

    The entire pipeline needs to execute quickly. What qualifies as fast? While this is somewhat objective and requires measurement, I found that an execution time under 600 ms is fast enough to mimic a natural conversation.

    After mapping out the basic architecture components, the next step was to explore some models.

    Model selection

    The Red Hat OpenShift AI model catalog can help you decide on model selection. If I select Audio-to-text as the task and enable the Model performance view, I can see that the recommended model is a validated model from Google: gemma-3n-E4B-it-FP8-dynamic. See Figure 2.

    Red Hat OpenShift AI model catalog interface with model performance view enabled, showing the filtered audio-to-text validated Gemma model card.
    Figure 2: Choosing models with the model catalog.

    You can easily select and display the model performance characteristics for the model based on your hardware. You can discover benchmark stats including time to first token (TTFT) latency (Figure 3), which is important when chatting with your agent—the generation must occur quickly enough to sound like a natural conversation.

    OpenShift AI interface displaying a performance metrics table comparing TTFT latency and throughput across various hardware options.
    Figure 3: Model performance benchmarks for various hardware.

    My main requirements were as follows.

    • Be an open weights model.
    • Ideally a validated model that is available from the Red Hat models on Hugging Face.
    • The models must perform well within the application so that the conversation does not experience unnatural delays (I needed to quantitatively figure this out along the way).
    • To save on GPU NVRAM during development, the models should be available in quantized versions or hosted on a Models-as-a-Service (MaaS) platform.

    Whisper was an easy selection for speech-to-text processing—it is a widely known model from OpenAI with a quantized version available on Hugging Face from Red Hat AI and performed well on initial testing.

    Text to speech was much harder to nail down. When I started, models like Qwen-3 TTS had not yet been released. I plan to test additional options now that more models are available. I found a well-performing model from Higgs Audio that received a lot of positive reviews on Reddit at the time. Higgs Audio also had a vLLM container that could be easily used.

    For the agent model my main criteria was that it worked well with my soon-to-be coded agent framework. The Llama 4 family includes models like Scout, which reduces reasoning anomalies through enhanced reasoning compared to previous generations. It was also available on our internal MaaS platform, so I did not need to deploy it myself. The voice application architecture uses the open weights models described in Table 1.

    Table 1: Model selection for the voice application.
    Model nameLicenseDescriptionNotes
    Whisperapache-2.0Speech-to-textQuantized version of openai/whisper-large-v3-turbo whisper-large-v3-turbo-quantized.w4a16
    Higgs-AudioHiggs Audio 2 (Meta 3) communityText-to-speechAudio foundation model github higgs-audio-v2-generation-3B-base
    Llama4 ScoutLlama 4 communityAgent modelLlama-4-Scout-17B-16E-Instruct-quantized.w4a16

    New models come out all the time. Ultimately the models I ended up selecting were from a point in time at the beginning of the year. Because of vLLM and standard OpenAI endpoints, I can use Red Hat OpenShift AI to continue experimenting, measuring, and testing newer models in my voice agent system.

    Speech-to-text

    Implementing speech-to-text processing is pretty straightforward, especially with validated models like Whisper. Review these great blog posts from my colleagues to learn more about using Whisper:

    • Speech-to-text with Whisper and Red Hat AI Inference Server
    • From local prototype to enterprise production: Private speech transcription with Whisper and Red Hat AI

    My initial workflow involved creating an InferenceService or LLMInferenceService YAML file and experimenting with vLLM settings to run the models on Red Hat OpenShift AI. This deployment ran a single-node OpenShift cluster running in AWS with a hardware accelerator featuring 24 GB of memory (m6i.4xlarge). Squeezing the models onto a compact GPU required some trial and error to optimize the settings, but the configuration process was straightforward. You can examine the structural setup in the LLMInferenceService YAML file.

    Initial testing involved recording a WAV audio file and submitting a POST request to the /v1/audio/transcriptions API endpoint on the Whisper model.

    curl -s -X POST ${MODEL_URL}/v1/audio/transcriptions \
      -H "Content-Type: multipart/form-data" \
      -H "Authorization: Bearer ${ACCESS_TOKEN}" \
      --form file=@/home/mike/Downloads/hello.wav \
      --form model=whisper | jq

    The next step is to configure the text-to-speech model.

    Evaluating text-to-speech capabilities

    One major piece of advice: Before evaluating a text-to-speech model, measure its generation capabilities on your specific hardware configuration. Why? Well, I initially overlooked this step and spent a lot of time writing client application code to patch choppy audio playback, when all I needed was better GPU hardware for my TTS model.

    (gen x) = (audio seconds produced) / (wall clock seconds elapsed)

    This measurement tells us the key truth:

    • If the gen x ratio drops below 1.0 for significant periods, the TTS stream is generating slower than real time. This latency causes underruns or chops unless you introduce additional buffering delays.
    • If gen x consistently remains greater than or equal to 1.0, any remaining playback degradation stems from client-side runtime issues.

    Upgrading to a higher-capacity hardware accelerator resolved the choppy playback. The improvement stemmed from the hardware's increased allocation of CUDA and Tensor Cores rather than total memory capacity. I added a new accelerated compute MachineSet node within Red Hat OpenShift AI to host a g6e.xlarge for this specific model.

    Table 2: GPU STT generation speed comparison.
    Card detailsCoresAMI instance typegen x
    NVIDIA L4 (24 GB)7,424 CUDA cores, 240 Tensor Cores (Gen 4), 60 RT Cores (Gen 3)g6-8xlarge0.78
    NVIDIA L40S (48 GB)18,176 CUDA cores, 568 Tensor Cores (Gen 4), 142 RT Cores (Gen 3)g6e-xlarge1.95

    Apart from the easy configuration, MachineSet resources allow you to scale to zero on demand when you are not using that GPU and model during development and testing, ultimately saving you money. This environment is shown in Figure 4.

    OpenShift MachineSets table within the openshift-machine-api project, displaying multiple worker machine sets scaled to zero of zero machines.
    Figure 4: OpenShift MachineSets.

    During the debugging, I also ended up adding metrics to the user interface to help me understand what was going on with the choppy speech that was generated: TTS (time to stream), buffering stats, underruns count, gen x, TTFT (time to first token), TTFO (time to first speech byte output), chunks, bytes, frames, and outHZ. These UI parameters are displayed in Figure 5.

    Dark user interface panel titled Playback displaying real-time text metrics for audio stream performance and latency.
    Figure 5: Client user metrics.

    Calling the TTS model was initially as simple as using curl and piping the output into ffmpeg and ffplay to hear the output on my laptop. This was handy for quickly testing the Higgs Audio model.

    curl -X POST $MODEL_URL/v1/audio/speech \
      -H "Content-Type: application/json" \
      -d '{
        "model": "higgs-audio-v2-generation-3B-base",
        "voice": "belinda",
        "input": "What would you like on your pizza?",
        "response_format": "pcm"
      }' \
      --output - | ffmpeg -f s16le -ar 24000 -ac 1 -i pipe:0 -f wav - | ffplay -nodisp -autoexit -
    

    Managing agent collaboration

    All the code is shared at the end of this blog post; however, I want to highlight what I learned on this journey to ordering a pizza. I was already familiar with LangChain and LangGraph. Ultimately, though, you can achieve similar code with one of many client configurations. Red Hat OpenShift AI is about meeting developers where they are, so you can bring your own client libraries using standard Red Hat Universal Base Image (UBI) containers to deploy them. You choose the language and the frameworks.

    So my first lesson learned was about agent handoffs. This is where we take a number of agents and hand off to another agent as the workflow progresses through the business process. You can explicitly code the handoff or, in my case, let the supervisor agent determine when to shift processing loops. I started with four agents—the supervisor agent that holds the session state for the pizza order, and which then routes to one of the other agents depending on the structured output received.

    def build_graph():
        """Compile and return the LangGraph instance (with checkpointer for interrupts)."""
        graph = StateGraph(SupervisorState)
        graph.add_node("supervisor", supervisor_command_node)
        graph.add_node("order_agent", order_agent_node)
        graph.add_node("pizza_agent", pizza_agent_node)
        graph.add_node("delivery_agent", delivery_agent_node)
    
        # Interrupt nodes
        graph.add_node("wait_for_user_after_pizza", wait_for_user_after_pizza)
        graph.add_node("wait_for_user_after_order", wait_for_user_after_order)
        graph.add_node("wait_for_user_after_delivery", wait_for_user_after_delivery)
    
        graph.add_edge(START, "supervisor")
        return graph.compile(checkpointer=MemorySaver())

    In this case:

    • I have a pizza agent that helps the user choose a pizza.
    • I have an order agent that adds items to the order and keeps track of the cost.
    • I have a delivery agent that helps the customer choose delivery options.

    This path guides the user through the process of ordering a pizza quite naturally. The magic is in the prompt engineering. Here is the pizza agent prompt. It has been optimized to work well with my Llama 4 Scout agent model.

    # pizza agent prompt
    PIZZA_AGENT_PROMPT = """You are a voice agent that helps the user choose a pizza.
    Your tasks:
    1. Respond with plain text that will be spoken aloud by the browser UI, and ask the user for a pizza type if they haven't chosen one yet.
    2. Extract any pizza type from the user's query.
    3. Wait for the user to speak again before responding.
    
    Important:
    - Do NOT call `convert_text_to_speech`. The server/browser will handle TTS playback automatically.
    - Do NOT include tool call syntax in your response. Output only what you want the user to hear.
    
    # Context: {context}
    Based on the conversation history, provide your response:"""

    There are some hand-crafted “Important” notes added to the agent prompt to prevent it from trying to call tools to do the text-to-speech conversion. I initially handled all conversions using tools but found it was too unreliable. Instead, I opted to let the server and client code to handle this directly.

    Currently, the only session state I extract from the conversation is the pizza type. In a production system, you would need to extract all other state parameters as well, such as cost, toppings, extras, delivery times, and physical addresses.

    # =========================================================
    # State and Models
    # =========================================================
    class SupervisorState(TypedDict, total=False):
        """State shared across all agents in the graph."""
    
        messages: Annotated[
            list, add_messages
        ] # Conversation history (uses add_messages reducer)
        pizza_type: Annotated[str, "The type of pizza the user wants to order."]
    
    
    class SupervisorDecision(BaseModel):
        """Structured output from supervisor for routing decisions."""
    
        next_agent: Literal["order_agent", "pizza_agent", "delivery_agent", "none"]
        pizza_type: Annotated[str, "The type of pizza the user wants to order."]
        response: str = "" # Direct response if no routing needed

    Each agent uses embedded tools defined within the execution code. For example, the pizza agent can use the get_pizza_type tool. Ultimately, I envision that this could be an external menu service called via the Model Context Protocol (MCP). Similarly, the price of a pizza is completely generated by the LLM, meaning the generated pricing parameters reflect the training data of the Llama 4 Scout model.

    @tool
    def get_pizza_type(query: str) -> dict:
        """Supported pizza types."""
        pizza_type_dictionary = {
            "margherita": "Margherita",
            "pepperoni": "Pepperoni",
            "vegetarian": "Vegetarian",
            "hawaiian": "Hawaiian",
            "meatlovers": "Meat Lovers",
            "bbq_chicken": "BBQ Chicken",
            "spinach_and_mushroom": "Spinach and Mushroom",
        }
        return pizza_type_dictionary

    The main reason I completely ignored VAD was the complexity of when to interrupt the TTS model and how to interrupt the multi-agent handoff flows. Instead, I opted for interrupt nodes in my agent graph as a simple first attempt. That way, the application waits for you to speak the next command when you select Talk and Send (Figure 6). Clunky, but it works.

    Voice agent client interface displaying prominent green Talk and red Send buttons below a status indicator and a text entry box.
    Figure 6: Client interface with manual talk and send controls.

    In code, the interrupt nodes serve as simple placeholders for now.

    def wait_for_user_after_pizza(state: SupervisorState) -> Command:
        """Interrupt after pizza agent, waiting for user's next input."""
        user_text = interrupt(_interrupt_payload(state, "pizza_agent"))
        return Command(
            goto="supervisor", update={"messages": [HumanMessage(content=str(user_text))]}
        )

    Observability

    LLM systems generate a high volume of token interactions. This token activity increases when you use reasoning models. This rapid interaction can make it easy to lose track of internal system dialogues. Red Hat OpenShift AI includes MLflow (Figure 7) for agent tracing, alongside platform tracing capabilities offered by the Tempo stack.

    Side-by-side view of the pizza application interface and the MLflow dashboard displaying logged interaction traces in a data table.
    Figure 7: MLfLow agent observability.

    Using the MLflow OpenTelemetry (OTEL) integration, you can examine each trace to inspect input and output responses from both the agent and the LLM (Figure 8). This framework helps you understand delays, timing, and the agent conversation state. This is invaluable for debugging. You also can replay this data to generate datasets for future testing scenarios based on real user input.

    MLflow trace interface displaying a LangGraph execution tree breakdown on the left and details of input and output messages on the right.
    Figure 8: MLfLow agent conversation traces.

    The need for guardrails

    You can persuade the agent to give you a pizza for one dollar (Figure 9). Try simple prompting to see if you can persuade the agent.

    Pizza shop interface showing a chat bubble from the agent confirming the order total has been updated to one dollar.
    Figure 9: A lack of guardrails: Pizza for one dollar.

    The system has no guardrails or safety mechanisms other than prompts and the data the LLM was trained on. We intend to use Red Hat OpenShift AI to add safety mechanisms such as guardrails to protect the pizza agent from abuse, following the pattern illustrated in Figure 10.

    Linear sequence inside a generative AI application showing user input passing through input detectors, a generative model, and output detectors.
    Figure 10: Basic guardrails and detectors.

    TrustyAI is an open source AI safety toolkit supported by Red Hat OpenShift AI. By using integrations with MLflow, we can store all testing observations and evaluations for pizza ordering conversations. This toolkit includes testing techniques such as LLM-as-a-judge to evaluate responses against expected outputs.

    You can also shift security left by evaluating these scenarios with red-teaming techniques. Red teaming is the systematic practice of simulating adversarial attacks to identify vulnerabilities, safety flaws, and alignment issues before the system is deployed to production.

    Next steps and resources

    Now that you have seen how to build the Red Hat pizza shop voice agent, you can try the live demo or explore the complete project details.

    To recap, this deployment demonstrates how to:

    • Run a voice-sandwich pattern in a local environment without external Software-as-a-Service (SaaS) voice services.
    • Select speech-to-text (STT), text-to-speech (TTS), and voice agent models.
    • Develop a voice agent while measuring system performance on specific hardware configurations.
    • Capture complex agent interactions and traces with MLflow observability tools.
    • Integrate security-focused mechanisms to protect the model configuration from prompt exploits.

    The voice agent code and configurations are available on GitHub, including all model configurations used on hardware accelerators, so you can test the framework and OpenShift AI with your own use cases.

    Voice applications provide an excellent use case for gen AI. I hope this brings you one step closer to talking with your agents!

    Related Posts

    • How to run a Red Hat-powered local AI audio transcription

    • From local prototype to enterprise production: Private speech transcription with Whisper and Red Hat AI

    • Speech-to-text with Whisper and Red Hat AI Inference Server

    • Use OpenVINO to convert speech to text

    • Eval-driven development: Build and evaluate reliable AI agents

    • Prompt engineering: Big vs. small prompts for AI agents

    Recent Posts

    • Build a local voice agent with Red Hat OpenShift AI

    • Gang autoscaling on OpenShift with Kueue and ProvisionRequest

    • Installing Red Hat Enterprise Linux 10 from a bootc image with bootc

    • Why your database benchmarking data is probably wrong (and how I fixed mine)

    • Type what you want to break: AI-assisted chaos engineering with Krkn

    What’s up next?

    Learning Path intro-to-OS-LP-feature-image

    Introduction to OpenShift AI

    Learn how to use Red Hat OpenShift AI to quickly develop, train, and deploy...
    Red Hat Developers logo LinkedIn YouTube Twitter Facebook

    Platforms

    • Red Hat AI
    • Red Hat Enterprise Linux
    • Red Hat OpenShift
    • Red Hat Ansible Automation Platform
    • See all products

    Build

    • Developer Sandbox
    • Developer tools
    • Interactive tutorials
    • API catalog

    Quicklinks

    • Learning resources
    • E-books
    • Cheat sheets
    • Blog
    • Events
    • Newsletter

    Communicate

    • About us
    • Contact sales
    • Find a partner
    • Report a website issue
    • Site status dashboard
    • Report a security problem

    RED HAT DEVELOPER

    Build here. Go anywhere.

    We serve the builders. The problem solvers who create careers with code.

    Join us if you’re a developer, software engineer, web designer, front-end designer, UX designer, computer scientist, architect, tester, product manager, project manager or team lead.

    Sign me up

    Red Hat legal and privacy links

    • About Red Hat
    • Jobs
    • Events
    • Locations
    • Contact Red Hat
    • Red Hat Blog
    • Inclusion at Red Hat
    • Cool Stuff Store
    • Red Hat Summit
    © 2026 Red Hat

    Red Hat legal and privacy links

    • Privacy statement
    • Terms of use
    • All policies and guidelines
    • Digital accessibility

    Chat Support

    Please log in with your Red Hat account to access chat support.