Build a local voice agent with Red Hat OpenShift AI

Writing a voice agent can be a daunting task. There are many moving parts to designing and developing a functional voice agent. In this post, I will explain how to to create a functional Red Hat pizza shop voice agent using Red Hat OpenShift AI, focusing on practical architecture choices and implementation lessons learned along the way. Like a lot of journeys you might take, the fun is not only the destination but the stops you get to take along the way.

My voice agent is far from complete, but I decided it was good enough to write a blog post about.

Working out the basic architecture

I want to give a massive shout out to the folks at LangChain for blogging about how they built an online voice agent. I wanted to use the voice sandwich architecture described in the video (Figure 1), but inside my cluster with my own models. I wanted to replace the use of online voice services so that I could control all the models and steps in my environment.

Horizontal sequence from left to right showing the progression from User input through VAD, STT, Agent, and TTS to Agent output. — Figure 1: The voice sandwich architecture.

As shown in Figure 1, the pipeline begins when an STT model converts human speech into clean text inputs. The voice agent routes this text to the model for reasoning. After processing, a TTS model converts the generated response back into spoken audio, and a transport layer manages moving the audio data between the client and the server backend.

In the Red Hat pizza shop voice agent, I decided to omit voice activity detection (VAD), leaving it for a future iteration.

The entire pipeline needs to execute quickly. What qualifies as fast? While this is somewhat objective and requires measurement, I found that an execution time under 600 ms is fast enough to mimic a natural conversation.

After mapping out the basic architecture components, the next step was to explore some models.

Model selection

The Red Hat OpenShift AI model catalog can help you decide on model selection. If I select Audio-to-text as the task and enable the Model performance view, I can see that the recommended model is a validated model from Google: gemma-3n-E4B-it-FP8-dynamic. See Figure 2.

Red Hat OpenShift AI model catalog interface with model performance view enabled, showing the filtered audio-to-text validated Gemma model card. — Figure 2: Choosing models with the model catalog.

You can easily select and display the model performance characteristics for the model based on your hardware. You can discover benchmark stats including time to first token (TTFT) latency (Figure 3), which is important when chatting with your agent—the generation must occur quickly enough to sound like a natural conversation.

OpenShift AI interface displaying a performance metrics table comparing TTFT latency and throughput across various hardware options. — Figure 3: Model performance benchmarks for various hardware.

My main requirements were as follows.

Be an open weights model.
Ideally a validated model that is available from the Red Hat models on Hugging Face.
The models must perform well within the application so that the conversation does not experience unnatural delays (I needed to quantitatively figure this out along the way).
To save on GPU NVRAM during development, the models should be available in quantized versions or hosted on a Models-as-a-Service (MaaS) platform.

Whisper was an easy selection for speech-to-text processing—it is a widely known model from OpenAI with a quantized version available on Hugging Face from Red Hat AI and performed well on initial testing.

Text to speech was much harder to nail down. When I started, models like Qwen-3 TTS had not yet been released. I plan to test additional options now that more models are available. I found a well-performing model from Higgs Audio that received a lot of positive reviews on Reddit at the time. Higgs Audio also had a vLLM container that could be easily used.

For the agent model my main criteria was that it worked well with my soon-to-be coded agent framework. The Llama 4 family includes models like Scout, which reduces reasoning anomalies through enhanced reasoning compared to previous generations. It was also available on our internal MaaS platform, so I did not need to deploy it myself. The voice application architecture uses the open weights models described in Table 1.

Table 1: Model selection for the voice application.
Model name	License	Description	Notes
Whisper	apache-2.0	Speech-to-text	Quantized version of openai/whisper-large-v3-turbo whisper-large-v3-turbo-quantized.w4a16
Higgs-Audio	Higgs Audio 2 (Meta 3) community	Text-to-speech	Audio foundation model github higgs-audio-v2-generation-3B-base
Llama4 Scout	Llama 4 community	Agent model	Llama-4-Scout-17B-16E-Instruct-quantized.w4a16

New models come out all the time. Ultimately the models I ended up selecting were from a point in time at the beginning of the year. Because of vLLM and standard OpenAI endpoints, I can use Red Hat OpenShift AI to continue experimenting, measuring, and testing newer models in my voice agent system.

Speech-to-text

Implementing speech-to-text processing is pretty straightforward, especially with validated models like Whisper. Review these great blog posts from my colleagues to learn more about using Whisper:

My initial workflow involved creating an InferenceService or LLMInferenceService YAML file and experimenting with vLLM settings to run the models on Red Hat OpenShift AI. This deployment ran a single-node OpenShift cluster running in AWS with a hardware accelerator featuring 24 GB of memory (m6i.4xlarge). Squeezing the models onto a compact GPU required some trial and error to optimize the settings, but the configuration process was straightforward. You can examine the structural setup in the LLMInferenceService YAML file.

Initial testing involved recording a WAV audio file and submitting a POST request to the /v1/audio/transcriptions API endpoint on the Whisper model.

curl -s -X POST ${MODEL_URL}/v1/audio/transcriptions \
  -H "Content-Type: multipart/form-data" \
  -H "Authorization: Bearer ${ACCESS_TOKEN}" \
  --form file=@/home/mike/Downloads/hello.wav \
  --form model=whisper | jq

The next step is to configure the text-to-speech model.

Evaluating text-to-speech capabilities

One major piece of advice: Before evaluating a text-to-speech model, measure its generation capabilities on your specific hardware configuration. Why? Well, I initially overlooked this step and spent a lot of time writing client application code to patch choppy audio playback, when all I needed was better GPU hardware for my TTS model.

(gen x) = (audio seconds produced) / (wall clock seconds elapsed)

This measurement tells us the key truth:

If the gen x ratio drops below 1.0 for significant periods, the TTS stream is generating slower than real time. This latency causes underruns or chops unless you introduce additional buffering delays.
If gen x consistently remains greater than or equal to 1.0, any remaining playback degradation stems from client-side runtime issues.

Upgrading to a higher-capacity hardware accelerator resolved the choppy playback. The improvement stemmed from the hardware's increased allocation of CUDA and Tensor Cores rather than total memory capacity. I added a new accelerated compute MachineSet node within Red Hat OpenShift AI to host a g6e.xlarge for this specific model.

Table 2: GPU STT generation speed comparison.
Card details	Cores	AMI instance type	gen x
NVIDIA L4 (24 GB)	7,424 CUDA cores, 240 Tensor Cores (Gen 4), 60 RT Cores (Gen 3)	g6-8xlarge	0.78
NVIDIA L40S (48 GB)	18,176 CUDA cores, 568 Tensor Cores (Gen 4), 142 RT Cores (Gen 3)	g6e-xlarge	1.95

Apart from the easy configuration, MachineSet resources allow you to scale to zero on demand when you are not using that GPU and model during development and testing, ultimately saving you money. This environment is shown in Figure 4.

OpenShift MachineSets table within the openshift-machine-api project, displaying multiple worker machine sets scaled to zero of zero machines. — Figure 4: OpenShift MachineSets.

During the debugging, I also ended up adding metrics to the user interface to help me understand what was going on with the choppy speech that was generated: TTS (time to stream), buffering stats, underruns count, gen x, TTFT (time to first token), TTFO (time to first speech byte output), chunks, bytes, frames, and outHZ. These UI parameters are displayed in Figure 5.

Dark user interface panel titled Playback displaying real-time text metrics for audio stream performance and latency. — Figure 5: Client user metrics.

Calling the TTS model was initially as simple as using curl and piping the output into ffmpeg and ffplay to hear the output on my laptop. This was handy for quickly testing the Higgs Audio model.

curl -X POST $MODEL_URL/v1/audio/speech \
  -H "Content-Type: application/json" \
  -d '{
    "model": "higgs-audio-v2-generation-3B-base",
    "voice": "belinda",
    "input": "What would you like on your pizza?",
    "response_format": "pcm"
  }' \
  --output - | ffmpeg -f s16le -ar 24000 -ac 1 -i pipe:0 -f wav - | ffplay -nodisp -autoexit -

Managing agent collaboration

All the code is shared at the end of this blog post; however, I want to highlight what I learned on this journey to ordering a pizza. I was already familiar with LangChain and LangGraph. Ultimately, though, you can achieve similar code with one of many client configurations. Red Hat OpenShift AI is about meeting developers where they are, so you can bring your own client libraries using standard Red Hat Universal Base Image (UBI) containers to deploy them. You choose the language and the frameworks.

So my first lesson learned was about agent handoffs. This is where we take a number of agents and hand off to another agent as the workflow progresses through the business process. You can explicitly code the handoff or, in my case, let the supervisor agent determine when to shift processing loops. I started with four agents—the supervisor agent that holds the session state for the pizza order, and which then routes to one of the other agents depending on the structured output received.

def build_graph():
    """Compile and return the LangGraph instance (with checkpointer for interrupts)."""
    graph = StateGraph(SupervisorState)
    graph.add_node("supervisor", supervisor_command_node)
    graph.add_node("order_agent", order_agent_node)
    graph.add_node("pizza_agent", pizza_agent_node)
    graph.add_node("delivery_agent", delivery_agent_node)

    # Interrupt nodes
    graph.add_node("wait_for_user_after_pizza", wait_for_user_after_pizza)
    graph.add_node("wait_for_user_after_order", wait_for_user_after_order)
    graph.add_node("wait_for_user_after_delivery", wait_for_user_after_delivery)

    graph.add_edge(START, "supervisor")
    return graph.compile(checkpointer=MemorySaver())

In this case:

I have a pizza agent that helps the user choose a pizza.
I have an order agent that adds items to the order and keeps track of the cost.
I have a delivery agent that helps the customer choose delivery options.

This path guides the user through the process of ordering a pizza quite naturally. The magic is in the prompt engineering. Here is the pizza agent prompt. It has been optimized to work well with my Llama 4 Scout agent model.

# pizza agent prompt
PIZZA_AGENT_PROMPT = """You are a voice agent that helps the user choose a pizza.
Your tasks:
1. Respond with plain text that will be spoken aloud by the browser UI, and ask the user for a pizza type if they haven't chosen one yet.
2. Extract any pizza type from the user's query.
3. Wait for the user to speak again before responding.

Important:
- Do NOT call `convert_text_to_speech`. The server/browser will handle TTS playback automatically.
- Do NOT include tool call syntax in your response. Output only what you want the user to hear.

# Context: {context}
Based on the conversation history, provide your response:"""

There are some hand-crafted “Important” notes added to the agent prompt to prevent it from trying to call tools to do the text-to-speech conversion. I initially handled all conversions using tools but found it was too unreliable. Instead, I opted to let the server and client code to handle this directly.

Currently, the only session state I extract from the conversation is the pizza type. In a production system, you would need to extract all other state parameters as well, such as cost, toppings, extras, delivery times, and physical addresses.

# =========================================================
# State and Models
# =========================================================
class SupervisorState(TypedDict, total=False):
    """State shared across all agents in the graph."""

    messages: Annotated[
        list, add_messages
    ] # Conversation history (uses add_messages reducer)
    pizza_type: Annotated[str, "The type of pizza the user wants to order."]


class SupervisorDecision(BaseModel):
    """Structured output from supervisor for routing decisions."""

    next_agent: Literal["order_agent", "pizza_agent", "delivery_agent", "none"]
    pizza_type: Annotated[str, "The type of pizza the user wants to order."]
    response: str = "" # Direct response if no routing needed

Each agent uses embedded tools defined within the execution code. For example, the pizza agent can use the get_pizza_type tool. Ultimately, I envision that this could be an external menu service called via the Model Context Protocol (MCP). Similarly, the price of a pizza is completely generated by the LLM, meaning the generated pricing parameters reflect the training data of the Llama 4 Scout model.

@tool
def get_pizza_type(query: str) -> dict:
    """Supported pizza types."""
    pizza_type_dictionary = {
        "margherita": "Margherita",
        "pepperoni": "Pepperoni",
        "vegetarian": "Vegetarian",
        "hawaiian": "Hawaiian",
        "meatlovers": "Meat Lovers",
        "bbq_chicken": "BBQ Chicken",
        "spinach_and_mushroom": "Spinach and Mushroom",
    }
    return pizza_type_dictionary

The main reason I completely ignored VAD was the complexity of when to interrupt the TTS model and how to interrupt the multi-agent handoff flows. Instead, I opted for interrupt nodes in my agent graph as a simple first attempt. That way, the application waits for you to speak the next command when you select Talk and Send (Figure 6). Clunky, but it works.

Voice agent client interface displaying prominent green Talk and red Send buttons below a status indicator and a text entry box. — Figure 6: Client interface with manual talk and send controls.

In code, the interrupt nodes serve as simple placeholders for now.

def wait_for_user_after_pizza(state: SupervisorState) -> Command:
    """Interrupt after pizza agent, waiting for user's next input."""
    user_text = interrupt(_interrupt_payload(state, "pizza_agent"))
    return Command(
        goto="supervisor", update={"messages": [HumanMessage(content=str(user_text))]}
    )

Observability

LLM systems generate a high volume of token interactions. This token activity increases when you use reasoning models. This rapid interaction can make it easy to lose track of internal system dialogues. Red Hat OpenShift AI includes MLflow (Figure 7) for agent tracing, alongside platform tracing capabilities offered by the Tempo stack.

Side-by-side view of the pizza application interface and the MLflow dashboard displaying logged interaction traces in a data table. — Figure 7: MLfLow agent observability.

Using the MLflow OpenTelemetry (OTEL) integration, you can examine each trace to inspect input and output responses from both the agent and the LLM (Figure 8). This framework helps you understand delays, timing, and the agent conversation state. This is invaluable for debugging. You also can replay this data to generate datasets for future testing scenarios based on real user input.

MLflow trace interface displaying a LangGraph execution tree breakdown on the left and details of input and output messages on the right. — Figure 8: MLfLow agent conversation traces.

The need for guardrails

You can persuade the agent to give you a pizza for one dollar (Figure 9). Try simple prompting to see if you can persuade the agent.

Pizza shop interface showing a chat bubble from the agent confirming the order total has been updated to one dollar. — Figure 9: A lack of guardrails: Pizza for one dollar.

The system has no guardrails or safety mechanisms other than prompts and the data the LLM was trained on. We intend to use Red Hat OpenShift AI to add safety mechanisms such as guardrails to protect the pizza agent from abuse, following the pattern illustrated in Figure 10.

Linear sequence inside a generative AI application showing user input passing through input detectors, a generative model, and output detectors. — Figure 10: Basic guardrails and detectors.

TrustyAI is an open source AI safety toolkit supported by Red Hat OpenShift AI. By using integrations with MLflow, we can store all testing observations and evaluations for pizza ordering conversations. This toolkit includes testing techniques such as LLM-as-a-judge to evaluate responses against expected outputs.

You can also shift security left by evaluating these scenarios with red-teaming techniques. Red teaming is the systematic practice of simulating adversarial attacks to identify vulnerabilities, safety flaws, and alignment issues before the system is deployed to production.

Next steps and resources

Now that you have seen how to build the Red Hat pizza shop voice agent, you can try the live demo or explore the complete project details.

To recap, this deployment demonstrates how to:

Run a voice-sandwich pattern in a local environment without external Software-as-a-Service (SaaS) voice services.
Select speech-to-text (STT), text-to-speech (TTS), and voice agent models.
Develop a voice agent while measuring system performance on specific hardware configurations.
Capture complex agent interactions and traces with MLflow observability tools.
Integrate security-focused mechanisms to protect the model configuration from prompt exploits.

The voice agent code and configurations are available on GitHub, including all model configurations used on hardware accelerators, so you can test the framework and OpenShift AI with your own use cases.

Voice applications provide an excellent use case for gen AI. I hope this brings you one step closer to talking with your agents!

Build a local voice agent with Red Hat OpenShift AI

Working out the basic architecture

Model selection

Speech-to-text

Evaluating text-to-speech capabilities

Managing agent collaboration

Observability

The need for guardrails

Next steps and resources

Red Hat Dependency Analytics works with your private Trusted Profile Analyzer instance!

Understanding ApplicationSets - Generators (Part 2)

Benchmark Red Hat Data Grid in OpenShift 4 using Hyperfoil

Layered sandboxing for AI agents: OpenShift and OpenShell

How obs-mcp boosts AI-native OpenShift observability

Introduction to OpenShift AI

Platforms

Build

Quicklinks

Communicate

RED HAT DEVELOPER

Red Hat legal and privacy links

Red Hat legal and privacy links