Key takeaways
- Mistral AI has released Voxtral Mini 4B Realtime, a streaming speech recognition model designed for low-latency voice workloads.
- The model supports real-time ASR with sub-500 ms latency and multilingual transcription across 13 languages.
- Voxtral is supported upstream in vLLM on Day 0 through the realtime streaming API.
- Red Hat AI makes Voxtral ready for Day 1 experimentation using Red Hat AI Inference Server.
- Developers can immediately prototype streaming voice applications using open infrastructure and open model ecosystems.
Real-time speech recognition is becoming a key area in generative AI. Organizations are rapidly adopting voice interfaces for customer engagement, internal productivity tools, accessibility initiatives, and conversational AI workflows.
Mistral AI recently released Voxtral Mini 4B Realtime, a streaming automatic speech recognition model optimized for low latency audio processing. Unlike traditional ASR models that rely on batch processing, Voxtral enables continuous streaming transcription designed for conversational workloads. You can download the model directly from Hugging Face.
The release highlights how open AI infrastructure is accelerating model adoption. Voxtral is already supported upstream in vLLM, enabling developers to serve the model immediately. With Red Hat AI Inference Server, developers and organizations can begin experimenting with streaming ASR workloads on Day 1.
What’s new in Voxtral Mini 4B Realtime
Voxtral Mini 4B Realtime is a lightweight streaming ASR model designed to deliver accuracy while maintaining real-time responsiveness.
Streaming ASR architecture
The model is built specifically for real-time inference, which lets you transcribe while audio is actively streaming. This reduces end-to-end latency and supports interactive conversational AI experiences.
Efficient model size
With approximately 4 billion parameters, Voxtral provides an efficient balance between model quality and deployability across enterprise infrastructure environments.
Multilingual capabilities
Voxtral supports transcription across 13 languages, so organizations can build global voice-driven applications without deploying multiple specialized models.
Designed for interactive voice applications
Voxtral supports a variety of interactive workloads, including voice assistants, live meeting transcription, real-time captioning, and multilingual customer support automation.
Licensing and openness
Voxtral continues Mistral AI’s commitment to open ecosystem development. The model is available publicly through Hugging Face, so developers and organizations can experiment and deploy without proprietary lock-in. This model is released in BF16 under the Apache License 2.0, ensuring flexibility for both research and commercial use.
The model runs with upstream vLLM without requiring custom forks or specialized integrations. This upstream compatibility accelerates adoption and ensures consistent developer workflows across open AI infrastructure.
The power of open: Immediate support in vLLM
vLLM recently introduced a realtime streaming API that supports audio streaming workloads through the /v1/realtime endpoint. This lets developers serve streaming ASR models without building custom streaming pipelines.
Using Voxtral with vLLM lets you load models directly from Hugging Face and serve them through native realtime APIs. This approach supports scalable, low-latency speech recognition and helps you integrate audio pipelines into your conversational AI applications. This makes vLLM the fastest path from speech model release to production-ready serving infrastructure.
Experiment with Red Hat AI on Day 1
Red Hat AI provides enterprise-ready infrastructure for running open source AI models across the hybrid cloud. By integrating with upstream inference technologies like vLLM, Red Hat AI allows organizations to evaluate new models immediately after release.
Red Hat AI Inference Server provides:
- Scalable model serving infrastructure
- Production-aligned deployment workflows
- Hybrid cloud deployment flexibility
- Integration with open model ecosystems
Using Red Hat AI Inference Server, you can start experimenting with Voxtral today to see how it fits into your existing hybrid cloud workflows.
Serve and run streaming ASR workloads using Red Hat AI Inference Server
This guide demonstrates how to deploy Voxtral Mini 4B Realtime using Red Hat AI Inference Server and vLLM realtime APIs.
Prerequisites
- Linux server with NVIDIA GPU with 16 GB+ VRAM
- Podman or Docker installed
- (Optional) Python 3.9+ with websockets, librosa, and numpy installed
- Access to Red Hat container images
- (Optional) Hugging Face account and token for model download
Technology preview notice
The Red Hat AI Inference Server images used in this guide are intended for experimentation and evaluation purposes. Production workloads should use upcoming stable releases from Red Hat.
Procedure: Deploy Voxtral Mini 4B Realtime using Red Hat AI Inference Server
This section walks you through how to run a large language model with Podman and Red Hat AI Inference Server using NVIDIA CUDA AI accelerators. For deployments in OpenShift AI, import the registry.redhat.io/rhaiis-preview/vllm-cuda-rhel9:voxtral-realtime image as a custom runtime. Then, use it to serve the model and add the vLLM parameters described in this procedure to enable model specific features.
Log in to the Red Hat Registry. Open a terminal on your server and log in to registry.redhat.io:
podman login registry.redhat.ioPull the Red Hat AI Inference Server image (CUDA version):
podman pull registry.redhat.io/rhaiis-preview/vllm-cuda-rhel9:voxtral-realtimeIf SELinux is enabled on your system, allow container access to devices:
sudo setsebool -P container_use_devices 1Create a volume directory for model caching. Create and set permissions for the cache directory:
mkdir -p rhaiis-cache chmod g+rwX rhaiis-cacheCreate or append your Hugging Face token to a local
private.envfile and source it:echo "export HF_TOKEN=<your_HF_token>" > private.env source private.env- Start the AI Inference Server container. If your system includes multiple NVIDIA GPUs connected via NVSwitch, follow these steps:
Check for NVSwitch. To detect NVSwitch support, check for these devices:
ls /proc/driver/nvidia-nvswitch/devices/Example output:
0000:0c:09.0 0000:0c:0a.0 0000:0c:0b.0 0000:0c:0c.0 0000:0c:0d.0 0000:0c:0e.0Start NVIDIA Fabric Manager (root required):
sudo systemctl start nvidia-fabricmanagerImportant: NVIDIA Fabric Manager is only required for systems with multiple GPUs using NVSwitch.
Verify GPU visibility from the container. Run the following command to verify GPU access inside a container:
podman run --rm -it \ --security-opt=label=disable \ --device nvidia.com/gpu=all \ nvcr.io/nvidia/cuda:12.4.1-base-ubi9 \ nvidia-smiStart the Red Hat AI Inference Server container with the Mistral Large 3 FP8 model:
podman run --rm \ --device nvidia.com/gpu=all \ --security-opt=label=disable \ --shm-size=4g \ -p 8000:8000 \ -v ~/.cache/huggingface:/hf:Z \ -e HF_HUB_OFFLINE=1 \ -e VLLM_DISABLE_COMPILE_CACHE=1 \ -e HF_HOME=/hf \ -e "HUGGING_FACE_HUB_TOKEN=$HF_TOKEN" \ registry.redhat.io/rhaiis-preview/vllm-cuda-rhel9:voxtral-realtime \ --model mistralai/Voxtral-Mini-4B-Realtime-2602 \ --tokenizer-mode mistral \ --config-format mistral \ --load-format mistral \ --trust-remote-code \ --compilation-config '{"cudagraph_mode":"PIECEWISE"}' \ --tensor-parallel-size 1 \ --max-model-len 45000 \ --max-num-batched-tokens 8192 \ --max-num-seqs 16 \ --gpu-memory-utilization 0.90 \ --host 0.0.0.0 --port 8000
Query the
/v1/realtimeendpoint with streaming audio. The/v1/realtimeendpoint in vLLM uses websockets to stream audio and receive inferred transcriptions.We have provided a demo script to stream audio to a websocket. Feel free to use or customize this implementation.
You can also download a sample audio file for transcription (this one is a recording of a train defect detection system audio from Wikimedia Commons):
curl -o CSX_Wikipedia_Detector_demo.wav \ https://upload.wikimedia.org/wikipedia/commons/5/5a/CSX_Wikipedia_Detector_demo.wavInstall these dependencies if you don’t have them:
pip install websockets librosa numpyThen, create a
realtime_test.pyfile (our modifications on top of the example vLLM streaming client):#!/usr/bin/env python3 """ Simplified realtime client for testing Voxtral. """ import argparse import asyncio import base64 import json import librosa import numpy as np import websockets def audio_to_pcm16_base64(audio_path: str) -> str: """Load an audio file and convert it to base64-encoded PCM16 @ 16kHz.""" audio, _ = librosa.load(audio_path, sr=16000, mono=True) pcm16 = (audio * 32767).astype(np.int16) return base64.b64encode(pcm16.tobytes()).decode("utf-8") async def realtime_transcribe(audio_path: str, host: str, port: int, model: str): """Connect to the Realtime API and transcribe an audio file.""" uri = f"ws://{host}:{port}/v1/realtime" print(f"Connecting to {uri}...") async with websockets.connect(uri) as ws: # Wait for session.created response = json.loads(await ws.recv()) if response["type"] == "session.created": print(f"Session created: {response['id']}") else: print(f"Unexpected response: {response}") return # Validate model await ws.send(json.dumps({"type": "session.update", "model": model})) # Signal ready to start await ws.send(json.dumps({"type": "input_audio_buffer.commit"})) # Convert audio file to base64 PCM16 print(f"Loading audio from: {audio_path}") audio_base64 = audio_to_pcm16_base64(audio_path) # Send audio in chunks chunk_size = 4096 audio_bytes = base64.b64decode(audio_base64) total_chunks = (len(audio_bytes) + chunk_size - 1) // chunk_size print(f"Sending {total_chunks} audio chunks...") for i in range(0, len(audio_bytes), chunk_size): chunk = audio_bytes[i : i + chunk_size] await ws.send( json.dumps( { "type": "input_audio_buffer.append", "audio": base64.b64encode(chunk).decode("utf-8"), } ) ) # Signal all audio is sent await ws.send(json.dumps({"type": "input_audio_buffer.commit", "final": True})) print("Audio sent. Waiting for transcription...\n") # Receive transcription print("Transcription: ", end="", flush=True) while True: response = json.loads(await ws.recv()) if response["type"] == "transcription.delta": print(response["delta"], end="", flush=True) elif response["type"] == "transcription.done": print(f"\n\nFinal transcription: {response['text']}") if response.get("usage"): print(f"Usage: {response['usage']}") break elif response["type"] == "error": print(f"\nError: {response['error']}") break def main(): parser = argparse.ArgumentParser(description="Realtime WebSocket Transcription Client") parser.add_argument("--model", type=str, default="mistralai/Voxtral-Mini-4B-Realtime-2602") parser.add_argument("--audio_path", type=str, required=True) parser.add_argument("--host", type=str, default="127.0.0.1") parser.add_argument("--port", type=int, default=8000) args = parser.parse_args() asyncio.run(realtime_transcribe(args.audio_path, args.host, args.port, args.model)) if __name__ == "__main__": main()Finally, run the demo script. Adjust the host address if you are running from a remote host:
python3 realtime_test.py --audio_path CSX_Wikipedia_Detector_demo.wav --host 127.0.0.1 --port 8000
Experimentation ideas
After you deploy the model, you can evaluate its performance by testing streaming latency, transcription accuracy across different languages, and how well it integrates into your agentic or conversational AI workflows.
Conclusion
The release of Voxtral Mini 4B Realtime shows how streaming speech models are becoming core components of modern AI inference stacks. Real-time ASR introduces new infrastructure challenges such as low-latency streaming, continuous audio ingestion, and scaling interactive workloads, which open source serving frameworks like vLLM now address.
With real-time API support in vLLM, you can deploy streaming ASR models without building custom audio pipelines. Red Hat AI Inference Server provides a production-aligned environment for evaluating these models using containerized deployments, GPU-accelerated inference, and hybrid deployment patterns.
As real-time multimodal and voice-driven applications continue to grow, the ability to quickly test and use new models like Voxtral will play a key role in building AI applications.