Run Voxtral Mini 4B Realtime on vLLM with Red Hat AI on Day 1

Key takeaways

Mistral AI has released Voxtral Mini 4B Realtime, a streaming speech recognition model designed for low-latency voice workloads.
The model supports real-time ASR with sub-500 ms latency and multilingual transcription across 13 languages.
Voxtral is supported upstream in vLLM on Day 0 through the realtime streaming API.
Red Hat AI makes Voxtral ready for Day 1 experimentation using Red Hat AI Inference Server.
Developers can immediately prototype streaming voice applications using open infrastructure and open model ecosystems.

Real-time speech recognition is becoming a key area in generative AI. Organizations are rapidly adopting voice interfaces for customer engagement, internal productivity tools, accessibility initiatives, and conversational AI workflows.

Mistral AI recently released Voxtral Mini 4B Realtime, a streaming automatic speech recognition model optimized for low latency audio processing. Unlike traditional ASR models that rely on batch processing, Voxtral enables continuous streaming transcription designed for conversational workloads. You can download the model directly from Hugging Face.

The release highlights how open AI infrastructure is accelerating model adoption. Voxtral is already supported upstream in vLLM, enabling developers to serve the model immediately. With Red Hat AI Inference Server, developers and organizations can begin experimenting with streaming ASR workloads on Day 1.

What’s new in Voxtral Mini 4B Realtime

Voxtral Mini 4B Realtime is a lightweight streaming ASR model designed to deliver accuracy while maintaining real-time responsiveness.

Streaming ASR architecture

The model is built specifically for real-time inference, which lets you transcribe while audio is actively streaming. This reduces end-to-end latency and supports interactive conversational AI experiences.

Efficient model size

With approximately 4 billion parameters, Voxtral provides an efficient balance between model quality and deployability across enterprise infrastructure environments.

Multilingual capabilities

Voxtral supports transcription across 13 languages, so organizations can build global voice-driven applications without deploying multiple specialized models.

Designed for interactive voice applications

Voxtral supports a variety of interactive workloads, including voice assistants, live meeting transcription, real-time captioning, and multilingual customer support automation.

Licensing and openness

Voxtral continues Mistral AI’s commitment to open ecosystem development. The model is available publicly through Hugging Face, so developers and organizations can experiment and deploy without proprietary lock-in. This model is released in BF16 under the Apache License 2.0, ensuring flexibility for both research and commercial use.

The model runs with upstream vLLM without requiring custom forks or specialized integrations. This upstream compatibility accelerates adoption and ensures consistent developer workflows across open AI infrastructure.

The power of open: Immediate support in vLLM

vLLM recently introduced a realtime streaming API that supports audio streaming workloads through the /v1/realtime endpoint. This lets developers serve streaming ASR models without building custom streaming pipelines.

Using Voxtral with vLLM lets you load models directly from Hugging Face and serve them through native realtime APIs. This approach supports scalable, low-latency speech recognition and helps you integrate audio pipelines into your conversational AI applications. This makes vLLM the fastest path from speech model release to production-ready serving infrastructure.

Experiment with Red Hat AI on Day 1

Red Hat AI provides enterprise-ready infrastructure for running open source AI models across the hybrid cloud. By integrating with upstream inference technologies like vLLM, Red Hat AI allows organizations to evaluate new models immediately after release.

Red Hat AI Inference Server provides:

Scalable model serving infrastructure
Production-aligned deployment workflows
Hybrid cloud deployment flexibility
Integration with open model ecosystems

Using Red Hat AI Inference Server, you can start experimenting with Voxtral today to see how it fits into your existing hybrid cloud workflows.

Serve and run streaming ASR workloads using Red Hat AI Inference Server

This guide demonstrates how to deploy Voxtral Mini 4B Realtime using Red Hat AI Inference Server and vLLM realtime APIs.

Prerequisites

Linux server with NVIDIA GPU with 16 GB+ VRAM
Podman or Docker installed
(Optional) Python 3.9+ with websockets, librosa, and numpy installed
Access to Red Hat container images
(Optional) Hugging Face account and token for model download

Technology preview notice

The Red Hat AI Inference Server images used in this guide are intended for experimentation and evaluation purposes. Production workloads should use upcoming stable releases from Red Hat.

Procedure: Deploy Voxtral Mini 4B Realtime using Red Hat AI Inference Server

This section walks you through how to run a large language model with Podman and Red Hat AI Inference Server using NVIDIA CUDA AI accelerators. For deployments in OpenShift AI, import the registry.redhat.io/rhaiis-preview/vllm-cuda-rhel9:voxtral-realtime image as a custom runtime. Then, use it to serve the model and add the vLLM parameters described in this procedure to enable model specific features.

Log in to the Red Hat Registry. Open a terminal on your server and log in to registry.redhat.io:
```
podman login registry.redhat.io
```

Pull the Red Hat AI Inference Server image (CUDA version):

podman pull registry.redhat.io/rhaiis-preview/vllm-cuda-rhel9:voxtral-realtime

If SELinux is enabled on your system, allow container access to devices:
```
sudo setsebool -P container_use_devices 1
```
Create a volume directory for model caching. Create and set permissions for the cache directory:
```
mkdir -p rhaiis-cache
chmod g+rwX rhaiis-cache
```
Create or append your Hugging Face token to a local private.env file and source it:
```
echo "export HF_TOKEN=<your_HF_token>" > private.env
source private.env
```

Start the AI Inference Server container. If your system includes multiple NVIDIA GPUs connected via NVSwitch, follow these steps:

Check for NVSwitch. To detect NVSwitch support, check for these devices:

ls /proc/driver/nvidia-nvswitch/devices/

Example output:

0000:0c:09.0  0000:0c:0a.0  0000:0c:0b.0  0000:0c:0c.0  0000:0c:0d.0  0000:0c:0e.0

Start NVIDIA Fabric Manager (root required):
```
sudo systemctl start nvidia-fabricmanager
```
Important: NVIDIA Fabric Manager is only required for systems with multiple GPUs using NVSwitch.

Verify GPU visibility from the container. Run the following command to verify GPU access inside a container:

podman run --rm -it \
  --security-opt=label=disable \
  --device nvidia.com/gpu=all \
  nvcr.io/nvidia/cuda:12.4.1-base-ubi9 \
  nvidia-smi

Start the Red Hat AI Inference Server container with the Mistral Large 3 FP8 model:

podman run --rm \
    --device nvidia.com/gpu=all \
    --security-opt=label=disable \
    --shm-size=4g \
    -p 8000:8000 \
    -v ~/.cache/huggingface:/hf:Z \
    -e HF_HUB_OFFLINE=1 \
    -e VLLM_DISABLE_COMPILE_CACHE=1 \
    -e HF_HOME=/hf \
    -e "HUGGING_FACE_HUB_TOKEN=$HF_TOKEN" \
    registry.redhat.io/rhaiis-preview/vllm-cuda-rhel9:voxtral-realtime \
      --model mistralai/Voxtral-Mini-4B-Realtime-2602 \
      --tokenizer-mode mistral \
      --config-format mistral \
      --load-format mistral \
      --trust-remote-code \
      --compilation-config '{"cudagraph_mode":"PIECEWISE"}' \
      --tensor-parallel-size 1 \
      --max-model-len 45000 \
      --max-num-batched-tokens 8192 \
      --max-num-seqs 16 \
      --gpu-memory-utilization 0.90 \
      --host 0.0.0.0 --port 8000

Query the /v1/realtime endpoint with streaming audio. The /v1/realtime endpoint in vLLM uses websockets to stream audio and receive inferred transcriptions.

We have provided a demo script to stream audio to a websocket. Feel free to use or customize this implementation.

You can also download a sample audio file for transcription (this one is a recording of a train defect detection system audio from Wikimedia Commons):

curl -o CSX_Wikipedia_Detector_demo.wav \   https://upload.wikimedia.org/wikipedia/commons/5/5a/CSX_Wikipedia_Detector_demo.wav

Install these dependencies if you don’t have them:

pip install websockets librosa numpy

Then, create a realtime_test.py file (our modifications on top of the example vLLM streaming client):

#!/usr/bin/env python3
"""
Simplified realtime client for testing Voxtral.
"""
import argparse
import asyncio
import base64
import json
import librosa
import numpy as np
import websockets
def audio_to_pcm16_base64(audio_path: str) -> str:
    """Load an audio file and convert it to base64-encoded PCM16 @ 16kHz."""
    audio, _ = librosa.load(audio_path, sr=16000, mono=True)
    pcm16 = (audio * 32767).astype(np.int16)
    return base64.b64encode(pcm16.tobytes()).decode("utf-8")
async def realtime_transcribe(audio_path: str, host: str, port: int, model: str):
    """Connect to the Realtime API and transcribe an audio file."""
    uri = f"ws://{host}:{port}/v1/realtime"
    print(f"Connecting to {uri}...")
    async with websockets.connect(uri) as ws:
        # Wait for session.created
        response = json.loads(await ws.recv())
        if response["type"] == "session.created":
            print(f"Session created: {response['id']}")
        else:
            print(f"Unexpected response: {response}")
            return
        # Validate model
        await ws.send(json.dumps({"type": "session.update", "model": model}))
        # Signal ready to start
        await ws.send(json.dumps({"type": "input_audio_buffer.commit"}))
        # Convert audio file to base64 PCM16
        print(f"Loading audio from: {audio_path}")
        audio_base64 = audio_to_pcm16_base64(audio_path)
        # Send audio in chunks
        chunk_size = 4096
        audio_bytes = base64.b64decode(audio_base64)
        total_chunks = (len(audio_bytes) + chunk_size - 1) // chunk_size
        print(f"Sending {total_chunks} audio chunks...")
        for i in range(0, len(audio_bytes), chunk_size):
            chunk = audio_bytes[i : i + chunk_size]
            await ws.send(
                json.dumps(
                    {
                        "type": "input_audio_buffer.append",
                        "audio": base64.b64encode(chunk).decode("utf-8"),
                    }
                )
            )
        # Signal all audio is sent
        await ws.send(json.dumps({"type": "input_audio_buffer.commit", "final": True}))
        print("Audio sent. Waiting for transcription...\n")
        # Receive transcription
        print("Transcription: ", end="", flush=True)
        while True:
            response = json.loads(await ws.recv())
            if response["type"] == "transcription.delta":
                print(response["delta"], end="", flush=True)
            elif response["type"] == "transcription.done":
                print(f"\n\nFinal transcription: {response['text']}")
                if response.get("usage"):
                    print(f"Usage: {response['usage']}")
                break
            elif response["type"] == "error":
                print(f"\nError: {response['error']}")
                break
def main():
    parser = argparse.ArgumentParser(description="Realtime WebSocket Transcription Client")
    parser.add_argument("--model", type=str, default="mistralai/Voxtral-Mini-4B-Realtime-2602")
    parser.add_argument("--audio_path", type=str, required=True)
    parser.add_argument("--host", type=str, default="127.0.0.1")
    parser.add_argument("--port", type=int, default=8000)
    args = parser.parse_args()
    asyncio.run(realtime_transcribe(args.audio_path, args.host, args.port, args.model))
if __name__ == "__main__":
    main()

Finally, run the demo script. Adjust the host address if you are running from a remote host:

python3 realtime_test.py --audio_path CSX_Wikipedia_Detector_demo.wav --host 127.0.0.1 --port 8000

Experimentation ideas

After you deploy the model, you can evaluate its performance by testing streaming latency, transcription accuracy across different languages, and how well it integrates into your agentic or conversational AI workflows.

Conclusion

The release of Voxtral Mini 4B Realtime shows how streaming speech models are becoming core components of modern AI inference stacks. Real-time ASR introduces new infrastructure challenges such as low-latency streaming, continuous audio ingestion, and scaling interactive workloads, which open source serving frameworks like vLLM now address.

With real-time API support in vLLM, you can deploy streaming ASR models without building custom audio pipelines. Red Hat AI Inference Server provides a production-aligned environment for evaluating these models using containerized deployments, GPU-accelerated inference, and hybrid deployment patterns.

As real-time multimodal and voice-driven applications continue to grow, the ability to quickly test and use new models like Voxtral will play a key role in building AI applications.

Red Hat Developer Sandbox

Programming languages & frameworks

System design & architecture

Developer experience

Automated data processing

Platform engineering

Secure development & architectures

E-books

Cheat sheets

Documentation

Run Voxtral Mini 4B Realtime on vLLM with Red Hat AI on Day 1: A step-by-step guide

What’s new in Voxtral Mini 4B Realtime

Streaming ASR architecture

Efficient model size

Multilingual capabilities

Designed for interactive voice applications

Licensing and openness

The power of open: Immediate support in vLLM

Experiment with Red Hat AI on Day 1

Serve and run streaming ASR workloads using Red Hat AI Inference Server

Prerequisites

Technology preview notice

Procedure: Deploy Voxtral Mini 4B Realtime using Red Hat AI Inference Server

Experimentation ideas

Conclusion

Red Hat trusted libraries - Trust and integrity for your software supply chain

GDAL 3.4 package: Full-featured GIS functionality on RHEL

Red Hat OpenShift Service on AWS with hosted control planes enables configuration of cluster monitoring operator for additional observability

How hosted control planes are getting smarter about resource management

Fine-tune AI pipelines in Red Hat OpenShift AI 3.3

Applied AI for Enterprise Java Development

Platforms

Build

Quicklinks

Communicate

RED HAT DEVELOPER

Red Hat legal and privacy links

Red Hat legal and privacy links

Report a website issue