Skip to main content
Redhat Developers  Logo
  • Products

    Platforms

    • Red Hat Enterprise Linux
      Red Hat Enterprise Linux Icon
    • Red Hat AI
      Red Hat AI
    • Red Hat OpenShift
      Openshift icon
    • Red Hat Ansible Automation Platform
      Ansible icon
    • See all Red Hat products

    Featured

    • Red Hat build of OpenJDK
    • Red Hat Developer Hub
    • Red Hat JBoss Enterprise Application Platform
    • Red Hat OpenShift Dev Spaces
    • Red Hat OpenShift Local
    • Red Hat Developer Sandbox

      Try Red Hat products and technologies without setup or configuration fees for 30 days with this shared Red Hat OpenShift and Kubernetes cluster.
    • Try at no cost
  • Technologies

    Featured

    • AI/ML
      AI/ML Icon
    • Linux
      Linux Icon
    • Kubernetes
      Cloud icon
    • Automation
      Automation Icon showing arrows moving in a circle around a gear
    • See all technologies
    • Programming languages & frameworks

      • Java
      • Python
      • JavaScript
    • System design & architecture

      • Red Hat architecture and design patterns
      • Microservices
      • Event-Driven Architecture
      • Databases
    • Developer experience

      • Productivity
      • Tools
      • GitOps
    • Automated data processing

      • AI/ML
      • Data science
      • Apache Kafka on Kubernetes
    • Platform engineering

      • DevOps
      • DevSecOps
      • Red Hat Ansible Automation Platform for applications and services
    • Secure development & architectures

      • Security
      • Secure coding
  • Learn

    Featured

    • Kubernetes & cloud native
      Openshift icon
    • Linux
      Rhel icon
    • Automation
      Ansible cloud icon
    • AI/ML
      AI/ML Icon
    • See all learning resources

    E-books

    • GitOps cookbook
    • Podman in action
    • Kubernetes operators
    • The path to GitOps
    • See all e-books

    Cheat sheets

    • Linux commands
    • Bash commands
    • Git
    • systemd commands
    • See all cheat sheets

    Documentation

    • Product documentation
    • API catalog
    • Legacy documentation
  • Developer Sandbox

    Developer Sandbox

    • Access Red Hat’s products and technologies without setup or configuration, and start developing quicker than ever before with our new, no-cost sandbox environments.
    • Explore the Developer Sandbox

    Featured Developer Sandbox activities

    • Get started with your Developer Sandbox
    • OpenShift virtualization and application modernization using the Developer Sandbox
    • Explore all Developer Sandbox activities

    Ready to start developing apps?

    • Try at no cost
  • Blog
  • Events
  • Videos

Run Voxtral Mini 4B Realtime on vLLM with Red Hat AI on Day 1: A step-by-step guide

February 6, 2026
Saša Zelenović Doug Smith
Related topics:
Artificial intelligenceOpen source
Related products:
Red Hat AI

    Key takeaways

    • Mistral AI has released Voxtral Mini 4B Realtime, a streaming speech recognition model designed for low-latency voice workloads.
    • The model supports real-time ASR with sub-500 ms latency and multilingual transcription across 13 languages.
    • Voxtral is supported upstream in vLLM on Day 0 through the realtime streaming API.
    • Red Hat AI makes Voxtral ready for Day 1 experimentation using Red Hat AI Inference Server.
    • Developers can immediately prototype streaming voice applications using open infrastructure and open model ecosystems.

    Real-time speech recognition is becoming a key area in generative AI. Organizations are rapidly adopting voice interfaces for customer engagement, internal productivity tools, accessibility initiatives, and conversational AI workflows.

    Mistral AI recently released Voxtral Mini 4B Realtime, a streaming automatic speech recognition model optimized for low latency audio processing. Unlike traditional ASR models that rely on batch processing, Voxtral enables continuous streaming transcription designed for conversational workloads. You can download the model directly from Hugging Face.

    The release highlights how open AI infrastructure is accelerating model adoption. Voxtral is already supported upstream in vLLM, enabling developers to serve the model immediately. With Red Hat AI Inference Server, developers and organizations can begin experimenting with streaming ASR workloads on Day 1.

    What’s new in Voxtral Mini 4B Realtime

    Voxtral Mini 4B Realtime is a lightweight streaming ASR model designed to deliver accuracy while maintaining real-time responsiveness.

    Streaming ASR architecture

    The model is built specifically for real-time inference, which lets you transcribe while audio is actively streaming. This reduces end-to-end latency and supports interactive conversational AI experiences.

    Efficient model size

    With approximately 4 billion parameters, Voxtral provides an efficient balance between model quality and deployability across enterprise infrastructure environments.

    Multilingual capabilities

    Voxtral supports transcription across 13 languages, so organizations can build global voice-driven applications without deploying multiple specialized models.

    Designed for interactive voice applications

    Voxtral supports a variety of interactive workloads, including voice assistants, live meeting transcription, real-time captioning, and multilingual customer support automation.

    Licensing and openness

    Voxtral continues Mistral AI’s commitment to open ecosystem development. The model is available publicly through Hugging Face, so developers and organizations can experiment and deploy without proprietary lock-in. This model is released in BF16 under the Apache License 2.0, ensuring flexibility for both research and commercial use.

    The model runs with upstream vLLM without requiring custom forks or specialized integrations. This upstream compatibility accelerates adoption and ensures consistent developer workflows across open AI infrastructure.

    The power of open: Immediate support in vLLM

    vLLM recently introduced a realtime streaming API that supports audio streaming workloads through the /v1/realtime endpoint. This lets developers serve streaming ASR models without building custom streaming pipelines.

    Using Voxtral with vLLM lets you load models directly from Hugging Face and serve them through native realtime APIs. This approach supports scalable, low-latency speech recognition and helps you integrate audio pipelines into your conversational AI applications. This makes vLLM the fastest path from speech model release to production-ready serving infrastructure.

    Experiment with Red Hat AI on Day 1

    Red Hat AI provides enterprise-ready infrastructure for running open source AI models across the hybrid cloud. By integrating with upstream inference technologies like vLLM, Red Hat AI allows organizations to evaluate new models immediately after release.

    Red Hat AI Inference Server provides:

    • Scalable model serving infrastructure
    • Production-aligned deployment workflows
    • Hybrid cloud deployment flexibility
    • Integration with open model ecosystems

    Using Red Hat AI Inference Server, you can start experimenting with Voxtral today to see how it fits into your existing hybrid cloud workflows.

    Serve and run streaming ASR workloads using Red Hat AI Inference Server

    This guide demonstrates how to deploy Voxtral Mini 4B Realtime using Red Hat AI Inference Server and vLLM realtime APIs.

    Prerequisites

    • Linux server with NVIDIA GPU with 16 GB+ VRAM
    • Podman or Docker installed
    • (Optional) Python 3.9+ with websockets, librosa, and numpy installed
    • Access to Red Hat container images
    • (Optional) Hugging Face account and token for model download

    Technology preview notice

    The Red Hat AI Inference Server images used in this guide are intended for experimentation and evaluation purposes. Production workloads should use upcoming stable releases from Red Hat.

    Procedure: Deploy Voxtral Mini 4B Realtime using Red Hat AI Inference Server

    This section walks you through how to run a large language model with Podman and Red Hat AI Inference Server using NVIDIA CUDA AI accelerators. For deployments in OpenShift AI, import the registry.redhat.io/rhaiis-preview/vllm-cuda-rhel9:voxtral-realtime image as a custom runtime. Then, use it to serve the model and add the vLLM parameters described in this procedure to enable model specific features.

    1. Log in to the Red Hat Registry. Open a terminal on your server and log in to registry.redhat.io:

      podman login registry.redhat.io
    2. Pull the Red Hat AI Inference Server image (CUDA version):

      podman pull registry.redhat.io/rhaiis-preview/vllm-cuda-rhel9:voxtral-realtime
    3. If SELinux is enabled on your system, allow container access to devices:

      sudo setsebool -P container_use_devices 1
    4. Create a volume directory for model caching. Create and set permissions for the cache directory:

      mkdir -p rhaiis-cache
      chmod g+rwX rhaiis-cache
    5. Create or append your Hugging Face token to a local private.env file and source it:

      echo "export HF_TOKEN=<your_HF_token>" > private.env
      source private.env
    6. Start the AI Inference Server container. If your system includes multiple NVIDIA GPUs connected via NVSwitch, follow these steps:
      1. Check for NVSwitch. To detect NVSwitch support, check for these devices:

        ls /proc/driver/nvidia-nvswitch/devices/

        Example output:

        0000:0c:09.0  0000:0c:0a.0  0000:0c:0b.0  0000:0c:0c.0  0000:0c:0d.0  0000:0c:0e.0
      2. Start NVIDIA Fabric Manager (root required):

        sudo systemctl start nvidia-fabricmanager

        Important: NVIDIA Fabric Manager is only required for systems with multiple GPUs using NVSwitch.

      3. Verify GPU visibility from the container. Run the following command to verify GPU access inside a container:

        podman run --rm -it \
          --security-opt=label=disable \
          --device nvidia.com/gpu=all \
          nvcr.io/nvidia/cuda:12.4.1-base-ubi9 \
          nvidia-smi
      4. Start the Red Hat AI Inference Server container with the Mistral Large 3 FP8 model:

        podman run --rm \
            --device nvidia.com/gpu=all \
            --security-opt=label=disable \
            --shm-size=4g \
            -p 8000:8000 \
            -v ~/.cache/huggingface:/hf:Z \
            -e HF_HUB_OFFLINE=1 \
            -e VLLM_DISABLE_COMPILE_CACHE=1 \
            -e HF_HOME=/hf \
            -e "HUGGING_FACE_HUB_TOKEN=$HF_TOKEN" \
            registry.redhat.io/rhaiis-preview/vllm-cuda-rhel9:voxtral-realtime \
              --model mistralai/Voxtral-Mini-4B-Realtime-2602 \
              --tokenizer-mode mistral \
              --config-format mistral \
              --load-format mistral \
              --trust-remote-code \
              --compilation-config '{"cudagraph_mode":"PIECEWISE"}' \
              --tensor-parallel-size 1 \
              --max-model-len 45000 \
              --max-num-batched-tokens 8192 \
              --max-num-seqs 16 \
              --gpu-memory-utilization 0.90 \
              --host 0.0.0.0 --port 8000
    7. Query the /v1/realtime endpoint with streaming audio. The /v1/realtime endpoint in vLLM uses websockets to stream audio and receive inferred transcriptions.

      We have provided a demo script to stream audio to a websocket. Feel free to use or customize this implementation.

      You can also download a sample audio file for transcription (this one is a recording of a train defect detection system audio from Wikimedia Commons):

      curl -o CSX_Wikipedia_Detector_demo.wav \   https://upload.wikimedia.org/wikipedia/commons/5/5a/CSX_Wikipedia_Detector_demo.wav

      Install these dependencies if you don’t have them:

      pip install websockets librosa numpy

      Then, create a realtime_test.py file  (our modifications on top of the example vLLM streaming client):

      #!/usr/bin/env python3
      """
      Simplified realtime client for testing Voxtral.
      """
      import argparse
      import asyncio
      import base64
      import json
      import librosa
      import numpy as np
      import websockets
      def audio_to_pcm16_base64(audio_path: str) -> str:
          """Load an audio file and convert it to base64-encoded PCM16 @ 16kHz."""
          audio, _ = librosa.load(audio_path, sr=16000, mono=True)
          pcm16 = (audio * 32767).astype(np.int16)
          return base64.b64encode(pcm16.tobytes()).decode("utf-8")
      async def realtime_transcribe(audio_path: str, host: str, port: int, model: str):
          """Connect to the Realtime API and transcribe an audio file."""
          uri = f"ws://{host}:{port}/v1/realtime"
          print(f"Connecting to {uri}...")
          async with websockets.connect(uri) as ws:
              # Wait for session.created
              response = json.loads(await ws.recv())
              if response["type"] == "session.created":
                  print(f"Session created: {response['id']}")
              else:
                  print(f"Unexpected response: {response}")
                  return
              # Validate model
              await ws.send(json.dumps({"type": "session.update", "model": model}))
              # Signal ready to start
              await ws.send(json.dumps({"type": "input_audio_buffer.commit"}))
              # Convert audio file to base64 PCM16
              print(f"Loading audio from: {audio_path}")
              audio_base64 = audio_to_pcm16_base64(audio_path)
              # Send audio in chunks
              chunk_size = 4096
              audio_bytes = base64.b64decode(audio_base64)
              total_chunks = (len(audio_bytes) + chunk_size - 1) // chunk_size
              print(f"Sending {total_chunks} audio chunks...")
              for i in range(0, len(audio_bytes), chunk_size):
                  chunk = audio_bytes[i : i + chunk_size]
                  await ws.send(
                      json.dumps(
                          {
                              "type": "input_audio_buffer.append",
                              "audio": base64.b64encode(chunk).decode("utf-8"),
                          }
                      )
                  )
              # Signal all audio is sent
              await ws.send(json.dumps({"type": "input_audio_buffer.commit", "final": True}))
              print("Audio sent. Waiting for transcription...\n")
              # Receive transcription
              print("Transcription: ", end="", flush=True)
              while True:
                  response = json.loads(await ws.recv())
                  if response["type"] == "transcription.delta":
                      print(response["delta"], end="", flush=True)
                  elif response["type"] == "transcription.done":
                      print(f"\n\nFinal transcription: {response['text']}")
                      if response.get("usage"):
                          print(f"Usage: {response['usage']}")
                      break
                  elif response["type"] == "error":
                      print(f"\nError: {response['error']}")
                      break
      def main():
          parser = argparse.ArgumentParser(description="Realtime WebSocket Transcription Client")
          parser.add_argument("--model", type=str, default="mistralai/Voxtral-Mini-4B-Realtime-2602")
          parser.add_argument("--audio_path", type=str, required=True)
          parser.add_argument("--host", type=str, default="127.0.0.1")
          parser.add_argument("--port", type=int, default=8000)
          args = parser.parse_args()
          asyncio.run(realtime_transcribe(args.audio_path, args.host, args.port, args.model))
      if __name__ == "__main__":
          main()
    8. Finally, run the demo script. Adjust the host address if you are running from a remote host:

      python3 realtime_test.py --audio_path CSX_Wikipedia_Detector_demo.wav --host 127.0.0.1 --port 8000

    Experimentation ideas

    After you deploy the model, you can evaluate its performance by testing streaming latency, transcription accuracy across different languages, and how well it integrates into your agentic or conversational AI workflows.

    Conclusion

    The release of Voxtral Mini 4B Realtime shows how streaming speech models are becoming core components of modern AI inference stacks. Real-time ASR introduces new infrastructure challenges such as low-latency streaming, continuous audio ingestion, and scaling interactive workloads, which open source serving frameworks like vLLM now address.

    With real-time API support in vLLM, you can deploy streaming ASR models without building custom audio pipelines. Red Hat AI Inference Server provides a production-aligned environment for evaluating these models using containerized deployments, GPU-accelerated inference, and hybrid deployment patterns.

    As real-time multimodal and voice-driven applications continue to grow, the ability to quickly test and use new models like Voxtral will play a key role in building AI applications.

    Related Posts

    • Why vLLM is the best choice for AI inference today

    • How to deploy and benchmark vLLM with GuideLLM on Kubernetes

    • Run Mistral Large 3 & Ministral 3 on vLLM with Red Hat AI on Day 0: A step-by-step guide

    • Autoscaling vLLM with OpenShift AI model serving: Performance validation

    • vLLM or llama.cpp: Choosing the right LLM inference engine for your use case

    • Run Qwen3-Next on vLLM with Red Hat AI: A step-by-step guide

    Recent Posts

    • Run Voxtral Mini 4B Realtime on vLLM with Red Hat AI on Day 1: A step-by-step guide

    • Deeper visibility in Red Hat Advanced Cluster Security

    • Upgrade volume performance without downtime: VolumeAttributesClass on OpenShift

    • How to connect OpenShift Lightspeed MCP to your IDE

    • Making LLMs boring: From chatbots to semantic processors

    What’s up next?

    share-graphic-applied-ai-enterprise-java-ebook.png

    Applied AI for Enterprise Java Development

    Alex Soto Bueno +2
    Red Hat Developers logo LinkedIn YouTube Twitter Facebook

    Platforms

    • Red Hat AI
    • Red Hat Enterprise Linux
    • Red Hat OpenShift
    • Red Hat Ansible Automation Platform
    • See all products

    Build

    • Developer Sandbox
    • Developer tools
    • Interactive tutorials
    • API catalog

    Quicklinks

    • Learning resources
    • E-books
    • Cheat sheets
    • Blog
    • Events
    • Newsletter

    Communicate

    • About us
    • Contact sales
    • Find a partner
    • Report a website issue
    • Site status dashboard
    • Report a security problem

    RED HAT DEVELOPER

    Build here. Go anywhere.

    We serve the builders. The problem solvers who create careers with code.

    Join us if you’re a developer, software engineer, web designer, front-end designer, UX designer, computer scientist, architect, tester, product manager, project manager or team lead.

    Sign me up

    Red Hat legal and privacy links

    • About Red Hat
    • Jobs
    • Events
    • Locations
    • Contact Red Hat
    • Red Hat Blog
    • Inclusion at Red Hat
    • Cool Stuff Store
    • Red Hat Summit
    © 2025 Red Hat

    Red Hat legal and privacy links

    • Privacy statement
    • Terms of use
    • All policies and guidelines
    • Digital accessibility

    Report a website issue