Skip to main content
Redhat Developers  Logo
  • Products

    Platforms

    • Red Hat Enterprise Linux
      Red Hat Enterprise Linux Icon
    • Red Hat AI
      Red Hat AI
    • Red Hat OpenShift
      Openshift icon
    • Red Hat Ansible Automation Platform
      Ansible icon
    • See all Red Hat products

    Featured

    • Red Hat build of OpenJDK
    • Red Hat Developer Hub
    • Red Hat JBoss Enterprise Application Platform
    • Red Hat OpenShift Dev Spaces
    • Red Hat OpenShift Local
    • Red Hat Developer Sandbox

      Try Red Hat products and technologies without setup or configuration fees for 30 days with this shared Red Hat OpenShift and Kubernetes cluster.
    • Try at no cost
  • Technologies

    Featured

    • AI/ML
      AI/ML Icon
    • Linux
      Linux Icon
    • Kubernetes
      Cloud icon
    • Automation
      Automation Icon showing arrows moving in a circle around a gear
    • See all technologies
    • Programming languages & frameworks

      • Java
      • Python
      • JavaScript
    • System design & architecture

      • Red Hat architecture and design patterns
      • Microservices
      • Event-Driven Architecture
      • Databases
    • Developer experience

      • Productivity
      • Tools
      • GitOps
    • Automated data processing

      • AI/ML
      • Data science
      • Apache Kafka on Kubernetes
    • Platform engineering

      • DevOps
      • DevSecOps
      • Red Hat Ansible Automation Platform for applications and services
    • Secure development & architectures

      • Security
      • Secure coding
  • Learn

    Featured

    • Kubernetes & cloud native
      Openshift icon
    • Linux
      Rhel icon
    • Automation
      Ansible cloud icon
    • AI/ML
      AI/ML Icon
    • See all learning resources

    E-books

    • GitOps cookbook
    • Podman in action
    • Kubernetes operators
    • The path to GitOps
    • See all e-books

    Cheat sheets

    • Linux commands
    • Bash commands
    • Git
    • systemd commands
    • See all cheat sheets

    Documentation

    • Product documentation
    • API catalog
    • Legacy documentation
  • Developer Sandbox

    Developer Sandbox

    • Access Red Hat’s products and technologies without setup or configuration, and start developing quicker than ever before with our new, no-cost sandbox environments.
    • Explore the Developer Sandbox

    Featured Developer Sandbox activities

    • Get started with your Developer Sandbox
    • OpenShift virtualization and application modernization using the Developer Sandbox
    • Explore all Developer Sandbox activities

    Ready to start developing apps?

    • Try at no cost
  • Blog
  • Events
  • Videos

Run Mistral Large 3 & Ministral 3 on vLLM with Red Hat AI on Day 0: A step-by-step guide

December 2, 2025
Saša Zelenović Doug Smith Tyler Michael Smith Dipika Sikka Kyle Sayers Eldar Kurtić Tarun Kumar
Related topics:
Artificial intelligenceOpen source
Related products:
Red Hat AI Inference ServerRed Hat AI

    Key takeaways

    • Mistral has released Mistral Large 3 and the Ministral 3 family under Apache 2, providing fully open source checkpoints to the community.
    • Mistral Large 3 is a sparse MoE available in BF16, and optimized low precision variants including FP8 and NVFP4 checkpoint provided through the Mistral AI, Red Hat AI, vLLM, and NVIDIA collaboration, using the LLM Compressor library.
    • Ministral 3 includes 3B, 8B, and 14B dense models in base, instruct, and reasoning variants. The official release notes that multiple compressed formats are available, and users should refer to the model cards for exact precision formats.
    • Mistral describes the Ministral 3 models as state of the art small dense models with strong cost-to-performance tradeoffs, multilingual and multimodal support, and built in vision encoders.
    • All new models are designed to run with upstream vLLM, with no custom fork required.
    • vLLM and Red Hat AI give developers Day 0 access, allowing immediate experimentation and deployment.

    Mistral has released a major new wave of open source models under Apache 2, including Mistral Large 3 and the Ministral 3 family. This generation reflects Mistral's continued move toward a fully open ecosystem, with open weights, multimodal capability, multilingual support, and upstream compatibility with vLLM. See the Mistral AI launch blog for more details, including benchmarks.

    As part of this release, we collaborated with Mistral AI using the llm-compressor library to produce optimized FP8 and NVFP4 variants of Mistral Large 3, giving the community smaller and faster checkpoints that preserve strong accuracy.

    With vLLM and Red Hat AI, developers and organizations can run these models on Day 0. There are no delays or proprietary forks. You can pull the weights and start serving the models immediately.

    What's new in the Mistral models

    Mistral Large 3:

    • Sparse Mixture of Experts with fewer but larger experts
    • Softmax expert routing
    • Top 4 expert selection
    • Llama 4 style rope scaling for long context
    • Native multimodal support with image understanding
    • Released in BF16, FP8, and NVFP4, with additional low precision formats available, using the llm-compressor library with support from the Red Hat AI team.
    • Mistral states that the new model provides parity with the best instruction tuned open weight models on the market and improves quality in tone, writing, and general knowledge tasks

    Ministral 3 (3B, 8B, 14B):

    • Dense models in base, instruct, and reasoning variants
    • Mistral describes these as state of the art small dense models with leading cost-to-performance ratios
    • All include vision encoders
    • Support multilingual and multimodal inputs
    • Multiple precision formats are available, although users should refer to the model cards for specifics since not all variants are guaranteed in BF16 and FP8

    Licensing and openness:

    • All checkpoints released under Apache 2
    • Fully compatible with upstream vLLM
    • Reflects Mistral's continued shift toward open source, which benefits the entire community and Red Hat customers

    The power of open: Immediate support in vLLM

    The new Mistral 3 models are designed to work directly with upstream vLLM, giving users immediate access without custom integrations.

    • Load the models from Hugging Face with no code changes; links here
    • Serve MoE and dense architectures efficiently
    • Use multimodal text and vision capabilities
    • Leverage quantization formats provided by Mistral, including FP8 and NVFP4 for Mistral Large 3
    • Enable speculative decoding, function calling, and long context features

    This makes vLLM the fastest path from model release to model serving.

    Quick side note: We at Red Hat hosted a vLLM meetup in Zurich together with Mistral AI in November 2025. You can view the meetup recording here for a primer and deep dive into Mistral AI's approach to building open, foundational models. 

    Experiment with Red Hat AI on Day 0

    Red Hat AI includes OpenShift AI and the Red Hat AI Inference Server, both built on top of open source foundations. The platform gives users a secure and efficient way to run the newest open models without waiting for long integration cycles.

    Red Hat AI Inference Server, built on vLLM, lets customers run open source LLMs in production environments on prem or in the cloud. With the current Red Hat preview  build, you can experiment with Mistral Large 3 and Ministral 3 today. A free 60-day trial is available for new users.

    If you are using OpenShift AI, you can import the following preview runtime as a custom image:

    registry.redhat.io/rhaiis-preview/vllm-cuda-rhel9:mistral-3-series

    You can then use it to serve the models in the standard way, and add vLLM parameters to enable features such as speculative decoding, function calling, and multimodal serving.

    This gives teams a fast and reliable way to explore the new Apache licensed Mistral models on Red Hat's AI platform while full enterprise support is prepared for upcoming stable releases.

    Serve and inference a large language model with Podman and Red Hat AI Inference Server (CUDA)

    This guide explains how to serve and run inference on a large language model using Podman and Red Hat AI Inference Server, leveraging NVIDIA CUDA AI accelerators.

    Prerequisites

    Make sure you meet the following requirements before proceeding:

    System requirements

    • A Linux server with data center-grade NVIDIA AI accelerators installed.

    Software requirements

    • You have installed either:
      • Podman or Docker
    • You have access to Red Hat container images:
      • Logged into registry.redhat.io

    Technology Preview notice

    The Red Hat AI Inference Server images used in this guide are a Technology Preview and not yet fully supported. They are for evaluation only, and production workloads should wait for the upcoming official GA release from the Red Hat container registries.

    Procedure: Serve and inference a model using Red Hat AI Inference Server (CUDA)

    This section walks you through the steps to run a large language model with Podman and Red Hat AI Inference Server using NVIDIA CUDA AI accelerators. For deployments in OpenShift AI, simply import the image registry.redhat.io/rhaiis-preview/vllm-cuda-rhel9:mistral-3-series as a custom runtime, and use it to serve the model in the standard way, eventually adding the vLLM parameters described in the following procedure to enable certain features (speculative decoding, function calling, and so on).

    1. Log in to the Red Hat Registry

    Open a terminal on your server and log in to registry.redhat.io: 

    podman login registry.redhat.io

    2. Pull the Red Hat AI Inference Server Image (CUDA version)

    podman pull registry.redhat.io/rhaiis-preview/vllm-cuda-rhel9:mistral-3-series

    3. Configure SELinux (if enabled)

    If SELinux is enabled on your system, allow container access to devices:

    sudo setsebool -P container_use_devices 1

    4. Create a volume directory for model caching

    Create and set proper permissions for the cache directory:

    mkdir -p rhaiis-cache
    chmod g+rwX rhaiis-cache

    5. Add your Hugging Face token

    Create or append your Hugging Face token to a local private.env file and source it:

    echo "export HF_TOKEN=<your_HF_token>" > private.env
    source private.env

    6.  Start the AI Inference Server container

    If your system includes multiple NVIDIA GPUs connected via NVSwitch, perform the following steps:

    a. Check for NVSwitch

    To detect NVSwitch support, check for the presence of devices:

    ls /proc/driver/nvidia-nvswitch/devices/

    Example output:

    0000:0c:09.0  0000:0c:0a.0  0000:0c:0b.0  0000:0c:0c.0  0000:0c:0d.0  0000:0c:0e.0

    b. Start NVIDIA Fabric Manager (root required)

    sudo systemctl start nvidia-fabricmanager

    Important

    NVIDIA Fabric Manager is only required for systems with multiple GPUs using NVSwitch.

    c. Verify GPU visibility from container

    Run the following command to verify GPU access inside a container:

    podman run --rm -it \
     --security-opt=label=disable \
     --device nvidia.com/gpu=all \
     nvcr.io/nvidia/cuda:12.4.1-base-ubi9 \
     nvidia-smi

    d. Start the Red Hat AI Inference Server container

    Start the Red Hat AI Inference Server container with the Mistral Large 3 FP8 model:

    podman run --rm -it \
      --device nvidia.com/gpu=all \
      --shm-size=4g \
      -p 8000:8000 \
      --tmpfs /home/vllm/.cache:rw,exec,uid=2000,gid=2000 \
      --env "HUGGING_FACE_HUB_TOKEN=$HF_TOKEN" \
      --env "HF_HUB_OFFLINE=0" \
      -e HF_HUB_CACHE=/opt/app-root/src/.cache \
      registry.redhat.io/rhaiis-preview/vllm-cuda-rhel9:mistral-3-series \
        --model mistralai/Mistral-Large-3-675B-Instruct-2512 \
        --tokenizer-mode mistral \
        --config-format mistral \
        --load-format mistral \
        --kv-cache-dtype fp8 \
        --tensor-parallel-size 8 \
        --limit-mm-per-prompt '{"image":10}' \
        --enable-auto-tool-choice \
        --tool-call-parser mistral \
        --host 0.0.0.0 \
        --port 8000

    Note: This configuration can be used to run Mistral Large 3 (FP8) on one 8x H200 node. Note the  --tensor-parallel-size parameter, adjust to match other situations.

    Function calling

    vLLM also supports calling user-defined functions. Make sure to run models with the following arguments.

    podman run --rm -it \
      --device nvidia.com/gpu=all \
      --shm-size=4g \
      -p 8000:8000 \
      --tmpfs /home/vllm/.cache:rw,exec,uid=2000,gid=2000 \
      --env "HUGGING_FACE_HUB_TOKEN=$HF_TOKEN" \
      --env "HF_HUB_OFFLINE=0" \
      -e HF_HUB_CACHE=/opt/app-root/src/.cache \
      registry.redhat.io/rhaiis-preview/vllm-cuda-rhel9:mistral-3-series \
        --model mistralai/Mistral-Large-3-675B-Instruct-2512 \
        --tokenizer-mode mistral \
        --config-format mistral \
        --load-format mistral \
        --kv-cache-dtype fp8 \
        --tensor-parallel-size 8 \
        --limit-mm-per-prompt '{"image":10}' \
        --enable-auto-tool-choice \
        --tool-call-parser mistral

    Speculative decoding with the draft model (EAGLE3):

    podman run --rm -it \
      --device nvidia.com/gpu=all \
      --shm-size=4g \
      -p 8000:8000 \
      --tmpfs /home/vllm/.cache:rw,exec,uid=2000,gid=2000 \
      --env "HUGGING_FACE_HUB_TOKEN=$HF_TOKEN" \
      --env "HF_HUB_OFFLINE=0" \
      -e HF_HUB_CACHE=/opt/app-root/src/.cache \
      registry.redhat.io/rhaiis-preview/vllm-cuda-rhel9:mistral-3-series \
        --model mistralai/Mistral-Large-3-675B-Instruct-2512 \
        --tokenizer-mode mistral \
        --config-format mistral \
        --load-format mistral \
        --kv-cache-dtype fp8 \
        --tensor-parallel-size 8 \
        --limit-mm-per-prompt '{"image":10}' \
        --host 0.0.0.0 \
        --port 8000 \
        --speculative_config '{
          "model": "mistralai/Mistral-Large-3-675B-Instruct-2512-Eagle",
          "num_speculative_tokens": 3,
          "method": "eagle",
          "max_model_len": "16384"
        }'

    What's next

    The release of Mistral Large 3 and Ministral 3 represents another major step for open source LLMs and the open infrastructure supporting them.

    Coming soon:

    • WideEP on GB200 for next-generation multi-expert parallelism with llm-d
    • Full enterprise support for Mistral 3 models in future Red Hat AI stable builds.

    Open models are evolving faster than ever, and with vLLM and Red Hat AI, developers and enterprises can experiment on Day 0  safely, openly, and at scale.

    Recent Posts

    • Run Mistral Large 3 & Ministral 3 on vLLM with Red Hat AI on Day 0: A step-by-step guide

    • Run cost-effective AI workloads on OpenShift with AWS Neuron Operator

    • Automate unique compliance checks with OpenShift and CustomRule

    • Build custom OS images for IBM Power systems (ppc64le) with bootc

    • Generate synthetic data for your AI models with SDG Hub

    What’s up next?

    Read Applied AI for Enterprise Java Development, a practical guide for Java developers to integrate generative AI and machine learning using familiar enterprise tools.

    Get the e-book
    Red Hat Developers logo LinkedIn YouTube Twitter Facebook

    Platforms

    • Red Hat AI
    • Red Hat Enterprise Linux
    • Red Hat OpenShift
    • Red Hat Ansible Automation Platform
    • See all products

    Build

    • Developer Sandbox
    • Developer tools
    • Interactive tutorials
    • API catalog

    Quicklinks

    • Learning resources
    • E-books
    • Cheat sheets
    • Blog
    • Events
    • Newsletter

    Communicate

    • About us
    • Contact sales
    • Find a partner
    • Site status dashboard
    • Report a security problem

    RED HAT DEVELOPER

    Build here. Go anywhere.

    We serve the builders. The problem solvers who create careers with code.

    Join us if you’re a developer, software engineer, web designer, front-end designer, UX designer, computer scientist, architect, tester, product manager, project manager or team lead.

    Sign me up

    Red Hat legal and privacy links

    • About Red Hat
    • Jobs
    • Events
    • Locations
    • Contact Red Hat
    • Red Hat Blog
    • Inclusion at Red Hat
    • Cool Stuff Store
    • Red Hat Summit
    © 2025 Red Hat

    Red Hat legal and privacy links

    • Privacy statement
    • Terms of use
    • All policies and guidelines
    • Digital accessibility

    Report a website issue