Run Mistral Large 3 & Ministral 3 on vLLM with Red Hat AI on Day 0

Key takeaways

Mistral has released Mistral Large 3 and the Ministral 3 family under Apache 2, providing fully open source checkpoints to the community.
Mistral Large 3 is a sparse MoE available in BF16, and optimized low precision variants including FP8 and NVFP4 checkpoint provided through the Mistral AI, Red Hat AI, vLLM, and NVIDIA collaboration, using the LLM Compressor library.
Ministral 3 includes 3B, 8B, and 14B dense models in base, instruct, and reasoning variants. The official release notes that multiple compressed formats are available, and users should refer to the model cards for exact precision formats.
Mistral describes the Ministral 3 models as state of the art small dense models with strong cost-to-performance tradeoffs, multilingual and multimodal support, and built in vision encoders.
All new models are designed to run with upstream vLLM, with no custom fork required.
vLLM and Red Hat AI give developers Day 0 access, allowing immediate experimentation and deployment.

Mistral has released a major new wave of open source models under Apache 2, including Mistral Large 3 and the Ministral 3 family. This generation reflects Mistral's continued move toward a fully open ecosystem, with open weights, multimodal capability, multilingual support, and upstream compatibility with vLLM. See the Mistral AI launch blog for more details, including benchmarks.

As part of this release, we collaborated with Mistral AI using the llm-compressor library to produce optimized FP8 and NVFP4 variants of Mistral Large 3, giving the community smaller and faster checkpoints that preserve strong accuracy.

With vLLM and Red Hat AI, developers and organizations can run these models on Day 0. There are no delays or proprietary forks. You can pull the weights and start serving the models immediately.

What's new in the Mistral models

Mistral Large 3:

Sparse Mixture of Experts with fewer but larger experts
Softmax expert routing
Top 4 expert selection
Llama 4 style rope scaling for long context
Native multimodal support with image understanding
Released in BF16, FP8, and NVFP4, with additional low precision formats available, using the llm-compressor library with support from the Red Hat AI team.
Mistral states that the new model provides parity with the best instruction tuned open weight models on the market and improves quality in tone, writing, and general knowledge tasks

Ministral 3 (3B, 8B, 14B):

Dense models in base, instruct, and reasoning variants
Mistral describes these as state of the art small dense models with leading cost-to-performance ratios
All include vision encoders
Support multilingual and multimodal inputs
Multiple precision formats are available, although users should refer to the model cards for specifics since not all variants are guaranteed in BF16 and FP8

Licensing and openness:

All checkpoints released under Apache 2
Fully compatible with upstream vLLM
Reflects Mistral's continued shift toward open source, which benefits the entire community and Red Hat customers

The power of open: Immediate support in vLLM

The new Mistral 3 models are designed to work directly with upstream vLLM, giving users immediate access without custom integrations.

Load the models from Hugging Face with no code changes; links here
Serve MoE and dense architectures efficiently
Use multimodal text and vision capabilities
Leverage quantization formats provided by Mistral, including FP8 and NVFP4 for Mistral Large 3
Enable speculative decoding, function calling, and long context features

This makes vLLM the fastest path from model release to model serving.

Quick side note: We at Red Hat hosted a vLLM meetup in Zurich together with Mistral AI in November 2025. You can view the meetup recording here for a primer and deep dive into Mistral AI's approach to building open, foundational models.

Experiment with Red Hat AI on Day 0

Red Hat AI includes OpenShift AI and the Red Hat AI Inference Server, both built on top of open source foundations. The platform gives users a secure and efficient way to run the newest open models without waiting for long integration cycles.

Red Hat AI Inference Server, built on vLLM, lets customers run open source LLMs in production environments on prem or in the cloud. With the current Red Hat preview build, you can experiment with Mistral Large 3 and Ministral 3 today. A free 60-day trial is available for new users.

If you are using OpenShift AI, you can import the following preview runtime as a custom image:

registry.redhat.io/rhaiis-preview/vllm-cuda-rhel9:mistral-3-series

You can then use it to serve the models in the standard way, and add vLLM parameters to enable features such as speculative decoding, function calling, and multimodal serving.

This gives teams a fast and reliable way to explore the new Apache licensed Mistral models on Red Hat's AI platform while full enterprise support is prepared for upcoming stable releases.

Serve and inference a large language model with Podman and Red Hat AI Inference Server (CUDA)

This guide explains how to serve and run inference on a large language model using Podman and Red Hat AI Inference Server, leveraging NVIDIA CUDA AI accelerators.

Prerequisites

Make sure you meet the following requirements before proceeding:

System requirements

A Linux server with data center-grade NVIDIA AI accelerators installed.

Software requirements

You have installed either:
- Podman or Docker
You have access to Red Hat container images:
- Logged into registry.redhat.io

Technology Preview notice

The Red Hat AI Inference Server images used in this guide are a Technology Preview and not yet fully supported. They are for evaluation only, and production workloads should wait for the upcoming official GA release from the Red Hat container registries.

Procedure: Serve and inference a model using Red Hat AI Inference Server (CUDA)

This section walks you through the steps to run a large language model with Podman and Red Hat AI Inference Server using NVIDIA CUDA AI accelerators. For deployments in OpenShift AI, simply import the image registry.redhat.io/rhaiis-preview/vllm-cuda-rhel9:mistral-3-series as a custom runtime, and use it to serve the model in the standard way, eventually adding the vLLM parameters described in the following procedure to enable certain features (speculative decoding, function calling, and so on).

1. Log in to the Red Hat Registry

Open a terminal on your server and log in to registry.redhat.io:

podman login registry.redhat.io

2. Pull the Red Hat AI Inference Server Image (CUDA version)

podman pull registry.redhat.io/rhaiis-preview/vllm-cuda-rhel9:mistral-3-series

3. Configure SELinux (if enabled)

If SELinux is enabled on your system, allow container access to devices:

sudo setsebool -P container_use_devices 1

4. Create a volume directory for model caching

Create and set proper permissions for the cache directory:

mkdir -p rhaiis-cache

chmod g+rwX rhaiis-cache

5. Add your Hugging Face token

Create or append your Hugging Face token to a local private.env file and source it:

echo "export HF_TOKEN=<your_HF_token>" > private.env

source private.env

6. Start the AI Inference Server container

If your system includes multiple NVIDIA GPUs connected via NVSwitch, perform the following steps:

a. Check for NVSwitch

To detect NVSwitch support, check for the presence of devices:

ls /proc/driver/nvidia-nvswitch/devices/

Example output:

0000:0c:09.0  0000:0c:0a.0  0000:0c:0b.0  0000:0c:0c.0  0000:0c:0d.0  0000:0c:0e.0

b. Start NVIDIA Fabric Manager (root required)

sudo systemctl start nvidia-fabricmanager

Important

NVIDIA Fabric Manager is only required for systems with multiple GPUs using NVSwitch.

c. Verify GPU visibility from container

Run the following command to verify GPU access inside a container:

podman run --rm -it \
 --security-opt=label=disable \
 --device nvidia.com/gpu=all \
 nvcr.io/nvidia/cuda:12.4.1-base-ubi9 \
 nvidia-smi

d. Start the Red Hat AI Inference Server container

Start the Red Hat AI Inference Server container with the Mistral Large 3 FP8 model:

podman run --rm -it \
  --device nvidia.com/gpu=all \
  --shm-size=4g \
  -p 8000:8000 \
  --tmpfs /home/vllm/.cache:rw,exec,uid=2000,gid=2000 \
  --env "HUGGING_FACE_HUB_TOKEN=$HF_TOKEN" \
  --env "HF_HUB_OFFLINE=0" \
  -e HF_HUB_CACHE=/opt/app-root/src/.cache \
  registry.redhat.io/rhaiis-preview/vllm-cuda-rhel9:mistral-3-series \
    --model mistralai/Mistral-Large-3-675B-Instruct-2512 \
    --tokenizer-mode mistral \
    --config-format mistral \
    --load-format mistral \
    --kv-cache-dtype fp8 \
    --tensor-parallel-size 8 \
    --limit-mm-per-prompt '{"image":10}' \
    --enable-auto-tool-choice \
    --tool-call-parser mistral \
    --host 0.0.0.0 \
    --port 8000

Note: This configuration can be used to run Mistral Large 3 (FP8) on one 8x H200 node. Note the --tensor-parallel-size parameter, adjust to match other situations.

Function calling

vLLM also supports calling user-defined functions. Make sure to run models with the following arguments.

podman run --rm -it \
  --device nvidia.com/gpu=all \
  --shm-size=4g \
  -p 8000:8000 \
  --tmpfs /home/vllm/.cache:rw,exec,uid=2000,gid=2000 \
  --env "HUGGING_FACE_HUB_TOKEN=$HF_TOKEN" \
  --env "HF_HUB_OFFLINE=0" \
  -e HF_HUB_CACHE=/opt/app-root/src/.cache \
  registry.redhat.io/rhaiis-preview/vllm-cuda-rhel9:mistral-3-series \
    --model mistralai/Mistral-Large-3-675B-Instruct-2512 \
    --tokenizer-mode mistral \
    --config-format mistral \
    --load-format mistral \
    --kv-cache-dtype fp8 \
    --tensor-parallel-size 8 \
    --limit-mm-per-prompt '{"image":10}' \
    --enable-auto-tool-choice \
    --tool-call-parser mistral

Speculative decoding with the draft model (EAGLE3):

podman run --rm -it \
  --device nvidia.com/gpu=all \
  --shm-size=4g \
  -p 8000:8000 \
  --tmpfs /home/vllm/.cache:rw,exec,uid=2000,gid=2000 \
  --env "HUGGING_FACE_HUB_TOKEN=$HF_TOKEN" \
  --env "HF_HUB_OFFLINE=0" \
  -e HF_HUB_CACHE=/opt/app-root/src/.cache \
  registry.redhat.io/rhaiis-preview/vllm-cuda-rhel9:mistral-3-series \
    --model mistralai/Mistral-Large-3-675B-Instruct-2512 \
    --tokenizer-mode mistral \
    --config-format mistral \
    --load-format mistral \
    --kv-cache-dtype fp8 \
    --tensor-parallel-size 8 \
    --limit-mm-per-prompt '{"image":10}' \
    --host 0.0.0.0 \
    --port 8000 \
    --speculative_config '{
      "model": "mistralai/Mistral-Large-3-675B-Instruct-2512-Eagle",
      "num_speculative_tokens": 3,
      "method": "eagle",
      "max_model_len": "16384"
    }'

What's next

The release of Mistral Large 3 and Ministral 3 represents another major step for open source LLMs and the open infrastructure supporting them.

Coming soon:

WideEP on GB200 for next-generation multi-expert parallelism with llm-d
Full enterprise support for Mistral 3 models in future Red Hat AI stable builds.

Open models are evolving faster than ever, and with vLLM and Red Hat AI, developers and enterprises can experiment on Day 0 safely, openly, and at scale.

Run Mistral Large 3 & Ministral 3 on vLLM with Red Hat AI on Day 0: A step-by-step guide

What's new in the Mistral models

The power of open: Immediate support in vLLM

Experiment with Red Hat AI on Day 0

Serve and inference a large language model with Podman and Red Hat AI Inference Server (CUDA)

Prerequisites

Procedure: Serve and inference a model using Red Hat AI Inference Server (CUDA)

1. Log in to the Red Hat Registry

2. Pull the Red Hat AI Inference Server Image (CUDA version)

3. Configure SELinux (if enabled)

4. Create a volume directory for model caching

5. Add your Hugging Face token

6. Start the AI Inference Server container

a. Check for NVSwitch

b. Start NVIDIA Fabric Manager (root required)

c. Verify GPU visibility from container

d. Start the Red Hat AI Inference Server container

Function calling

What's next

Automate Oracle 19c deployments on OpenShift Virtualization

Monitoring OpenShift Gateway API and Service Mesh with Kiali

Improve efficiency with OpenStack Services on OpenShift

Quantum-secure gateways in Red Hat OpenShift Service Mesh 3.2

Deploy and customize JBoss Web Server on Red Hat OpenShift

Platforms

Build

Quicklinks

Communicate

RED HAT DEVELOPER

Red Hat legal and privacy links

Red Hat legal and privacy links

Report a website issue