Key takeaways
- Mistral has released Mistral Large 3 and the Ministral 3 family under Apache 2, providing fully open source checkpoints to the community.
- Mistral Large 3 is a sparse MoE available in BF16, and optimized low precision variants including FP8 and NVFP4 checkpoint provided through the Mistral AI, Red Hat AI, vLLM, and NVIDIA collaboration, using the LLM Compressor library.
- Ministral 3 includes 3B, 8B, and 14B dense models in base, instruct, and reasoning variants. The official release notes that multiple compressed formats are available, and users should refer to the model cards for exact precision formats.
- Mistral describes the Ministral 3 models as state of the art small dense models with strong cost-to-performance tradeoffs, multilingual and multimodal support, and built in vision encoders.
- All new models are designed to run with upstream vLLM, with no custom fork required.
- vLLM and Red Hat AI give developers Day 0 access, allowing immediate experimentation and deployment.
Mistral has released a major new wave of open source models under Apache 2, including Mistral Large 3 and the Ministral 3 family. This generation reflects Mistral's continued move toward a fully open ecosystem, with open weights, multimodal capability, multilingual support, and upstream compatibility with vLLM. See the Mistral AI launch blog for more details, including benchmarks.
As part of this release, we collaborated with Mistral AI using the llm-compressor library to produce optimized FP8 and NVFP4 variants of Mistral Large 3, giving the community smaller and faster checkpoints that preserve strong accuracy.
With vLLM and Red Hat AI, developers and organizations can run these models on Day 0. There are no delays or proprietary forks. You can pull the weights and start serving the models immediately.
What's new in the Mistral models
Mistral Large 3:
- Sparse Mixture of Experts with fewer but larger experts
- Softmax expert routing
- Top 4 expert selection
- Llama 4 style rope scaling for long context
- Native multimodal support with image understanding
- Released in BF16, FP8, and NVFP4, with additional low precision formats available, using the llm-compressor library with support from the Red Hat AI team.
- Mistral states that the new model provides parity with the best instruction tuned open weight models on the market and improves quality in tone, writing, and general knowledge tasks
Ministral 3 (3B, 8B, 14B):
- Dense models in base, instruct, and reasoning variants
- Mistral describes these as state of the art small dense models with leading cost-to-performance ratios
- All include vision encoders
- Support multilingual and multimodal inputs
- Multiple precision formats are available, although users should refer to the model cards for specifics since not all variants are guaranteed in BF16 and FP8
Licensing and openness:
- All checkpoints released under Apache 2
- Fully compatible with upstream vLLM
- Reflects Mistral's continued shift toward open source, which benefits the entire community and Red Hat customers
The power of open: Immediate support in vLLM
The new Mistral 3 models are designed to work directly with upstream vLLM, giving users immediate access without custom integrations.
- Load the models from Hugging Face with no code changes; links here
- Serve MoE and dense architectures efficiently
- Use multimodal text and vision capabilities
- Leverage quantization formats provided by Mistral, including FP8 and NVFP4 for Mistral Large 3
- Enable speculative decoding, function calling, and long context features
This makes vLLM the fastest path from model release to model serving.
Quick side note: We at Red Hat hosted a vLLM meetup in Zurich together with Mistral AI in November 2025. You can view the meetup recording here for a primer and deep dive into Mistral AI's approach to building open, foundational models.
Experiment with Red Hat AI on Day 0
Red Hat AI includes OpenShift AI and the Red Hat AI Inference Server, both built on top of open source foundations. The platform gives users a secure and efficient way to run the newest open models without waiting for long integration cycles.
Red Hat AI Inference Server, built on vLLM, lets customers run open source LLMs in production environments on prem or in the cloud. With the current Red Hat preview build, you can experiment with Mistral Large 3 and Ministral 3 today. A free 60-day trial is available for new users.
If you are using OpenShift AI, you can import the following preview runtime as a custom image:
registry.redhat.io/rhaiis-preview/vllm-cuda-rhel9:mistral-3-seriesYou can then use it to serve the models in the standard way, and add vLLM parameters to enable features such as speculative decoding, function calling, and multimodal serving.
This gives teams a fast and reliable way to explore the new Apache licensed Mistral models on Red Hat's AI platform while full enterprise support is prepared for upcoming stable releases.
Serve and inference a large language model with Podman and Red Hat AI Inference Server (CUDA)
This guide explains how to serve and run inference on a large language model using Podman and Red Hat AI Inference Server, leveraging NVIDIA CUDA AI accelerators.
Prerequisites
Make sure you meet the following requirements before proceeding:
System requirements
- A Linux server with data center-grade NVIDIA AI accelerators installed.
Software requirements
- You have installed either:
- Podman or Docker
- You have access to Red Hat container images:
- Logged into registry.redhat.io
Technology Preview notice
The Red Hat AI Inference Server images used in this guide are a Technology Preview and not yet fully supported. They are for evaluation only, and production workloads should wait for the upcoming official GA release from the Red Hat container registries.
Procedure: Serve and inference a model using Red Hat AI Inference Server (CUDA)
This section walks you through the steps to run a large language model with Podman and Red Hat AI Inference Server using NVIDIA CUDA AI accelerators. For deployments in OpenShift AI, simply import the image registry.redhat.io/rhaiis-preview/vllm-cuda-rhel9:mistral-3-series as a custom runtime, and use it to serve the model in the standard way, eventually adding the vLLM parameters described in the following procedure to enable certain features (speculative decoding, function calling, and so on).
1. Log in to the Red Hat Registry
Open a terminal on your server and log in to registry.redhat.io:
podman login registry.redhat.io2. Pull the Red Hat AI Inference Server Image (CUDA version)
podman pull registry.redhat.io/rhaiis-preview/vllm-cuda-rhel9:mistral-3-series3. Configure SELinux (if enabled)
If SELinux is enabled on your system, allow container access to devices:
sudo setsebool -P container_use_devices 14. Create a volume directory for model caching
Create and set proper permissions for the cache directory:
mkdir -p rhaiis-cachechmod g+rwX rhaiis-cache5. Add your Hugging Face token
Create or append your Hugging Face token to a local private.env file and source it:
echo "export HF_TOKEN=<your_HF_token>" > private.envsource private.env6. Start the AI Inference Server container
If your system includes multiple NVIDIA GPUs connected via NVSwitch, perform the following steps:
a. Check for NVSwitch
To detect NVSwitch support, check for the presence of devices:
ls /proc/driver/nvidia-nvswitch/devices/Example output:
0000:0c:09.0 0000:0c:0a.0 0000:0c:0b.0 0000:0c:0c.0 0000:0c:0d.0 0000:0c:0e.0
b. Start NVIDIA Fabric Manager (root required)
sudo systemctl start nvidia-fabricmanagerImportant
NVIDIA Fabric Manager is only required for systems with multiple GPUs using NVSwitch.
c. Verify GPU visibility from container
Run the following command to verify GPU access inside a container:
podman run --rm -it \
--security-opt=label=disable \
--device nvidia.com/gpu=all \
nvcr.io/nvidia/cuda:12.4.1-base-ubi9 \
nvidia-smid. Start the Red Hat AI Inference Server container
Start the Red Hat AI Inference Server container with the Mistral Large 3 FP8 model:
podman run --rm -it \
--device nvidia.com/gpu=all \
--shm-size=4g \
-p 8000:8000 \
--tmpfs /home/vllm/.cache:rw,exec,uid=2000,gid=2000 \
--env "HUGGING_FACE_HUB_TOKEN=$HF_TOKEN" \
--env "HF_HUB_OFFLINE=0" \
-e HF_HUB_CACHE=/opt/app-root/src/.cache \
registry.redhat.io/rhaiis-preview/vllm-cuda-rhel9:mistral-3-series \
--model mistralai/Mistral-Large-3-675B-Instruct-2512 \
--tokenizer-mode mistral \
--config-format mistral \
--load-format mistral \
--kv-cache-dtype fp8 \
--tensor-parallel-size 8 \
--limit-mm-per-prompt '{"image":10}' \
--enable-auto-tool-choice \
--tool-call-parser mistral \
--host 0.0.0.0 \
--port 8000Note: This configuration can be used to run Mistral Large 3 (FP8) on one 8x H200 node. Note the --tensor-parallel-size parameter, adjust to match other situations.
Function calling
vLLM also supports calling user-defined functions. Make sure to run models with the following arguments.
podman run --rm -it \
--device nvidia.com/gpu=all \
--shm-size=4g \
-p 8000:8000 \
--tmpfs /home/vllm/.cache:rw,exec,uid=2000,gid=2000 \
--env "HUGGING_FACE_HUB_TOKEN=$HF_TOKEN" \
--env "HF_HUB_OFFLINE=0" \
-e HF_HUB_CACHE=/opt/app-root/src/.cache \
registry.redhat.io/rhaiis-preview/vllm-cuda-rhel9:mistral-3-series \
--model mistralai/Mistral-Large-3-675B-Instruct-2512 \
--tokenizer-mode mistral \
--config-format mistral \
--load-format mistral \
--kv-cache-dtype fp8 \
--tensor-parallel-size 8 \
--limit-mm-per-prompt '{"image":10}' \
--enable-auto-tool-choice \
--tool-call-parser mistralSpeculative decoding with the draft model (EAGLE3):
podman run --rm -it \
--device nvidia.com/gpu=all \
--shm-size=4g \
-p 8000:8000 \
--tmpfs /home/vllm/.cache:rw,exec,uid=2000,gid=2000 \
--env "HUGGING_FACE_HUB_TOKEN=$HF_TOKEN" \
--env "HF_HUB_OFFLINE=0" \
-e HF_HUB_CACHE=/opt/app-root/src/.cache \
registry.redhat.io/rhaiis-preview/vllm-cuda-rhel9:mistral-3-series \
--model mistralai/Mistral-Large-3-675B-Instruct-2512 \
--tokenizer-mode mistral \
--config-format mistral \
--load-format mistral \
--kv-cache-dtype fp8 \
--tensor-parallel-size 8 \
--limit-mm-per-prompt '{"image":10}' \
--host 0.0.0.0 \
--port 8000 \
--speculative_config '{
"model": "mistralai/Mistral-Large-3-675B-Instruct-2512-Eagle",
"num_speculative_tokens": 3,
"method": "eagle",
"max_model_len": "16384"
}'What's next
The release of Mistral Large 3 and Ministral 3 represents another major step for open source LLMs and the open infrastructure supporting them.
Coming soon:
- WideEP on GB200 for next-generation multi-expert parallelism with llm-d
- Full enterprise support for Mistral 3 models in future Red Hat AI stable builds.
Open models are evolving faster than ever, and with vLLM and Red Hat AI, developers and enterprises can experiment on Day 0 safely, openly, and at scale.