Skip to main content
Redhat Developers  Logo
  • AI

    Get started with AI

    • Red Hat AI
      Accelerate the development and deployment of enterprise AI solutions.
    • AI learning hub
      Explore learning materials and tools, organized by task.
    • AI interactive demos
      Click through scenarios with Red Hat AI, including training LLMs and more.
    • AI/ML learning paths
      Expand your OpenShift AI knowledge using these learning resources.
    • AI quickstarts
      Focused AI use cases designed for fast deployment on Red Hat AI platforms.
    • No-cost AI training
      Foundational Red Hat AI training.

    Featured resources

    • OpenShift AI learning
    • Open source AI for developers
    • AI product application development
    • Open source-powered AI/ML for hybrid cloud
    • AI and Node.js cheat sheet

    Red Hat AI Factory with NVIDIA

    • Red Hat AI Factory with NVIDIA is a co-engineered, enterprise-grade AI solution for building, deploying, and managing AI at scale across hybrid cloud environments.
    • Explore the solution
  • Learn

    Self-guided

    • Documentation
      Find answers, get step-by-step guidance, and learn how to use Red Hat products.
    • Learning paths
      Explore curated walkthroughs for common development tasks.
    • Guided learning
      Receive custom learning paths powered by our AI assistant.
    • See all learning

    Hands-on

    • Developer Sandbox
      Spin up Red Hat's products and technologies without setup or configuration.
    • Interactive labs
      Learn by doing in these hands-on, browser-based experiences.
    • Interactive demos
      Click through product features in these guided tours.

    Browse by topic

    • AI/ML
    • Automation
    • Java
    • Kubernetes
    • Linux
    • See all topics

    Training & certifications

    • Courses and exams
    • Certifications
    • Skills assessments
    • Red Hat Academy
    • Learning subscription
    • Explore training
  • Build

    Get started

    • Red Hat build of Podman Desktop
      A downloadable, local development hub to experiment with our products and builds.
    • Developer Sandbox
      Spin up Red Hat's products and technologies without setup or configuration.

    Download products

    • Access product downloads to start building and testing right away.
    • Red Hat Enterprise Linux
    • Red Hat AI
    • Red Hat OpenShift
    • Red Hat Ansible Automation Platform
    • See all products

    Featured

    • Red Hat build of OpenJDK
    • Red Hat JBoss Enterprise Application Platform
    • Red Hat OpenShift Dev Spaces
    • Red Hat Developer Toolset

    References

    • E-books
    • Documentation
    • Cheat sheets
    • Architecture center
  • Community

    Get involved

    • Events
    • Live AI events
    • Red Hat Summit
    • Red Hat Accelerators
    • Community discussions

    Follow along

    • Articles & blogs
    • Developer newsletter
    • Videos
    • Github

    Get help

    • Customer service
    • Customer support
    • Regional contacts
    • Find a partner

    Join the Red Hat Developer program

    • Download Red Hat products and project builds, access support documentation, learning content, and more.
    • Explore the benefits

Run Mistral Large 3 & Ministral 3 on vLLM with Red Hat AI on Day 0: A step-by-step guide

December 2, 2025
Saša Zelenović Doug Smith Tyler Michael Smith Dipika Sikka Kyle Sayers Eldar Kurtić Tarun Kumar
Related topics:
Artificial intelligenceOpen source
Related products:
Red Hat AI InferenceRed Hat AI

    Key takeaways

    • Mistral has released Mistral Large 3 and the Ministral 3 family under Apache 2, providing fully open source checkpoints to the community.
    • Mistral Large 3 is a sparse MoE available in BF16, and optimized low precision variants including FP8 and NVFP4 checkpoint provided through the Mistral AI, Red Hat AI, vLLM, and NVIDIA collaboration, using the LLM Compressor library.
    • Ministral 3 includes 3B, 8B, and 14B dense models in base, instruct, and reasoning variants. The official release notes that multiple compressed formats are available, and users should refer to the model cards for exact precision formats.
    • Mistral describes the Ministral 3 models as state of the art small dense models with strong cost-to-performance tradeoffs, multilingual and multimodal support, and built in vision encoders.
    • All new models are designed to run with upstream vLLM, with no custom fork required.
    • vLLM and Red Hat AI give developers Day 0 access, allowing immediate experimentation and deployment.

    Mistral has released a major new wave of open source models under Apache 2, including Mistral Large 3 and the Ministral 3 family. This generation reflects Mistral's continued move toward a fully open ecosystem, with open weights, multimodal capability, multilingual support, and upstream compatibility with vLLM. See the Mistral AI launch blog for more details, including benchmarks.

    As part of this release, we collaborated with Mistral AI using the llm-compressor library to produce optimized FP8 and NVFP4 variants of Mistral Large 3, giving the community smaller and faster checkpoints that preserve strong accuracy.

    With vLLM and Red Hat AI, developers and organizations can run these models on Day 0. There are no delays or proprietary forks. You can pull the weights and start serving the models immediately.

    What's new in the Mistral models

    Mistral Large 3:

    • Sparse Mixture of Experts with fewer but larger experts
    • Softmax expert routing
    • Top 4 expert selection
    • Llama 4 style rope scaling for long context
    • Native multimodal support with image understanding
    • Released in BF16, FP8, and NVFP4, with additional low precision formats available, using the llm-compressor library with support from the Red Hat AI team.
    • Mistral states that the new model provides parity with the best instruction tuned open weight models on the market and improves quality in tone, writing, and general knowledge tasks

    Ministral 3 (3B, 8B, 14B):

    • Dense models in base, instruct, and reasoning variants
    • Mistral describes these as state of the art small dense models with leading cost-to-performance ratios
    • All include vision encoders
    • Support multilingual and multimodal inputs
    • Multiple precision formats are available, although users should refer to the model cards for specifics since not all variants are guaranteed in BF16 and FP8

    Licensing and openness:

    • All checkpoints released under Apache 2
    • Fully compatible with upstream vLLM
    • Reflects Mistral's continued shift toward open source, which benefits the entire community and Red Hat customers

    The power of open: Immediate support in vLLM

    The new Mistral 3 models are designed to work directly with upstream vLLM, giving users immediate access without custom integrations.

    • Load the models from Hugging Face with no code changes; links here
    • Serve MoE and dense architectures efficiently
    • Use multimodal text and vision capabilities
    • Leverage quantization formats provided by Mistral, including FP8 and NVFP4 for Mistral Large 3
    • Enable speculative decoding, function calling, and long context features

    This makes vLLM the fastest path from model release to model serving.

    Quick side note: We at Red Hat hosted a vLLM meetup in Zurich together with Mistral AI in November 2025. You can view the meetup recording here for a primer and deep dive into Mistral AI's approach to building open, foundational models. 

    Experiment with Red Hat AI on Day 0

    Red Hat AI includes OpenShift AI and the Red Hat AI Inference Server, both built on top of open source foundations. The platform gives users a secure and efficient way to run the newest open models without waiting for long integration cycles.

    Red Hat AI Inference Server, built on vLLM, lets customers run open source LLMs in production environments on prem or in the cloud. With the current Red Hat preview  build, you can experiment with Mistral Large 3 and Ministral 3 today. A free 60-day trial is available for new users.

    If you are using OpenShift AI, you can import the following preview runtime as a custom image:

    registry.redhat.io/rhaiis-preview/vllm-cuda-rhel9:mistral-3-series

    You can then use it to serve the models in the standard way, and add vLLM parameters to enable features such as speculative decoding, function calling, and multimodal serving.

    This gives teams a fast and reliable way to explore the new Apache licensed Mistral models on Red Hat's AI platform while full enterprise support is prepared for upcoming stable releases.

    Serve and inference a large language model with Podman and Red Hat AI Inference Server (CUDA)

    This guide explains how to serve and run inference on a large language model using Podman and Red Hat AI Inference Server, leveraging NVIDIA CUDA AI accelerators.

    Prerequisites

    Make sure you meet the following requirements before proceeding:

    System requirements

    • A Linux server with data center-grade NVIDIA AI accelerators installed.

    Software requirements

    • You have installed either:
      • Podman or Docker
    • You have access to Red Hat container images:
      • Logged into registry.redhat.io

    Technology Preview notice

    The Red Hat AI Inference Server images used in this guide are a Technology Preview and not yet fully supported. They are for evaluation only, and production workloads should wait for the upcoming official GA release from the Red Hat container registries.

    Procedure: Serve and inference a model using Red Hat AI Inference Server (CUDA)

    This section walks you through the steps to run a large language model with Podman and Red Hat AI Inference Server using NVIDIA CUDA AI accelerators. For deployments in OpenShift AI, simply import the image registry.redhat.io/rhaiis-preview/vllm-cuda-rhel9:mistral-3-series as a custom runtime, and use it to serve the model in the standard way, eventually adding the vLLM parameters described in the following procedure to enable certain features (speculative decoding, function calling, and so on).

    1. Log in to the Red Hat Registry

    Open a terminal on your server and log in to registry.redhat.io: 

    podman login registry.redhat.io

    2. Pull the Red Hat AI Inference Server Image (CUDA version)

    podman pull registry.redhat.io/rhaiis-preview/vllm-cuda-rhel9:mistral-3-series

    3. Configure SELinux (if enabled)

    If SELinux is enabled on your system, allow container access to devices:

    sudo setsebool -P container_use_devices 1

    4. Create a volume directory for model caching

    Create and set proper permissions for the cache directory:

    mkdir -p rhaiis-cache
    chmod g+rwX rhaiis-cache

    5. Add your Hugging Face token

    Create or append your Hugging Face token to a local private.env file and source it:

    echo "export HF_TOKEN=<your_HF_token>" > private.env
    source private.env

    6.  Start the AI Inference Server container

    If your system includes multiple NVIDIA GPUs connected via NVSwitch, perform the following steps:

    a. Check for NVSwitch

    To detect NVSwitch support, check for the presence of devices:

    ls /proc/driver/nvidia-nvswitch/devices/

    Example output:

    0000:0c:09.0  0000:0c:0a.0  0000:0c:0b.0  0000:0c:0c.0  0000:0c:0d.0  0000:0c:0e.0

    b. Start NVIDIA Fabric Manager (root required)

    sudo systemctl start nvidia-fabricmanager

    Important

    NVIDIA Fabric Manager is only required for systems with multiple GPUs using NVSwitch.

    c. Verify GPU visibility from container

    Run the following command to verify GPU access inside a container:

    podman run --rm -it \
     --security-opt=label=disable \
     --device nvidia.com/gpu=all \
     nvcr.io/nvidia/cuda:12.4.1-base-ubi9 \
     nvidia-smi

    d. Start the Red Hat AI Inference Server container

    Start the Red Hat AI Inference Server container with the Mistral Large 3 FP8 model:

    podman run --rm -it \
      --device nvidia.com/gpu=all \
      --shm-size=4g \
      -p 8000:8000 \
      --tmpfs /home/vllm/.cache:rw,exec,uid=2000,gid=2000 \
      --env "HUGGING_FACE_HUB_TOKEN=$HF_TOKEN" \
      --env "HF_HUB_OFFLINE=0" \
      -e HF_HUB_CACHE=/opt/app-root/src/.cache \
      registry.redhat.io/rhaiis-preview/vllm-cuda-rhel9:mistral-3-series \
        --model mistralai/Mistral-Large-3-675B-Instruct-2512 \
        --tokenizer-mode mistral \
        --config-format mistral \
        --load-format mistral \
        --kv-cache-dtype fp8 \
        --tensor-parallel-size 8 \
        --limit-mm-per-prompt '{"image":10}' \
        --enable-auto-tool-choice \
        --tool-call-parser mistral \
        --host 0.0.0.0 \
        --port 8000

    Note: This configuration can be used to run Mistral Large 3 (FP8) on one 8x H200 node. Note the  --tensor-parallel-size parameter, adjust to match other situations.

    Function calling

    vLLM also supports calling user-defined functions. Make sure to run models with the following arguments.

    podman run --rm -it \
      --device nvidia.com/gpu=all \
      --shm-size=4g \
      -p 8000:8000 \
      --tmpfs /home/vllm/.cache:rw,exec,uid=2000,gid=2000 \
      --env "HUGGING_FACE_HUB_TOKEN=$HF_TOKEN" \
      --env "HF_HUB_OFFLINE=0" \
      -e HF_HUB_CACHE=/opt/app-root/src/.cache \
      registry.redhat.io/rhaiis-preview/vllm-cuda-rhel9:mistral-3-series \
        --model mistralai/Mistral-Large-3-675B-Instruct-2512 \
        --tokenizer-mode mistral \
        --config-format mistral \
        --load-format mistral \
        --kv-cache-dtype fp8 \
        --tensor-parallel-size 8 \
        --limit-mm-per-prompt '{"image":10}' \
        --enable-auto-tool-choice \
        --tool-call-parser mistral

    Speculative decoding with the draft model (EAGLE3):

    podman run --rm -it \
      --device nvidia.com/gpu=all \
      --shm-size=4g \
      -p 8000:8000 \
      --tmpfs /home/vllm/.cache:rw,exec,uid=2000,gid=2000 \
      --env "HUGGING_FACE_HUB_TOKEN=$HF_TOKEN" \
      --env "HF_HUB_OFFLINE=0" \
      -e HF_HUB_CACHE=/opt/app-root/src/.cache \
      registry.redhat.io/rhaiis-preview/vllm-cuda-rhel9:mistral-3-series \
        --model mistralai/Mistral-Large-3-675B-Instruct-2512 \
        --tokenizer-mode mistral \
        --config-format mistral \
        --load-format mistral \
        --kv-cache-dtype fp8 \
        --tensor-parallel-size 8 \
        --limit-mm-per-prompt '{"image":10}' \
        --host 0.0.0.0 \
        --port 8000 \
        --speculative_config '{
          "model": "mistralai/Mistral-Large-3-675B-Instruct-2512-Eagle",
          "num_speculative_tokens": 3,
          "method": "eagle",
          "max_model_len": "16384"
        }'

    What's next

    The release of Mistral Large 3 and Ministral 3 represents another major step for open source LLMs and the open infrastructure supporting them.

    Coming soon:

    • WideEP on GB200 for next-generation multi-expert parallelism with llm-d
    • Full enterprise support for Mistral 3 models in future Red Hat AI stable builds.

    Open models are evolving faster than ever, and with vLLM and Red Hat AI, developers and enterprises can experiment on Day 0  safely, openly, and at scale.

    Recent Posts

    • Every layer counts: Defense in depth for AI agents with Red Hat AI

    • Fun in the RUN instruction: Why container builds with distroless images can surprise you

    • Trusted software factory: Building trust in the agentic AI era

    • Build a zero trust AI pipeline with OpenShift and RHEL CVMs

    • Red Hat Hardened Images: Top 5 benefits for software developers

    What’s up next?

    Read Applied AI for Enterprise Java Development, a practical guide for Java developers to integrate generative AI and machine learning using familiar enterprise tools.

    Get the e-book
    Red Hat Developers logo LinkedIn YouTube Twitter Facebook

    Platforms

    • Red Hat AI
    • Red Hat Enterprise Linux
    • Red Hat OpenShift
    • Red Hat Ansible Automation Platform
    • See all products

    Build

    • Developer Sandbox
    • Developer tools
    • Interactive tutorials
    • API catalog

    Quicklinks

    • Learning resources
    • E-books
    • Cheat sheets
    • Blog
    • Events
    • Newsletter

    Communicate

    • About us
    • Contact sales
    • Find a partner
    • Report a website issue
    • Site status dashboard
    • Report a security problem

    RED HAT DEVELOPER

    Build here. Go anywhere.

    We serve the builders. The problem solvers who create careers with code.

    Join us if you’re a developer, software engineer, web designer, front-end designer, UX designer, computer scientist, architect, tester, product manager, project manager or team lead.

    Sign me up

    Red Hat legal and privacy links

    • About Red Hat
    • Jobs
    • Events
    • Locations
    • Contact Red Hat
    • Red Hat Blog
    • Inclusion at Red Hat
    • Cool Stuff Store
    • Red Hat Summit
    © 2026 Red Hat

    Red Hat legal and privacy links

    • Privacy statement
    • Terms of use
    • All policies and guidelines
    • Digital accessibility

    Chat Support

    Please log in with your Red Hat account to access chat support.