Skip to main content
Redhat Developers  Logo
  • Products

    Platforms

    • Red Hat Enterprise Linux
      Red Hat Enterprise Linux Icon
    • Red Hat AI
      Red Hat AI
    • Red Hat OpenShift
      Openshift icon
    • Red Hat Ansible Automation Platform
      Ansible icon
    • See all Red Hat products

    Featured

    • Red Hat build of OpenJDK
    • Red Hat Developer Hub
    • Red Hat JBoss Enterprise Application Platform
    • Red Hat OpenShift Dev Spaces
    • Red Hat OpenShift Local
    • Red Hat Developer Sandbox

      Try Red Hat products and technologies without setup or configuration fees for 30 days with this shared Red Hat OpenShift and Kubernetes cluster.
    • Try at no cost
  • Technologies

    Featured

    • AI/ML
      AI/ML Icon
    • Linux
      Linux Icon
    • Kubernetes
      Cloud icon
    • Automation
      Automation Icon showing arrows moving in a circle around a gear
    • See all technologies
    • Programming languages & frameworks

      • Java
      • Python
      • JavaScript
    • System design & architecture

      • Red Hat architecture and design patterns
      • Microservices
      • Event-Driven Architecture
      • Databases
    • Developer experience

      • Productivity
      • Tools
      • GitOps
    • Automated data processing

      • AI/ML
      • Data science
      • Apache Kafka on Kubernetes
    • Platform engineering

      • DevOps
      • DevSecOps
      • Red Hat Ansible Automation Platform for applications and services
    • Secure development & architectures

      • Security
      • Secure coding
  • Learn

    Featured

    • Kubernetes & cloud native
      Openshift icon
    • Linux
      Rhel icon
    • Automation
      Ansible cloud icon
    • AI/ML
      AI/ML Icon
    • See all learning resources

    E-books

    • GitOps cookbook
    • Podman in action
    • Kubernetes operators
    • The path to GitOps
    • See all e-books

    Cheat sheets

    • Linux commands
    • Bash commands
    • Git
    • systemd commands
    • See all cheat sheets

    Documentation

    • Product documentation
    • API catalog
    • Legacy documentation
  • Developer Sandbox

    Developer Sandbox

    • Access Red Hat’s products and technologies without setup or configuration, and start developing quicker than ever before with our new, no-cost sandbox environments.
    • Explore the Developer Sandbox

    Featured Developer Sandbox activities

    • Get started with your Developer Sandbox
    • OpenShift virtualization and application modernization using the Developer Sandbox
    • Explore all Developer Sandbox activities

    Ready to start developing apps?

    • Try at no cost
  • Blog
  • Events
  • Videos

Run Gemma 4 with Red Hat AI on Day 0: A step-by-step guide

April 2, 2026
Saša Zelenović Selbi Siddiqi Tarun Kumar Daniele Trifirò Lucas Wilkinson
Related topics:
Artificial intelligence
Related products:
Red Hat AIRed Hat AI Inference Server

    Key takeaways

    • Gemma 4 is Google DeepMind's latest open model family, spanning four sizes from 2B to 31B parameters. All variants support text, image, and video input; the two smallest (E2B, E4B) also support audio.
    • The 26B MoE model activates only 3.8B parameters per forward pass, delivering frontier-class reasoning at a fraction of the inference cost.
    • Edge models (E2B, E4B) support 128K context; larger models (26B, 31B) support 256K.
    • Gemma 4 is available to run today: Day 0 on vLLM upstream and immediate experimentation available via Red Hat AI Inference Server (Technology Preview).
    • All models are released under the Apache 2.0 license and support 140+ languages.

    Google DeepMind released the Gemma 4 model family today, and it delivers an impressive leap in intelligence per parameter. Gemma 4 introduces multimodal capabilities across the entire lineup, a Mixture-of-Experts (MoE) architecture for efficient inference, and native support for thinking mode and function calling.

    At Red Hat, we believe that open models and open infrastructure go hand in hand. That's why we work closely with the upstream community to ensure new models like Gemma 4 are ready for deployment the moment they land. vLLM ships with Day 0 support, and Red Hat AI Inference Server is available on the same day for experimentation. 

    This post walks you through what's new in Gemma 4, how to get started with vLLM, and how to experiment using Red Hat AI.

    What's new in Gemma 4

    Gemma 4 introduces significant improvements in model efficiency and multimodal capabilities.

    The model family

    Gemma 4 ships as four distinct models, each optimized for a different set of constraints:

    ModelParametersContext windowModalities
    Gemma 4 E2B2.3B effective128KText, image, video, audio
    Gemma 4 E4B4.5B effective128KText, image, video, audio
    Gemma 4 26B A4B26B total / 3.8B active (MoE)256KText, image, video
    Gemma 4 31B31B dense256KText, image, video

    Multimodal by default

    Every Gemma 4 model accepts text, image, and video input out of the box. The two smallest variants, E2B and E4B, go further by supporting audio input as well, making them well suited for speech-aware applications without requiring a separate model. The vision encoder supports variable aspect ratios and configurable token budgets, giving you control over the speed, memory, and quality tradeoff.

    Mixture-of-Experts efficiency

    The 26B A4B model is the standout efficiency story in the Gemma 4 lineup. It uses a MoE architecture where the full model contains 26 billion parameters, but only 3.8 billion are active during any single forward pass. This means you get the reasoning depth of a large model with the inference cost of a much smaller one, a compelling option when you're optimizing for throughput or running on constrained hardware.

    Thinking mode and native function calling

    All Gemma 4 models support thinking mode, which allows the model to reason step-by-step through complex problems before returning a final answer. Native function calling is also built in across the family, making it straightforward to integrate Gemma 4 into agentic workflows and tool-use pipelines.

    Long context and multilingual support

    Context window support varies by model size. The edge models (E2B, E4B) offer 128K tokens. The larger 26B and 31B models scale up to 256K tokens, letting you pass entire codebases or long-form content in a single prompt. All four variants were trained on 140+ languages, making it straightforward to build applications for a global audience.

    Open license

    All Gemma 4 models are released under the Apache 2.0 license, which permits commercial use, modification, and redistribution without restrictions.

    The power of open: Use Gemma 4 on Day 0

    One of the clearest signals of a model's production readiness is how quickly the open source ecosystem responds to it. For Gemma 4, vLLM ships with Day 0 support, meaning you can start serving inference requests today without waiting for framework updates.

    To get started with vLLM, install the latest release and launch the server with your chosen Gemma 4 variant. For example:

    docker run --gpus all \
        -v ~/.cache/huggingface:/root/.cache/huggingface \
        --env "HF_TOKEN=$HF_TOKEN" \
        -p 8000:8000 \
        --ipc=host \
        vllm/vllm-openai:gemma4 \
        --model google/gemma-4-31B-it

    For multimodal inference (text and image), you can pass image inputs directly through the OpenAI-compatible API that vLLM exposes. Refer to the vLLM multimodal documentation for the full request format.

    vLLM supports Gemma 4 across NVIDIA, AMD, and Intel GPUs, as well as Google TPUs. CPU inference support is also in active development (PR #38676) and expected to land soon, further broadening where you can run Gemma 4.

    Get started using Red Hat AI Inference Server

    This guide demonstrates how to deploy Gemma 4 26B A4B using Red Hat AI Inference Server.

    Prerequisites

    • Linux server with NVIDIA GPU with 80 GB+ VRAM. Smaller variants such as Gemma 4 E2B can run on GPUs with 16 GB+ VRAM.
    • Podman or Docker installed
    • Access to Red Hat container images
    • (Optional) Hugging Face account and token for model download

    Technology preview notice

    The Red Hat AI Inference Server images used in this guide are intended for experimentation and evaluation purposes. Production workloads should use upcoming stable releases from Red Hat.

    Procedure: Deploy Gemma 4 26B A4B using Red Hat AI Inference Server

    This section walks you through how to run Gemma 4 with Podman and Red Hat AI Inference Server using NVIDIA CUDA AI accelerators. For deployments in OpenShift AI, import the registry.redhat.io/rhaii-preview/vllm-cuda-rhel9:gemma4 image as a custom runtime. Then, use it to serve the model and add the vLLM parameters described in this procedure to enable model-specific features.

    1. Log in to the Red Hat registry. Open a terminal on your server and log in to registry.redhat.io:

      $ podman login registry.redhat.io
    2. Pull the Red Hat AI Inference Server image (CUDA version):

      $ podman pull registry.redhat.io/rhaii-preview/vllm-cuda-rhel9:gemma4
    3. If SELinux is enabled on your system, allow container access to devices:

      $ sudo setsebool -P container_use_devices 1
    4. Create a volume directory for model caching. Create and set permissions for the cache directory:

      $ mkdir -p rhaii-cache
      $ chmod g+rwX rhaii-cache
    5. Create or append your Hugging Face token to a local private.env file and source it:

      $ echo "export HF_TOKEN=<your_HF_token>" > private.env
      $ source private.env
    6. Start the AI Inference Server container. If your system includes multiple NVIDIA GPUs connected via NVSwitch, follow these steps:
      1. Check for NVSwitch support: Detect supported devices by running this command:

        $ ls /proc/driver/nvidia-nvswitch/devices/

        Example output:

        0000:0c:09.0  0000:0c:0a.0  0000:0c:0b.0  0000:0c:0c.0  0000:0c:0d.0  0000:0c:0e.0
      2. Start NVIDIA Fabric Manager (root required):

        $ sudo systemctl start nvidia-fabricmanager

        Important: NVIDIA Fabric Manager is only required for systems with multiple GPUs using NVSwitch.

      3. Verify GPU visibility from the container. Run the following command to verify GPU access inside a container:

        $ podman run --rm -it \
          --security-opt=label=disable \
          --device nvidia.com/gpu=all \
          nvcr.io/nvidia/cuda:12.9.0-base-ubi9 \
          nvidia-smi
      4. Start the Red Hat AI Inference Server container with the Gemma 4 26B A4B model:

        $ podman run --rm \
            --device nvidia.com/gpu=all \
            --security-opt=label=disable \
            --shm-size=4g \
            -p 8000:8000 \
        -v./rhaii-cache:/opt/app-root/src/.cache \
            -e HF_HUB_OFFLINE=0 \ -e"HUGGING_FACE_HUB_TOKEN=$HF_TOKEN" \
            registry.redhat.io/rhaii-preview/vllm-cuda-rhel9:gemma4 \
              --model google/gemma-4-26B-A4B-it \
              --tensor-parallel-size 1 \
              --max-model-len 4096 \
              --gpu-memory-utilization 0.90 \
              --host 0.0.0.0 --port 8000

        Note: Adjust --tensor-parallel-size to match the number of GPUs you want to use. The 26B A4B variant requires approximately 49 GB of GPU memory for model weights alone, so a single 80 GB GPU is recommended.

    7. Query the model using the OpenAI-compatible API. Once the server logs show Application startup complete, you can interact with Gemma 4 through the /v1/chat/completions endpoint. 

    Query the Gemma 4 model

    The following examples demonstrate how to use Gemma 4 for chat, reasoning, multimodal, and function calling capabilities.

    Basic chat completion

    $ curl -s http://localhost:8000/v1/chat/completions \
        -H "Content-Type: application/json" \
        -d '{
          "model": "google/gemma-4-26B-A4B-it",
          "messages": [
            {"role": "user", "content": "What are the key benefits of mixture-of-experts models?"}
          ],
          "max_tokens": 256,
          "temperature": 0.7
        }'

    Reasoning (thinking mode)

    Gemma 4 supports a structured reasoning mode where the model shows its step-by-step thinking before producing a final answer. This is particularly effective for math, logic, coding, and scientific tasks. To enable it, add enable_thinking to the chat template and set skip_special_tokens to false:

    $ curl -s http://localhost:8000/v1/chat/completions \
        -H "Content-Type: application/json" \
        -d '{
          "model": "google/gemma-4-26B-A4B-it",
          "messages": [
            {"role": "user", "content": "How many rs are in the word strawberry? Think step by step."}
          ],
          "max_tokens": 512,
          "temperature": 0,
          "chat_template_kwargs": {"enable_thinking": true},
          "skip_special_tokens": false
        }'

    When thinking mode is enabled, the response will contain the model's internal reasoning process followed by the final answer.

    Multimodal input (text + image)

    Gemma 4 can process both text and images. To send an image along with a text prompt:

    $ curl -s http://localhost:8000/v1/chat/completions \
        -H "Content-Type: application/json" \
        -d '{
          "model": "google/gemma-4-26B-A4B-it",
          "messages": [
            {
              "role": "user",
              "content": [
                {"type": "image_url", "image_url": {"url": "https://upload.wikimedia.org/wikipedia/commons/thumb/3/3a/Cat03.jpg/1200px-Cat03.jpg"}},
                {"type": "text", "text": "Describe this image in detail."}
              ]
            }
          ],
          "max_tokens": 256,
          "temperature": 0.7
        }'

    This enables use cases such as visual question answering, image captioning, and document understanding directly through the same API endpoint.

    Function calling (agentic workflows)

    Gemma 4 supports native function calling, enabling workflows where the model can decide when and how to use external tools. Define your tools and let the model invoke them:

    $ curl -s http://localhost:8000/v1/chat/completions \
        -H "Content-Type: application/json" \
        -d '{
          "model": "google/gemma-4-26B-A4B-it",
          "messages": [
            {"role": "user", "content": "What is the weather like in Paris today?"}
          ],
          "tools": [
            {
              "type": "function",
              "function": {
                "name": "get_weather",
                "description": "Get the current weather for a given location",
                "parameters": {
                  "type": "object",
                  "properties": {
                    "location": {"type": "string", "description": "City name"},
                    "unit": {"type": "string", "enum": ["celsius", "fahrenheit"]}
                  },
                  "required": ["location"]
                }
              }
            }
          ],
          "max_tokens": 256,
          "temperature": 0
        }'

    The model will return a tool_calls response with the function name and arguments, which your application can then execute and feed back to the model for a final answer.

    To serve a different variant, replace the --model argument with the corresponding model identifier, for example google/gemma-4-E2B-it or google/gemma-4-31B-it. Larger variants such as the 31B might require multiple GPUs and a higher --tensor-parallel-size, while the E2B and E4B variants can run on a single mid-range GPU.

    Explore more

    • Explore the Gemma 4 models on Hugging Face
    • Get started with vLLM
    • Learn more about Red Hat AI Inference Server (no-cost trial available)
    • Learn more about Red Hat OpenShift AI
    Last updated: April 3, 2026

    Related Posts

    • Why vLLM is the best choice for AI inference today

    • Run Mistral Large 3 & Ministral 3 on vLLM with Red Hat AI on Day 0: A step-by-step guide

    • DeepSeek-V3.2-Exp on vLLM, Day 0: Sparse Attention for long-context inference, ready for experimentation today with Red Hat AI

    • Llama 4 herd is here with Day 0 inference support in vLLM

    • Getting started with the vLLM Semantic Router project's Athena release: Optimize your tokens for agentic AI

    • 5 steps to triage vLLM performance

    Recent Posts

    • Red Hat build of Perses with the cluster observability operator

    • How to plan your RHEL lifecycle with AI

    • Beyond guesswork: Generating accurate ingress firewall rules with oc commatrix

    • Faster AI/ML container startup with additional storage in Red Hat OpenShift 4.22

    • Blast radius validation: Large and small Red Hat OpenShift nodes

    Red Hat Developers logo LinkedIn YouTube Twitter Facebook

    Platforms

    • Red Hat AI
    • Red Hat Enterprise Linux
    • Red Hat OpenShift
    • Red Hat Ansible Automation Platform
    • See all products

    Build

    • Developer Sandbox
    • Developer tools
    • Interactive tutorials
    • API catalog

    Quicklinks

    • Learning resources
    • E-books
    • Cheat sheets
    • Blog
    • Events
    • Newsletter

    Communicate

    • About us
    • Contact sales
    • Find a partner
    • Report a website issue
    • Site status dashboard
    • Report a security problem

    RED HAT DEVELOPER

    Build here. Go anywhere.

    We serve the builders. The problem solvers who create careers with code.

    Join us if you’re a developer, software engineer, web designer, front-end designer, UX designer, computer scientist, architect, tester, product manager, project manager or team lead.

    Sign me up

    Red Hat legal and privacy links

    • About Red Hat
    • Jobs
    • Events
    • Locations
    • Contact Red Hat
    • Red Hat Blog
    • Inclusion at Red Hat
    • Cool Stuff Store
    • Red Hat Summit
    © 2026 Red Hat

    Red Hat legal and privacy links

    • Privacy statement
    • Terms of use
    • All policies and guidelines
    • Digital accessibility

    Report a website issue