Run Gemma 4 with Red Hat AI on Day 0: A step-by-step guide

Key takeaways

Gemma 4 is Google DeepMind's latest open model family, spanning four sizes from 2B to 31B parameters. All variants support text, image, and video input; the two smallest (E2B, E4B) also support audio.
The 26B MoE model activates only 3.8B parameters per forward pass, delivering frontier-class reasoning at a fraction of the inference cost.
Edge models (E2B, E4B) support 128K context; larger models (26B, 31B) support 256K.
Gemma 4 is available to run today: Day 0 on vLLM upstream and immediate experimentation available via Red Hat AI Inference Server (Technology Preview).
All models are released under the Apache 2.0 license and support 140+ languages.

Google DeepMind released the Gemma 4 model family today, and it delivers an impressive leap in intelligence per parameter. Gemma 4 introduces multimodal capabilities across the entire lineup, a Mixture-of-Experts (MoE) architecture for efficient inference, and native support for thinking mode and function calling.

At Red Hat, we believe that open models and open infrastructure go hand in hand. That's why we work closely with the upstream community to ensure new models like Gemma 4 are ready for deployment the moment they land. vLLM ships with Day 0 support, and Red Hat AI Inference Server is available on the same day for experimentation.

This post walks you through what's new in Gemma 4, how to get started with vLLM, and how to experiment using Red Hat AI.

What's new in Gemma 4

Gemma 4 introduces significant improvements in model efficiency and multimodal capabilities.

The model family

Gemma 4 ships as four distinct models, each optimized for a different set of constraints:

Model	Parameters	Context window	Modalities
Gemma 4 E2B	2.3B effective	128K	Text, image, video, audio
Gemma 4 E4B	4.5B effective	128K	Text, image, video, audio
Gemma 4 26B A4B	26B total / 3.8B active (MoE)	256K	Text, image, video
Gemma 4 31B	31B dense	256K	Text, image, video

Multimodal by default

Every Gemma 4 model accepts text, image, and video input out of the box. The two smallest variants, E2B and E4B, go further by supporting audio input as well, making them well suited for speech-aware applications without requiring a separate model. The vision encoder supports variable aspect ratios and configurable token budgets, giving you control over the speed, memory, and quality tradeoff.

Mixture-of-Experts efficiency

The 26B A4B model is the standout efficiency story in the Gemma 4 lineup. It uses a MoE architecture where the full model contains 26 billion parameters, but only 3.8 billion are active during any single forward pass. This means you get the reasoning depth of a large model with the inference cost of a much smaller one, a compelling option when you're optimizing for throughput or running on constrained hardware.

Thinking mode and native function calling

All Gemma 4 models support thinking mode, which allows the model to reason step-by-step through complex problems before returning a final answer. Native function calling is also built in across the family, making it straightforward to integrate Gemma 4 into agentic workflows and tool-use pipelines.

Long context and multilingual support

Context window support varies by model size. The edge models (E2B, E4B) offer 128K tokens. The larger 26B and 31B models scale up to 256K tokens, letting you pass entire codebases or long-form content in a single prompt. All four variants were trained on 140+ languages, making it straightforward to build applications for a global audience.

Open license

All Gemma 4 models are released under the Apache 2.0 license, which permits commercial use, modification, and redistribution without restrictions.

The power of open: Use Gemma 4 on Day 0

One of the clearest signals of a model's production readiness is how quickly the open source ecosystem responds to it. For Gemma 4, vLLM ships with Day 0 support, meaning you can start serving inference requests today without waiting for framework updates.

To get started with vLLM, install the latest release and launch the server with your chosen Gemma 4 variant. For example:

docker run --gpus all \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    --env "HF_TOKEN=$HF_TOKEN" \
    -p 8000:8000 \
    --ipc=host \
    vllm/vllm-openai:gemma4 \
    --model google/gemma-4-31B-it

For multimodal inference (text and image), you can pass image inputs directly through the OpenAI-compatible API that vLLM exposes. Refer to the vLLM multimodal documentation for the full request format.

vLLM supports Gemma 4 across NVIDIA, AMD, and Intel GPUs, as well as Google TPUs. CPU inference support is also in active development (PR #38676) and expected to land soon, further broadening where you can run Gemma 4.

Get started using Red Hat AI Inference Server

This guide demonstrates how to deploy Gemma 4 26B A4B using Red Hat AI Inference Server.

Prerequisites

Linux server with NVIDIA GPU with 80 GB+ VRAM. Smaller variants such as Gemma 4 E2B can run on GPUs with 16 GB+ VRAM.
Podman or Docker installed
Access to Red Hat container images
(Optional) Hugging Face account and token for model download

Technology preview notice

The Red Hat AI Inference Server images used in this guide are intended for experimentation and evaluation purposes. Production workloads should use upcoming stable releases from Red Hat.

Procedure: Deploy Gemma 4 26B A4B using Red Hat AI Inference Server

This section walks you through how to run Gemma 4 with Podman and Red Hat AI Inference Server using NVIDIA CUDA AI accelerators. For deployments in OpenShift AI, import the registry.redhat.io/rhaii-preview/vllm-cuda-rhel9:gemma4 image as a custom runtime. Then, use it to serve the model and add the vLLM parameters described in this procedure to enable model-specific features.

Log in to the Red Hat registry. Open a terminal on your server and log in to registry.redhat.io:
```
$ podman login registry.redhat.io
```

Pull the Red Hat AI Inference Server image (CUDA version):

$ podman pull registry.redhat.io/rhaii-preview/vllm-cuda-rhel9:gemma4

If SELinux is enabled on your system, allow container access to devices:
```
$ sudo setsebool -P container_use_devices 1
```
Create a volume directory for model caching. Create and set permissions for the cache directory:
```
$ mkdir -p rhaii-cache
$ chmod g+rwX rhaii-cache
```
Create or append your Hugging Face token to a local private.env file and source it:
```
$ echo "export HF_TOKEN=<your_HF_token>" > private.env
$ source private.env
```

Start the AI Inference Server container. If your system includes multiple NVIDIA GPUs connected via NVSwitch, follow these steps:

Check for NVSwitch support: Detect supported devices by running this command:

$ ls /proc/driver/nvidia-nvswitch/devices/

Example output:

0000:0c:09.0  0000:0c:0a.0  0000:0c:0b.0  0000:0c:0c.0  0000:0c:0d.0  0000:0c:0e.0

Start NVIDIA Fabric Manager (root required):
```
$ sudo systemctl start nvidia-fabricmanager
```
Important: NVIDIA Fabric Manager is only required for systems with multiple GPUs using NVSwitch.

Verify GPU visibility from the container. Run the following command to verify GPU access inside a container:

$ podman run --rm -it \
  --security-opt=label=disable \
  --device nvidia.com/gpu=all \
  nvcr.io/nvidia/cuda:12.9.0-base-ubi9 \
  nvidia-smi

Start the Red Hat AI Inference Server container with the Gemma 4 26B A4B model:

$ podman run --rm \
    --device nvidia.com/gpu=all \
    --security-opt=label=disable \
    --shm-size=4g \
    -p 8000:8000 \
-v./rhaii-cache:/opt/app-root/src/.cache \
    -e HF_HUB_OFFLINE=0 \ -e"HUGGING_FACE_HUB_TOKEN=$HF_TOKEN" \
    registry.redhat.io/rhaii-preview/vllm-cuda-rhel9:gemma4 \
      --model google/gemma-4-26B-A4B-it \
      --tensor-parallel-size 1 \
      --max-model-len 4096 \
      --gpu-memory-utilization 0.90 \
      --host 0.0.0.0 --port 8000

Note: Adjust --tensor-parallel-size to match the number of GPUs you want to use. The 26B A4B variant requires approximately 49 GB of GPU memory for model weights alone, so a single 80 GB GPU is recommended.

Query the model using the OpenAI-compatible API. Once the server logs show Application startup complete, you can interact with Gemma 4 through the /v1/chat/completions endpoint.

Query the Gemma 4 model

The following examples demonstrate how to use Gemma 4 for chat, reasoning, multimodal, and function calling capabilities.

Basic chat completion

$ curl -s http://localhost:8000/v1/chat/completions \
    -H "Content-Type: application/json" \
    -d '{
      "model": "google/gemma-4-26B-A4B-it",
      "messages": [
        {"role": "user", "content": "What are the key benefits of mixture-of-experts models?"}
      ],
      "max_tokens": 256,
      "temperature": 0.7
    }'

Reasoning (thinking mode)

Gemma 4 supports a structured reasoning mode where the model shows its step-by-step thinking before producing a final answer. This is particularly effective for math, logic, coding, and scientific tasks. To enable it, add enable_thinking to the chat template and set skip_special_tokens to false:

$ curl -s http://localhost:8000/v1/chat/completions \
    -H "Content-Type: application/json" \
    -d '{
      "model": "google/gemma-4-26B-A4B-it",
      "messages": [
        {"role": "user", "content": "How many rs are in the word strawberry? Think step by step."}
      ],
      "max_tokens": 512,
      "temperature": 0,
      "chat_template_kwargs": {"enable_thinking": true},
      "skip_special_tokens": false
    }'

When thinking mode is enabled, the response will contain the model's internal reasoning process followed by the final answer.

Multimodal input (text + image)

Gemma 4 can process both text and images. To send an image along with a text prompt:

$ curl -s http://localhost:8000/v1/chat/completions \
    -H "Content-Type: application/json" \
    -d '{
      "model": "google/gemma-4-26B-A4B-it",
      "messages": [
        {
          "role": "user",
          "content": [
            {"type": "image_url", "image_url": {"url": "https://upload.wikimedia.org/wikipedia/commons/thumb/3/3a/Cat03.jpg/1200px-Cat03.jpg"}},
            {"type": "text", "text": "Describe this image in detail."}
          ]
        }
      ],
      "max_tokens": 256,
      "temperature": 0.7
    }'

This enables use cases such as visual question answering, image captioning, and document understanding directly through the same API endpoint.

Function calling (agentic workflows)

Gemma 4 supports native function calling, enabling workflows where the model can decide when and how to use external tools. Define your tools and let the model invoke them:

$ curl -s http://localhost:8000/v1/chat/completions \
    -H "Content-Type: application/json" \
    -d '{
      "model": "google/gemma-4-26B-A4B-it",
      "messages": [
        {"role": "user", "content": "What is the weather like in Paris today?"}
      ],
      "tools": [
        {
          "type": "function",
          "function": {
            "name": "get_weather",
            "description": "Get the current weather for a given location",
            "parameters": {
              "type": "object",
              "properties": {
                "location": {"type": "string", "description": "City name"},
                "unit": {"type": "string", "enum": ["celsius", "fahrenheit"]}
              },
              "required": ["location"]
            }
          }
        }
      ],
      "max_tokens": 256,
      "temperature": 0
    }'

The model will return a tool_calls response with the function name and arguments, which your application can then execute and feed back to the model for a final answer.

To serve a different variant, replace the --model argument with the corresponding model identifier, for example google/gemma-4-E2B-it or google/gemma-4-31B-it. Larger variants such as the 31B might require multiple GPUs and a higher --tensor-parallel-size, while the E2B and E4B variants can run on a single mid-range GPU.

Explore more

Explore the Gemma 4 models on Hugging Face
Get started with vLLM
Learn more about Red Hat AI Inference Server (no-cost trial available)
Learn more about Red Hat OpenShift AI

Last updated: April 3, 2026

Run Gemma 4 with Red Hat AI on Day 0: A step-by-step guide

What's new in Gemma 4

The model family

Multimodal by default

Mixture-of-Experts efficiency

Thinking mode and native function calling

Long context and multilingual support

Open license

The power of open: Use Gemma 4 on Day 0

Get started using Red Hat AI Inference Server

Prerequisites

Technology preview notice

Procedure: Deploy Gemma 4 26B A4B using Red Hat AI Inference Server

Query the Gemma 4 model

Basic chat completion

Reasoning (thinking mode)

Multimodal input (text + image)

Function calling (agentic workflows)

Explore more

Efficiently manage host content with Red Hat Satellite's multi-CV

New features in Python 3.14

Why killing pods is not enough: Testing operator reconciliation with operator-chaos

Troubleshoot Red Hat OpenShift Virtualization localnet with the netobserv command

EvalHub: Capability and safety benchmarking for AI models

Platforms

Build

Quicklinks

Communicate

RED HAT DEVELOPER

Red Hat legal and privacy links

Red Hat legal and privacy links