Key takeaways
- Gemma 4 is Google DeepMind's latest open model family, spanning four sizes from 2B to 31B parameters. All variants support text, image, and video input; the two smallest (E2B, E4B) also support audio.
- The 26B MoE model activates only 3.8B parameters per forward pass, delivering frontier-class reasoning at a fraction of the inference cost.
- Edge models (E2B, E4B) support 128K context; larger models (26B, 31B) support 256K.
- Gemma 4 is available to run today: Day 0 on vLLM upstream and immediate experimentation available via Red Hat AI Inference Server (Technology Preview).
- All models are released under the Apache 2.0 license and support 140+ languages.
Google DeepMind released the Gemma 4 model family today, and it delivers an impressive leap in intelligence per parameter. Gemma 4 introduces multimodal capabilities across the entire lineup, a Mixture-of-Experts (MoE) architecture for efficient inference, and native support for thinking mode and function calling.
At Red Hat, we believe that open models and open infrastructure go hand in hand. That's why we work closely with the upstream community to ensure new models like Gemma 4 are ready for deployment the moment they land. vLLM ships with Day 0 support, and Red Hat AI Inference Server is available on the same day for experimentation.
This post walks you through what's new in Gemma 4, how to get started with vLLM, and how to experiment using Red Hat AI.
What's new in Gemma 4
Gemma 4 introduces significant improvements in model efficiency and multimodal capabilities.
The model family
Gemma 4 ships as four distinct models, each optimized for a different set of constraints:
| Model | Parameters | Context window | Modalities |
| Gemma 4 E2B | 2.3B effective | 128K | Text, image, video, audio |
| Gemma 4 E4B | 4.5B effective | 128K | Text, image, video, audio |
| Gemma 4 26B A4B | 26B total / 3.8B active (MoE) | 256K | Text, image, video |
| Gemma 4 31B | 31B dense | 256K | Text, image, video |
Multimodal by default
Every Gemma 4 model accepts text, image, and video input out of the box. The two smallest variants, E2B and E4B, go further by supporting audio input as well, making them well suited for speech-aware applications without requiring a separate model. The vision encoder supports variable aspect ratios and configurable token budgets, giving you control over the speed, memory, and quality tradeoff.
Mixture-of-Experts efficiency
The 26B A4B model is the standout efficiency story in the Gemma 4 lineup. It uses a MoE architecture where the full model contains 26 billion parameters, but only 3.8 billion are active during any single forward pass. This means you get the reasoning depth of a large model with the inference cost of a much smaller one, a compelling option when you're optimizing for throughput or running on constrained hardware.
Thinking mode and native function calling
All Gemma 4 models support thinking mode, which allows the model to reason step-by-step through complex problems before returning a final answer. Native function calling is also built in across the family, making it straightforward to integrate Gemma 4 into agentic workflows and tool-use pipelines.
Long context and multilingual support
Context window support varies by model size. The edge models (E2B, E4B) offer 128K tokens. The larger 26B and 31B models scale up to 256K tokens, letting you pass entire codebases or long-form content in a single prompt. All four variants were trained on 140+ languages, making it straightforward to build applications for a global audience.
Open license
All Gemma 4 models are released under the Apache 2.0 license, which permits commercial use, modification, and redistribution without restrictions.
The power of open: Use Gemma 4 on Day 0
One of the clearest signals of a model's production readiness is how quickly the open source ecosystem responds to it. For Gemma 4, vLLM ships with Day 0 support, meaning you can start serving inference requests today without waiting for framework updates.
To get started with vLLM, install the latest release and launch the server with your chosen Gemma 4 variant. For example:
docker run --gpus all \
-v ~/.cache/huggingface:/root/.cache/huggingface \
--env "HF_TOKEN=$HF_TOKEN" \
-p 8000:8000 \
--ipc=host \
vllm/vllm-openai:gemma4 \
--model google/gemma-4-31B-itFor multimodal inference (text and image), you can pass image inputs directly through the OpenAI-compatible API that vLLM exposes. Refer to the vLLM multimodal documentation for the full request format.
vLLM supports Gemma 4 across NVIDIA, AMD, and Intel GPUs, as well as Google TPUs. CPU inference support is also in active development (PR #38676) and expected to land soon, further broadening where you can run Gemma 4.
Get started using Red Hat AI Inference Server
This guide demonstrates how to deploy Gemma 4 26B A4B using Red Hat AI Inference Server.
Prerequisites
- Linux server with NVIDIA GPU with 80 GB+ VRAM. Smaller variants such as Gemma 4 E2B can run on GPUs with 16 GB+ VRAM.
- Podman or Docker installed
- Access to Red Hat container images
- (Optional) Hugging Face account and token for model download
Technology preview notice
The Red Hat AI Inference Server images used in this guide are intended for experimentation and evaluation purposes. Production workloads should use upcoming stable releases from Red Hat.
Procedure: Deploy Gemma 4 26B A4B using Red Hat AI Inference Server
This section walks you through how to run Gemma 4 with Podman and Red Hat AI Inference Server using NVIDIA CUDA AI accelerators. For deployments in OpenShift AI, import the registry.redhat.io/rhaii-preview/vllm-cuda-rhel9:gemma4 image as a custom runtime. Then, use it to serve the model and add the vLLM parameters described in this procedure to enable model-specific features.
Log in to the Red Hat registry. Open a terminal on your server and log in to registry.redhat.io:
$ podman login registry.redhat.ioPull the Red Hat AI Inference Server image (CUDA version):
$ podman pull registry.redhat.io/rhaii-preview/vllm-cuda-rhel9:gemma4If SELinux is enabled on your system, allow container access to devices:
$ sudo setsebool -P container_use_devices 1Create a volume directory for model caching. Create and set permissions for the cache directory:
$ mkdir -p rhaii-cache $ chmod g+rwX rhaii-cacheCreate or append your Hugging Face token to a local
private.envfile and source it:$ echo "export HF_TOKEN=<your_HF_token>" > private.env $ source private.env- Start the AI Inference Server container. If your system includes multiple NVIDIA GPUs connected via NVSwitch, follow these steps:
Check for NVSwitch support: Detect supported devices by running this command:
$ ls /proc/driver/nvidia-nvswitch/devices/Example output:
0000:0c:09.0 0000:0c:0a.0 0000:0c:0b.0 0000:0c:0c.0 0000:0c:0d.0 0000:0c:0e.0Start NVIDIA Fabric Manager (root required):
$ sudo systemctl start nvidia-fabricmanagerImportant: NVIDIA Fabric Manager is only required for systems with multiple GPUs using NVSwitch.
Verify GPU visibility from the container. Run the following command to verify GPU access inside a container:
$ podman run --rm -it \ --security-opt=label=disable \ --device nvidia.com/gpu=all \ nvcr.io/nvidia/cuda:12.9.0-base-ubi9 \ nvidia-smiStart the Red Hat AI Inference Server container with the Gemma 4 26B A4B model:
$ podman run --rm \ --device nvidia.com/gpu=all \ --security-opt=label=disable \ --shm-size=4g \ -p 8000:8000 \ -v./rhaii-cache:/opt/app-root/src/.cache \ -e HF_HUB_OFFLINE=0 \ -e"HUGGING_FACE_HUB_TOKEN=$HF_TOKEN" \ registry.redhat.io/rhaii-preview/vllm-cuda-rhel9:gemma4 \ --model google/gemma-4-26B-A4B-it \ --tensor-parallel-size 1 \ --max-model-len 4096 \ --gpu-memory-utilization 0.90 \ --host 0.0.0.0 --port 8000Note: Adjust
--tensor-parallel-sizeto match the number of GPUs you want to use. The 26B A4B variant requires approximately 49 GB of GPU memory for model weights alone, so a single 80 GB GPU is recommended.
- Query the model using the OpenAI-compatible API. Once the server logs show
Application startup complete, you can interact with Gemma 4 through the/v1/chat/completionsendpoint.
Query the Gemma 4 model
The following examples demonstrate how to use Gemma 4 for chat, reasoning, multimodal, and function calling capabilities.
Basic chat completion
$ curl -s http://localhost:8000/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "google/gemma-4-26B-A4B-it",
"messages": [
{"role": "user", "content": "What are the key benefits of mixture-of-experts models?"}
],
"max_tokens": 256,
"temperature": 0.7
}'Reasoning (thinking mode)
Gemma 4 supports a structured reasoning mode where the model shows its step-by-step thinking before producing a final answer. This is particularly effective for math, logic, coding, and scientific tasks. To enable it, add enable_thinking to the chat template and set skip_special_tokens to false:
$ curl -s http://localhost:8000/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "google/gemma-4-26B-A4B-it",
"messages": [
{"role": "user", "content": "How many rs are in the word strawberry? Think step by step."}
],
"max_tokens": 512,
"temperature": 0,
"chat_template_kwargs": {"enable_thinking": true},
"skip_special_tokens": false
}'When thinking mode is enabled, the response will contain the model's internal reasoning process followed by the final answer.
Multimodal input (text + image)
Gemma 4 can process both text and images. To send an image along with a text prompt:
$ curl -s http://localhost:8000/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "google/gemma-4-26B-A4B-it",
"messages": [
{
"role": "user",
"content": [
{"type": "image_url", "image_url": {"url": "https://upload.wikimedia.org/wikipedia/commons/thumb/3/3a/Cat03.jpg/1200px-Cat03.jpg"}},
{"type": "text", "text": "Describe this image in detail."}
]
}
],
"max_tokens": 256,
"temperature": 0.7
}'This enables use cases such as visual question answering, image captioning, and document understanding directly through the same API endpoint.
Function calling (agentic workflows)
Gemma 4 supports native function calling, enabling workflows where the model can decide when and how to use external tools. Define your tools and let the model invoke them:
$ curl -s http://localhost:8000/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "google/gemma-4-26B-A4B-it",
"messages": [
{"role": "user", "content": "What is the weather like in Paris today?"}
],
"tools": [
{
"type": "function",
"function": {
"name": "get_weather",
"description": "Get the current weather for a given location",
"parameters": {
"type": "object",
"properties": {
"location": {"type": "string", "description": "City name"},
"unit": {"type": "string", "enum": ["celsius", "fahrenheit"]}
},
"required": ["location"]
}
}
}
],
"max_tokens": 256,
"temperature": 0
}'The model will return a tool_calls response with the function name and arguments, which your application can then execute and feed back to the model for a final answer.
To serve a different variant, replace the --model argument with the corresponding model identifier, for example google/gemma-4-E2B-it or google/gemma-4-31B-it. Larger variants such as the 31B might require multiple GPUs and a higher --tensor-parallel-size, while the E2B and E4B variants can run on a single mid-range GPU.
Explore more
- Explore the Gemma 4 models on Hugging Face
- Get started with vLLM
- Learn more about Red Hat AI Inference Server (no-cost trial available)
- Learn more about Red Hat OpenShift AI