Skip to main content
Redhat Developers  Logo
  • Products

    Platforms

    • Red Hat Enterprise Linux
      Red Hat Enterprise Linux Icon
    • Red Hat AI
      Red Hat AI
    • Red Hat OpenShift
      Openshift icon
    • Red Hat Ansible Automation Platform
      Ansible icon
    • View All Red Hat Products

    Featured

    • Red Hat build of OpenJDK
    • Red Hat Developer Hub
    • Red Hat JBoss Enterprise Application Platform
    • Red Hat OpenShift Dev Spaces
    • Red Hat OpenShift Local
    • Red Hat Developer Sandbox

      Try Red Hat products and technologies without setup or configuration fees for 30 days with this shared Openshift and Kubernetes cluster.
    • Try at no cost
  • Technologies

    Featured

    • AI/ML
      AI/ML Icon
    • Linux
      Linux Icon
    • Kubernetes
      Cloud icon
    • Automation
      Automation Icon showing arrows moving in a circle around a gear
    • View All Technologies
    • Programming Languages & Frameworks

      • Java
      • Python
      • JavaScript
    • System Design & Architecture

      • Red Hat architecture and design patterns
      • Microservices
      • Event-Driven Architecture
      • Databases
    • Developer Productivity

      • Developer productivity
      • Developer Tools
      • GitOps
    • Automated Data Processing

      • AI/ML
      • Data Science
      • Apache Kafka on Kubernetes
    • Platform Engineering

      • DevOps
      • DevSecOps
      • Ansible automation for applications and services
    • Secure Development & Architectures

      • Security
      • Secure coding
  • Learn

    Featured

    • Kubernetes & Cloud Native
      Openshift icon
    • Linux
      Rhel icon
    • Automation
      Ansible cloud icon
    • AI/ML
      AI/ML Icon
    • View All Learning Resources

    E-Books

    • GitOps Cookbook
    • Podman in Action
    • Kubernetes Operators
    • The Path to GitOps
    • View All E-books

    Cheat Sheets

    • Linux Commands
    • Bash Commands
    • Git
    • systemd Commands
    • View All Cheat Sheets

    Documentation

    • Product Documentation
    • API Catalog
    • Legacy Documentation
  • Developer Sandbox

    Developer Sandbox

    • Access Red Hat’s products and technologies without setup or configuration, and start developing quicker than ever before with our new, no-cost sandbox environments.
    • Explore Developer Sandbox

    Featured Developer Sandbox activities

    • Get started with your Developer Sandbox
    • OpenShift virtualization and application modernization using the Developer Sandbox
    • Explore all Developer Sandbox activities

    Ready to start developing apps?

    • Try at no cost
  • Blog
  • Events
  • Videos

How to run OpenAI's gpt-oss models locally with RamaLama

Secure, simple, and containerized

September 9, 2025
Cedric Clyburn
Related topics:
Artificial intelligenceContainers
Related products:
Red Hat AI

Share:

    The release of OpenAI's gpt-oss models is a significant milestone for developers and enterprises looking to control their own AI journey. These open-weight models, available in 20B and 120B parameter variants, bring ChatGPT-level reasoning capabilities to your local machine under the Apache 2.0 license. But here’s the catch: How do you run these models securely, without compromising your system or spending hours configuring GPU drivers?

    Enter RamaLama, a command-line tool that makes running AI models as simple as running containers. By leveraging OCI containers and intelligent GPU detection, RamaLama eliminates the complexity of AI infrastructure while providing strong isolation via containerization.

    This post guides you through the steps to get gpt-oss running on your machine in minutes so you can quickly integrate it into your chat interface, RAG application, agentic workflow, and more.

    Why use RamaLama for the gpt-oss models?

    Before diving into the setup, let's address the elephant in the room: Why not just use Ollama or run the models directly?

    The answer lies in RamaLama's unique approach to AI model management:

    • Zero trust security: Models run in rootless containers with no network access by default
    • Automatic GPU optimization: RamaLama detects your hardware and pulls the right container image
    • Familiar container workflows: Use the same tools and patterns you already know
    • Production-ready path: Easily transition from local development to Kubernetes deployment

    Understanding the gpt-oss models

    OpenAI's gpt-oss models come in two flavors, gpt-oss-20b and gpt-oss-120b. 

    ModelParametersActive per tokenMemory requiredUse case
    gpt-oss-20b~21 B~3.6 B~16 GBGeneral chat, coding assistance
    gpt-oss-120b~117 B~5.1 B~80 GB (ex. NVIDIA H100)Complex reasoning, advanced tasks

    Both models support a 128,000k token context length (although reduced context lengths between 8k–32k are recommended unless you have ≥ 80GB of VRAM or substantial unified memory on Apple silicon).

    They use MXFP4 quantization, which enables memory-efficient deployment on consumer GPUs. Read Optimizing generative AI models with quantization to learn more about how quantization works.

    Benchmarks show 20B ≈ o3-mini, and 120B ≈ o4-mini on tasks like reasoning, coding, and MMLU (learn more on the OpenAI blog).

    Getting started with RamaLama

    Let's get gpt-oss running on your machine with RamaLama.

    Step 1: Install RamaLama

    On macOS/Linux via the install script:

    curl -fsSL https://ramalama.ai/install.sh | bash

    Or via PyPI:

    pip install ramalama

    One line, and that's it. You can now use ramalama in the terminal to pull, run, and serve models from your system. Behind the scenes, RamaLama will automatically detect your container runtime, like Podman or Docker, when running and serving models.

    Step 2: Pull and run gpt-oss-20b

    Here's where RamaLama shines. With a single command, it will:

    1. Detect your GPU configuration.
    2. Pull the appropriate container image (CUDA, ROCm, or CPU).
    3. Download the model.
    4. Launch it in an isolated container.

    Enter:

    ramalama run gpt-oss:20b

    With that single command, we’ve pulled and started an inference server for gpt-oss, right from our command line using RamaLama.

    While you still need the appropriate GPU drivers installed, RamaLama removes the need to install CUDA, CUDA deep neural network (cuDNN), or other GPU dependencies in your environment—the container image includes those.

    Note

    RamaLama isn't limited to Ollama's registry; it's transport-agnostic. It supports Hugging Face (huggingface://), OCI (oci://), ModelScope, and Ollama (ollama://).

    Security by default

    When RamaLama runs your model, several security measures kick in automatically:

    • Container runs with --network=none (no internet access)
    • Model mounted read-only
    • All Linux capabilities dropped (no attack access)
    • Temporary data wiped on exit with --rm

    Why does this matter? Many models today are shared peer-to-peer or through community hubs, and their provenance isn’t always clear. Running such models directly on your host could expose you to data leaks or system tampering. By default, RamaLama’s container isolation ensures that—even if a model is malicious—it cannot exfiltrate data or modify your system outside its sandbox.

    Maximizing your performance with RamaLama

    RamaLama automatically detects your hardware and pulls the appropriate container image, but you can fine-tune performance based on your system's capabilities.

    High-end systems (16 GB+ VRAM or 64 GB+ unified memory)

    For NVIDIA RTX 4060 Ti or better, or Apple silicon with substantial memory:

    ramalama serve gpt-oss:20b

    This runs with full GPU acceleration and launches a REST API server with web UI at http://localhost:8080. RamaLama automatically uses all available GPU layers (--ngl 999) and the model's default context size.

    Figure 1 shows the process of locally serving the gpt-oss model with RamaLama and interacting with it via a local web interface.

    A split-screen view. On the left, a command-line interface shows RamaLama serving the gpt-oss model with full GPU acceleration. On the right, a web browser shows the local chat interface for interacting with the served model.
    Figure 1: Serving the gpt-oss model locally using RamaLama (left) and testing the local web interface to chat with the model (right).

    Memory-constrained systems (8-16GB VRAM)

    For mid‑range GPUs or systems with limited memory, you can offload 10 layers to the GPU with --ngl 10 (leaving the rest on the CPU to save VRAM) while limiting the context to ~16k tokens with --ctx-size 16384 to reduce overall memory usage.

    ramalama serve --ngl 10 --ctx-size 16384 gpt-oss:20b

    CPU-only systems

    On systems without a compatible GPU, use --ngl 0 to force CPU-only inference and --threads 8 (adjust as needed) to set the CPU thread count.

    ramalama serve --ngl 0 --threads 8 --ctx-size 4096 gpt-oss:20b

    Monitoring resource usage

    Running AI models can be heavy on your system. RamaLama containers make it easy to keep an eye on performance so you know whether you’re maxing out CPU, GPU, or memory. Let’s check container details and resource consumption:

    ramalama containers

    Let’s say we’re using Podman. Here we can use podman stats to stream container resource usage:

    podman stats <container_name>

    Alternatively, we can use nvtop, the task monitor for NVIDIA GPU’s and other accelerators to monitor load and memory usage (shown in Figure 2):

    nvtop
    The nvtop command-line interface showing real-time metrics of GPU usage, including memory and GPU load for multiple GPUs.
    Figure 2: Metrics of GPU usage with nvtop, a command line utility for real-time monitoring of accelerator load.

    Community and next steps

    RamaLama is a collaborative effort to make AI as simple as possible by using containers to run and serve models. With support for a wide variety of registries including Hugging Face and Ollama (even OCI registries), as well as multiple inference runtimes (namely llama.cpp and vLLM), you run and build apps using countless different types of models, including gpt-oss. What will you try?

    In the meantime, here are some helpful links:

    • Check out the RamaLama repository.
    • Join the Matrix chat.
    • Try the quick start examples.
    • Read the blog post How RamaLama runs AI models in isolation by default.

    The future of AI is local, secure, and containerized, and with tools like RamaLama, that future is already here.

    Related Posts

    • How RamaLama makes working with AI models boring

    • How RamaLama runs AI models in isolation by default

    • Simplify AI data integration with RamaLama and RAG

    • Podman AI Lab and RamaLama unite for easier local AI

    • Supercharging AI isolation: microVMs with RamaLama & libkrun

    • Unleashing multimodal magic with RamaLama

    Recent Posts

    • Profiling vLLM Inference Server with GPU acceleration on RHEL

    • Network performance in distributed training: Maximizing GPU utilization on OpenShift

    • Clang bytecode interpreter update

    • How Red Hat has redefined continuous performance testing

    • Simplify OpenShift installation in air-gapped environments

    Red Hat Developers logo LinkedIn YouTube Twitter Facebook

    Platforms

    • Red Hat AI
    • Red Hat Enterprise Linux
    • Red Hat OpenShift
    • Red Hat Ansible Automation Platform
    • See all products

    Build

    • Developer Sandbox
    • Developer Tools
    • Interactive Tutorials
    • API Catalog

    Quicklinks

    • Learning Resources
    • E-books
    • Cheat Sheets
    • Blog
    • Events
    • Newsletter

    Communicate

    • About us
    • Contact sales
    • Find a partner
    • Report a website issue
    • Site Status Dashboard
    • Report a security problem

    RED HAT DEVELOPER

    Build here. Go anywhere.

    We serve the builders. The problem solvers who create careers with code.

    Join us if you’re a developer, software engineer, web designer, front-end designer, UX designer, computer scientist, architect, tester, product manager, project manager or team lead.

    Sign me up

    Red Hat legal and privacy links

    • About Red Hat
    • Jobs
    • Events
    • Locations
    • Contact Red Hat
    • Red Hat Blog
    • Inclusion at Red Hat
    • Cool Stuff Store
    • Red Hat Summit
    © 2025 Red Hat

    Red Hat legal and privacy links

    • Privacy statement
    • Terms of use
    • All policies and guidelines
    • Digital accessibility

    Report a website issue