Skip to main content
Redhat Developers  Logo
  • AI

    Get started with AI

    • Red Hat AI
      Accelerate the development and deployment of enterprise AI solutions.
    • AI learning hub
      Explore learning materials and tools, organized by task.
    • AI interactive demos
      Click through scenarios with Red Hat AI, including training LLMs and more.
    • AI/ML learning paths
      Expand your OpenShift AI knowledge using these learning resources.
    • AI quickstarts
      Focused AI use cases designed for fast deployment on Red Hat AI platforms.
    • No-cost AI training
      Foundational Red Hat AI training.

    Featured resources

    • OpenShift AI learning
    • Open source AI for developers
    • AI product application development
    • Open source-powered AI/ML for hybrid cloud
    • AI and Node.js cheat sheet

    Red Hat AI Factory with NVIDIA

    • Red Hat AI Factory with NVIDIA is a co-engineered, enterprise-grade AI solution for building, deploying, and managing AI at scale across hybrid cloud environments.
    • Explore the solution
  • Learn

    Self-guided

    • Documentation
      Find answers, get step-by-step guidance, and learn how to use Red Hat products.
    • Learning paths
      Explore curated walkthroughs for common development tasks.
    • See all learning

    Hands-on

    • Developer Sandbox
      Spin up Red Hat's products and technologies without setup or configuration.
    • Interactive labs
      Learn by doing in these hands-on, browser-based experiences.
    • Interactive demos
      Click through product features in these guided tours.

    Browse by topic

    • AI/ML
    • Automation
    • Java
    • Kubernetes
    • Linux
    • See all topics

    Training & certifications

    • Courses and exams
    • Certifications
    • Skills assessments
    • Red Hat Academy
    • Learning subscription
    • Explore training
  • Build

    Get started

    • Red Hat build of Podman Desktop
      A downloadable, local development hub to experiment with our products and builds.
    • Developer Sandbox
      Spin up Red Hat's products and technologies without setup or configuration.

    Download products

    • Access product downloads to start building and testing right away.
    • Red Hat Enterprise Linux
    • Red Hat AI
    • Red Hat OpenShift
    • Red Hat Ansible Automation Platform
    • See all products

    Featured

    • Red Hat build of OpenJDK
    • Red Hat JBoss Enterprise Application Platform
    • Red Hat OpenShift Dev Spaces
    • Red Hat Developer Toolset

    References

    • E-books
    • Documentation
    • Cheat sheets
    • Architecture center
  • Community

    Get involved

    • Events
    • Live AI events
    • Red Hat Summit
    • Red Hat Accelerators
    • Community discussions

    Follow along

    • Articles & blogs
    • Developer newsletter
    • Videos
    • Github

    Get help

    • Customer service
    • Customer support
    • Regional contacts
    • Find a partner

    Join the Red Hat Developer program

    • Download Red Hat products and project builds, access support documentation, learning content, and more.
    • Explore the benefits

How to run OpenAI's gpt-oss models locally with RamaLama

Secure, simple, and containerized

September 9, 2025
Cedric Clyburn
Related topics:
Artificial intelligenceContainers
Related products:
Red Hat AI

    The release of OpenAI's gpt-oss models is a significant milestone for developers and enterprises looking to control their own AI journey. These open-weight models, available in 20B and 120B parameter variants, bring ChatGPT-level reasoning capabilities to your local machine under the Apache 2.0 license. But here’s the catch: How do you run these models securely, without compromising your system or spending hours configuring GPU drivers?

    Enter RamaLama, a command-line tool that makes running AI models as simple as running containers. By leveraging OCI containers and intelligent GPU detection, RamaLama eliminates the complexity of AI infrastructure while providing strong isolation via containerization.

    This post guides you through the steps to get gpt-oss running on your machine in minutes so you can quickly integrate it into your chat interface, RAG application, agentic workflow, and more.

    Why use RamaLama for the gpt-oss models?

    Before diving into the setup, let's address the elephant in the room: Why not just use Ollama or run the models directly?

    The answer lies in RamaLama's unique approach to AI model management:

    • Zero trust security: Models run in rootless containers with no network access by default
    • Automatic GPU optimization: RamaLama detects your hardware and pulls the right container image
    • Familiar container workflows: Use the same tools and patterns you already know
    • Production-ready path: Easily transition from local development to Kubernetes deployment

    Understanding the gpt-oss models

    OpenAI's gpt-oss models come in two flavors, gpt-oss-20b and gpt-oss-120b. 

    ModelParametersActive per tokenMemory requiredUse case
    gpt-oss-20b~21 B~3.6 B~16 GBGeneral chat, coding assistance
    gpt-oss-120b~117 B~5.1 B~80 GB (ex. NVIDIA H100)Complex reasoning, advanced tasks

    Both models support a 128,000k token context length (although reduced context lengths between 8k–32k are recommended unless you have ≥ 80GB of VRAM or substantial unified memory on Apple silicon).

    They use MXFP4 quantization, which enables memory-efficient deployment on consumer GPUs. Read Optimizing generative AI models with quantization to learn more about how quantization works.

    Benchmarks show 20B ≈ o3-mini, and 120B ≈ o4-mini on tasks like reasoning, coding, and MMLU (learn more on the OpenAI blog).

    Getting started with RamaLama

    Let's get gpt-oss running on your machine with RamaLama.

    Step 1: Install RamaLama

    On macOS/Linux via the install script:

    curl -fsSL https://ramalama.ai/install.sh | bash

    Or via PyPI:

    pip install ramalama

    One line, and that's it. You can now use ramalama in the terminal to pull, run, and serve models from your system. Behind the scenes, RamaLama will automatically detect your container runtime, like Podman or Docker, when running and serving models.

    Step 2: Pull and run gpt-oss-20b

    Here's where RamaLama shines. With a single command, it will:

    1. Detect your GPU configuration.
    2. Pull the appropriate container image (CUDA, ROCm, or CPU).
    3. Download the model.
    4. Launch it in an isolated container.

    Enter:

    ramalama run gpt-oss:20b

    With that single command, we’ve pulled and started an inference server for gpt-oss, right from our command line using RamaLama.

    While you still need the appropriate GPU drivers installed, RamaLama removes the need to install CUDA, CUDA deep neural network (cuDNN), or other GPU dependencies in your environment—the container image includes those.

    Note

    RamaLama isn't limited to Ollama's registry; it's transport-agnostic. It supports Hugging Face (huggingface://), OCI (oci://), ModelScope, and Ollama (ollama://).

    Security by default

    When RamaLama runs your model, several security measures kick in automatically:

    • Container runs with --network=none (no internet access)
    • Model mounted read-only
    • All Linux capabilities dropped (no attack access)
    • Temporary data wiped on exit with --rm

    Why does this matter? Many models today are shared peer-to-peer or through community hubs, and their provenance isn’t always clear. Running such models directly on your host could expose you to data leaks or system tampering. By default, RamaLama’s container isolation ensures that—even if a model is malicious—it cannot exfiltrate data or modify your system outside its sandbox.

    Maximizing your performance with RamaLama

    RamaLama automatically detects your hardware and pulls the appropriate container image, but you can fine-tune performance based on your system's capabilities.

    High-end systems (16 GB+ VRAM or 64 GB+ unified memory)

    For NVIDIA RTX 4060 Ti or better, or Apple silicon with substantial memory:

    ramalama serve gpt-oss:20b

    This runs with full GPU acceleration and launches a REST API server with web UI at http://localhost:8080. RamaLama automatically uses all available GPU layers (--ngl 999) and the model's default context size.

    Figure 1 shows the process of locally serving the gpt-oss model with RamaLama and interacting with it via a local web interface.

    A split-screen view. On the left, a command-line interface shows RamaLama serving the gpt-oss model with full GPU acceleration. On the right, a web browser shows the local chat interface for interacting with the served model.
    Figure 1: Serving the gpt-oss model locally using RamaLama (left) and testing the local web interface to chat with the model (right).

    Memory-constrained systems (8-16GB VRAM)

    For mid‑range GPUs or systems with limited memory, you can offload 10 layers to the GPU with --ngl 10 (leaving the rest on the CPU to save VRAM) while limiting the context to ~16k tokens with --ctx-size 16384 to reduce overall memory usage.

    ramalama serve --ngl 10 --ctx-size 16384 gpt-oss:20b

    CPU-only systems

    On systems without a compatible GPU, use --ngl 0 to force CPU-only inference and --threads 8 (adjust as needed) to set the CPU thread count.

    ramalama serve --ngl 0 --threads 8 --ctx-size 4096 gpt-oss:20b

    Monitoring resource usage

    Running AI models can be heavy on your system. RamaLama containers make it easy to keep an eye on performance so you know whether you’re maxing out CPU, GPU, or memory. Let’s check container details and resource consumption:

    ramalama containers

    Let’s say we’re using Podman. Here we can use podman stats to stream container resource usage:

    podman stats <container_name>

    Alternatively, we can use nvtop, the task monitor for NVIDIA GPU’s and other accelerators to monitor load and memory usage (shown in Figure 2):

    nvtop
    The nvtop command-line interface showing real-time metrics of GPU usage, including memory and GPU load for multiple GPUs.
    Figure 2: Metrics of GPU usage with nvtop, a command line utility for real-time monitoring of accelerator load.

    Community and next steps

    RamaLama is a collaborative effort to make AI as simple as possible by using containers to run and serve models. With support for a wide variety of registries including Hugging Face and Ollama (even OCI registries), as well as multiple inference runtimes (namely llama.cpp and vLLM), you run and build apps using countless different types of models, including gpt-oss. What will you try?

    In the meantime, here are some helpful links:

    • Check out the RamaLama repository.
    • Join the Matrix chat.
    • Try the quick start examples.
    • Read the blog post How RamaLama runs AI models in isolation by default.

    The future of AI is local, secure, and containerized, and with tools like RamaLama, that future is already here.

    Related Posts

    • How RamaLama makes working with AI models boring

    • How RamaLama runs AI models in isolation by default

    • Simplify AI data integration with RamaLama and RAG

    • Podman AI Lab and RamaLama unite for easier local AI

    • Supercharging AI isolation: microVMs with RamaLama & libkrun

    • Unleashing multimodal magic with RamaLama

    Recent Posts

    • Confidential virtual machine storage attack scenarios

    • Introducing virtualization platform autopilot

    • Integrate zero trust workload identity manager with Red Hat OpenShift GitOps

    • Best Practice Configuration and Tuning for Linux and Windows VMs

    • Red Hat UBI 8 builders have been promoted to the Paketo Buildpacks organization

    Red Hat Developers logo LinkedIn YouTube Twitter Facebook

    Platforms

    • Red Hat AI
    • Red Hat Enterprise Linux
    • Red Hat OpenShift
    • Red Hat Ansible Automation Platform
    • See all products

    Build

    • Developer Sandbox
    • Developer tools
    • Interactive tutorials
    • API catalog

    Quicklinks

    • Learning resources
    • E-books
    • Cheat sheets
    • Blog
    • Events
    • Newsletter

    Communicate

    • About us
    • Contact sales
    • Find a partner
    • Report a website issue
    • Site status dashboard
    • Report a security problem

    RED HAT DEVELOPER

    Build here. Go anywhere.

    We serve the builders. The problem solvers who create careers with code.

    Join us if you’re a developer, software engineer, web designer, front-end designer, UX designer, computer scientist, architect, tester, product manager, project manager or team lead.

    Sign me up

    Red Hat legal and privacy links

    • About Red Hat
    • Jobs
    • Events
    • Locations
    • Contact Red Hat
    • Red Hat Blog
    • Inclusion at Red Hat
    • Cool Stuff Store
    • Red Hat Summit
    © 2026 Red Hat

    Red Hat legal and privacy links

    • Privacy statement
    • Terms of use
    • All policies and guidelines
    • Digital accessibility