How RamaLama makes working with AI models boring

Over the last few months, our team has been working on a new AI project called RamaLama (Figure 1). Yes, another name that contains lama.

What does RamaLama do?

RamaLama facilitates local management and serving of AI models.

RamaLama's goal is to make it easy for developers and administrators to run and serve AI models. RamaLama merges the world of AI inferencing with the world of containers as designed by Podman and Docker, and eventually, Kubernetes.

When you first launch RamaLama, it inspects your system for GPU support, falling back to CPU support if no GPUs are present. It then uses a container engine like Podman or Docker to download a container image from quay.io/ramalama. The container image contains all the software necessary to run an AI model for your systems setup. Currently RamaLama supports llama.cpp and vLLM for running container models.

Once the container image is in place, RamaLama pulls the specified AI model from any of types of model registries: Ollama, Hugging Face, OCI registry.

At this point, once RamaLama has pulled the AI model, it’s showtime, baby! Time to run our inferencing runtime. RamaLama offers switchable inferencing runtimes, namely llama.cpp and vLLM, for running containerized models.

RamaLama launches a container with the AI model volume mounted into the container, starting a chatbot or a rest API service from a simple single command. Models are treated similarly to how Podman and Docker treat container images. RamaLama works with Podman Desktop and Docker Desktop on macOS and Windows.

Running AI workloads in containers eliminates the users need to configure the host system for AI.

8 reasons to use RamaLama

RamaLama thinks differently about LLMs, connecting your use cases with the rest of the Linux and container world. You should use RamaLama if:

You want a simple and easy way to test out AI models.
You don’t want to mess with installing specialized software to support your specific GPU.
You want to find and pull models from any catalog including Hugging Face, Ollama, and even container registries.
You want to use whichever runtime works best for your model and hardware combination: llama.cpp, vLLM, whisper.cpp, etc.
You value running AI models in containers for the simplicity, collaborative properties, and existing infrastructure you have (container registries, CI/CD workflows, etc.).
You want an easy path to run AI models on Podman, Docker, and Kubernetes.
You love the power of running models at system boot using containers with Quadlets.
You believe in the power of collaborative open source to enable the fastest and most creativity when tackling new problems in a fast-moving space.

Why not just use Ollama?

Realizing that lots of people currently use Ollama, we looked into working with it. While we loved its ease of use, we did not think it fit our needs. We decided to build an alternative tool that allows developers to run and serve AI models from a simple interface, while making it easy to take those models, put them in containers, and enable all of the local, collaborative, and production benefits that they offer.

Differences between Ollama and RamaLama

Table 1 compares Ollama and RamaLama capabilities.

Table 1: Ollama versus RamaLama.
Feature	Ollama	RamaLama
Running models on host OS	Defaults to running AI models locally on the host system.	Defaults to running AI models in containers on the host system, but can also run them directly using the `–nocontainer` option.
Running models on host container	Not supported.	Default. RamaLama wraps Podman or Docker and launches first, downloading a container with all of the AI tools ready to execute. It also downloads the AI model to the host, then launches the container with the AI model mounted into it, and runs the serving app.
Support for alternative AI runtimes	Supports `llama.cpp`.	Currently RamaLama supports `llama.cpp` and vLLM.
Optimization and installation of AI software	Statically linked with `llama.cpp` and it becomes a problem for the user to configure their host system to run the AI model.	RamaLama downloads different container images with all of the software, optimized for your specific GPU configuration. Benefit: Users get started faster and optimized for the specific GPU they have, similar to what Flatpak does to pull all of the display stuff at once and use it everywhere. The same optimized containers are used for every model you pull?
AI model registry support	Defaults to pulling images from Ollama; some support for Hugging Face, and no support for OCI content.	Supports pulling from OCI, Ollama, and Hugging Face. Benefit: Sometimes the latest model is only available in one or two places. RamaLama lets you pull it from almost anywhere. If you can find what you want, you can pull it.
Podman Quadlet generation	None.	RamaLama can generate a Podman Quadlet file suitable for launching the AI model and container underneath systemd as a service on an edge device. The Quadlet is based on the locally running AI model, making it easy for the developer to go from experimenting to using it in production.
Kubernetes YAML generation	None.	RamaLama can generate a Kubernetes YAML file to enable users to easily move from a locally running AI model to running the same AI model in a Kubernetes cluster.

Bottom line

We want to iterate quickly on RamaLama and experiment with how we can help developers run and package AI workloads with different patterns like retrieval-augmented generation (RAG) models, Whisper support, summarizes, and other patterns.

Install RamaLama

You can install RamaLama via PyPi or the command line.

PyPi

RamaLama is available via PyPi:

pipx install ramalama

Install by script

Install RamaLama by running one of the following one-liners.

Linux:

curl -fsSL https://raw.githubusercontent.com/containers/ramalama/s/install.sh | sudo bash

macOS (run without sudo):

curl -fsSL https://raw.githubusercontent.com/containers/ramalama/s/install.sh | bash

Distro install

Fedora:

$ sudo dnf -y install ramalama

We need your help!

We want you to install the tool and try it out, and then give us feedback on what you think.

Looking for a project to contribute to? RamaLama welcomes you. It is written in simple Python and wraps other tools, so the barrier to contribute is low. We love help on documentation and potentially web design. This is definitely a community project where we can use varied talents.

We are looking for help packaging RamaLama for other Linux distributions, Mac (Brew?), and Windows. We have it packaged for Fedora and plan on getting it into CentOS Stream and hopefully RHEL. But we really want to see if available everywhere you can run Podman and/or Docker.

Red Hat Developer Sandbox

Programming languages & frameworks

System design & architecture

Developer experience

Automated data processing

Platform engineering

Secure development & architectures

E-books

Cheat sheets

Documentation

How RamaLama makes working with AI models boring

What does RamaLama do?

8 reasons to use RamaLama

Why not just use Ollama?

Differences between Ollama and RamaLama

Bottom line

Install RamaLama

PyPi

Install by script

Distro install

We need your help!

Run privileged commands more securely in OpenShift Dev Spaces

Advanced authentication and authorization for MCP Gateway

Unify OpenShift Service Mesh observability: Perses and Prometheus

Visualize Performance Co-Pilot data with geomaps in Grafana

Integrate a custom AI service with Red Hat Ansible Lightspeed

Platforms

Build

Quicklinks

Communicate

RED HAT DEVELOPER

Red Hat legal and privacy links

Red Hat legal and privacy links

Report a website issue