The frontier of AI is no longer just in the data center; it's on the factory floor, in our retail stores, in public infrastructure, and in the smart devices all around us. Driving this shift are vision language models (VLMs), a revolutionary class of AI that can interpret and reason about images and videos. The potential is immense, but any developer who has tried to deploy these models at the edge knows the harsh reality: it's a battle against dependency hell, driver incompatibilities, and inconsistent environments.

What if you could bypass that complexity entirely? What if deploying a VLM was as simple as ramalama run?

Enter RamaLama, an open source command-line interface (CLI) designed to make working with AI models "boring" by handling the hard parts for you. It leverages container engines like Podman and Docker to abstract away hardware and software dependencies, providing an isolated and repeatable environment for your models with a stronger security posture.

In this article, we'll show you how to leverage RamaLama to deploy powerful multimodal AI models on your edge devices. We will cover everything from pulling your first VLM, to serving it via an API, and turning a once-daunting task into a manageable and efficient workflow.

Vision language models for the edge

For years, the most powerful artificial intelligence lived in the cloud, accessible only through an internet connection. But a new frontier is rapidly emerging: the edge. Imagine a smart camera that doesn't just record video but describes what it sees in real-time, an offline industrial scanner that identifies product defects, or a device that can help identify threats and hazards in public infrastructure and raise the appropriate warnings. This is the promise of vision language models: AI that can see, understand, and communicate information about the world.

Bringing this capability to local, resource-constrained edge devices has traditionally been a challenge, fraught with complex dependencies, hardware-specific configurations, and deployment nightmares. RamaLama simplifies the process by embracing a container-native philosophy, making the process of deploying sophisticated VLMs as straightforward and reliable as running a container on the edge device.

The RamaLama architecture: Simplicity through containerization

At its core, RamaLama's architecture is built on a simple premise: “Treat AI models with the same robust, portable, and isolated approach that containers brought to software applications.”

Instead of requiring users to manually configure complex environments with specific GPU drivers, Python libraries, and model dependencies, RamaLama abstracts this entire process away. It acts as an intelligent orchestrator that sits between the user and a container engine like Podman or Docker. When a user issues a command like ramalama run, the tool first inspects the host system to detect available hardware. It then automatically determines whether to use a CPU or a specific type of GPU (like NVIDIA via CUDA, AMD via ROCm, or an Intel GPU).

Based on this hardware detection, RamaLama's key architectural function comes into play: it selects and pulls a pre-configured OCI (Open Container Initiative) container image specifically optimized for that environment. These images, stored in registries like Quay.io, contain all the necessary system libraries, drivers, and runtimes (such as llama.cpp or vLLM) needed to execute the AI model efficiently. The user's chosen model, pulled from a source like Hugging Face or Ollama, is then mounted into this container. (See Figure 1.) This elegant design means the user's machine stays clean, and the model runs in a more consistent, reproducible environment, effectively eliminating the "it works on my machine" problem for AI.

Architecture diagram illustrating the RamaLama tool's use of a container engine and OCI images to pull, run, and serve an AI model from various registries like Hugging Face. — Figure 1: RamaLama architecture.

By default, RamaLama runs models inside rootless containers, which provides crucial isolation from the host system. Furthermore, it applies a stringent security policy out of the box: the AI model is mounted as a read-only volume, preventing any potential modification of the host's file system. Critically, containers are run with networking disabled (--network=none), blocking any unauthorized communication and severely limiting the potential for leaks of the sensitive data processed by the model. This multi-layered, security-first approach allows developers to experiment with and serve a wide variety of models from the community with a significantly reduced risk profile.

Deploying Qwen vision language models using RamaLama

We will now walk through deploying the Qwen2.5VL-3B VLM on a local workstation using RamaLama. The major steps are as follows:

Install RamaLama on your platform.
Pull the qwen2.5vl:3b model from Hugging Face.
Serve the VLM on the local system to expose an OpenAI-compliant REST API endpoint for applications.
Test the VLM inference using images and videos.

Prerequisites

Before you begin, ensure you have the following prerequisites in place:

Install Podman (recommended) or Docker.
Create a Hugging Face account to download private LLMs that require a token. Set your token as the value for the HF_TOKEN environment variable for your system (for example, by setting it in ~/.bashrc or ~/.zshrc).
Install RamaLama.
Run ramalama info to verify that the ramalama CLI is correctly installed. You should see various details about your environment and no errors.
Clone the code and test files from the GitHub repository at https://github.com/rsriniva/ramalama-edge to your local system.

Pull the Qwen2.5 VLM from Hugging Face

Download the qwen2.5vl:3b model from Hugging Face as follows.

ramalama pull qwen2.5vl:3b

Verify the model has been downloaded and is stored locally:

ramalama ls

NAME                                                  MODIFIED       SIZE
hf://ggml-org/Qwen2.5-VL-3B-Instruct-GGUF             23 hours ago   2.58 GB

Serve the VLM on the local system

To expose the VLM as a callable REST API endpoint (on port 8081), use the ramalama serve command.

ramalama serve --port 8081 qwen2.5vl:3b

ggml_vulkan: Found 1 Vulkan devices:
ggml_vulkan: 0 = Virtio-GPU Venus (Apple M1 Pro) (venus)
...
main: HTTP server is listening, hostname: 0.0.0.0, port: 8081, http threads: 4
main: loading model
srv    load_model: loading model '/mnt/models/Qwen2.5-VL-3B-Instruct-Q4_K_M.gguf'
...
main: server is listening on http://0.0.0.0:8081 - starting the main loop
srv  update_slots: all slots are idle

Note: You can start multiple instances of the VLM or other LLMs and serve them on a different port as long as you have the CPU, memory, and GPU capacity on your system. Each instance will run in its own container.

List the models that are running on your system:

ramalama ps

CONTAINER ID  IMAGE                             PORTS
0685b116a10c  quay.io/ramalama/ramalama:latest  0.0.0.0:8081->8081/tcp

Test the VLM using the built-in web UI

RamaLama comes with a built-in, simple chat web UI that you can use to interact with the model:

Navigate to http://localhost:8081 with a web browser.
Click the attachment icon at the bottom of the web UI next to the chat box and upload the jazz-quintet.jpg file from the folder where you cloned the test code and files from GitHub.
You can now ask questions to the VLM about the image. Try entering How many people are in the image? to prompt a response from the VLM (Figure 2).

Screenshot of the RamaLama web UI showing the user input and response. — Figure 2: RamaLama web UI chat.

Ask a follow-up question like What is the person in the center of the image holding? and inspect the response from the VLM. Because we are using a smaller quantized model, the results might not be fully accurate, but should be close to the actual answer you expect. Using larger, non-quantized models will result in better results.

Communicate with the VLM from applications

In real-world scenarios, you will invoke the VLM using the RamaLama-provided REST API (which is OpenAI-compliant) from applications that reside on or near the edge device.

Inspect the Python script image_analyze.py in the Git repository. This script uses the OpenAI client and sends the image to the VLM for analysis. A real-world use case is a camera or image capture device that sends images to the VLM for analysis.

To run the script, you first need to create a Python virtual environment in the same folder and activate it:

python -m venv venv
source venv/bin/activate
(venv)

Next, install the dependencies required for the script to run. A requirements.txt file listing all the dependencies is provided for you.

(venv) pip install -r requirements.txt

Inspect the script and change the IMAGE_FILE and PROMPT variable as needed. You can now run the image_analyze.py script.

(venv) python image_analyze.py

Querying local model served by RamaLama...
Prompt: How many people are in this image?

Response from model:
There are five people in the image.

You can provide your own images and change the PROMPT variable with the query you want to send to the VLM.

You can pull and serve other VLMs, and perform your own tests before deploying your application to production. RamaLama makes it easy to experiment with different models with minimal code changes, and once testing is complete, you can deploy the models to a Kubernetes or OpenShift cluster using RamaLama-generated YAML artifacts.

Test with videos

You can also analyze video streams using the same setup. Inspect the video_analyze.py script in the same folder. A sample video file in MP4 format is provided in the same folder. To process the video file, we take image snapshots of the video in 7-second intervals using the OpenCV library, and send the stream of images to the VLM for analysis.

Note: The snapshot intervals are customizable and depend on the length of the video and the hardware capacity backing the VLM. Because we are testing this on our local systems, we use a modest 30-second video and take 7 snapshots at 7-second intervals.

Run the video_analyze.py script:

(venv) python video_analyze.py

Extracting frames from 'classroom.mp4' every 7 seconds...
Successfully extracted 5 frames.
Sending frames to the VLM for analysis...

====================
Video Analysis Summary
====================
In the video, there are four individuals seated at desks in a classroom or meeting room. The person on the far right, who appears to be the last in the sequence, is standing up and leaning towards the table, then turning and walking away. The other three individuals remain seated, with their body language indicating they are engaged in the conversation or activity. The person on the far right also interacts with the individual next to him, possibly discussing something or gesturing with their hands.

You can debug the script by checking the output in the terminal where you started the VLM. You should see the communication between the client and the VLM and see some basic statistics about the inference printed on the console.

srv  log_server_r: request: POST /v1/chat/completions 192.168.127.1 200
srv  params_from_: Chat format: Content-only
slot launch_slot_: id  0 | task 49 | processing task
...
srv  process_chun: processing image...
srv  process_chun: image processed in 9983 ms
slot update_slots: id  0 | task 49 | prompt processing progress, n_past = 51, n_tokens = 3, progress = 0.727273
slot update_slots: id  0 | task 49 | kv cache rm [51, end)
srv  process_chun: processing image...
srv  process_chun: image processed in 11417 ms
slot update_slots: id  0 | task 49 | prompt processing progress, n_past = 55, n_tokens = 3, progress = 0.787879
slot update_slots: id  0 | task 49 | kv cache rm [55, end)
srv  process_chun: processing image...
srv  process_chun: image processed in 12865 ms
slot update_slots: id  0 | task 49 | prompt processing progress, n_past = 59, n_tokens = 3, progress = 0.848485
slot update_slots: id  0 | task 49 | kv cache rm [59, end)
srv  process_chun: processing image...
srv  process_chun: image processed in 14268 ms
slot update_slots: id  0 | task 49 | prompt processing progress, n_past = 66, n_tokens = 6, progress = 0.954545
srv  log_server_r: request: POST /v1/chat/completions 192.168.127.1 500
srv  update_slots: all slots are idle
srv  params_from_: Chat format: Content-only
slot launch_slot_: id  0 | task 208 | processing task
slot update_slots: id  0 | task 208 | new prompt, n_ctx_slot = 4096, n_keep = 0, n_prompt_tokens = 66
slot update_slots: id  0 | task 208 | need to evaluate at least 1 token for each active slot, n_past = 66, n_prompt_tokens = 66
slot update_slots: id  0 | task 208 | kv cache rm [65, end)
slot update_slots: id  0 | task 208 | prompt processing progress, n_past = 66, n_tokens = 1, progress = 0.015152
slot update_slots: id  0 | task 208 | prompt done, n_past = 66, n_tokens = 1
slot      release: id  0 | task 208 | stop processing: n_past = 162, truncated = 0
slot print_timing: id  0 | task 208 |
prompt eval time =      72.13 ms /     1 tokens (   72.13 ms per token,    13.86 tokens per second)
       eval time =    5809.82 ms /    97 tokens (   59.90 ms per token,    16.70 tokens per second)
      total time =    5881.95 ms /    98 tokens
srv  update_slots: all slots are idle
srv  log_server_r: request: POST /v1/chat/completions 192.168.127.1 200

Acknowledgements

This article was created with assistance from Daniel Walsh and Michael Engel.

Red Hat Developer Sandbox

Programming languages & frameworks

System design & architecture

Developer experience

Automated data processing

Platform engineering

Secure development & architectures

E-books

Cheat sheets

Documentation

Multimodal AI at the edge: Deploy vision language models with RamaLama

Vision language models for the edge

The RamaLama architecture: Simplicity through containerization

Deploying Qwen vision language models using RamaLama

Prerequisites

Pull the Qwen2.5 VLM from Hugging Face

Serve the VLM on the local system

Test the VLM using the built-in web UI

Communicate with the VLM from applications

Test with videos

Further reading

Acknowledgements

What’s new in Red Hat build of Apache Camel 4.14

Red Hat Enterprise Linux 9.7: Top features for developers

Red Hat Enterprise Linux 10.1: Top features for developers

New in MicroShift 4.20: The generic device plug-in

Exhaustive profiling toolkit: elfutils and libdwfl_stacktrace

Platforms

Build

Quicklinks

Communicate

RED HAT DEVELOPER

Red Hat legal and privacy links

Red Hat legal and privacy links

Report a website issue