Skip to main content
Redhat Developers  Logo
  • Products

    Platforms

    • Red Hat Enterprise Linux
      Red Hat Enterprise Linux Icon
    • Red Hat AI
      Red Hat AI
    • Red Hat OpenShift
      Openshift icon
    • Red Hat Ansible Automation Platform
      Ansible icon
    • View All Red Hat Products

    Featured

    • Red Hat build of OpenJDK
    • Red Hat Developer Hub
    • Red Hat JBoss Enterprise Application Platform
    • Red Hat OpenShift Dev Spaces
    • Red Hat OpenShift Local
    • Red Hat Developer Sandbox

      Try Red Hat products and technologies without setup or configuration fees for 30 days with this shared Openshift and Kubernetes cluster.
    • Try at no cost
  • Technologies

    Featured

    • AI/ML
      AI/ML Icon
    • Linux
      Linux Icon
    • Kubernetes
      Cloud icon
    • Automation
      Automation Icon showing arrows moving in a circle around a gear
    • View All Technologies
    • Programming Languages & Frameworks

      • Java
      • Python
      • JavaScript
    • System Design & Architecture

      • Red Hat architecture and design patterns
      • Microservices
      • Event-Driven Architecture
      • Databases
    • Developer Productivity

      • Developer productivity
      • Developer Tools
      • GitOps
    • Automated Data Processing

      • AI/ML
      • Data Science
      • Apache Kafka on Kubernetes
    • Platform Engineering

      • DevOps
      • DevSecOps
      • Ansible automation for applications and services
    • Secure Development & Architectures

      • Security
      • Secure coding
  • Learn

    Featured

    • Kubernetes & Cloud Native
      Openshift icon
    • Linux
      Rhel icon
    • Automation
      Ansible cloud icon
    • AI/ML
      AI/ML Icon
    • View All Learning Resources

    E-Books

    • GitOps Cookbook
    • Podman in Action
    • Kubernetes Operators
    • The Path to GitOps
    • View All E-books

    Cheat Sheets

    • Linux Commands
    • Bash Commands
    • Git
    • systemd Commands
    • View All Cheat Sheets

    Documentation

    • Product Documentation
    • API Catalog
    • Legacy Documentation
  • Developer Sandbox

    Developer Sandbox

    • Access Red Hat’s products and technologies without setup or configuration, and start developing quicker than ever before with our new, no-cost sandbox environments.
    • Explore Developer Sandbox

    Featured Developer Sandbox activities

    • Get started with your Developer Sandbox
    • OpenShift virtualization and application modernization using the Developer Sandbox
    • Explore all Developer Sandbox activities

    Ready to start developing apps?

    • Try at no cost
  • Blog
  • Events
  • Videos

Multimodal AI at the edge: Deploy vision language models with RamaLama

October 27, 2025
Ravishankar Srinivasan
Related topics:
Artificial intelligence
Related products:
Red Hat AI

Share:

    The frontier of AI is no longer just in the data center; it's on the factory floor, in our retail stores, in public infrastructure, and in the smart devices all around us. Driving this shift are vision language models (VLMs), a revolutionary class of AI that can interpret and reason about images and videos. The potential is immense, but any developer who has tried to deploy these models at the edge knows the harsh reality: it's a battle against dependency hell, driver incompatibilities, and inconsistent environments.

    What if you could bypass that complexity entirely? What if deploying a VLM was as simple as ramalama run?

    Enter RamaLama, an open source command-line interface (CLI) designed to make working with AI models "boring" by handling the hard parts for you. It leverages container engines like Podman and Docker to abstract away hardware and software dependencies, providing an isolated and repeatable environment for your models with a stronger security posture.

    In this article, we'll show you how to leverage RamaLama to deploy powerful multimodal AI models on your edge devices. We will cover everything from pulling your first VLM, to serving it via an API, and turning a once-daunting task into a manageable and efficient workflow.

    Vision language models for the edge

    For years, the most powerful artificial intelligence lived in the cloud, accessible only through an internet connection. But a new frontier is rapidly emerging: the edge. Imagine a smart camera that doesn't just record video but describes what it sees in real-time, an offline industrial scanner that identifies product defects, or a device that can help identify threats and hazards in public infrastructure and raise the appropriate warnings. This is the promise of vision language models: AI that can see, understand, and communicate information about the world.

    Bringing this capability to local, resource-constrained edge devices has traditionally been a challenge, fraught with complex dependencies, hardware-specific configurations, and deployment nightmares. RamaLama simplifies the process by embracing a container-native philosophy, making the process of deploying sophisticated VLMs as straightforward and reliable as running a container on the edge device.

    The RamaLama architecture: Simplicity through containerization

    At its core, RamaLama's architecture is built on a simple premise: “Treat AI models with the same robust, portable, and isolated approach that containers brought to software applications.”

    Instead of requiring users to manually configure complex environments with specific GPU drivers, Python libraries, and model dependencies, RamaLama abstracts this entire process away. It acts as an intelligent orchestrator that sits between the user and a container engine like Podman or Docker. When a user issues a command like ramalama run, the tool first inspects the host system to detect available hardware. It then automatically determines whether to use a CPU or a specific type of GPU (like NVIDIA via CUDA, AMD via ROCm, or an Intel GPU).

    Based on this hardware detection, RamaLama's key architectural function comes into play: it selects and pulls a pre-configured OCI (Open Container Initiative) container image specifically optimized for that environment. These images, stored in registries like Quay.io, contain all the necessary system libraries, drivers, and runtimes (such as llama.cpp or vLLM) needed to execute the AI model efficiently. The user's chosen model, pulled from a source like Hugging Face or Ollama, is then mounted into this container. (See Figure 1.) This elegant design means the user's machine stays clean, and the model runs in a more consistent, reproducible environment, effectively eliminating the "it works on my machine" problem for AI.

    Architecture diagram illustrating the RamaLama tool's use of a container engine and OCI images to pull, run, and serve an AI model from various registries like Hugging Face.
    Figure 1: RamaLama architecture.

    By default, RamaLama runs models inside rootless containers, which provides crucial isolation from the host system. Furthermore, it applies a stringent security policy out of the box: the AI model is mounted as a read-only volume, preventing any potential modification of the host's file system. Critically, containers are run with networking disabled (--network=none), blocking any unauthorized communication and severely limiting the potential for leaks of the sensitive data processed by the model. This multi-layered, security-first approach allows developers to experiment with and serve a wide variety of models from the community with a significantly reduced risk profile.

    Deploying Qwen vision language models using RamaLama

    We will now walk through deploying the Qwen2.5VL-3B VLM on a local workstation using RamaLama. The major steps are as follows:

    1. Install RamaLama on your platform.
    2. Pull the qwen2.5vl:3b model from Hugging Face.
    3. Serve the VLM on the local system to expose an OpenAI-compliant REST API endpoint for applications.
    4. Test the VLM inference using images and videos.

    Prerequisites

    Before you begin, ensure you have the following prerequisites in place:

    • Install Podman (recommended) or Docker.
    • Create a Hugging Face account to download private LLMs that require a token. Set your token as the value for the HF_TOKEN environment variable for your system (for example, by setting it in ~/.bashrc or ~/.zshrc).
    • Install RamaLama.
    • Run ramalama info to verify that the ramalama CLI is correctly installed. You should see various details about your environment and no errors.
    • Clone the code and test files from the GitHub repository at https://github.com/rsriniva/ramalama-edge to your local system. 

    Pull the Qwen2.5 VLM from Hugging Face

    Download the qwen2.5vl:3b model from Hugging Face as follows.

    ramalama pull qwen2.5vl:3b

    Verify the model has been downloaded and is stored locally:

    ramalama ls
    
    NAME                                                  MODIFIED       SIZE
    hf://ggml-org/Qwen2.5-VL-3B-Instruct-GGUF             23 hours ago   2.58 GB

    Serve the VLM on the local system

    To expose the VLM as a callable REST API endpoint (on port 8081), use the ramalama serve command.

    ramalama serve --port 8081 qwen2.5vl:3b
    
    ggml_vulkan: Found 1 Vulkan devices:
    ggml_vulkan: 0 = Virtio-GPU Venus (Apple M1 Pro) (venus)
    ...
    main: HTTP server is listening, hostname: 0.0.0.0, port: 8081, http threads: 4
    main: loading model
    srv    load_model: loading model '/mnt/models/Qwen2.5-VL-3B-Instruct-Q4_K_M.gguf'
    ...
    main: server is listening on http://0.0.0.0:8081 - starting the main loop
    srv  update_slots: all slots are idle

    Note: You can start multiple instances of the VLM or other LLMs and serve them on a different port as long as you have the CPU, memory, and GPU capacity on your system. Each instance will run in its own container. 

    List the models that are running on your system:

    ramalama ps
    
    CONTAINER ID  IMAGE                             PORTS
    0685b116a10c  quay.io/ramalama/ramalama:latest  0.0.0.0:8081->8081/tcp

    Test the VLM using the built-in web UI

    RamaLama comes with a built-in, simple chat web UI  that you can use to interact with the model:

    1. Navigate to http://localhost:8081 with a web browser.
    2. Click the attachment icon at the bottom of the web UI next to the chat box and upload the jazz-quintet.jpg file from the folder where you cloned the test code and files from GitHub.
    3. You can now ask questions to the VLM about the image. Try entering How many people are in the image? to prompt a response from the VLM (Figure 2).
    Screenshot of the RamaLama web UI showing the user input and response.
    Figure 2: RamaLama web UI chat.
    1. Ask a follow-up question like What is the person in the center of the image holding? and inspect the response from the VLM. Because we are using a smaller quantized model, the results might not be fully accurate, but should be close to the actual answer you expect. Using larger, non-quantized models will result in better results.

    Communicate with the VLM from applications

    In real-world scenarios, you will invoke the VLM using the RamaLama-provided REST API (which is OpenAI-compliant) from applications that reside on or near the edge device. 

    Inspect the Python script image_analyze.py in the Git repository. This script uses the OpenAI client and sends the image to the VLM for analysis. A real-world use case is a camera or image capture device that sends images to the VLM for analysis.

    To run the script, you first need to create a Python virtual environment in the same folder and activate it:

    python -m venv venv
    source venv/bin/activate
    (venv)

    Next, install the dependencies required for the script to run. A requirements.txt file listing all the dependencies is provided for you.

    (venv) pip install -r requirements.txt

    Inspect the script and change the IMAGE_FILE and PROMPT variable as needed. You can now run the image_analyze.py script.

    (venv) python image_analyze.py
    
    Querying local model served by RamaLama...
    Prompt: How many people are in this image?
    
    Response from model:
    There are five people in the image.

    You can provide your own images and change the PROMPT variable with the query you want to send to the VLM. 

    You can pull and serve other VLMs, and perform your own tests before deploying your application to production. RamaLama makes it easy to experiment with different models with minimal code changes, and once testing is complete, you can deploy the models to a Kubernetes or OpenShift cluster using RamaLama-generated YAML artifacts.

    Test with videos

    You can also analyze video streams using the same setup. Inspect the video_analyze.py script in the same folder. A sample video file in MP4 format is provided in the same folder. To process the video file, we take image snapshots of the video in 7-second intervals using the OpenCV library, and send the stream of images to the VLM for analysis. 

    Note: The snapshot intervals are customizable and depend on the length of the video and the hardware capacity backing the VLM. Because we are testing this on our local systems, we use a modest 30-second video and take 7 snapshots at 7-second intervals.

    Run the video_analyze.py script:

    (venv) python video_analyze.py
    
    Extracting frames from 'classroom.mp4' every 7 seconds...
    Successfully extracted 5 frames.
    Sending frames to the VLM for analysis...
    
    ====================
    Video Analysis Summary
    ====================
    In the video, there are four individuals seated at desks in a classroom or meeting room. The person on the far right, who appears to be the last in the sequence, is standing up and leaning towards the table, then turning and walking away. The other three individuals remain seated, with their body language indicating they are engaged in the conversation or activity. The person on the far right also interacts with the individual next to him, possibly discussing something or gesturing with their hands.

    You can debug the script by checking the output in the terminal where you started the VLM. You should see the communication between the client and the VLM and see some basic statistics about the inference printed on the console.

    srv  log_server_r: request: POST /v1/chat/completions 192.168.127.1 200
    srv  params_from_: Chat format: Content-only
    slot launch_slot_: id  0 | task 49 | processing task
    ...
    srv  process_chun: processing image...
    srv  process_chun: image processed in 9983 ms
    slot update_slots: id  0 | task 49 | prompt processing progress, n_past = 51, n_tokens = 3, progress = 0.727273
    slot update_slots: id  0 | task 49 | kv cache rm [51, end)
    srv  process_chun: processing image...
    srv  process_chun: image processed in 11417 ms
    slot update_slots: id  0 | task 49 | prompt processing progress, n_past = 55, n_tokens = 3, progress = 0.787879
    slot update_slots: id  0 | task 49 | kv cache rm [55, end)
    srv  process_chun: processing image...
    srv  process_chun: image processed in 12865 ms
    slot update_slots: id  0 | task 49 | prompt processing progress, n_past = 59, n_tokens = 3, progress = 0.848485
    slot update_slots: id  0 | task 49 | kv cache rm [59, end)
    srv  process_chun: processing image...
    srv  process_chun: image processed in 14268 ms
    slot update_slots: id  0 | task 49 | prompt processing progress, n_past = 66, n_tokens = 6, progress = 0.954545
    srv  log_server_r: request: POST /v1/chat/completions 192.168.127.1 500
    srv  update_slots: all slots are idle
    srv  params_from_: Chat format: Content-only
    slot launch_slot_: id  0 | task 208 | processing task
    slot update_slots: id  0 | task 208 | new prompt, n_ctx_slot = 4096, n_keep = 0, n_prompt_tokens = 66
    slot update_slots: id  0 | task 208 | need to evaluate at least 1 token for each active slot, n_past = 66, n_prompt_tokens = 66
    slot update_slots: id  0 | task 208 | kv cache rm [65, end)
    slot update_slots: id  0 | task 208 | prompt processing progress, n_past = 66, n_tokens = 1, progress = 0.015152
    slot update_slots: id  0 | task 208 | prompt done, n_past = 66, n_tokens = 1
    slot      release: id  0 | task 208 | stop processing: n_past = 162, truncated = 0
    slot print_timing: id  0 | task 208 |
    prompt eval time =      72.13 ms /     1 tokens (   72.13 ms per token,    13.86 tokens per second)
           eval time =    5809.82 ms /    97 tokens (   59.90 ms per token,    16.70 tokens per second)
          total time =    5881.95 ms /    98 tokens
    srv  update_slots: all slots are idle
    srv  log_server_r: request: POST /v1/chat/completions 192.168.127.1 200

    Further reading

    • RamaLama website
    • How RamaLama runs AI models in isolation by default
    • How to run OpenAI's gpt-oss models locally with RamaLama
    • Unleashing multimodal magic with RamaLama
    • How RamaLama makes working with AI models boring
    • Supercharging AI isolation: microVMs with RamaLama & libkrun
    • Simplify AI data integration with RamaLama and RAG
    • Simplifying AI with RamaLama and llama-run
    • Podman AI Lab and RamaLama unite for easier local AI
    • How to run AI models in cloud development environments 

    Acknowledgements

    This article was created with assistance from Daniel Walsh and Michael Engel.

    Related Posts

    • How to run OpenAI's gpt-oss models locally with RamaLama

    • Supercharging AI isolation: microVMs with RamaLama & libkrun

    • Unleashing multimodal magic with RamaLama

    • Podman AI Lab and RamaLama unite for easier local AI

    • Simplify AI data integration with RamaLama and RAG

    • How RamaLama runs AI models in isolation by default

    Recent Posts

    • Multimodal AI at the edge: Deploy vision language models with RamaLama

    • SDG Hub: Building synthetic data pipelines with modular blocks

    • AI accelerator selection for inference: A stage-based framework

    • How to modify system-reserved parameters on OpenShift nodes

    • The odo CLI is deprecated: What developers need to know

    What’s up next?

    Learn the basic commands for managing images, containers, and container resources using Podman. Podman lets you create and manage rootless containers, which don't require root access to be built and deployed.

    Get the cheat sheet
    Red Hat Developers logo LinkedIn YouTube Twitter Facebook

    Platforms

    • Red Hat AI
    • Red Hat Enterprise Linux
    • Red Hat OpenShift
    • Red Hat Ansible Automation Platform
    • See all products

    Build

    • Developer Sandbox
    • Developer Tools
    • Interactive Tutorials
    • API Catalog

    Quicklinks

    • Learning Resources
    • E-books
    • Cheat Sheets
    • Blog
    • Events
    • Newsletter

    Communicate

    • About us
    • Contact sales
    • Find a partner
    • Report a website issue
    • Site Status Dashboard
    • Report a security problem

    RED HAT DEVELOPER

    Build here. Go anywhere.

    We serve the builders. The problem solvers who create careers with code.

    Join us if you’re a developer, software engineer, web designer, front-end designer, UX designer, computer scientist, architect, tester, product manager, project manager or team lead.

    Sign me up

    Red Hat legal and privacy links

    • About Red Hat
    • Jobs
    • Events
    • Locations
    • Contact Red Hat
    • Red Hat Blog
    • Inclusion at Red Hat
    • Cool Stuff Store
    • Red Hat Summit
    © 2025 Red Hat

    Red Hat legal and privacy links

    • Privacy statement
    • Terms of use
    • All policies and guidelines
    • Digital accessibility

    Report a website issue