Simplifying AI with RamaLama and llama-run

In our previous blog post, we introduced the RamaLama project, a bold initiative to make AI development and deployment delightfully boring by leveraging the power of OCI containers. Since then, the RamaLama community has been busy, contributing tools and workflows to simplify AI integration. One of the contributions we’d like to spotlight today is llama-run, a new addition to the llama.cpp ecosystem.
In this post, we’ll explore how llama-run aligns with RamaLama’s vision of effortless AI workflows, enabling developers to work with large language models (LLMs) straightforwardly and intuitively.

What Is llama-run?

The llama-run tool was designed to be a minimal and versatile interface for running LLMs with llama.cpp. While llama.cpp has been widely recognized for its efficient and resource-friendly approach to running LLMs, configuring and invoking these models could sometimes feel daunting—until now.

With llama-run, we’ve abstracted away much of the complexity, focusing on simplicity and usability. Developers can now load and run models with a single command or fine-tune their configuration through lightweight options. llama-run will power the “ramalama run” command.

Here’s how it works:

A Simple Example

llama-run granite-code

The above command launches the granite-code model, retrieving it seamlessly from its designated source.

Need more control? Check out these additional options:

llama-run --context-size 4096 --verbose llama3

This flexibility makes llama-run an excellent tool for developers who want to focus on outcomes rather than infrastructure.

Why llama-run Matters for RamaLama

The RamaLama project is all about reducing friction in AI workflows. By using OCI containers as the foundation for deploying LLMs, we aim to eliminate the headaches of dependency management, environment setup, and operational inconsistencies. llama-run integrates seamlessly into this philosophy by providing:

Protocol-Agnostic Model Retrieval

With support for multiple protocols (e.g., hf://, huggingface://, ollama://, https://, or file://), llama-run enables easy integration with your preferred model repositories or local files.
Declarative Simplicity

Whether you’re pulling a model from Hugging Face, Ollama, or a custom endpoint, llama-run simplifies the process:
llama-run hf://QuantFactory/SmolLM-135M-GGUF/SmolLM-135M.Q2_K.gguf
Interoperability with Containers

Combined with OCI containerized environments from RamaLama, llama-run becomes a lightweight runtime for deploying and experimenting with LLMs at scale.
Debugging Made Easy

With options like --verbose, llama-run provides detailed logs for debugging, making it easier to iterate on configurations or troubleshoot issues.

Key Features of llama-run

Here’s a closer look at the features that make llama-run a must-have for AI developers:

Flexible Model Support

Run models from a variety of sources with minimal effort. llama-run intelligently handles model retrieval, ensuring you always have access to the resources you need:

hf:// (or huggingface://)
ollama://
file://
https://

If no protocol is specified, llama-run defaults to local files or assumes an Ollama-hosted model. This flexibility ensures compatibility with a wide array of use cases.

Configurable Context Sizes

Control the context size to suit your specific requirements:

llama-run --context-size 4096 some-model

GPU Optimization

Leverage GPU acceleration with the --ngl option, giving you the ability to specify the number of GPU layers used:

llama-run --ngl 999 some-model

Debugging Options

Use --verbose to log detailed information during execution, which is invaluable for debugging and profiling:

llama-run --verbose model-name

Example Use Cases

Running a Hugging Face Model

llama-run hf://bartowski/SmolLM-1.7B-Instruct-v0.2-GGUF/SmolLM-1.7B-Instruct-v0.2-IQ3_M.gguf

Running an Ollama Model

llama-run granite-code

Testing Locally Stored Models

llama-run some-local-file.gguf "Hello World"

Stdin support

git diff | llama-run granite-code "Write a git commit message for this change"

GPU Acceleration for Performance

llama-run --ngl 999 my-model.gguf

Debugging

llama-run --verbose dev-model.gguf

llama-run in the RamaLama Ecosystem

By combining the flexibility of llama-run with the containerized, predictable environments offered by RamaLama, we’re making AI workflows simpler and more productive. Developers can:

Quickly prototype AI applications.
Standardize deployments across environments.
Collaborate easily, sharing configurations and results without worrying about underlying infrastructure.

This approach empowers teams to focus on creating meaningful applications without being bogged down by the intricacies of model management or deployment.

What’s Next?

At Red Hat and in the RamaLama community, we’re continually looking for ways to make AI boring—in the best possible way. Join us by experimenting with RamaLama, contributing to the RamaLama project, or sharing your feedback and use cases. Together, we can make the extraordinary seem routine.

Let’s keep innovating—one boring AI tool at a time. 🚀

Simplifying AI with RamaLama and llama-run

Share:

What Is llama-run?

A Simple Example

Why llama-run Matters for RamaLama

Key Features of llama-run

Flexible Model Support

Configurable Context Sizes

GPU Optimization

Debugging Options

Example Use Cases

Running a Hugging Face Model

Running an Ollama Model

Testing Locally Stored Models

Stdin support

GPU Acceleration for Performance

Debugging

llama-run in the RamaLama Ecosystem

What’s Next?

Storage considerations for OpenShift Virtualization

Upgrade from OpenShift Service Mesh 2.6 to 3.0 with Kiali

EE Builder with Ansible Automation Platform on OpenShift

How to debug confidential containers securely

Announcing self-service access to Red Hat Enterprise Linux for Business Developers

Products

Build

Quicklinks

Communicate

RED HAT DEVELOPER

Red Hat legal and privacy links

Red Hat legal and privacy links

Report a website issue