In our previous blog post, we introduced the RamaLama project, a bold initiative to make AI development and deployment delightfully boring by leveraging the power of OCI containers. Since then, the RamaLama community has been busy, contributing tools and workflows to simplify AI integration. One of the contributions we’d like to spotlight today is llama-run, a new addition to the llama.cpp ecosystem.
In this post, we’ll explore how llama-run aligns with RamaLama’s vision of effortless AI workflows, enabling developers to work with large language models (LLMs) straightforwardly and intuitively.
What Is llama-run?
The llama-run tool was designed to be a minimal and versatile interface for running LLMs with llama.cpp. While llama.cpp has been widely recognized for its efficient and resource-friendly approach to running LLMs, configuring and invoking these models could sometimes feel daunting—until now.
With llama-run, we’ve abstracted away much of the complexity, focusing on simplicity and usability. Developers can now load and run models with a single command or fine-tune their configuration through lightweight options. llama-run will power the “ramalama run” command.
Here’s how it works:
A Simple Example
llama-run granite-code
The above command launches the granite-code model, retrieving it seamlessly from its designated source.
Need more control? Check out these additional options:
llama-run --context-size 4096 --verbose llama3
This flexibility makes llama-run an excellent tool for developers who want to focus on outcomes rather than infrastructure.
Why llama-run Matters for RamaLama
The RamaLama project is all about reducing friction in AI workflows. By using OCI containers as the foundation for deploying LLMs, we aim to eliminate the headaches of dependency management, environment setup, and operational inconsistencies. llama-run integrates seamlessly into this philosophy by providing:
-
Protocol-Agnostic Model Retrieval
With support for multiple protocols (e.g., hf://, huggingface://, ollama://, https://, or file://), llama-run enables easy integration with your preferred model repositories or local files.
-
Declarative Simplicity
Whether you’re pulling a model from Hugging Face, Ollama, or a custom endpoint, llama-run simplifies the process:
llama-run hf://QuantFactory/SmolLM-135M-GGUF/SmolLM-135M.Q2_K.gguf -
Interoperability with Containers
Combined with OCI containerized environments from RamaLama, llama-run becomes a lightweight runtime for deploying and experimenting with LLMs at scale.
-
Debugging Made Easy
With options like --verbose, llama-run provides detailed logs for debugging, making it easier to iterate on configurations or troubleshoot issues.
Key Features of llama-run
Here’s a closer look at the features that make llama-run a must-have for AI developers:
Flexible Model Support
Run models from a variety of sources with minimal effort. llama-run intelligently handles model retrieval, ensuring you always have access to the resources you need:
- hf:// (or huggingface://)
- ollama://
- file://
- https://
If no protocol is specified, llama-run defaults to local files or assumes an Ollama-hosted model. This flexibility ensures compatibility with a wide array of use cases.
Configurable Context Sizes
Control the context size to suit your specific requirements:
llama-run --context-size 4096 some-model
GPU Optimization
Leverage GPU acceleration with the --ngl option, giving you the ability to specify the number of GPU layers used:
llama-run --ngl 999 some-model
Debugging Options
Use --verbose to log detailed information during execution, which is invaluable for debugging and profiling:
llama-run --verbose model-name
Example Use Cases
Running a Hugging Face Model
llama-run hf://bartowski/SmolLM-1.7B-Instruct-v0.2-GGUF/SmolLM-1.7B-Instruct-v0.2-IQ3_M.gguf
Running an Ollama Model
llama-run granite-code
Testing Locally Stored Models
llama-run some-local-file.gguf "Hello World"
Stdin support
git diff | llama-run granite-code "Write a git commit message for this change"
GPU Acceleration for Performance
llama-run --ngl 999 my-model.gguf
Debugging
llama-run --verbose dev-model.gguf
llama-run in the RamaLama Ecosystem
By combining the flexibility of llama-run with the containerized, predictable environments offered by RamaLama, we’re making AI workflows simpler and more productive. Developers can:
- Quickly prototype AI applications.
- Standardize deployments across environments.
- Collaborate easily, sharing configurations and results without worrying about underlying infrastructure.
This approach empowers teams to focus on creating meaningful applications without being bogged down by the intricacies of model management or deployment.
What’s Next?
At Red Hat and in the RamaLama community, we’re continually looking for ways to make AI boring—in the best possible way. Join us by experimenting with RamaLama, contributing to the RamaLama project, or sharing your feedback and use cases. Together, we can make the extraordinary seem routine.
Let’s keep innovating—one boring AI tool at a time. 🚀