
vLLM or llama.cpp: Choosing the right LLM inference engine for your use case
See how vLLM’s throughput and latency compare to llama.cpp's and discover which tool is right for your specific deployment needs on enterprise-grade hardware.
See how vLLM’s throughput and latency compare to llama.cpp's and discover which tool is right for your specific deployment needs on enterprise-grade hardware.
Deploy DialoGPT-small on OpenShift AI for internal model testing, with step-by-step instructions for setting up runtime, model storage, and inference services.
Walk through how to set up KServe autoscaling by leveraging the power of vLLM, KEDA, and the custom metrics autoscaler operator in Open Data Hub.
See how to use Cursor AI to migrate a Bash test suite to Python, including how to replace functions and create a new PyTest suite.
Discover how llama.cpp API remoting brings AI inference to native speed on macOS, closing the gap between API remoting and native performance.
Learn how to leverage kernel live patching to keep your infrastructure updated and minimize the amount of manual work required.
AI agents are where things get exciting! In this episode of The Llama Stack Tutorial, we'll dive into Agentic AI with Llama Stack—showing you how to give your LLM real-world capabilities like searching the web, pulling in data, and connecting to external APIs. You'll learn how agents are built with models, instructions, tools, and safety shields, and see live demos of using the Agentic API, running local models, and extending functionality with Model Context Protocol (MCP) servers.Join Senior Developer Advocate Cedric Clyburn as we learn all things Llama Stack! Next episode? Guardrails, evals, and more!
Learn how to install Red Hat OpenShift AI to enable an on-premise inference service for Ansible Lightspeed in this step-by-step guide.
Discover the benefits of using Rust for building concurrent, scalable agentic systems, and learn how it addresses the GIL bottleneck in Python.
Learn how to deploy Red Hat AI Inference Server using vLLM and evaluate its performance with GuideLLM in a fully disconnected Red Hat OpenShift cluster.
Learn how to deploy the Qwen3-Next model on vLLM using Red Hat AI. This guide covers the steps for serving the model with Podman.
Enhance your Python AI applications with distributed tracing. Discover how to use Jaeger and OpenTelemetry for insights into Llama Stack interactions.
Learn how to deploy the lightweight AI model Llama-3.2-1B-Instruct-quantized.w8a8 using Red Hat AI Inference Server containerization.
Discover the vLLM Semantic Router, an open source system for intelligent, cost-aware request routing that ensures every token generated truly adds value.
Deploy a Llama language model using Red Hat OpenShift AI. This guide walks you through GPU setup, model deployment, and internal and external testing.
Explore a fashion AI search on Red Hat OpenShift AI with EDB Postgres AI.
Learn to run and serve OpenAI's gpt-oss models locally with RamaLama, a CLI tool that automates secure, containerized deployment and GPU optimization.
Building AI apps is one thing—but making them chat with your documents is next-level. In Part 3 of the Llama Stack Tutorial, we dive into Retrieval Augmented Generation (RAG), a pattern that lets your LLM reference external knowledge it wasn't trained on. Using the open-source Llama Stack project from Meta, you'll learn how to:- Spin up a local Llama Stack server with Podman- Create and ingest documents into a vector database- Build a RAG agent that selectively retrieves context from your data- Chat with real docs like PDFs, invoices, or project files, using Agentic RAGBy the end, you'll see how RAG brings your unique data into AI workflows and how Llama Stack makes it easy to scale from local dev to production on Kubernetes.
Learn how to deploy and scale Mixture of Experts (MoE) models using vLLM's new execution model and llm-d's intelligent Kubernetes-native inference framework.
Welcome back to Red Hat Dan on Tech, where Senior Distinguished Engineer Dan Walsh dives deep on all things technical, from his expertise in container technologies with tools like Podman and Buildah, to runtimes, Kubernetes, AI, and SELinux! In this episode, Eric Curtin joins to discuss Sorcery AI, a new AI code review tool that has been helping to find bugs, review PR's and much more!
Welcome back to Red Hat Dan on Tech, where Senior Distinguished Engineer Dan Walsh dives deep on all things technical, from his expertise in container technologies with tools like Podman and Buildah, to runtimes, Kubernetes, AI, and SELinux! In this episode, you'll see a live demo on Ramalama's new RAG capability, allowing you to use your unique data with a local LLM. Learn More: https://developers.redhat.com/articles/2025/04/03/simplify-ai-data-integration-ramalama-and-rag5.
Explore how platform engineering, OpenShift, and Developer Hub create a governed, repeatable, and scalable foundation for enterprise AI.
Building AI applications is more than just running a model — you need a consistent way to connect inference, agents, storage, and safety features across different environments. That’s where Llama Stack comes in. In this second episode of The Llama Stack Tutorial Series, Cedric (Developer Advocate @ Red Hat) walks through how to:- Run Llama 3.2 (3B) locally and connect it to Llama Stack- Use the Llama Stack server as the backbone for your AI applications- Call REST APIs for inference, agents, vector databases, guardrails, and telemetry- Test out a Python app that talks to Llama Stack for inferenceBy the end of the series, you’ll see how Llama Stack gives developers a modular API layer that makes it easy to start building enterprise-ready generative AI applications—from local testing all the way to production. In the next episode, we'll use Llama Stack to chat with your own data (PDFs, websites, and images) with local models.🔗 Explore MoreLlama Stack GitHub: https://github.com/meta-llama/llama-stackDocs: https://llama-stack.readthedocs.io5.
Learn how to optimize PyTorch code with minimal effort using torch.compile, a just-in-time compiler that generates optimized kernels automatically.
Learn how a pattern engine plus a small LLM can perform production-grade failure analysis on low-cost hardware, slashing inference costs by over 99%.