Red Hat AI

Learn how to deploy the lightweight AI model Llama-3.2-1B-Instruct-quantized.w8a8 using Red Hat AI Inference Server containerization.

Discover the vLLM Semantic Router, an open source system for intelligent, cost-aware request routing that ensures every token generated truly adds value.

Deploy a Llama language model using Red Hat OpenShift AI. This guide walks you through GPU setup, model deployment, and internal and external testing.

Explore a fashion AI search on Red Hat OpenShift AI with EDB Postgres AI.

Learn to run and serve OpenAI's gpt-oss models locally with RamaLama, a CLI tool that automates secure, containerized deployment and GPU optimization.

Building AI apps is one thing—but making them chat with your documents is next-level. In Part 3 of the Llama Stack Tutorial, we dive into Retrieval Augmented Generation (RAG), a pattern that lets your LLM reference external knowledge it wasn't trained on. Using the open-source Llama Stack project from Meta, you'll learn how to:- Spin up a local Llama Stack server with Podman- Create and ingest documents into a vector database- Build a RAG agent that selectively retrieves context from your data- Chat with real docs like PDFs, invoices, or project files, using Agentic RAGBy the end, you'll see how RAG brings your unique data into AI workflows and how Llama Stack makes it easy to scale from local dev to production on Kubernetes.

Learn how to deploy and scale Mixture of Experts (MoE) models using vLLM's new execution model and llm-d's intelligent Kubernetes-native inference framework.

Welcome back to Red Hat Dan on Tech, where Senior Distinguished Engineer Dan Walsh dives deep on all things technical, from his expertise in container technologies with tools like Podman and Buildah, to runtimes, Kubernetes, AI, and SELinux! In this episode, Eric Curtin joins to discuss Sorcery AI, a new AI code review tool that has been helping to find bugs, review PR's and much more!

Welcome back to Red Hat Dan on Tech, where Senior Distinguished Engineer Dan Walsh dives deep on all things technical, from his expertise in container technologies with tools like Podman and Buildah, to runtimes, Kubernetes, AI, and SELinux! In this episode, you'll see a live demo on Ramalama's new RAG capability, allowing you to use your unique data with a local LLM. Learn More: https://developers.redhat.com/articles/2025/04/03/simplify-ai-data-integration-ramalama-and-rag5.

Explore how platform engineering, OpenShift, and Developer Hub create a governed, repeatable, and scalable foundation for enterprise AI.

Building AI applications is more than just running a model — you need a consistent way to connect inference, agents, storage, and safety features across different environments. That’s where Llama Stack comes in. In this second episode of The Llama Stack Tutorial Series, Cedric (Developer Advocate @ Red Hat) walks through how to:- Run Llama 3.2 (3B) locally and connect it to Llama Stack- Use the Llama Stack server as the backbone for your AI applications- Call REST APIs for inference, agents, vector databases, guardrails, and telemetry- Test out a Python app that talks to Llama Stack for inferenceBy the end of the series, you’ll see how Llama Stack gives developers a modular API layer that makes it easy to start building enterprise-ready generative AI applications—from local testing all the way to production. In the next episode, we'll use Llama Stack to chat with your own data (PDFs, websites, and images) with local models.🔗 Explore MoreLlama Stack GitHub: https://github.com/meta-llama/llama-stackDocs: https://llama-stack.readthedocs.io5.

Learn how to optimize PyTorch code with minimal effort using torch.compile, a just-in-time compiler that generates optimized kernels automatically.

Learn how a pattern engine plus a small LLM can perform production-grade failure analysis on low-cost hardware, slashing inference costs by over 99%.

AI applications are moving fast—but building them at scale is hard. Local prototypes often don’t translate to production, and every environment seems to require a different setup. Llama Stack, an open-source framework from Meta, was created to bring consistency and modularity to generative AI applications. In this first episode of The Llama Stack Tutorial Series, Cedric (Developer Advocate @ Red Hat) explains what Llama Stack is, why it’s being compared to Kubernetes for the AI world, key building blocks, and future episodes that'll dive into real-world use cases with Llama Stack. Explore MoreLlama Stack Tutorial (what we'll be following during the series): https://rh-aiservices-bu.github.io/llama-stack-tutorial Llama Stack GitHub: https://github.com/meta-llama/llama-stackDocs: https://llama-stack.readthedocs.io5.

The rise of large language models (LLMs) has opened up exciting possibilities for developers looking to build intelligent applications. However, the process of adapting these models to specific use cases can be difficult, requiring deep expertise and substantial resources. In this talk, we'll introduce you to InstructLab, an open-source project that aims to make LLM tuning accessible to developers and data scientists of all skill levels, on consumer-grade hardware.

In this video, we'll explore how InstructLab's innovative approach combines collaborative knowledge curation, efficient data generation, and instruction training to enable developers to refine foundation models for specific use cases. Through a live demonstration, you'll learn how IBM Research has partnered with Red Hat to simplify the process of enhancing LLMs with new knowledge and skills for targeted applications. Join us to explore how InstructLab is making LLM tuning more accessible, empowering developers to harness the power of AI in their projects.

Found a bug? Have new features would like to propose? You don't want to miss this tutorial on how to create them on GitHub in the community. In this video, we will be walking through how to create a GitHub issue to report bugs or suggest any features you would like to propose.

Headed to Devoxx Belgium? Visit the Red Hat Developer booth on-site to speak to our expert technologists.

Red Hat empowers Developers. Wherever you are, whoever you are, it's your innovations that drive us to go bigger and build better, but we know there's only so much one developer can do. That's why it's our mission to bring you together, to create a community where you can learn new skills, get inspired, and create incredible ideas. We are here to empower you.

Welcome to Red Hat Red Hat Developer brings developers together to learn from each other and create more extraordinary things, faster. We serve the builders. Those who solve problems and create their careers with code. We chart a course for you, giving your career a path and your work purpose. We share what we know to help you solve problems once, build momentum together, and make the world better for all.

As GPU demand grows, idle time gets expensive. Learn how to efficiently manage AI workloads on OpenShift AI with Kueue and the custom metrics autoscaler.

Learn how to implement Llama Stack's built-in guardrails with Python, helping to improve the safety and performance of your LLM applications.

LLM Compressor 0.7.0 release recap

Dipika Sikka +3

August 25, 2025

LLM Compressor 0.7.0 brings Hadamard transforms for better accuracy, mixed-precision FP4/FP8, and calibration-free block quantization for efficient compression.

Explore AI integration in this DevNation Day Santiago 2024 deep dive.

Llama Stack offers an alternative to the OpenAI Responses API, enabling multi-step agents, RAG, and tool use on your own infrastructure with any model.

See how a custom MCP client for Docling transformed unstructured data into usable content, reducing document prep time by over 80%.

Red Hat AI

Platforms

Build

Quicklinks

Communicate

RED HAT DEVELOPER

Red Hat legal and privacy links

Red Hat legal and privacy links

Report a website issue