AI inference

Learn how to build a distributed RAG pipeline with Ray Data on OpenShift AI for high-performance parsing, embedding, and writing to a vector database.

Boost AI and analytics workloads with Kove:SDM on Red Hat OpenShift, enabling applications to access memory resources beyond local node limits.

Learn how to build an open cloud native architecture for AI agents. This blueprint explains how to improve workload isolation and implement inference routing.

Learn how to set up local agentic AI computer use. Run quantized models like Qwen 3.6 with Hermes to automate desktop tasks on your own terms today.

Deploy a self-hosted AI coding assistant with vLLM and Red Hat OpenShift AI for privacy and operational independence.

Learn about the llm-d batch gateway, a Kubernetes-native batch inference service that plugs into the same llm-d inference stack managed by Red Hat OpenShift AI.

Explore a demo of serving a multimodal model (Qwen3-Omni) with vLLM-Omni on a single hardware accelerator.

Learn how to implement GPU-as-a-Service on Red Hat OpenShift using Kueue, NVIDIA MIG, and a custom dashboard plug-in for self-service GPU resource booking.

Learn how to optimize deployment of vLLM for various traffic shapes, including high-concurrency chat, long-context RAG, high-throughput batch, and distributed AI-grid.

Learn about the three optimization levers for distributed AI inference: prefill/decode disaggregation, KV cache strategy, and speculative decoding.

Learn how Red Hat's SastAI initiative, in collaboration with NVIDIA, automates false positive identification in static application security testing (SAST) using generative AI. By employing an agentic, multi-stage research workflow, SastAI reduces noise and improves triage efficiency. Discover the pattern harvesting methodology that greatly enhances the SastAI solution, now offering a tighter solution with better knowledge and reasoning.

Learn how to connect the EvalHub runtime to internal or external model servers using service account tokens, API keys, or custom certificates.

Learn how to connect a modern Apache Iceberg lakehouse to LLM-hosted models using nothing but SQL on Red Hat OpenShift AI.

Learn about the five-dimensional design space in modern LLM serving, including tensor, pipeline, expert, data, and context parallelism.

Discover how personal AI notebooks in Red Hat Developer Lightspeed can help developers find specific details in project documents quickly, grounded in context.

Learn when to use llama.cpp and vLLM for local inference of large language models (LLMs). Discover the key differences, benchmarks, and use cases for each engine.

Learn how speculative decoding can improve the performance of large language models (LLMs) in production by using a small, fast model to generate tokens speculatively and a large model to verify them.

Learn how Model-as-a-Service (MaaS) solves the problem of managing AI costs, security, and models for every developer in an organization.

Learn how llm-d routes each inference request to the GPU that already has the relevant data cached, cutting down on time-to-first-token, and doubling throughput without changing hardware. Discover how Red Hat's stack packages this neatly into a single Kubernetes resource.

Learn how to create a functional Red Hat pizza shop voice agent using Red Hat OpenShift AI, focusing on practical architecture choices and implementation lessons learned along the way.

Speculators v0.5.0 introduces DFlash support, enabling single-pass draft token generation with block diffusion for more efficient speculative decoding workflows. The release also adds unified online and offline training through vLLM’s native hidden states extraction system, improving training flexibility, version stability, and production readiness.

Red Hat and DeepLearning.AI have released a free hands-on course on the full LLM

Learn how to use Red Hat OpenShift AI's reusable components to build modular AI pipelines, speed up development, and focus on what differentiates your applications.

Learn how to deploy Hermes Agent, a self-improving AI agent with a learning loop, on OpenShift AI with GPU-accelerated vLLM model serving.

Evaluation-driven development with EvalHub

William Caban Babilonia +1

June 2, 2026

Learn how evaluation-driven development (EDD) turns AI optimization from an art into an engineering discipline with EvalHub.

AI inference

Build a distributed RAG pipeline with Ray Data on OpenShift AI

Optimize OpenShift workloads with software-defined memory

Architect an open blueprint for cloud-native AI agents

Computer use: How AI agents can automate almost anything

Run Claude Code locally with vLLM and OpenShift AI

Batch inference on OpenShift AI with llm-d: Architecture, integration, and workflows

Inside the vLLM-Omni architecture: Serving Qwen3-Omni

Implement GPU-as-a-Service with Kueue and NVIDIA MIG

Deploying distributed AI inference: Blueprints & troubleshooting

Optimizing distributed AI inference: Advanced deployment patterns

Beyond regex: Harvesting security logic with LLMs

Connect EvalHub to protected production model servers

SQL with GenAI: Building an Apache Iceberg lakehouse on Red Hat OpenShift

Designing distributed AI inference: Core concepts and scaling dimensions

Chat with your docs with Red Hat Developer Hub

llama.cpp vs. vLLM: Choosing the right local LLM inference engine

How speculative decoding delivers faster LLM inference

Model-as-a-Service: How to run your own private AI API

Intelligent inference scheduling with llm-d on Red Hat AI

Build a local voice agent with Red Hat OpenShift AI

Speculators v0.5.0: DFlash support and online training

Learn to optimize, deploy, and benchmark LLMs with vLLM: A New Free Course

Build modular AI pipelines with OpenShift AI and reusable components

Deploy Hermes Agent on OpenShift AI with vLLM model serving

Evaluation-driven development with EvalHub

Platforms

Build

Quicklinks

Communicate

RED HAT DEVELOPER

Red Hat legal and privacy links

Red Hat legal and privacy links