AI inference

Red Hat Developer Hub feature image
Article

Chat with your docs with Red Hat Developer Hub

Lucas Yoon

Discover how personal AI notebooks in Red Hat Developer Lightspeed can help developers find specific details in project documents quickly, grounded in context.

A stylized illustration representing an artificial neural network, set against a dark purple background within a slightly rounded, darker purple square icon shape. The neural network consists of multiple layers of interconnected nodes, depicted as glossy, spherical red orbs. Lines connect these red orbs, forming a complex web. White arrow shapes extend horizontally from the left side, pointing towards the network, suggesting input or data flowing into the system.
Article

How speculative decoding delivers faster LLM inference

Sawyer Bowerman

Learn how speculative decoding can improve the performance of large language models (LLMs) in production by using a small, fast model to generate tokens speculatively and a large model to verify them.

A stylized illustration representing an artificial neural network, set against a dark purple background within a slightly rounded, darker purple square icon shape. The neural network consists of multiple layers of interconnected nodes, depicted as glossy, spherical red orbs. Lines connect these red orbs, forming a complex web. White arrow shapes extend horizontally from the left side, pointing towards the network, suggesting input or data flowing into the system.
Article

Intelligent inference scheduling with llm-d on Red Hat AI

Madhu Goutham Reddy Ambati +1

Learn how llm-d routes each inference request to the GPU that already has the relevant data cached, cutting down on time-to-first-token, and doubling throughput without changing hardware. Discover how Red Hat's stack packages this neatly into a single Kubernetes resource.

Featured image for Red Hat OpenShift AI.
Article

Build a local voice agent with Red Hat OpenShift AI

Mike Hepburn

Learn how to create a functional Red Hat pizza shop voice agent using Red Hat OpenShift AI, focusing on practical architecture choices and implementation lessons learned along the way.

Featured image for vLLM interference article.
Article

Speculators v0.5.0: DFlash support and online training

Helen Zhao +2

Speculators v0.5.0 introduces DFlash support, enabling single-pass draft token generation with block diffusion for more efficient speculative decoding workflows. The release also adds unified online and offline training through vLLM’s native hidden states extraction system, improving training flexibility, version stability, and production readiness.

ai-ml
Article

Evaluation-driven development with EvalHub

William Caban Babilonia +1

Learn how evaluation-driven development (EDD) turns AI optimization from an art into an engineering discipline with EvalHub.

A stylized illustration representing an artificial neural network, set against a dark purple background within a slightly rounded, darker purple square icon shape. The neural network consists of multiple layers of interconnected nodes, depicted as glossy, spherical red orbs. Lines connect these red orbs, forming a complex web. White arrow shapes extend horizontally from the left side, pointing towards the network, suggesting input or data flowing into the system.
Article

Claude as your performance analysis partner

Archana Ravindar

Explore the benefits of using Claude for performance analysis on CPU profiles and traces, focusing on the Go Green Tea Garbage collector as a case study. Learn about optimization opportunities and low-level code analysis.

A stylized illustration representing an artificial neural network, set against a dark purple background within a slightly rounded, darker purple square icon shape. The neural network consists of multiple layers of interconnected nodes, depicted as glossy, spherical red orbs. Lines connect these red orbs, forming a complex web. White arrow shapes extend horizontally from the left side, pointing towards the network, suggesting input or data flowing into the system.
Article

Running AI inference on Rebellions ATOM NPU with Red Hat AI

Erwan Gallen +2

Learn how to deploy and serve large language models (LLM) on Rebellions ATOM NPUs using Red Hat OpenShift AI and a certified vLLM container image on the Red Hat AI Inference Server. This post walks through the steps to set up the joint solution between Red Hat and Rebellions, including installing the Node Feature Discovery operator, the Rebellions NPU operator, creating the ATOM hardware profile in OpenShift AI, and creating the vLLM RBLN ServingRuntime.

Featured image for Red Hat OpenShift AI.
Article

Build an enterprise RAG system with OGX

Abdelhamid Soliman

Learn how to transform a simple chatbot into an enterprise RAG application by applying metadata filtering, hybrid search, and neural reranking using the OGX framework in Red Hat OpenShift AI.

Red Hat OpenShift AI
Article

Centralized routing for external and self-hosted LLMs on OpenShift AI

Edward Arthur Quarm Jnr

Discover how Red Hat OpenShift AI 3.4's Models-as-a-Service (MaaS) capability streamlines AI inference by acting as an integrated AI gateway within the platform, providing centralized governance and routing requests to both self-hosted models and external providers.

A stylized illustration representing an artificial neural network, set against a dark purple background within a slightly rounded, darker purple square icon shape. The neural network consists of multiple layers of interconnected nodes, depicted as glossy, spherical red orbs. Lines connect these red orbs, forming a complex web. White arrow shapes extend horizontally from the left side, pointing towards the network, suggesting input or data flowing into the system.
Article

Combining KServe and llm-d for optimized generative AI inference

Ran Pollak +1

Learn how to combine KServe and llm-d to optimize generative AI inference, improve performance, and reduce infrastructure costs. This article demonstrates the integration architecture and provides practical guidance for AI platform teams.