Red Hat AI Inference Server

Learn how to deploy and experiment with Gemma 4, the latest open model family from Google DeepMind. This guide covers text, image, and video input, Mixture-of-Experts architecture, and more. Get started with Red Hat AI Inference Server today.

Explore the four pillars of AI coding: vibes, secs, skills, and agents, and learn how they can improve the coding quality and reduce the encoding/decoding gap. Discover the benefits of a spec-driven approach and the importance of modular specs and skills in achieving harmony.

Learn how to integrate Anthropic's Claude Code, an agentic coding tool, using Red Hat AI Inference Server on OpenShift. Keep the inference process private on your own infrastructure while retaining the full Claude Code workflow.

Learn how to set up vLLM Semantic Router locally with two models: a quantized Qwen3-Coder-Next running on Apple Silicon, and Google's Gemini 2.5 Pro as the cloud fallback. This router can significantly reduce token costs by routing common requests to a less expensive model.

Learn how to set up and run a local AI audio transcription using an Red Hat open source model.

Learn how to deploy multiple large language models (LLMs) behind a single OpenAI-compatible endpoint on OpenShift using a Model-as-a-Service (MaaS) approach. This guide demonstrates how to build an intelligent routing infrastructure that dynamically inspects the request payload and directs traffic based on the specified model field, reducing GPU waste and simplifying application logic.

Discover a practical solution pattern for building a modern financial application that makes loan decisions using multiple machine learning systems deployed across hybrid environments.

LLM Compressor v0.10 introduces Distributed Data Parallel (DDP) for faster compression, memory management, and advanced quantization formats. Make model compression workflows more efficient for large language models.

Learn how to enable the NVIDIA RTX PRO 4500 Blackwell Server Edition on Red Hat AI for compact, power-efficient AI deployments. This hardware offers inference performance without adding unnecessary operational complexity for Red Hat AI users.

5 steps to triage vLLM performance

David Whyte-Gray +3

March 9, 2026

Learn how to improve the performance of your vLLM deployments with a diagnostic workflow that isolates latency issues and server saturation. Discover the key metrics to monitor and techniques to alleviate memory pressure.

Learn how to run OpenAI's Whisper model through vLLM on Apple Silicon, giving you an OpenAI-compatible endpoint on localhost. Then, discover how to take this architecture into production using Red Hat AI Inference Server.

Learn how to estimate memory requirements for your LLM fine-tuning experiments using Red Hat Training Hub's memory_estimator.py API. This guide covers the memory components, adjusting training setups for specific GPU specifications, and using the memory estimator in your code. Streamline your model fine-tuning process with runtime estimates and automated hyperparameter suggestions.

Learn how to deploy and test an Earth and space model inference service on Red Hat AI Inference Server and Red Hat OpenShift AI. This article includes two self-contained activities, one deploying Prithvi using a traditional Deployment object and another serving the model using KServe and observing Knative scaling.

Understand the PyTorch autograd engine internals to debug gradient flows. Learn about computational graphs, saved tensors, and performance optimization techniques.

Optimize vLLM performance with practical tuning tips. Learn how to use GuideLLM for benchmarking, adjust GPU ratios, and maximize KV cache to improve throughput.

Learn how to design agentic workflows, and how the Red Hat AI portfolio supports production-ready agentic systems across the hybrid cloud.

See how to use Apache Camel to turn LLMs into reliable text-processing engines for generative parsing, semantic routing, and "air-gapped" database querying.

Learn about NVFP4, a 4-bit floating-point format for high-performance inference on modern GPUs that can deliver near-baseline accuracy at large scale.

Explore how Red Hat OpenShift AI uses LLM-generated summaries to distill product reviews into a form users can quickly process.

Learn how to build AI-enabled applications for product recommendations, semantic product search, and automated product review summarization with OpenShift AI.

Deploy an Oracle SQLcl MCP server on an OpenShift cluster and use it with the OpenShift AI platform in this AI quickstart.

Explore the latest release of LLM Compressor, featuring attention quantization, MXFP4 support, AutoRound quantization modifier, and more.

This article compares the performance of llm-d, Red Hat's distributed LLM inference solution, with a traditional deployment of vLLM using naive load balancing.

Whether you're just getting started with artificial intelligence or looking to deepen your knowledge, our hands-on tutorials will help you unlock the potential of AI while leveraging Red Hat's enterprise-grade solutions.

Take a look back at Red Hat Developer's most popular articles of 2025, covering AI coding practices, agentic systems, advanced Linux networking, and more.

Red Hat AI Inference Server

Platforms

Build

Quicklinks

Communicate

RED HAT DEVELOPER

Red Hat legal and privacy links

Red Hat legal and privacy links

Report a website issue