Learn to optimize, deploy, and benchmark LLMs with vLLM: A New Free Course
Red Hat and DeepLearning.AI have released a free hands-on course on the full LLM
Red Hat and DeepLearning.AI have released a free hands-on course on the full LLM
Learn how to deploy and serve large language models (LLM) on Rebellions ATOM NPUs using Red Hat OpenShift AI and a certified vLLM container image on the Red Hat AI Inference Server. This post walks through the steps to set up the joint solution between Red Hat and Rebellions, including installing the Node Feature Discovery operator, the Rebellions NPU operator, creating the ATOM hardware profile in OpenShift AI, and creating the vLLM RBLN ServingRuntime.
Learn how to combine KServe and llm-d to optimize generative AI inference, improve performance, and reduce infrastructure costs. This article demonstrates the integration architecture and provides practical guidance for AI platform teams.
Learn how speculative decoding in vLLM can significantly increase throughput without altering a model's output quality, resulting in 19% cost savings at scale for enterprise AI. This post benchmarks gpt-oss-120B with Eagle3 speculative decoding on vLLM and demonstrates consistent throughput and latency improvements across varying concurrency levels, datasets, tensor-parallelism settings, and draft-token budgets.
Learn how to deploy and experiment with Gemma 4, the latest open model family from Google DeepMind. This guide covers text, image, and video input, Mixture-of-Experts architecture, and more. Get started with Red Hat AI Inference Server today.
Explore the four pillars of AI coding: vibes, secs, skills, and agents, and learn how they can improve the coding quality and reduce the encoding/decoding gap. Discover the benefits of a spec-driven approach and the importance of modular specs and skills in achieving harmony.
Learn how to integrate Anthropic's Claude Code, an agentic coding tool, using Red Hat AI Inference Server on OpenShift. Keep the inference process private on your own infrastructure while retaining the full Claude Code workflow.
Learn how to set up vLLM Semantic Router locally with two models: a quantized Qwen3-Coder-Next running on Apple Silicon, and Google's Gemini 2.5 Pro as the cloud fallback. This router can significantly reduce token costs by routing common requests to a less expensive model.
Learn how to set up and run a local AI audio transcription using an Red Hat open source model.
Learn how to deploy multiple large language models (LLMs) behind a single OpenAI-compatible endpoint on OpenShift using a Model-as-a-Service (MaaS) approach. This guide demonstrates how to build an intelligent routing infrastructure that dynamically inspects the request payload and directs traffic based on the specified model field, reducing GPU waste and simplifying application logic.
Discover a practical solution pattern for building a modern financial application that makes loan decisions using multiple machine learning systems deployed across hybrid environments.
LLM Compressor v0.10 introduces Distributed Data Parallel (DDP) for faster compression, memory management, and advanced quantization formats. Make model compression workflows more efficient for large language models.
Learn how to enable the NVIDIA RTX PRO 4500 Blackwell Server Edition on Red Hat AI for compact, power-efficient AI deployments. This hardware offers inference performance without adding unnecessary operational complexity for Red Hat AI users.
Learn how to improve the performance of your vLLM deployments with a diagnostic workflow that isolates latency issues and server saturation. Discover the key metrics to monitor and techniques to alleviate memory pressure.
Learn how to run OpenAI's Whisper model through vLLM on Apple Silicon, giving you an OpenAI-compatible endpoint on localhost. Then, discover how to take this architecture into production using Red Hat AI Inference Server.
Learn how to estimate memory requirements for your LLM fine-tuning experiments using Red Hat Training Hub's memory_estimator.py API. This guide covers the memory components, adjusting training setups for specific GPU specifications, and using the memory estimator in your code. Streamline your model fine-tuning process with runtime estimates and automated hyperparameter suggestions.
Learn how to deploy and test an Earth and space model inference service on Red Hat AI Inference Server and Red Hat OpenShift AI. This article includes two self-contained activities, one deploying Prithvi using a traditional Deployment object and another serving the model using KServe and observing Knative scaling.
Understand the PyTorch autograd engine internals to debug gradient flows. Learn about computational graphs, saved tensors, and performance optimization techniques.
Optimize vLLM performance with practical tuning tips. Learn how to use GuideLLM for benchmarking, adjust GPU ratios, and maximize KV cache to improve throughput.
Learn how to design agentic workflows, and how the Red Hat AI portfolio supports production-ready agentic systems across the hybrid cloud.
See how to use Apache Camel to turn LLMs into reliable text-processing engines for generative parsing, semantic routing, and "air-gapped" database querying.
Learn about NVFP4, a 4-bit floating-point format for high-performance inference on modern GPUs that can deliver near-baseline accuracy at large scale.
Explore how Red Hat OpenShift AI uses LLM-generated summaries to distill product reviews into a form users can quickly process.
Learn how to build AI-enabled applications for product recommendations, semantic product search, and automated product review summarization with OpenShift AI.
Deploy an Oracle SQLcl MCP server on an OpenShift cluster and use it with the OpenShift AI platform in this AI quickstart.