Chat with your docs with Red Hat Developer Hub
Discover how personal AI notebooks in Red Hat Developer Lightspeed can help developers find specific details in project documents quickly, grounded in context.
Discover how personal AI notebooks in Red Hat Developer Lightspeed can help developers find specific details in project documents quickly, grounded in context.
Learn when to use llama.cpp and vLLM for local inference of large language models (LLMs). Discover the key differences, benchmarks, and use cases for each engine.
Learn how speculative decoding can improve the performance of large language models (LLMs) in production by using a small, fast model to generate tokens speculatively and a large model to verify them.
Learn how Model-as-a-Service (MaaS) solves the problem of managing AI costs, security, and models for every developer in an organization.
Learn how llm-d routes each inference request to the GPU that already has the relevant data cached, cutting down on time-to-first-token, and doubling throughput without changing hardware. Discover how Red Hat's stack packages this neatly into a single Kubernetes resource.
Learn how to create a functional Red Hat pizza shop voice agent using Red Hat OpenShift AI, focusing on practical architecture choices and implementation lessons learned along the way.
Speculators v0.5.0 introduces DFlash support, enabling single-pass draft token generation with block diffusion for more efficient speculative decoding workflows. The release also adds unified online and offline training through vLLM’s native hidden states extraction system, improving training flexibility, version stability, and production readiness.
Red Hat and DeepLearning.AI have released a free hands-on course on the full LLM
Learn how to use Red Hat OpenShift AI's reusable components to build modular AI pipelines, speed up development, and focus on what differentiates your applications.
Learn how to deploy Hermes Agent, a self-improving AI agent with a learning loop, on OpenShift AI with GPU-accelerated vLLM model serving.
Learn how evaluation-driven development (EDD) turns AI optimization from an art into an engineering discipline with EvalHub.
Learn how we fine-tuned the vLLM Semantic Router's embedding model to reduce misrouting rates and improve routing accuracy in enterprise deployments.
Explore the benefits of using Claude for performance analysis on CPU profiles and traces, focusing on the Go Green Tea Garbage collector as a case study. Learn about optimization opportunities and low-level code analysis.
Learn about LogAn, an open source tool designed to overcome the limitations of using LLMs to analyze massive volumes of production logs.
Learn how to deploy and serve large language models (LLM) on Rebellions ATOM NPUs using Red Hat OpenShift AI and a certified vLLM container image on the Red Hat AI Inference Server. This post walks through the steps to set up the joint solution between Red Hat and Rebellions, including installing the Node Feature Discovery operator, the Rebellions NPU operator, creating the ATOM hardware profile in OpenShift AI, and creating the vLLM RBLN ServingRuntime.
Learn how to transform a simple chatbot into an enterprise RAG application by applying metadata filtering, hybrid search, and neural reranking using the OGX framework in Red Hat OpenShift AI.
Discover how Red Hat OpenShift AI 3.4's Models-as-a-Service (MaaS) capability streamlines AI inference by acting as an integrated AI gateway within the platform, providing centralized governance and routing requests to both self-hosted models and external providers.
Learn how to prevent silent failures in your production AI inference stack with end-to-end benchmarking.
Learn about GPU compute kernels, their role in distributed AI inference, and the Hugging Face Kernel Hub.
Learn how our team implemented CI/CD pipelines for the it-self-service-agent AI quickstart and the benefits of using CI/CD for agentic systems.
Learn how Red Hat AI 3.4 uses EvalHub to orchestrate AI evaluations on Kubernetes. Scale frameworks like Garak and LightEval with built-in MLflow tracking.
Learn how to combine KServe and llm-d to optimize generative AI inference, improve performance, and reduce infrastructure costs. This article demonstrates the integration architecture and provides practical guidance for AI platform teams.
Users can deploy vLLM on a variety of hardware with a simple command. But a lot of work goes on below the surface to make the magic happen.