Red Hat AI Inference Server
Move larger models from code to production faster with an enterprise-grade inference engine built on vLLM.
Faster inference with Red Hat AI Inference Server
Red Hat® AI Inference Server provides fast and cost-effective inference at scale, across the hybrid cloud. Its open source nature allows it to support your preferred generative AI (gen AI) model, on any AI accelerator, in any cloud environment.
Powered by vLLM, the inference server maximizes GPU utilization and minimizes latency to enable faster response times. Combined with LLM Compressor capabilities, inference efficiency increases without sacrificing performance or increased compute costs.
Additionally, a pre-optimized model repository ensures rapid model serving. Access to ready-to-use models enables faster work, improves cross-team consistency, and grants the flexibility to choose a model that fits your use case.
Red Hat AI Inference Server can be deployed across other Linux and Kubernetes platforms with support under Red Hat’s third-party support policy. It is also certified for all Red Hat products and is included in the Red Hat AI portfolio.
_________________________________________________________
Why use Red Hat AI Inference Server?
Centralize your models with the Red Hat AI Model Repository
Red Hat AI Inference Server provides access to a set of validated third-party models that run efficiently on vLLM, which serves as a fast and efficient serving engine for large language models (LLMs).
Red Hat AI validates our collection of pre-optimized models by running a series of capacity planning scenarios. The validation process is performed on leading LLMs and across a wide range of hardware to ensure reproducibility for popular enterprise use cases. With this guidance, customers can properly size inference workloads for domain-specific use cases, such as virtual assistants, retrieval augmented generation (RAG) applications, and summarization.
Benefits:
- Flexibility
- Optimized inference
- Predictability
________________________________________
Make the most of your GPUs with vLLM
Red Hat AI Inference Server is powered by vLLM, an inference server that speeds up the output of generative AI applications by making better use of the GPU memory.
Building cost-efficient and reliable LLM services requires significant computing power, energy resources, and specialized operational skills. These challenges can put the benefits of customized, deployment-ready AI out of reach for many organizations.
vLLM uses the hardware needed to support AI workloads more efficiently to help make AI at scale a reality for those with a budget.
vLLM is a library of open source code maintained by the vLLM community. It helps (LLMs) perform calculations more efficiently and at scale. With cross-platform adaptability and a growing community of contributors, vLLM is emerging as the Linux® of gen AI inference.
Benefits:
- Higher GPU utilization
- Minimized latency
- Faster response time
________________________________________________________
Optimize models and cut costs with LLM Compressor
Red Hat AI Inference Server optimizes model deployments using LLM Compressor. It also optimizes model inference by compressing both foundational and trained models of any size. This reduces compute utilization and its related costs, without sacrificing performance or model accuracy.
LLM Compressor allows users to apply various compression algorithms to LLMs for optimized deployment with vLLM.
Benefits:
- Reduced compute utilization and costs
- Performance and model accuracy
- Optimized deployment of large models
________________________________________________________
Try Red Hat AI Inference Server
Start a no-cost product trial of Red Hat AI Inference Server, which includes access to:
- A single, 60-day, self-supported subscription to Red Hat AI Inference Server.
- Red Hat’s award-winning Customer Portal, with product documentation, helpful videos, discussions, and more.
- An enterprise-grade inference runtime, based on the de facto standard for LLM inference.
- Our third-party validated and optimized model repository hosted on Hugging Face.
- LLM Compressor—the tool Red Hat used to build its optimized model repository—so you can optimize your own customized models.
- Access to inference benchmarks and accuracy evaluation tools such as GuideLLM.
Refer to Red Hat AI Inference Server documentation for more details.
Red Hat AI blogs and articles

Learn more about how AI models apply training data to real-world situations.

Learn more about vLLM and how it speeds up gen AI inference.

Learn how to optimize your AI runtimes and AI inference for the enterprise.

Red Hat AI engineers are using AI and open source.