Red Hat AI Inference Server

Move larger models from code to production faster with an enterprise-grade inference engine built on vLLM.

Try it Explore Red Hat portfolio

Faster inference with Red Hat AI Inference Server

Red Hat® AI Inference Server provides fast and cost-effective inference at scale, across the hybrid cloud. Its open source nature allows it to support your preferred generative AI (gen AI) model, on any AI accelerator, in any cloud environment.

 

Powered by vLLM, the inference server maximizes GPU utilization and minimizes latency to enable faster response times. Combined with LLM Compressor capabilities, inference efficiency increases without sacrificing performance or increased compute costs.  

Additionally, a pre-optimized model repository ensures rapid model serving. Access to ready-to-use models enables faster work, improves cross-team consistency, and grants the flexibility to choose a model that fits your use case.

Red Hat AI Inference Server can be deployed across other Linux and Kubernetes platforms with support under Red Hat’s third-party support policy. It is also certified for all Red Hat products and is included in the Red Hat AI portfolio.

_________________________________________________________

Why use Red Hat AI Inference Server?

Centralize your models with the Red Hat AI Model Repository

Red Hat AI Inference Server provides access to a set of validated third-party models that run efficiently on vLLM, which serves as a fast and efficient serving engine for large language models (LLMs). 

Red Hat AI validates our collection of pre-optimized models by running a series of capacity planning scenarios. The validation process is performed on leading LLMs and across a wide range of hardware to ensure reproducibility for popular enterprise use cases. With this guidance, customers can properly size inference workloads for domain-specific use cases, such as virtual assistants, retrieval augmented generation (RAG) applications, and summarization. 

Benefits: 

  • Flexibility
  • Optimized inference
  • Predictability 

________________________________________

Make the most of your GPUs with vLLM

Red Hat AI Inference Server is powered by vLLM, an inference server that speeds up the output of generative AI applications by making better use of the GPU memory.

Building cost-efficient and reliable LLM services requires significant computing power, energy resources, and specialized operational skills. These challenges can put the benefits of customized, deployment-ready AI out of reach for many organizations.

vLLM uses the hardware needed to support AI workloads more efficiently to help make AI at scale a reality for those with a budget.

vLLM is a library of open source code maintained by the vLLM community. It helps  (LLMs) perform calculations more efficiently and at scale. With cross-platform adaptability and a growing community of contributors, vLLM is emerging as the Linux® of gen AI inference. 

Benefits: 

  • Higher GPU utilization
  • Minimized latency
  • Faster response time 

________________________________________________________

Optimize models and cut costs with LLM Compressor

Red Hat AI Inference Server optimizes model deployments using LLM Compressor. It also optimizes model inference by compressing both foundational and trained models of any size. This reduces compute utilization and its related costs, without sacrificing performance or model accuracy. 

LLM Compressor allows users to apply various compression algorithms to LLMs for optimized deployment with vLLM.

Benefits:

  • Reduced compute utilization and costs
  • Performance and model accuracy
  • Optimized deployment of large models

________________________________________________________

Try Red Hat AI Inference Server

Start a no-cost product trial of Red Hat AI Inference Server, which includes access to:

  • A single, 60-day, self-supported subscription to Red Hat AI Inference Server.
  • Red Hat’s award-winning Customer Portal, with product documentation, helpful videos, discussions, and more.
  • An enterprise-grade inference runtime, based on the de facto standard for LLM inference.
  • Our third-party validated and optimized model repository hosted on Hugging Face.
  • LLM Compressor—the tool Red Hat used to build its optimized model repository—so you can optimize your own customized models.
  • Access to inference benchmarks and accuracy evaluation tools such as GuideLLM.

Refer to Red Hat AI Inference Server documentation for more details. 

 

Start your trial

Red Hat Enterprise Linux® AI

Red Hat Enterprise Linux AI is a foundation model platform that makes open source-licensed gen AI models work for the enterprise. 

Its hybrid cloud flexibility lowers costs and removes barriers to testing and experimentation.

Explore Red Hat Enterpise Linux AI

Red Hat OpenShift® AI

Red Hat OpenShift AI is an AI platform with tools to develop, train, serve, and monitor machine learning models quickly and consistently. 

Its hybrid flexibility enables AI tooling on-site, in the public cloud, or at the edge.

Explore Red Hat OpenShift AI

Red Hat AI blogs and articles

Inference General image

Learn more about how AI models apply training data to real-world situations.

vLLM img

Learn more about vLLM and how it speeds up gen AI inference.

GenAI-ResearchA

Learn how to optimize your AI runtimes and AI inference for the enterprise.

GenAI Research B

Red Hat AI engineers are using AI and open source.

Ready to use AI in production?

Transitioning to production with Red Hat AI offers enhanced stability for enterprises that want to scale. As one of the largest commercial contributors to vLLM, we have a deep understanding of the technology. Our AI consultants are ready to help you achieve your enterprise AI goals. 

Talk to an expert