Red Hat AI Inference

Move larger models from code to production faster with an end-to-end inference stack built on vLLM.

Try it Explore Red Hat portfolio
AI Inference Server hero

Breadcrumb

  1. Home
  2. Products
  3. Red Hat AI Inference

Faster inference with Red Hat AI Inference

Red Hat® AI Inference provides fast and cost-effective inference at scale across the hybrid cloud. Its open source nature supports your preferred generative AI model, on any accelerator, in any cloud environment. 

Powered by vLLM, the comprehensive end-to-end inference stack maximizes GPU utilization and minimizes latency to enable faster response times. 

As the engine for agentic workflows and Models-as-a-Service patterns, it improves inference efficiency without sacrificing performance or increased compute costs.  

Additionally, users gain access to llm-d, an open source framework that gives developers a blueprint for building distributed inference on Kubernetes environments. This is covered under Red Hat’s third party support policy.  

As part of the Red Hat AI platform, Red Hat AI Inference can be deployed across other Linux and Kubernetes platforms and is certified for all Red Hat products.

_________________________________________________________

Why use Red Hat AI Inference?

Make the most of your GPUs with vLLM

Red Hat AI Inference is powered by vLLM, an inference server that speeds up the output of generative AI applications by making better use of the GPU memory.

Building cost-efficient and reliable LLM services requires significant computing power, energy resources, and specialized operational skills. These challenges can put the benefits of customized, deployment-ready AI out of reach for many organizations.

vLLM uses the hardware needed to support AI workloads more efficiently to help make AI at scale a reality for those with a budget.

vLLM is a library of open source code maintained by the vLLM community. It helps  (LLMs) perform calculations more efficiently and at scale. With cross-platform adaptability and a growing community of contributors, vLLM is emerging as the Linux® of gen AI inference. 

Benefits: 

  • Higher GPU utilization
  • Minimized latency
  • Faster response time 

________________________________________

Centralize your models with the Red Hat AI model repository

Access pre-optimized, validated, third-party models that run efficiently on vLLM, which serves as a fast and efficient serving engine for models of any size. 

 

Red Hat AI validates our collection of pre-optimized models by running a series of capacity planning scenarios. The validation process is performed on leading LLMs and across a wide range of hardware to ensure reproducibility for popular enterprise use cases. With this guidance, customers can properly size inference workloads for domain-specific use cases, such as virtual assistants, retrieval augmented generation (RAG) applications, and summarization. 

Benefits: 

  • Flexibility
  • Optimized inference
  • Predictability

________________________________________________________

Optimize models and cut costs with LLM compressor

Optimize model deployments using LLM compressor. It can optimize model inference by compressing both foundational and trained models of any size. This reduces compute utilization and its related costs, without sacrificing performance or model accuracy. 

LLM compressor allows users to apply various compression algorithms to LLMs for optimized deployment with vLLM.

Benefits:

  • Reduced compute utilization and costs
  • Performance and model accuracy
  • Optimized deployment of large models

________________________________________________________

Try Red Hat AI Inference 

Start a no-cost product trial of Red Hat AI Inference, which includes access to:

  • A single, 60-day, self-supported subscription to Red Hat AI Inference.
  • Red Hat’s award-winning customer portal, with product documentation, helpful videos, discussions, and more.
  • An enterprise-grade inference runtime, based on the de facto standard for LLM inference.
  • A model optimization toolkit to reduce hardware requirements for foundational or custom models with techniques like quantization or sparsity.
  • Our third-party validated and optimized model repository hosted on Hugging Face.
  • LLM Compressor—the tool Red Hat used to build its optimized model repository—so you can optimize your own customized models.
  • GenAI-specific telemetry that shares model-specific performance metrics such as time-to-first-token, KV-cache hit rate, throughput, and GPU utilization to help control performance. 

Refer to Red Hat AI Inference documentation for more details.  

 

Start your trial
________________________________________________________

Scale with Red Hat AI Enterprise

  • Red Hat® AI Enterprise provides the foundation for building, developing, and deploying AI-powered applications across the hybrid cloud. 

  • This is an integrated AI platform for deploying, managing, and scaling AI inference, agentic AI workflows, and AI-powered applications on any infrastructure. It ensures your AI applications stay fast and responsive, even as user demand grows. With layered security and safety, the platform enables hybrid cloud agility while mitigating risk. 

    Learn more

Red Hat Enterprise Linux® AI

Red Hat Enterprise Linux AI is a foundation model platform that makes open source-licensed gen AI models work for the enterprise. 

Its hybrid cloud flexibility lowers costs and removes barriers to testing and experimentation.

Explore Red Hat Enterpise Linux AI

Red Hat OpenShift® AI

Red Hat OpenShift AI is an AI platform with tools to develop, train, serve, and monitor machine learning models quickly and consistently. 

Its hybrid flexibility enables AI tooling on-site, in the public cloud, or at the edge.

Explore Red Hat OpenShift AI

Red Hat AI blogs and articles

Inference General image

Learn more about how AI models apply training data to real-world situations.

vLLM img

Learn more about vLLM and how it speeds up gen AI inference.

GenAI-ResearchA

Learn how to optimize your AI runtimes and AI inference for the enterprise.

GenAI Research B

Red Hat AI engineers are using AI and open source.

Ready to use AI in production?

Transitioning to production with Red Hat AI offers enhanced stability for enterprises that want to scale. As one of the largest commercial contributors to vLLM, we have a deep understanding of the technology. Our AI consultants are ready to help you achieve your enterprise AI goals. 

Talk to an expert