Faster inference with Red Hat AI Inference
Red Hat® AI Inference provides fast and cost-effective inference at scale across the hybrid cloud. Its open source nature supports your preferred generative AI model, on any accelerator, in any cloud environment.
Powered by vLLM, the comprehensive end-to-end inference stack maximizes GPU utilization and minimizes latency to enable faster response times.
As the engine for agentic workflows and Models-as-a-Service patterns, it improves inference efficiency without sacrificing performance or increased compute costs.
Additionally, users gain access to llm-d, an open source framework that gives developers a blueprint for building distributed inference on Kubernetes environments. This is covered under Red Hat’s third party support policy.
As part of the Red Hat AI platform, Red Hat AI Inference can be deployed across other Linux and Kubernetes platforms and is certified for all Red Hat products.
_________________________________________________________
Why use Red Hat AI Inference?
Make the most of your GPUs with vLLM
Red Hat AI Inference is powered by vLLM, an inference server that speeds up the output of generative AI applications by making better use of the GPU memory.
Building cost-efficient and reliable LLM services requires significant computing power, energy resources, and specialized operational skills. These challenges can put the benefits of customized, deployment-ready AI out of reach for many organizations.
vLLM uses the hardware needed to support AI workloads more efficiently to help make AI at scale a reality for those with a budget.
vLLM is a library of open source code maintained by the vLLM community. It helps (LLMs) perform calculations more efficiently and at scale. With cross-platform adaptability and a growing community of contributors, vLLM is emerging as the Linux® of gen AI inference.
Benefits:
- Higher GPU utilization
- Minimized latency
- Faster response time
________________________________________
Centralize your models with the Red Hat AI model repository
Access pre-optimized, validated, third-party models that run efficiently on vLLM, which serves as a fast and efficient serving engine for models of any size.
Red Hat AI validates our collection of pre-optimized models by running a series of capacity planning scenarios. The validation process is performed on leading LLMs and across a wide range of hardware to ensure reproducibility for popular enterprise use cases. With this guidance, customers can properly size inference workloads for domain-specific use cases, such as virtual assistants, retrieval augmented generation (RAG) applications, and summarization.
Benefits:
- Flexibility
- Optimized inference
- Predictability
________________________________________________________
Optimize models and cut costs with LLM compressor
Optimize model deployments using LLM compressor. It can optimize model inference by compressing both foundational and trained models of any size. This reduces compute utilization and its related costs, without sacrificing performance or model accuracy.
LLM compressor allows users to apply various compression algorithms to LLMs for optimized deployment with vLLM.
Benefits:
- Reduced compute utilization and costs
- Performance and model accuracy
- Optimized deployment of large models
________________________________________________________
Try Red Hat AI Inference
Start a no-cost product trial of Red Hat AI Inference, which includes access to:
- A single, 60-day, self-supported subscription to Red Hat AI Inference.
- Red Hat’s award-winning customer portal, with product documentation, helpful videos, discussions, and more.
- An enterprise-grade inference runtime, based on the de facto standard for LLM inference.
- A model optimization toolkit to reduce hardware requirements for foundational or custom models with techniques like quantization or sparsity.
- Our third-party validated and optimized model repository hosted on Hugging Face.
- LLM Compressor—the tool Red Hat used to build its optimized model repository—so you can optimize your own customized models.
- GenAI-specific telemetry that shares model-specific performance metrics such as time-to-first-token, KV-cache hit rate, throughput, and GPU utilization to help control performance.
Refer to Red Hat AI Inference documentation for more details.
________________________________________________________
Scale with Red Hat AI Enterprise
Red Hat® AI Enterprise provides the foundation for building, developing, and deploying AI-powered applications across the hybrid cloud.
This is an integrated AI platform for deploying, managing, and scaling AI inference, agentic AI workflows, and AI-powered applications on any infrastructure. It ensures your AI applications stay fast and responsive, even as user demand grows. With layered security and safety, the platform enables hybrid cloud agility while mitigating risk.
Learn more
Red Hat AI blogs and articles
Learn more about how AI models apply training data to real-world situations.
Learn more about vLLM and how it speeds up gen AI inference.
Learn how to optimize your AI runtimes and AI inference for the enterprise.
Red Hat AI engineers are using AI and open source.