Faster inference with Red Hat AI Inference Server
Red Hat® AI Inference Server provides fast and cost-effective inference at scale, across the hybrid cloud. Its open source nature allows it to support your preferred generative AI (gen AI) model, on any AI accelerator, in any cloud environment.
Powered by vLLM, the inference server maximizes GPU utilization and minimizes latency to enable faster response times. Combined with LLM Compressor capabilities, inference efficiency increases without sacrificing performance or increased compute costs.
Additionally, a pre-optimized model repository ensures rapid model serving. Access to ready-to-use models enables faster work, improves cross-team consistency, and grants the flexibility to choose a model that fits your use case.
Red Hat AI Inference Server can be deployed across other Linux and Kubernetes platforms with support under Red Hat’s third-party support policy. It is also certified for all Red Hat products and is included in the Red Hat AI portfolio.
_________________________________________________________
Why use Red Hat AI Inference Server?
Centralize your models with the Red Hat AI Model Repository
Red Hat AI Inference Server provides access to a set of validated third-party models that run efficiently on vLLM, which serves as a fast and efficient serving engine for large language models (LLMs).
Red Hat AI validates our collection of pre-optimized models by running a series of capacity planning scenarios. The validation process is performed on leading LLMs and across a wide range of hardware to ensure reproducibility for popular enterprise use cases. With this guidance, customers can properly size inference workloads for domain-specific use cases, such as virtual assistants, retrieval augmented generation (RAG) applications, and summarization.
Benefits:
- Flexibility
- Optimized inference
- Predictability
________________________________________
Make the most of your GPUs with vLLM
Red Hat AI Inference Server is powered by vLLM, an inference server that speeds up the output of generative AI applications by making better use of the GPU memory.
Building cost-efficient and reliable LLM services requires significant computing power, energy resources, and specialized operational skills. These challenges can put the benefits of customized, deployment-ready AI out of reach for many organizations.
vLLM uses the hardware needed to support AI workloads more efficiently to help make AI at scale a reality for those with a budget.
vLLM is a library of open source code maintained by the vLLM community. It helps (LLMs) perform calculations more efficiently and at scale. With cross-platform adaptability and a growing community of contributors, vLLM is emerging as the Linux® of gen AI inference.
Benefits:
- Higher GPU utilization
- Minimized latency
- Faster response time
________________________________________________________
Optimize models and cut costs with LLM Compressor
Red Hat AI Inference Server optimizes model deployments using LLM Compressor. It also optimizes model inference by compressing both foundational and trained models of any size. This reduces compute utilization and its related costs, without sacrificing performance or model accuracy.
LLM Compressor allows users to apply various compression algorithms to LLMs for optimized deployment with vLLM.
Benefits:
- Reduced compute utilization and costs
- Performance and model accuracy
- Optimized deployment of large models
________________________________________________________
Try Red Hat AI Inference
Start a no-cost product trial of Red Hat AI Inference Server, which includes access to:
- A single, 60-day, self-supported subscription to Red Hat AI Inference.
- Red Hat’s award-winning customer portal, with product documentation, helpful videos, discussions, and more.
- An enterprise-grade inference runtime, based on the de facto standard for LLM inference.
- A model optimization toolkit to reduce hardware requirements for foundational or custom models with techniques like quantization or sparsity.
- Our third-party validated and optimized model repository hosted on Hugging Face.
- LLM Compressor—the tool Red Hat used to build its optimized model repository—so you can optimize your own customized models.
- GenAI-specific telemetry that shares model-specific performance metrics such as time-to-first-token, KV-cache hit rate, throughput, and GPU utilization to help control performance.
Refer to Red Hat AI Inference Server documentation for more details.
________________________________________________________
Scale with Red Hat AI Enterprise
Red Hat® AI Enterprise provides the foundation for building, developing, and deploying AI-powered applications across the hybrid cloud.
This is an integrated AI platform for deploying, managing, and scaling AI inference, agentic AI workflows, and AI-powered applications on any infrastructure. It ensures your AI applications stay fast and responsive, even as user demand grows. With layered security and safety, the platform enables hybrid cloud agility while mitigating risk.
Learn more
Red Hat AI blogs and articles
Learn more about how AI models apply training data to real-world situations.
Learn more about vLLM and how it speeds up gen AI inference.
Learn how to optimize your AI runtimes and AI inference for the enterprise.
Red Hat AI engineers are using AI and open source.