Skip to main content
Redhat Developers  Logo
  • Products

    Featured

    • Red Hat Enterprise Linux
      Red Hat Enterprise Linux Icon
    • Red Hat OpenShift AI
      Red Hat OpenShift AI
    • Red Hat Enterprise Linux AI
      Linux icon inside of a brain
    • Image mode for Red Hat Enterprise Linux
      RHEL image mode
    • Red Hat OpenShift
      Openshift icon
    • Red Hat Ansible Automation Platform
      Ansible icon
    • Red Hat Developer Hub
      Developer Hub
    • View All Red Hat Products
    • Linux

      • Red Hat Enterprise Linux
      • Image mode for Red Hat Enterprise Linux
      • Red Hat Universal Base Images (UBI)
    • Java runtimes & frameworks

      • JBoss Enterprise Application Platform
      • Red Hat build of OpenJDK
    • Kubernetes

      • Red Hat OpenShift
      • Microsoft Azure Red Hat OpenShift
      • Red Hat OpenShift Virtualization
      • Red Hat OpenShift Lightspeed
    • Integration & App Connectivity

      • Red Hat Build of Apache Camel
      • Red Hat Service Interconnect
      • Red Hat Connectivity Link
    • AI/ML

      • Red Hat OpenShift AI
      • Red Hat Enterprise Linux AI
    • Automation

      • Red Hat Ansible Automation Platform
      • Red Hat Ansible Lightspeed
    • Developer tools

      • Red Hat Trusted Software Supply Chain
      • Podman Desktop
      • Red Hat OpenShift Dev Spaces
    • Developer Sandbox

      Developer Sandbox
      Try Red Hat products and technologies without setup or configuration fees for 30 days with this shared Openshift and Kubernetes cluster.
    • Try at no cost
  • Technologies

    Featured

    • AI/ML
      AI/ML Icon
    • Linux
      Linux Icon
    • Kubernetes
      Cloud icon
    • Automation
      Automation Icon showing arrows moving in a circle around a gear
    • View All Technologies
    • Programming Languages & Frameworks

      • Java
      • Python
      • JavaScript
    • System Design & Architecture

      • Red Hat architecture and design patterns
      • Microservices
      • Event-Driven Architecture
      • Databases
    • Developer Productivity

      • Developer productivity
      • Developer Tools
      • GitOps
    • Secure Development & Architectures

      • Security
      • Secure coding
    • Platform Engineering

      • DevOps
      • DevSecOps
      • Ansible automation for applications and services
    • Automated Data Processing

      • AI/ML
      • Data Science
      • Apache Kafka on Kubernetes
      • View All Technologies
    • Start exploring in the Developer Sandbox for free

      sandbox graphic
      Try Red Hat's products and technologies without setup or configuration.
    • Try at no cost
  • Learn

    Featured

    • Kubernetes & Cloud Native
      Openshift icon
    • Linux
      Rhel icon
    • Automation
      Ansible cloud icon
    • Java
      Java icon
    • AI/ML
      AI/ML Icon
    • View All Learning Resources

    E-Books

    • GitOps Cookbook
    • Podman in Action
    • Kubernetes Operators
    • The Path to GitOps
    • View All E-books

    Cheat Sheets

    • Linux Commands
    • Bash Commands
    • Git
    • systemd Commands
    • View All Cheat Sheets

    Documentation

    • API Catalog
    • Product Documentation
    • Legacy Documentation
    • Red Hat Learning

      Learning image
      Boost your technical skills to expert-level with the help of interactive lessons offered by various Red Hat Learning programs.
    • Explore Red Hat Learning
  • Developer Sandbox

    Developer Sandbox

    • Access Red Hat’s products and technologies without setup or configuration, and start developing quicker than ever before with our new, no-cost sandbox environments.
    • Explore Developer Sandbox

    Featured Developer Sandbox activities

    • Get started with your Developer Sandbox
    • OpenShift virtualization and application modernization using the Developer Sandbox
    • Explore all Developer Sandbox activities

    Ready to start developing apps?

    • Try at no cost
  • Blog
  • Events
  • Videos

Empower conversational AI at scale with KServe

March 15, 2024
Saurabh Agarwal Yuan Tang Adam Tetelman, Rob Esker
Related topics:
Artificial intelligence
Related products:
Red Hat OpenShift AI

Share:

KServe is a standard Model Inference Platform on Kubernetes built for highly scalable use cases. It is a popular open source platform available as a community project, as well as a core component of Red Hat OpenShift AI. It provides a Kubernetes custom resource definition (CRD) for serving machine learning (ML) models on arbitrary frameworks. It aims to solve production model-serving use cases by providing performant, high-abstraction interfaces for common ML frameworks and model formats like TensorFlow, PyTorch, XGBoost, Scikit-learn, and ONNX.

It simplifies the deployment process by abstracting the complexities associated with serving ML models at scale in production. KServe is framework-agnostic, supporting various ML frameworks, and it leverages Kubernetes' scalability features for dynamic scaling based on demand. It facilitates canary deployments for controlled model version updates, offers advanced metrics and monitoring for performance insights, and allows users to serve multiple models concurrently. KServe’s versatility, versioning support, and customizable inference pipelines make it a powerful tool for deploying, scaling, and managing ML models in production environments. Figure 1 illustrates these components.

Interactions between components including KServe, Inferencing Server and AI Application running on Kubernetes / OpenShift platform
Interactions between components including KServe, Inferencing Server and AI Application running on Kubernetes / OpenShift platform. Source: https://kserve.github.io/website/master/images/controlplane.png
Interactions between components including KServe, Inferencing Server, and AI Application running on Kubernetes / OpenShift platform. Source: https://kserve.github.io/website/master/images/controlplane.png

Why KServe?

  • KServe is a standard, cloud-agnostic Model Inference Platform on Kubernetes, built for highly scalable use cases. 
  • Provides performant, standardized inference protocol across ML frameworks.
  • Supports modern serverless inference workloads with request-based auto-scaling including scale-to-zero on CPUs and GPUs.
  • Provides high scalability, density packing, and intelligent routing using ModelMesh.
  • Offers simple and pluggable production serving for inference, pre/post processing, monitoring, and explainability.
  • Supports advanced deployments for canary rollout, pipeline, and ensembles with InferenceGraph.

KServe’s ModelServer is built on top of FastAPI, which brings out-of-box support for OpenAPI specification and Swagger UI. KServe supports the Open Inference Protocol (OIP) specification that defines a standard protocol for performing ML model inference across different serving runtimes, promotes standardization, and improves interoperability between different model-serving runtimes. It enables the creation of cohesive inference experience, empowering the development of versatile client or benchmarking tools that can work with all supported serving runtimes. This will enable users to easily migrate from one platform to another. 

Popular API specs

OpenAI

OpenAI provides an API interface that empowers end users with AI capabilities, facilitating the smooth integration of their applications with robust ML models. Hosted in OpenAI’s cloud, these pre-trained models are accessible through APIs, presenting substantial cost savings for organizations. OpenAI further supports model fine-tuning by allowing dataset uploads via APIs. These models are deployable on a scalable infrastructure, and their adoption has increased because of their user-friendly API interface.

HuggingFace

Starting with version 1.4.0, Text Generation Inference (TGI) offers an API compatible with the OpenAI Chat Completion API. The new Messages API allows customers and users to transition seamlessly from OpenAI models to open LLMs. The API can be directly used with OpenAI’s client libraries or third-party tools like LangChain or LlamaIndex.

Open Inference Protocol (OIP)

This specification enables the creation of a cohesive inference experiences, empowering the development of versatile client or benchmarking tools that can work with all supported serving runtimes. KServe’s OIP has been widely adopted by popular runtimes like KServe, NVIDIA Triton Inference Server protocol, Seldon MLServer, Seldon Core v2 inference protocol, OpenVino RESTful API and gRPC API, AMD Inference Server and the TorchServe Inference API. By adopting OIP, developers can build applications that are not tied to a single platform and are able to swap out inference servers and even deploy different AI models such as Mistral or Llama2 without modifying the application layer.

NVIDIA and Red Hat

KServe has recently added support for generate endpoint to support additional runtimes like HuggingFace and vLLM to work with large models.

With the launch of NVIDIA NIM, NVIDIA is providing a industry-standard set of inference applications including support for optimized LLM models such as Nemotron, Llama2, and Mistral through an OpenAI API compliant interface. Along with the tooling to download and optimize these models, NVIDIA also provides the TensorRT-LLM and the NVIDIA Triton Inference Server both as a contribution to the open source community, as well as in its NVIDIA AI Enterprise offerings. KServe is a powerful inferencing platform that can complement inference services built on Triton and other engines.

To benefit from open standards, NVIDIA is working with Red Hat to improve KServe OIP’s compatibility with OpenAI schema. The addition of this feature will allow standards-based portability in applications leveraging generative AI models, including the integration of such inference services into Red Hat OpenShift AI, which uses KServe for model serving, as well as interoperability with current and future services by NVIDIA converging on the OpenAI Specifications and making locally deployed LLMs more accessible.

Last updated: March 25, 2024

Related Posts

  • Multilingual semantic-similarity search with Elasticsearch

  • How to use LLMs in Java with LangChain4j and Quarkus

  • Intel GPUs and OVMS: A winning combination for deep learning efficiency

  • Implement MLOps with Kubeflow Pipelines

  • Integrate your Quarkus application with GPT4All

  • Access the OpenAI ChatGPT API in Quarkus

Recent Posts

  • How to use RHEL 10 as a WSL Podman machine

  • MINC: Fast, local Kubernetes with Podman Desktop & MicroShift

  • How to stay informed with Red Hat status notifications

  • Getting started with RHEL on WSL

  • llm-d: Kubernetes-native distributed inferencing

What’s up next?

 

Open Source Data Pipelines for Intelligent Applications provides data engineers and scientists insight into how Kubernetes provides a platform for building data platforms that increase an organization’s data agility.

Get the e-book
Red Hat Developers logo LinkedIn YouTube Twitter Facebook

Products

  • Red Hat Enterprise Linux
  • Red Hat OpenShift
  • Red Hat Ansible Automation Platform

Build

  • Developer Sandbox
  • Developer Tools
  • Interactive Tutorials
  • API Catalog

Quicklinks

  • Learning Resources
  • E-books
  • Cheat Sheets
  • Blog
  • Events
  • Newsletter

Communicate

  • About us
  • Contact sales
  • Find a partner
  • Report a website issue
  • Site Status Dashboard
  • Report a security problem

RED HAT DEVELOPER

Build here. Go anywhere.

We serve the builders. The problem solvers who create careers with code.

Join us if you’re a developer, software engineer, web designer, front-end designer, UX designer, computer scientist, architect, tester, product manager, project manager or team lead.

Sign me up

Red Hat legal and privacy links

  • About Red Hat
  • Jobs
  • Events
  • Locations
  • Contact Red Hat
  • Red Hat Blog
  • Inclusion at Red Hat
  • Cool Stuff Store
  • Red Hat Summit

Red Hat legal and privacy links

  • Privacy statement
  • Terms of use
  • All policies and guidelines
  • Digital accessibility

Report a website issue