Skip to main content
Redhat Developers  Logo
  • Products

    Platforms

    • Red Hat Enterprise Linux
      Red Hat Enterprise Linux Icon
    • Red Hat AI
      Red Hat AI
    • Red Hat OpenShift
      Openshift icon
    • Red Hat Ansible Automation Platform
      Ansible icon
    • View All Red Hat Products

    Featured

    • Red Hat build of OpenJDK
    • Red Hat Developer Hub
    • Red Hat JBoss Enterprise Application Platform
    • Red Hat OpenShift Dev Spaces
    • Red Hat OpenShift Local
    • Red Hat Developer Sandbox

      Try Red Hat products and technologies without setup or configuration fees for 30 days with this shared Openshift and Kubernetes cluster.
    • Try at no cost
  • Technologies

    Featured

    • AI/ML
      AI/ML Icon
    • Linux
      Linux Icon
    • Kubernetes
      Cloud icon
    • Automation
      Automation Icon showing arrows moving in a circle around a gear
    • View All Technologies
    • Programming Languages & Frameworks

      • Java
      • Python
      • JavaScript
    • System Design & Architecture

      • Red Hat architecture and design patterns
      • Microservices
      • Event-Driven Architecture
      • Databases
    • Developer Productivity

      • Developer productivity
      • Developer Tools
      • GitOps
    • Automated Data Processing

      • AI/ML
      • Data Science
      • Apache Kafka on Kubernetes
    • Platform Engineering

      • DevOps
      • DevSecOps
      • Ansible automation for applications and services
    • Secure Development & Architectures

      • Security
      • Secure coding
  • Learn

    Featured

    • Kubernetes & Cloud Native
      Openshift icon
    • Linux
      Rhel icon
    • Automation
      Ansible cloud icon
    • AI/ML
      AI/ML Icon
    • View All Learning Resources

    E-Books

    • GitOps Cookbook
    • Podman in Action
    • Kubernetes Operators
    • The Path to GitOps
    • View All E-books

    Cheat Sheets

    • Linux Commands
    • Bash Commands
    • Git
    • systemd Commands
    • View All Cheat Sheets

    Documentation

    • Product Documentation
    • API Catalog
    • Legacy Documentation
  • Developer Sandbox

    Developer Sandbox

    • Access Red Hat’s products and technologies without setup or configuration, and start developing quicker than ever before with our new, no-cost sandbox environments.
    • Explore Developer Sandbox

    Featured Developer Sandbox activities

    • Get started with your Developer Sandbox
    • OpenShift virtualization and application modernization using the Developer Sandbox
    • Explore all Developer Sandbox activities

    Ready to start developing apps?

    • Try at no cost
  • Blog
  • Events
  • Videos

How to set up KServe autoscaling for vLLM with KEDA

September 23, 2025
Alberto Perdomo
Related topics:
Artificial intelligenceAutomation and managementKubernetesOpen source
Related products:
Red Hat AIRed Hat OpenShift AI

Share:

    Deploying machine learning models in a production environment presents a unique set of challenges, and one of the most critical is ensuring that your inference service can handle varying levels of traffic with efficiency and reliability. The unpredictable nature of AI workloads, where traffic can spike dramatically and resource needs can fluctuate based on factors like varying input sequence lengths, token generation lengths or the number of concurrent requests, often means that traditional autoscaling methods fall short. 

    Relying solely on CPU or memory usage can lead to either overprovisioning and wasted resources, or underprovisioning and poor user experience. Similarly, a high GPU utilization might indicate efficient usage of accelerators or it can also signify reaching a saturated state, therefore industry best practices for LLM autoscaling have shifted towards more workload-related specific metrics.  

    In this blog post, we will introduce a more sophisticated and flexible solution. We will walk through the process of setting up KServe autoscaling by leveraging the power of vLLM, KEDA (Kubernetes Event-driven Autoscaling, illustrated in Figure 1), and the custom metrics autoscaler operator in Open Data Hub (ODH). This powerful combination allows us to scale our vLLM services on a wide range of custom, application-specific signals, not just generic metrics. This provides a level of control and efficiency that is tailored to the specific demands of your AI workloads.

    A flow diagram shows how KEDA autoscales AI workloads, with an emphasis on the role of the KEDA operator. The user sends an inference request, which goes to the Red Hat OpenShift Service Mesh ingress gateway. A Prometheus server monitors metrics from the gateway and from the machine learning model. Prometheus then sends metrics to KEDA, which uses them to autoscale the AI workload. The KEDA operator scales up or scales down the number of pods running the AI model.
    Figure 1: High-level architecture of the KEDA autoscaling system for AI workloads.

    Why KEDA in the first place?

    Understanding inference workloads requires acknowledging their fundamental difference from traditional web application patterns. Unlike conventional web requests that typically follow predictable response times, AI inference requests can be inherently heterogeneous and computationally intensive processes. These requests carry an element of uncertainty, because the processing time varies dramatically based on input sequence length, model architecture, and output sequence length, just to name a few.

    It's like the difference between a simple database lookup and solving a complex mathematical proof. A standard web request might complete in milliseconds with predictable resource consumption, but an inference request could require seconds or even minutes with unpredictable resources needs, which are based on the cognitive complexity of the task at hand, among other variables. This fundamental distinction demands autoscaling strategies that look beyond simple data and instead focus on GPU load, request queues, processing latency patterns, and other metrics unique to inference workloads.

    The traditional OpenShift horizontal pod autoscaler is primarily designed to work with CPU and memory metrics through the built-in metrics server. For scaling based on model serving metrics, an additional component must expose those metrics in a format that the horizontal pod autoscaler can understand. That is precisely when KEDA comes into play.

    KEDA is a game-changer for scaling applications. KEDA extends the functionality of the standard OpenShift horizontal pod autoscaler, allowing it to scale applications from zero to N instances and back down based on a variety of event sources, including Prometheus triggers. While KServe already provides its own built-in autoscaling capabilities, KEDA introduces an open and extensible framework. This framework allows KServe to scale based on virtually any event, whether it's the length of a message queue, the number of tasks in a job, or any other custom metric exposed by vLLM that is directly relevant to the performance of our AI model.

    The custom metrics autoscaler operator is the key to unlocking this flexibility. It acts as a crucial intermediary, simplifying the process of exposing metrics from your services and making them consumable by KEDA. The custom metrics autoscaler operator bridges the gap between the specific performance indicators your application generates and KEDA's scaling logic. With the custom metrics autoscaler operator, we can define highly tailored scaling rules that react precisely to the load on our vLLM services, ensuring optimal performance and cost-effectiveness.

    Enabling vLLM metrics for autoscaling in Open Data Hub

    In this guide, we are using the meta-llama/Llama-3.2-3B model served on Open Data Hub version 2.33.0 using vLLM and a KServe RawDeployment. It is assumed that ODH is correctly installed and the requirements met. 

    To effectively autoscale your vLLM-served models on ODH, you need to ensure their performance metrics are collected. This involves making vLLM metrics available to the cluster's internal Prometheus instance, which the custom metrics autoscaler operator will then consume. The following steps walk you through this process. 

    1. Expose vLLM metrics to OpenShift with annotations

    First, you need a way for the cluster to discover and connect your vLLM inference service's metrics endpoint. Luckily, there is a standard way to do so by adding annotations to your InferenceService. These annotations expose the metrics port (typically 8000 or 8080) of your vLLM pods. Your InferenceService should look like this: 

    ---
    apiVersion: serving.kserve.io/v1beta1
    kind: InferenceService
    metadata:
      annotations:
        opendatahub.io/hardware-profile-namespace: opendatahub
        opendatahub.io/legacy-hardware-profile-name: migrated-gpu-mglzi-serving
        openshift.io/display-name: llama-3.1-8b
        serving.kserve.io/autoscalerClass: external
        serving.kserve.io/deploymentMode: RawDeployment
        prometheus.io/scrape: "true"  # enable Prometheus scraping
        prometheus.io/path: "/metrics"  # specify the endpoint
        prometheus.io/port: "8000"  # specify the port
      creationTimestamp: "2025-09-01T14:09:17Z"
    # Rest of your InferenceService...

    2. Enabling user-defined project monitoring

    To ensure Prometheus discovers and scrapes metrics from your annotated InferenceService, you must enable user-defined project monitoring. This is a cluster-wide setting configured in a ConfigMap.

    ---
    apiVersion: v1
    kind: ConfigMap
    metadata:
      name: cluster-monitoring-config
      namespace: openshift-monitoring
    data:
      config.yaml: |
        enableUserWorkload: true

    By setting enableUserWorkload: true, you're telling the OpenShift monitoring stack to monitor and scrape metrics from user-defined workloads in different namespaces, making your vLLM metrics visible to the cluster's Prometheus. 

    Once these steps are complete, your vLLM metrics will be available for use with the custom metrics autoscaler. You can go to the OpenShift console → Observe → Metrics and query any of the vLLM metrics to make sure that it is working properly (see Figure 2). 

    The dashboard displays various metrics for a vLLM inference server.
    Figure 2: Metrics dashboard showing vLLM metrics.

    Set up KServe autoscaling with the custom metrics autoscaler operator

    With your vLLM metrics now successfully being scrapped by Prometheus, the next step is to configure KEDA to use these metrics to autoscale your vLLM service. In the following steps, we will set up the custom metrics autoscaler operator to scale your workloads based on your vLLM metrics from Prometheus. For specific steps on how to install the custom metrics autoscaler operator and create the kedaController instance, you can check out the blog posts Custom Metrics Autoscaler on OpenShift and Boost AI efficiency with GPU autoscaling on OpenShift. 

    A quick but important note: When using a KServe RawDeployment, you must disable the default KServe horizontal pod autoscaler. You can do this by adding the following annotation to your InferenceService manifest:

    serving.kserve.io/autoscalerClass: external

    We use the label selector app: isvc.llama-32-3b-predictor and the keda namespace. Be sure to replace these with the appropriate label and namespace for your own InferenceService.

    1. Prepare the environment and access control

    To allow KEDA to communicate securely with the internal cluster Prometheus, we need to create a dedicated ServiceAccount and configure role-based access control (RBAC). This ensures KEDA has the necessary permissions to read the metrics it needs to make scaling decisions.

    ---
    apiVersion: v1
    kind: ServiceAccount
    metadata:
      name: keda-prometheus-sa
      namespace: keda
    ---
    apiVersion: rbac.authorization.k8s.io/v1
    kind: Role
    metadata:
      namespace: keda
      name: keda-prometheus-role
    rules:
    - apiGroups: ["metrics.k8s.io", "custom.metrics.k8s.io", "external.metrics.k8s.io"]
      resources: ["*"]
      verbs: ["get", "list", "watch"]
    ---
    apiVersion: rbac.authorization.k8s.io/v1
    kind: RoleBinding
    metadata:
      name: keda-prometheus-rolebinding
      namespace: keda
    subjects:
    - kind: ServiceAccount
      name: keda-prometheus-sa
      namespace: keda
    roleRef:
      kind: Role
      name: keda-prometheus-role
      apiGroup: rbac.authorization.k8s.io

    After applying this YAML, you must also grant the keda-prometheus-sa service account the cluster-monitoring-view role. This is a critical step that allows it to access data from OpenShift’s central monitoring stack. You can do this with the following command:

    oc adm policy add-cluster-role-to-user cluster-monitoring-view -z keda-prometheus-sa -n keda

    2. Create a TriggerAuthentication for secure access

    KEDA needs a way to securely authenticate with Prometheus. This is done with a TriggerAuthentication object, which uses a bearer token. The token is automatically generated when you create a Secret tied to your service account.

    First, create the Secret:

    ---
    apiVersion: v1
    kind: Secret
    metadata:
      name: keda-prometheus-token
      namespace: keda
      annotations:
        kubernetes.io/service-account.name: keda-prometheus-sa
    type: kubernetes.io/service-account-token

    Next, create the TriggerAuthentication custom resource. This object references the secret we just created to provide KEDA with the necessary credentials.

    ---
    apiVersion: keda.sh/v1alpha1
    kind: TriggerAuthentication
    metadata:
      name: keda-trigger-auth-prometheus
      namespace: keda
    spec:
      secretTargetRef:
        - parameter: bearerToken
          name: keda-prometheus-token
          key: token
        - parameter: ca
          name: keda-prometheus-token
          key: ca.crt

    The TriggerAuthentication acts as a reusable authentication configuration that can be referenced by any ScaledObject that needs to query Prometheus.

    3. Define your Prometheus query

    This is arguably the most crucial and nuanced part of the setup. You must craft a Prometheus query that returns a single numeric value that accurately reflects the load on your model. vLLM exposes various useful metrics:

    • vllm:time_per_output_token_seconds_bucket: Inter-Token Latency or Time Per Output Token (TPOT).
    • vllm:e2e_request_latency_seconds_bucket: End-to-end request latency.
    • vllm:gpu_cache_usage_perc: The percentage of GPU KV cache utilization.
    • vllm:num_requests_waiting: The number of requests waiting to be processed.

    For our example, let's say we want to scale based on waiting requests. We'll use the sum() aggregation function to ensure the query returns a single value across all pods of our deployment.

    sum(vllm:num_requests_waiting{namespace="keda", pod=~"llama-32-3b-predictor.*"})

    This query sums up all pending requests for pods within the keda namespace that belong to our llama-32-3b-predictor deployment.

    4. Creating the ScaledObject

    Now, let's bring everything together in a ScaledObject. This KEDA custom resource is the core of our autoscaling configuration. It defines the target deployment, the minimum and maximum replica counts, and the scaling triggers.

    ---
    apiVersion: keda.sh/v1alpha1
    kind: ScaledObject
    metadata:
      name: llama-32-3b-predictor
      namespace: keda
    spec:
      scaleTargetRef:
        apiVersion: apps/v1
        kind: Deployment
        name: llama-32-3b-predictor
      minReplicaCount: 1
      maxReplicaCount: 5
      pollingInterval: 5
      triggers:
      - type: prometheus
        authenticationRef:
          name: keda-trigger-auth-prometheus
        metadata:
          serverAddress: https://thanos-querier.openshift-monitoring.svc.cluster.local:9092
          query: 'sum(vllm:num_requests_waiting{namespace="keda", pod=~"llama-32-3b-predictor.*"})'
          threshold: '2'
          authModes: "bearer"
          namespace: keda

    In this ScaledObject, we tell KEDA to scale the llama-32-3b-predictor deployment. The triggers section specifies that we'll use a Prometheus source, referencing our TriggerAuthentication. We set a threshold of 2, meaning KEDA will add new pods if the total number of waiting requests exceeds this value. We also define a minReplicaCount of 1 to ensure a pod is always available and a maxReplicaCount of 5 to prevent runaway scaling.

    5. Monitoring and verification

    Once you apply the ScaledObject, KEDA will automatically create a horizontal pod autoscaler for you, as shown in Figure 3. You can monitor its status to confirm that everything is working as expected.

    Terminal output showing the applied ScaledObject and horizontal pod autoscaler associated with it.
    Figure 3: Terminal output showing the applied ScaledObject and horizontal pod autoscaler associated with it.

    The horizontal pod autoscaler's output will show the current metric value, the scaling threshold, and the current and desired number of replicas. As traffic to your service increases, you should see the TARGETS value rise, and the REPLICAS will automatically scale up to meet demand. This confirms that your event-driven autoscaling pipeline is now fully operational.

    What's next? Setting the stage for performance validation

    This post has focused on the foundational steps of setting up KServe autoscaling with KEDA and the Custom Metrics Autoscaler operator. We've established the architecture for a highly flexible and efficient scaling solution.

    In our next blog post, we will move from setup to validation. We will share the results of a series of performance and load tests designed to put this autoscaling architecture through its paces. We'll analyze key performance inference indicators and how they impact or drive the performance of the autoscaling, and, therefore, our service. The goal is to provide a data-driven understanding of the significant benefits of this autoscaling strategy for inference workloads. Be sure to stay tuned for the results!

    Related Posts

    • Empower conversational AI at scale with KServe

    • How to install KServe using Open Data Hub

    • Anonymize data in real time with KEDA and Rook

    • How to install single node OpenShift on bare metal

    • Red Hat OpenShift AI installation and setup

    • Optimize GPU utilization with Kueue and KEDA

    Recent Posts

    • What's New in OpenShift GitOps 1.18

    • Beyond a single cluster with OpenShift Service Mesh 3

    • Kubernetes MCP server: AI-powered cluster management

    • Unlocking the power of OpenShift Service Mesh 3

    • Run DialoGPT-small on OpenShift AI for internal model testing

    What’s up next?

    Dive into the end-to-end process of building and managing machine learning (ML) pipelines using OpenShift AI.

    Start the activity
    Red Hat Developers logo LinkedIn YouTube Twitter Facebook

    Platforms

    • Red Hat AI
    • Red Hat Enterprise Linux
    • Red Hat OpenShift
    • Red Hat Ansible Automation Platform
    • See all products

    Build

    • Developer Sandbox
    • Developer Tools
    • Interactive Tutorials
    • API Catalog

    Quicklinks

    • Learning Resources
    • E-books
    • Cheat Sheets
    • Blog
    • Events
    • Newsletter

    Communicate

    • About us
    • Contact sales
    • Find a partner
    • Report a website issue
    • Site Status Dashboard
    • Report a security problem

    RED HAT DEVELOPER

    Build here. Go anywhere.

    We serve the builders. The problem solvers who create careers with code.

    Join us if you’re a developer, software engineer, web designer, front-end designer, UX designer, computer scientist, architect, tester, product manager, project manager or team lead.

    Sign me up

    Red Hat legal and privacy links

    • About Red Hat
    • Jobs
    • Events
    • Locations
    • Contact Red Hat
    • Red Hat Blog
    • Inclusion at Red Hat
    • Cool Stuff Store
    • Red Hat Summit
    © 2025 Red Hat

    Red Hat legal and privacy links

    • Privacy statement
    • Terms of use
    • All policies and guidelines
    • Digital accessibility

    Report a website issue