Skip to main content
Redhat Developers  Logo
  • AI

    Get started with AI

    • Red Hat AI
      Accelerate the development and deployment of enterprise AI solutions.
    • AI learning hub
      Explore learning materials and tools, organized by task.
    • AI interactive demos
      Click through scenarios with Red Hat AI, including training LLMs and more.
    • AI/ML learning paths
      Expand your OpenShift AI knowledge using these learning resources.
    • AI quickstarts
      Focused AI use cases designed for fast deployment on Red Hat AI platforms.
    • No-cost AI training
      Foundational Red Hat AI training.

    Featured resources

    • OpenShift AI learning
    • Open source AI for developers
    • AI product application development
    • Open source-powered AI/ML for hybrid cloud
    • AI and Node.js cheat sheet

    Red Hat AI Factory with NVIDIA

    • Red Hat AI Factory with NVIDIA is a co-engineered, enterprise-grade AI solution for building, deploying, and managing AI at scale across hybrid cloud environments.
    • Explore the solution
  • Learn

    Self-guided

    • Documentation
      Find answers, get step-by-step guidance, and learn how to use Red Hat products.
    • Learning paths
      Explore curated walkthroughs for common development tasks.
    • Guided learning
      Receive custom learning paths powered by our AI assistant.
    • See all learning

    Hands-on

    • Developer Sandbox
      Spin up Red Hat's products and technologies without setup or configuration.
    • Interactive labs
      Learn by doing in these hands-on, browser-based experiences.
    • Interactive demos
      Click through product features in these guided tours.

    Browse by topic

    • AI/ML
    • Automation
    • Java
    • Kubernetes
    • Linux
    • See all topics

    Training & certifications

    • Courses and exams
    • Certifications
    • Skills assessments
    • Red Hat Academy
    • Learning subscription
    • Explore training
  • Build

    Get started

    • Red Hat build of Podman Desktop
      A downloadable, local development hub to experiment with our products and builds.
    • Developer Sandbox
      Spin up Red Hat's products and technologies without setup or configuration.

    Download products

    • Access product downloads to start building and testing right away.
    • Red Hat Enterprise Linux
    • Red Hat AI
    • Red Hat OpenShift
    • Red Hat Ansible Automation Platform
    • See all products

    Featured

    • Red Hat build of OpenJDK
    • Red Hat JBoss Enterprise Application Platform
    • Red Hat OpenShift Dev Spaces
    • Red Hat Developer Toolset

    References

    • E-books
    • Documentation
    • Cheat sheets
    • Architecture center
  • Community

    Get involved

    • Events
    • Live AI events
    • Red Hat Summit
    • Red Hat Accelerators
    • Community discussions

    Follow along

    • Articles & blogs
    • Developer newsletter
    • Videos
    • Github

    Get help

    • Customer service
    • Customer support
    • Regional contacts
    • Find a partner

    Join the Red Hat Developer program

    • Download Red Hat products and project builds, access support documentation, learning content, and more.
    • Explore the benefits

Red Hat AI Inference on Amazon EKS: Exploring the Kubernetes resources

A look inside AI inference cluster resources on Amazon EKS

June 16, 2026
Alexa Griffith
Related topics:
Artificial intelligenceKubernetes
Related products:
Red Hat AIRed Hat AI Inference

    I recently joined Red Hat and wanted to explore and test Red Hat AI Inference with llm-d on Amazon Elastic Kubernetes Service (EKS) to understand how all of the components work together. In order to understand something well, I think you need to deploy it, especially when it comes to Kubernetes. And I find that digging into the custom resource definitions (CRDs) and each component in the control and data plane is extremely helpful for a beginner looking to understand Kubernetes services. So, I decided to do that here. After setting up a two-GPU cluster with NVIDIA L4s, I deployed a small language model to see exactly how it operates.

    This is not about running llm-d with GuideLLM benchmark numbers (that's part 2). This article focuses on understanding the architecture, including what Kubernetes resources get created, how they connect, and why the Red Hat AI Inference components make these choices.

    Helm deployment

    The Red Hat AI Inference Stack for Kubernetes docs have complete instructions for installing Red Hat AI Inference, setting up pull secrets, cloud-specific configs, and troubleshooting.

    My cluster includes:

    • Amazon EKS 1.30.
    • Two g6.2xlarge instances (NVIDIA L4 GPUs).
    • Red Hat AI Inference 3.4.
    # Verify it's running
    kubectl get pods -n redhat-ods-applications
    kubectl get gateway -A

    Now let's see what actually got deployed.

    Inside the Red Hat AI Inference platform deployment

    The Helm chart installs several platform components. Each component creates a mix of Kubernetes resources:

    • Deployments and pods run the actual controllers or gateways.
    • Services expose those pods internally or externally via LoadBalancer.
    • ConfigMaps and secrets hold configuration and certificate.
    • CRDs extend Kubernetes with new API types (Gateway, InferencePool, LeaderWorkerSet).
    • Custom Resources are instances of the APIs (like the actual inference-gateway).

    The platform depends on these basic components:

    • cert-manager: TLS automation.
    • Istio and Gateway API: Service mesh and external ingress.
    • KServe and llm-d: Inference controllers and intelligent routing.
    • LeaderWorkerSet: Advanced workload patterns (multi-node inference).
    • Red Hat AI Inference operator: Platform lifecycle orchestration.
    • Cloud Manager operator: Multi-cloud integration.

    Here's what got deployed, organized by component:

    RHAII Inference Platform
    ├── cert-manager (TLS automation)
    │ ├── Namespaces: cert-manager, cert-manager-operator
    │ ├── Workloads (4 deployments):
    │ │ ├── cert-manager
    │ │ ├── cert-manager-cainjector
    │ │ ├── cert-manager-webhook
    │ │ └── cert-manager-operator-controller-manager
    │ ├── Services (4): Metrics and webhook endpoints
    │ └── Purpose: Issues and renews TLS certificates
    │
    ├── Istio / Gateway API (Service mesh & ingress)
    │ ├── Namespace: istio-system
    │ ├── Workloads (2 deployments):
    │ │ ├── istiod (control plane)
    │ │ └── servicemesh-operator3
    │ ├── Services (2): Control plane and metrics
    │ ├── CRDs (7): gateways, httproutes, grpcroutes, backendtlspolicies, etc.
    │ └── Purpose: External LoadBalancer and intelligent routing
    │
    ├── LeaderWorkerSet (Advanced workload orchestration)
    │ ├── Namespace: openshift-lws-operator
    │ ├── Workloads (2 deployments):
    │ │ ├── lws-controller-manager (2 replicas)
    │ │ └── openshift-lws-operator
    │ ├── Services (2): Metrics and webhook
    │ ├── CRDs (2): leaderworkersets, leaderworkersetoperators
    │ └── Purpose: Coordinated multi-pod patterns (prefill/decode disaggregation)
    │
    ├── KServe & llm-d (Inference platform)
    │ ├── Namespace: redhat-ods-applications
    │ ├── Workloads (2 deployments):
    │ │ ├── llmisvc-controller-manager (KServe reconciler)
    │ │ └── inference-gateway-istio (Envoy proxy)
    │ ├── Services (3):
    │ │ ├── inference-gateway-istio (LoadBalancer - external entry point)
    │ │ ├── llmisvc-controller-manager (metrics)
    │ │ └── llmisvc-webhook-server
    │ ├── CRDs (8):
    │ │ ├── KServe: llminferenceservices, llminferenceserviceconfigs, kserves
    │ │ ├── llm-d: inferencepools (2 versions), inferenceobjectives,
    │ │ │ inferencemodelrewrites, inferencepoolimports
    │ ├── Custom Resources (1):
    │ │ └── Gateway/inference-gateway (actual LoadBalancer instance)
    │ └── Purpose: Orchestrates model deployments and intelligent routing
    │
    ├── RHAII Operator (Platform lifecycle)
    │ ├── Namespace: redhat-ods-operator
    │ ├── Workloads (1 deployment):
    │ │ └── rhai-operator (3 replicas for HA)
    │ ├── Services (2): Metrics and webhook
    │ └── Purpose: Orchestrates installation and upgrades
    │
    └── Cloud Manager (Multi-cloud portability)
    ├── Namespace: rhai-cloudmanager-system
    ├── Workloads (1 deployment):
    │ └── azure-cloud-manager-operator
    └── Purpose: Cloud-specific integrations (Azure/AWS/GCP)

    The platform consists of:

    • Seven namespaces.
    • 12 deployments.
    • 13 services (including external LoadBalancer).
    • 17 CRDs.
    • One Gateway instance (the actual entry point).

    All of this is shared infrastructure. You install it once per cluster, then deploy your models on top of it.

    Verify the platform is ready by checking the gateway:

    $ kubectl get gateway -n redhat-ods-applications
    NAME                CLASS   ADDRESS                                                PROGRAMMED
    inference-gateway   istio   k8s-redhatod-inferenc-xxx.elb.us-east-1.amazonaws.com   True

    The PROGRAMMED: True status confirms the platform has an externally reachable entry point through the Istio-backed Gateway API. You can now deploy models.

    Let's take a look at each of these components and what they deploy.

    cert-manager (TLS certificate management)

    The cert-manager component automatically creates and renews TLS certificates for secure communication between components.

    The problem it solves

    This automation fixes a significant operational headache: because KServe and vLLM pods expose metrics over HTTPS, a standard Prometheus instance cannot safely scrape metrics without valid certificates. Without cert-manager stepping in to prevent validation errors, platform teams would have to manually create, mount, and rotate certificates for every single new model deployment.

    Deployed resources

    The platform creates two namespaces for this component: cert-manager-operator and cert-manager.

    Running kubectl get deployment reveals the active controller applications running across both namespaces:

    $ kubectl get deployment -n cert-manager
      NAME                       READY   UP-TO-DATE   AVAILABLE
      cert-manager               1/1     1            1
      cert-manager-cainjector    1/1     1            1
      cert-manager-webhook       1/1     1            1
    $ kubectl get deployment -n cert-manager-operator
      NAME                                           READY   UP-TO-DATE AVAILABLE
      cert-manager-operator-controller-manager       1/1     1            1

    The platform maps these controllers to internal cluster communication endpoints to manage metrics and webhook validation traffic:

    $ kubectl get svc -n cert-manager
      NAME                       TYPE        PORT(S)
      cert-manager               ClusterIP   9402/TCP
      cert-manager-cainjector    ClusterIP   9402/TCP
      cert-manager-webhook       ClusterIP   443/TCP
    $ kubectl get svc -n cert-manager-operator
      NAME                                                    TYPE        PORT(S)
      cert-manager-operator-controller-manager-metrics-svc    ClusterIP   8443/TCP

    The deployed resources divide the certificate management workload across several specific roles:

    • The cert-manager deployment acts as the core controller that issues certificates.
    • The cert-manager-cainjector deployment injects CA bundles into webhooks and API services.
    • The cert-manager-webhook deployment validates incoming certificate requests.
    • The cert-manager-operator deployment manages the cert-manager installation lifecycle.

    Why it matters

    For internal services, such as vLLM metrics and Prometheus scraping, self-signed certificates work fine. For external-facing services like the inference gateway, you can configure cert-manager with ACME or Let's Encrypt to obtain trusted certificates automatically. Either way, cert-manager handles the entire lifecycle so that you never interact with certificate files manually.

    Sail operator and inference gateway (Istio Gateway API)

    The inference gateway provides a central entry point for all model requests using the Kubernetes Gateway API backed by the Istio service mesh. The platform orchestrates these ingress resources declaratively through the open source Sail operator.

    The problem it solves

    This architecture solves a common challenge in production AI deployments: routing traffic to multiple models from a single external endpoint. Traditional Kubernetes Ingress can handle basic path routing, but Gateway API offers more flexibility because it provides advanced routing capabilities. For example, the gateway can route incoming network requests to completely different backend models based entirely on the URL path:

    Request URLTarget model
    https://gateway/llm-test/qwen-basic/v1/completionsqwen-basic
    https://gateway/llm-test/llama-7b/v1/completionsllama-7b
    https://gateway/llm-test/mistral-8x7b/v1/completionsmistral-8x7b

    The Gateway API provides several operational advantages over standard ingress controllers:

    • Header-based routing: You can route incoming requests to specific destinations based on model version headers.
    • Weighted traffic splits: You can send 90% of traffic to model-v1 and 10% to model-v2 to handle A/B testing and deployment rollouts.
    • Custom backends: You can route traffic directly to an InferencePool instead of a standard Service, which enables intelligent scheduling.
    • Per-route policies: You can set custom timeouts for long-running inference requests, enforce rate limits, or apply retry policies for individual models.

    Deployed resources

    During installation, the platform creates the istio-system namespace to house the core control plane components. This namespace runs two primary deployments that manage your routing infrastructure:

    $ kubectl get deployment -n istio-system
      NAME                     READY   UP-TO-DATE   AVAILABLE
      istiod                   1/1     1            1
      servicemesh-operator3    1/1     1            1

    The corresponding services expose these control plane endpoints internally:

    $ kubectl get svc -n istio-system
      NAME                                 TYPE        PORT(S)
      istiod                               ClusterIP   15010/TCP,15012/TCP,443/TCP
      servicemesh-operator3-metrics        ClusterIP   8443/TCP

    These core control plane components divide the work of managing cluster ingress:

    • The servicemesh-operator3 (Sail operator) deployment manages the Istio control plane lifecycle. Sail makes Istio Kubernetes-native. You define your desired state with custom resources, and Sail deploys and upgrades Istio declaratively. The "3" in the resource name refers to Red Hat OpenShift Service Mesh 3, which uses Sail instead of the legacy Maistra-based operators.
    • The istiod deployment acts as the core Istio control plane. It configures the underlying Envoy proxies and implements Gateway API routing rules.

    Gateway and HTTPRoute custom resources

    The platform installs a collection of CRDs to extend the standard Kubernetes networking capabilities:

    $ kubectl get crd | grep gateway 
      backendtlspolicies.gateway.networking.k8s.io
      gatewayclasses.gateway.networking.k8s.io
      gateways.gateway.networking.k8s.io
      grpcroutes.gateway.networking.k8s.io
      httproutes.gateway.networking.k8s.io
      referencegrants.gateway.networking.k8s.io
      gateways.networking.istio.io

    These definitions split management responsibilities across distinct networking tasks:

    • Core gateway resources: The gateways.gateway.networking.k8s.io CRD defines specific gateway instances, such as the inference-gateway, that trigger the creation of cloud load balancers. The gatewayclasses.gateway.networking.k8s.io CRD defines implementation types for those gateways, including the istio class powered by Envoy proxies.
    • Intelligent routing: The httproutes.gateway.networking.k8s.io CRD maps HTTP traffic paths like /qwen-basic to their respective backend models. The grpcroutes.gateway.networking.k8s.io CRD governs routing for model servers that communicate using gRPC-based inference APIs.
    • Security and policies: The backendtlspolicies.gateway.networking.k8s.io CRD manages TLS encryption for traffic flowing from the gateway to internal backends. The referencegrants.gateway.networking.k8s.io CRD enables security-focused cross-namespace routing, allowing a shared gateway instance to reach services running in other namespaces.
    • Istio compatibility: The gateways.networking.istio.io CRD maintains the legacy Istio configuration format, maintained for backward compatibility with older deployments.

    Gateway proxy runtime locations

    The actual gateway Envoy proxy runs inside the redhat-ods-applications namespace:

    $ kubectl get gateway -n redhat-ods-applications
      NAME                 CLASS   ADDRESS                                                   PROGRAMMED
      inference-gateway    istio   k8s-redhatod-inferenc-xxx.elb.us-east-1.amazonaws.com     True

    This Gateway resource instructs Istio to create the actual Envoy proxy deployment:

    $ kubectl get deployment -n redhat-ods-applications | grep gateway
      NAME                         READY   UP-TO-DATE   AVAILABLE
      inference-gateway-istio      1/1     1            1

    Note

    You might notice another deployment named llmisvc-controller-manager running in this namespace. The llmisvc-controller-manager component serves as the KServe controller; it shares this namespace but isn't part of the gateway component.

    What components sit behind the inference gateway?

    You can visualize how the platform maps the root gateway configuration down to physical runtime pods by using the kubectl tree utility:

    $ kubectl tree gateway inference-gateway -n redhat-ods-applications
    Gateway/inference-gateway
    ├── Deployment/inference-gateway-istio (Istio Envoy proxy)
    │   └── ReplicaSet/inference-gateway-istio-xxx
    │       └── Pod/inference-gateway-istio-xxx
    ├── Service/inference-gateway-istio (ClusterIP - internal)
    └── Service/inference-gateway-istio-lb (LoadBalancer - external)

    The cloud provider automatically provisions an external network load balancer to match this configuration:

    $ kubectl get svc inference-gateway-istio -n redhat-ods-applications
    NAME                      TYPE           EXTERNAL-IP                                             PORT(S)
    inference-gateway-istio   LoadBalancer   k8s-redhatod-inferenc-xxx.elb.us-east-1.amazonaws.com   80:32626/TCP

    This LoadBalancer is how external clients (including your application, curl commands, and GuideLLM testing utilities) reach the inference models. All models share this single gateway, and HTTPRoute rules determine which model handles each request.

    Why it matters

    Transitioning to the Gateway API and Istio provides advanced routing capabilities that standard Kubernetes Ingress controllers lack. You get one stable endpoint (the LoadBalancer) that can route traffic to dozens of models, with built-in traffic splitting for A/B tests, header-based routing for model versions, and the ability to route to custom backends like an InferencePool. Your client applications simply target different URL paths, allowing you to update or scale backend models without modifying client-side application code.

    KServe (model serving framework)

    KServe is the control plane for model serving. In practice, it translates your declared intent into various Kubernetes resources needed to serve a model. It provides the LLMInferenceService CRD and an operator that reconciles your YAML into actual workloads, handling the deployments, networking, and runtime configuration needed to make a model available for inference.

    The problem it solves

    Deploying a model to production isn't just "run vLLM." You need to solve a whole class of coordination problems:

    • Cross-component configuration: Settings like the tensor parallelism degree need to be coordinated across your manifest. If you use tensor parallelism across two GPUs, you must reflect that setting in both the serving configuration and the GPU resource request to keep them in sync.
    • Router-aware serving features: Some configurations require the router to recognize and compose with the model server. For example, if you want the router to make smarter decisions about where to send requests based on what's already cached in GPU memory, the router and model server need a direct communication channel. That wiring has to be configured consistently across both components, which is how the ZeroMQ (ZMQ) side channel and key-value (KV) cache event loops function together.
    • Traffic exposure: You need to expose exactly the right subset of traffic. Misconfigured routing is how you accidentally make internal model endpoints publicly accessible. This type of misconfiguration can expose fine-tuning adapters or other model internals you never intended to be reachable.
    • Traffic management: You must align your serving configuration with HTTPRoute definitions to expose the right endpoints, InferencePool resources to group backends, and proper traffic splitting for A/B testing.
    • Security and access control: You must create service accounts, RBAC policies, TLS certificates, and webhook configurations across every cluster component.

    Without automated coordination, platform teams manually create and synchronize more than 200 lines of YAML configuration per model. This overhead includes managing separate definitions for deployments, services, HTTPRoute mappings, InferencePool groups, certificates, RBAC policies, and model-server configurations. Every change ripples across multiple resources.

    For example, scaling from two replicas to four forces you to manually update the deployment manifest, reconfigure the InferencePool backends, adjust the HTTPRoute weights, and verify that RBAC policies still function correctly without falling out of sync.

    KServe eliminates this manual operational overhead by allowing teams to define the target inference service in a single manifest of approximately 30 lines of YAML. KServe then owns the implementation. If you change a configuration from replicas: 2 to replicas: 4, KServe updates the vLLM deployment, adjusts the InferencePool backends, and rebalances routing automatically.

    Deployed resources

    The platform deploys the core management workloads inside the shared redhat-ods-applications namespace. Running kubectl get deployment highlights the active reconciler:

    $ kubectl get deployment -n redhat-ods-applications | grep llmisvc
      NAME                         READY   UP-TO-DATE   AVAILABLE
      llmisvc-controller-manager   1/1     1            1

    The accompanying network services expose metrics and manage validation webhooks:

    $ kubectl get svc -n redhat-ods-applications | grep llmisvc
      NAME                              TYPE           PORT(S)
      llmisvc-controller-manager        ClusterIP      8443/TCP
      llmisvc-webhook-server-service    ClusterIP      443/TCP

    These core system resources split coordination duties across two specific roles:

    • The llmisvc-controller-manager deployment acts as the KServe controller, watching LLMInferenceService resources and reconciling them into vLLM deployments, services, routes, and pools.
    • The llmisvc-webhook-server-service deployment validates and mutates LLMInferenceService resources before they're created.

    KServe custom resource definitions

    The framework installs several CRDs to define the model serving APIs within the cluster:

    $ kubectl get crd | grep -E 'llminferenceservice|kserve'
      kserves.components.platform.opendatahub.io
      llminferenceserviceconfigs.serving.kserve.io
      llminferenceservices.serving.kserve.io

    These CRDs define the API for serving models:

    • The llminferenceservices.serving.kserve.io CRD acts as the primary API, defining configuration fields you can use in your LLMInferenceService YAML (such as model URI, replicas, and GPU limits).
    • The llminferenceserviceconfigs.serving.kserve.io CRD provides reusable serving configurations that can be referenced by multiple models.
    • The kserves.components.platform.opendatahub.io CRD handles platform-level KServe configuration.

    Why automated model coordination matters

    Every other component in this stack—the gateway, the scheduler, the model server—has a specific job to do at runtime. KServe makes sure they all exist, remain configured correctly, and adapt to state changes automatically. Without this automation, that coordination burden falls entirely on your platform team, and it compounds with every new model you deploy.

    llm-d (intelligent inference router)

    The llm-d project provides the custom resource definitions that enable intelligent model routing. Unlike other components, the Helm chart only installs the CRDs—the actual llm-d EPP (Endpoint Picker) scheduler pods are deployed per-model when you create an LLMInferenceService.

    The problem it solves

    This dynamic routing layer resolves a critical limitation of standard Kubernetes services, which rely on basic round-robin load balancing. For example, the service sends the first request to the first pod, the second request to the second pod, and the third request back to the first pod—even if that first pod is still processing the initial query.

    Round-robin doesn't know which pod is busy, which one has relevant context cached in GPU memory, or which requests are high-priority. For LLM serving, this means:

    • Requests queue up behind slow requests instead of going to idle pods.
    • Cache-friendly requests (same prompt prefix) don't land on the pod with that prefix already in memory.
    • Batch jobs and premium requests get the same treatment.

    llm-d enables:

    • Load-aware routing: The system routes incoming queries based on the current pod load by analyzing the queue depth of both vLLM and the llm-d router, alongside the number of active requests currently in flight.
    • Cache-aware routing: The router tracks the KV cache state to minimize cache misses. It automatically directs traffic to pods that already have the prompt prefix cached, or to pods with available cache space to encourage an even distribution of prefix allocation across the system.
    • Priority-based flow control: The system references client request parameters to let premium, high-priority queries bypass the standard queue ahead of long-running batch jobs.

    Deployed resources

    Running kubectl get crd reveals the custom API definitions installed by the platform to manage network routing behavior:

    $  kubectl get crd | grep inference.networking
        inferencemodelrewrites.inference.networking.x-k8s.io
        inferenceobjectives.inference.networking.x-k8s.io
        inferencepoolimports.inference.networking.x-k8s.io
        inferencepools.inference.networking.k8s.io
        inferencepools.inference.networking.x-k8s.io

    These installed CRDs expand the cluster's control plane capability to handle advanced model backends:

    • The inferencepools.inference.networking.k8s.io CRD defines the stable, standard version of the InferencePool API.
    • The inferencepools.inference.networking.x-k8s.io CRD provides an extended version of the InferencePool API with additional features.
    • The inferenceobjectives.inference.networking.x-k8s.io CRD defines priority levels (premium=100, normal=0, batch=-10) to be included in the request body from clients.
    • The inferencemodelrewrites.inference.networking.x-k8s.io CRD rewrites model names in requests (for example, mapping gpt-4 to llama-70b for a drop-in replacement).
    • The inferencepoolimports.inference.networking.x-k8s.io CRD imports external inference endpoints as pools (for example, OpenAI API or Amazon Bedrock).

    Where are the llm-d pods?

    The Helm chart installs the CRDs only. The actual llm-d Endpoint Picker (EPP) scheduler pods are created per-model when you deploy an LLMInferenceService.

    For example, if you deploy qwen-basic with two replicas, you get:

    • Two vLLM pods (qwen-basic-kserve-xxx).
    • One llm-d EPP scheduler pod (qwen-basic-kserve-router-scheduler-xxx).

    The EPP scheduler is specific to that model and routes requests only to that model's vLLM pods.

    Why it matters

    llm-d makes routing "intelligent" instead of "random.” Without llm-d's intelligent routing, you're limited to round-robin, which means every tenth request might hit the one pod that's 90% full while nine other pods sit idle. With llm-d, the router is aware of the queue depth, cache state, and request priority, so it can send each request to the pod that will handle it fastest. The Helm chart installs the type definitions; KServe creates the actual EPP scheduler pods automatically when you deploy a model.

    LeaderWorkerSet operator (advanced workload orchestration)

    The LeaderWorkerSet (LWS) operator provides coordinated multi-pod deployments where pods work together as a logical unit with differentiated roles. Unlike standard Kubernetes deployments where all pods are identical and interchangeable, LWS enables diverse pod groups with leader-worker coordination.

    The problem it solves

    For advanced AI workloads like wide Expert Parallelism (wideEP) in Mixture of Experts (MoE) models, you need:

    • A leader pod with specific configuration parameters for coordination.
    • Worker pods with different configurations optimized for their role.
    • The ability to scale workers independently of the leader.
    • Co-location and synchronized lifecycle across the pod group.

    Standard Kubernetes deployments cannot express this, as they treat all pods identically. StatefulSet resources provide ordering but not role differentiation or coordinated scaling.

    LeaderWorkerSet enables the following operational patterns:

    • Dual-template design: Separate configurations for leader and worker pods (different resource limits, arguments, and environment variables).
    • Gang scheduling: All pods in a group are scheduled together as a unit; if one cannot be placed, none are created.
    • Topology awareness: Co-locate pod groups on the same infrastructure (nodes, racks, or availability zones).
    • Coordinated failure handling: When one pod fails, restart the entire group to maintain consistency.
    • Group-level rolling updates: Upgrade complete pod groups as units rather than individual pods.
    • Independent worker scaling: Scale workers (zero to N) without recreating the leader.

    This also solves multi-node GPU allocation: if you need Tensor Parallelism (TP)=16 but each node has only eight GPUs, LeaderWorkerSet coordinates two pods across nodes to work as one server with peer-to-peer communication.

    This enables large parallelism patterns:

    • Tensor Parallelism (TP): Split model layers across GPUs (for example, TP=16 across two nodes).
    • Data Parallelism (DP): Replicate model across GPUs (for example, DP=16 across two nodes).
    • Expert Parallelism (EP): Distribute MoE model experts across GPUs, with leader handling coordination.

    Note

    The LeaderWorkerSet project also supports disaggregated inference patterns (separating prefill and decode phases) via DisaggregatedSet, but Red Hat AI Inference 3.4 doesn't include that capability yet.

    Deployed resources

    The platform creates the openshift-lws-operator namespace for this component.

    Running kubectl get deployment reveals the active management controllers:

    $  kubectl get deployment -n openshift-lws-operator
        NAME                         READY   UP-TO-DATE   AVAILABLE       
        lws-controller-manager       2/2     2            2           
        openshift-lws-operator       1/1     1            1

    The corresponding cluster services expose metrics and webhook endpoints internally:

    $ kubectl get svc -n openshift-lws-operator
      NAME                                      TYPE        PORT(S)
      lws-controller-manager-metrics-service    ClusterIP   8443/TCP
      lws-webhook-service                       ClusterIP   443/TCP

    The deployed management workloads share specific operational responsibilities:

    • The lws-controller-manager deployment creates coordinated multi-pod groups where pods work together as one logical server.
    • The openshift-lws-operator deployment manages the LeaderWorkerSet controller lifecycle.

    Running kubectl get crd displays the custom API schemas installed for multi-pod orchestration:

    $  kubectl get crd | grep leaderworkerset
      leaderworkersets.leaderworkerset.x-k8s.io
      leaderworkersetoperators.operator.openshift.io

    The platform uses these definitions to manage how the cluster recognizes multi-pod groups:

    • The leaderworkersets.leaderworkerset.x-k8s.io CRD defines coordinated pod groups where multiple pods work together as one logical server.
    • The leaderworkersetoperators.operator.openshift.io CRD manages the LeaderWorkerSet operator installation and configuration.

    Why it matters

    Without LeaderWorkerSet, Kubernetes can't express that the distributed pods are parts of one server. You'd be limited to models that fit on your largest single node. With this operator, node boundaries stop being a constraint, allowing you to coordinate pods across nodes and treat your entire cluster as one large logical GPU pool.

    Cloud manager operator (cloud provider integration)

    The cloud manager operator integrates with cloud-specific features across Microsoft Azure, Amazon Web Services (AWS), and Google Cloud to manage autoscaling, spot instances, and resource quotas.

    The problem it solves

    Different clouds have distinct APIs to handle specific infrastructure tasks:

    • Autoscaling GPU nodes when demand increases.
    • Using spot or preemptible instances to reduce operational costs.
    • Enforcing resource quotas and limits.
    • Integrating with cloud-native load balancers.

    The cloud manager provides a unified interface so that LLMInferenceService resources work the same way across various cloud environments.

    Deployed resources

    Running kubectl get deployment reveals the active provider integration controller:

    $  kubectl get deployment -n rhai-cloudmanager-system
        NAME                              READY   UP-TO-DATE   AVAILABLE       
        aws-cloud-manager-operator        1/1     1            1

    This single aws-cloud-manager-operator deployment handles all cloud-specific integrations for the cluster.

    Running kubectl get crd displays the custom API schemas installed for multi-cloud integration:

    $ kubectl get crd | grep -E 'kubernetesengine'
    awskubernetesengines.infrastructure.opendatahub.io

    The awskubernetesengines.infrastructure.opendatahub.io CRD defines cloud-specific cluster configurations. For generic Kubernetes, this component is optional.

    Why it matters

    This component is what makes Red Hat AI Inference portable across clouds. The same LLMInferenceService YAML manifest works on Amazon EKS, Microsoft Azure Kubernetes Service (AKS), Google Kubernetes Engine (GKE), or on-premise Kubernetes clusters. The cloud manager handles the underlying cloud-specific orchestration details.

    Red Hat AI Inference operator (platform orchestrator)

    The Red Hat AI Inference operator acts as a meta-controller that manages the entire platform installation and keeps all components in sync.

    The problem it solves

    The Red Hat AI Inference platform has many moving parts, including KServe, Istio, cert-manager, llm-d, and monitoring utilities. When you upgrade Red Hat AI Inference, all these components must upgrade collectively in a specific sequence to maintain version compatibility.

    To address this challenge, the Red Hat AI Inference operator handles the following tasks:

    • Manages the lifecycle of all platform components.
    • Ensures compatible versions are installed together.
    • Handles upgrades and rollbacks.
    • Watches for configuration changes.
    • Reconciles desired state versus actual state.

    Deployed resources

    The platform installs these core control plane assets inside the redhat-ods-operator namespace. Running kubectl get deployment displays the active orchestrator controller:

    $  kubectl get deployment -n redhat-ods-operator
        NAME            READY   UP-TO-DATE   AVAILABLE       
        rhai-operator   3/3     3            3

    The accompanying network services expose metrics and manage validation endpoints internally:

    $  kubectl get svc -n redhat-ods-operator
      NAME                                              TYPE        PORT(S)
      rhai-operator-controller-manager-metrics-service  ClusterIP   8443/TCP
      rhai-operator-webhook-service                     ClusterIP   443/TCP

    The deployed management workloads share structural and operational responsibilities across the platform:

    • The rhai-operator deployment acts as the meta-controller, utilizing three replicas for high availability so that if a single pod crashes, the others keep running.
    • The rhai-operator-controller-manager-metrics-service service exposes Prometheus metrics regarding the operator's health and reconciliation status.
    • The rhai-operator-webhook-service service validates and mutates platform configuration to ensure compatibility across components.

    Platform orchestration custom resource definitions

    The Red Hat AI Inference operator manages platform-level component configurations (like the kserves.components.platform.opendatahub.io CRD from the KServe section). These define which platform components to deploy and how to configure them.

    Why it matters

    Often described as the operator that operates operators, this component is what makes helm upgrade work smoothly. The Red Hat AI Inference operator coordinates every component upgrade, making sure that the version transitions happen in the correct order with compatible dependencies. Without this centralized orchestration, upgrading the platform would require a manual, error-prone process of upgrading multiple different operators individually.

    Bringing the automated inference architecture together

    When you run helm install rhaii, you're not just deploying a container; you're standing up a complete AI inference platform with deployments, services, custom resource types, and an inference gateway working together across multiple namespaces.

    These components work together so you don't have to coordinate them manually:

    • cert-manager handles TLS certificates automatically.
    • Istio and Gateway API provide one stable endpoint that routes to various models.
    • KServe translates your ~30-line LLMInferenceService YAML into 200 lines of coordinated Kubernetes resources.
    • llm-d enables queue-aware and cache-aware routing that standard Kubernetes Service resources can't achieve.
    • LeaderWorkerSet allows models to span multiple nodes when they exceed single-node GPU capacity.
    • Cloud manager keeps the same YAML portable across EKS, AKS, GKE, and on-premise environments.
    • Red Hat AI Inference operator orchestrates upgrades so all underlying components remain compatible.

    Once the platform is installed, you can deploy as many models as your cluster can support. Each deployment automatically receives intelligent routing, TLS, and multi-cloud portability with no per-model configuration needed.

    Learn more about the core technologies:

    • OpenDataHub RHAII on xKS deployment guide
    • KServe control plane documentation
    • Istio service mesh documentation
    • cert-manager installation and configuration guides

    Next steps: Deploying and testing your first model

    This article covered the core platform components deployed by helm install. Part 2 of this series walks through deploying an active model, analyzing the specific runtime resources that KServe generates on a per-model basis (including vLLM pods, llm-d-router, InferencePool), and validating the environment with GuideLLM to demonstrate intelligent routing capabilities.

    Ready to deploy Red Hat AI Inference on your cluster? Check out the Red Hat AI Inference on xKS documentation for complete installation instructions, or Red Hat OpenShift AI to explore the full platform.

    Related Posts

    • llama.cpp vs. vLLM: Choosing the right local LLM inference engine

    • Intelligent inference scheduling with llm-d on Red Hat AI

    • How to prevent AI inference stack silent failures

    • Why vLLM is the best choice for AI inference today

    • Master KV cache aware routing with llm-d for efficient AI inference

    • vLLM or llama.cpp: Choosing the right LLM inference engine for your use case

    Recent Posts

    • Red Hat AI Inference on Amazon EKS: Exploring the Kubernetes resources

    • Store immutable AI evaluation records with EvalHub and OCI

    • The evolution of agentic AI and text-to-SQL

    • Security is Getting Harder: Here's Why Image Mode for RHEL Helps

    • Using NetworkManager to permanently set an interface administratively down

    What’s up next?

    Red Hat AI Inference Server

    Red Hat AI Inference

    Move larger models from code to production faster with an end-to-end...

    Red Hat Developers logo LinkedIn YouTube Twitter Facebook

    Platforms

    • Red Hat AI
    • Red Hat Enterprise Linux
    • Red Hat OpenShift
    • Red Hat Ansible Automation Platform
    • See all products

    Build

    • Developer Sandbox
    • Developer tools
    • Interactive tutorials
    • API catalog

    Quicklinks

    • Learning resources
    • E-books
    • Cheat sheets
    • Blog
    • Events
    • Newsletter

    Communicate

    • About us
    • Contact sales
    • Find a partner
    • Report a website issue
    • Site status dashboard
    • Report a security problem

    RED HAT DEVELOPER

    Build here. Go anywhere.

    We serve the builders. The problem solvers who create careers with code.

    Join us if you’re a developer, software engineer, web designer, front-end designer, UX designer, computer scientist, architect, tester, product manager, project manager or team lead.

    Sign me up

    Red Hat legal and privacy links

    • About Red Hat
    • Jobs
    • Events
    • Locations
    • Contact Red Hat
    • Red Hat Blog
    • Inclusion at Red Hat
    • Cool Stuff Store
    • Red Hat Summit
    © 2026 Red Hat

    Red Hat legal and privacy links

    • Privacy statement
    • Terms of use
    • All policies and guidelines
    • Digital accessibility

    Chat Support

    Please log in with your Red Hat account to access chat support.