I recently joined Red Hat and wanted to explore and test Red Hat AI Inference with llm-d on Amazon Elastic Kubernetes Service (EKS) to understand how all of the components work together. In order to understand something well, I think you need to deploy it, especially when it comes to Kubernetes. And I find that digging into the custom resource definitions (CRDs) and each component in the control and data plane is extremely helpful for a beginner looking to understand Kubernetes services. So, I decided to do that here. After setting up a two-GPU cluster with NVIDIA L4s, I deployed a small language model to see exactly how it operates.
This is not about running llm-d with GuideLLM benchmark numbers (that's part 2). This article focuses on understanding the architecture, including what Kubernetes resources get created, how they connect, and why the Red Hat AI Inference components make these choices.
Helm deployment
The Red Hat AI Inference Stack for Kubernetes docs have complete instructions for installing Red Hat AI Inference, setting up pull secrets, cloud-specific configs, and troubleshooting.
My cluster includes:
- Amazon EKS 1.30.
- Two g6.2xlarge instances (NVIDIA L4 GPUs).
- Red Hat AI Inference 3.4.
# Verify it's running
kubectl get pods -n redhat-ods-applications
kubectl get gateway -ANow let's see what actually got deployed.
Inside the Red Hat AI Inference platform deployment
The Helm chart installs several platform components. Each component creates a mix of Kubernetes resources:
- Deployments and pods run the actual controllers or gateways.
- Services expose those pods internally or externally via LoadBalancer.
- ConfigMaps and secrets hold configuration and certificate.
- CRDs extend Kubernetes with new API types (Gateway, InferencePool, LeaderWorkerSet).
- Custom Resources are instances of the APIs (like the actual
inference-gateway).
The platform depends on these basic components:
- cert-manager: TLS automation.
- Istio and Gateway API: Service mesh and external ingress.
- KServe and llm-d: Inference controllers and intelligent routing.
- LeaderWorkerSet: Advanced workload patterns (multi-node inference).
- Red Hat AI Inference operator: Platform lifecycle orchestration.
- Cloud Manager operator: Multi-cloud integration.
Here's what got deployed, organized by component:
RHAII Inference Platform
├── cert-manager (TLS automation)
│ ├── Namespaces: cert-manager, cert-manager-operator
│ ├── Workloads (4 deployments):
│ │ ├── cert-manager
│ │ ├── cert-manager-cainjector
│ │ ├── cert-manager-webhook
│ │ └── cert-manager-operator-controller-manager
│ ├── Services (4): Metrics and webhook endpoints
│ └── Purpose: Issues and renews TLS certificates
│
├── Istio / Gateway API (Service mesh & ingress)
│ ├── Namespace: istio-system
│ ├── Workloads (2 deployments):
│ │ ├── istiod (control plane)
│ │ └── servicemesh-operator3
│ ├── Services (2): Control plane and metrics
│ ├── CRDs (7): gateways, httproutes, grpcroutes, backendtlspolicies, etc.
│ └── Purpose: External LoadBalancer and intelligent routing
│
├── LeaderWorkerSet (Advanced workload orchestration)
│ ├── Namespace: openshift-lws-operator
│ ├── Workloads (2 deployments):
│ │ ├── lws-controller-manager (2 replicas)
│ │ └── openshift-lws-operator
│ ├── Services (2): Metrics and webhook
│ ├── CRDs (2): leaderworkersets, leaderworkersetoperators
│ └── Purpose: Coordinated multi-pod patterns (prefill/decode disaggregation)
│
├── KServe & llm-d (Inference platform)
│ ├── Namespace: redhat-ods-applications
│ ├── Workloads (2 deployments):
│ │ ├── llmisvc-controller-manager (KServe reconciler)
│ │ └── inference-gateway-istio (Envoy proxy)
│ ├── Services (3):
│ │ ├── inference-gateway-istio (LoadBalancer - external entry point)
│ │ ├── llmisvc-controller-manager (metrics)
│ │ └── llmisvc-webhook-server
│ ├── CRDs (8):
│ │ ├── KServe: llminferenceservices, llminferenceserviceconfigs, kserves
│ │ ├── llm-d: inferencepools (2 versions), inferenceobjectives,
│ │ │ inferencemodelrewrites, inferencepoolimports
│ ├── Custom Resources (1):
│ │ └── Gateway/inference-gateway (actual LoadBalancer instance)
│ └── Purpose: Orchestrates model deployments and intelligent routing
│
├── RHAII Operator (Platform lifecycle)
│ ├── Namespace: redhat-ods-operator
│ ├── Workloads (1 deployment):
│ │ └── rhai-operator (3 replicas for HA)
│ ├── Services (2): Metrics and webhook
│ └── Purpose: Orchestrates installation and upgrades
│
└── Cloud Manager (Multi-cloud portability)
├── Namespace: rhai-cloudmanager-system
├── Workloads (1 deployment):
│ └── azure-cloud-manager-operator
└── Purpose: Cloud-specific integrations (Azure/AWS/GCP)The platform consists of:
- Seven namespaces.
- 12 deployments.
- 13 services (including external
LoadBalancer). - 17 CRDs.
- One Gateway instance (the actual entry point).
All of this is shared infrastructure. You install it once per cluster, then deploy your models on top of it.
Verify the platform is ready by checking the gateway:
$ kubectl get gateway -n redhat-ods-applications
NAME CLASS ADDRESS PROGRAMMED
inference-gateway istio k8s-redhatod-inferenc-xxx.elb.us-east-1.amazonaws.com TrueThe PROGRAMMED: True status confirms the platform has an externally reachable entry point through the Istio-backed Gateway API. You can now deploy models.
Let's take a look at each of these components and what they deploy.
cert-manager (TLS certificate management)
The cert-manager component automatically creates and renews TLS certificates for secure communication between components.
The problem it solves
This automation fixes a significant operational headache: because KServe and vLLM pods expose metrics over HTTPS, a standard Prometheus instance cannot safely scrape metrics without valid certificates. Without cert-manager stepping in to prevent validation errors, platform teams would have to manually create, mount, and rotate certificates for every single new model deployment.
Deployed resources
The platform creates two namespaces for this component: cert-manager-operator and cert-manager.
Running kubectl get deployment reveals the active controller applications running across both namespaces:
$ kubectl get deployment -n cert-manager
NAME READY UP-TO-DATE AVAILABLE
cert-manager 1/1 1 1
cert-manager-cainjector 1/1 1 1
cert-manager-webhook 1/1 1 1
$ kubectl get deployment -n cert-manager-operator
NAME READY UP-TO-DATE AVAILABLE
cert-manager-operator-controller-manager 1/1 1 1The platform maps these controllers to internal cluster communication endpoints to manage metrics and webhook validation traffic:
$ kubectl get svc -n cert-manager
NAME TYPE PORT(S)
cert-manager ClusterIP 9402/TCP
cert-manager-cainjector ClusterIP 9402/TCP
cert-manager-webhook ClusterIP 443/TCP
$ kubectl get svc -n cert-manager-operator
NAME TYPE PORT(S)
cert-manager-operator-controller-manager-metrics-svc ClusterIP 8443/TCPThe deployed resources divide the certificate management workload across several specific roles:
- The
cert-managerdeployment acts as the core controller that issues certificates. - The
cert-manager-cainjectordeployment injects CA bundles into webhooks and API services. - The
cert-manager-webhookdeployment validates incoming certificate requests. - The
cert-manager-operatordeployment manages thecert-managerinstallation lifecycle.
Why it matters
For internal services, such as vLLM metrics and Prometheus scraping, self-signed certificates work fine. For external-facing services like the inference gateway, you can configure cert-manager with ACME or Let's Encrypt to obtain trusted certificates automatically. Either way, cert-manager handles the entire lifecycle so that you never interact with certificate files manually.
Sail operator and inference gateway (Istio Gateway API)
The inference gateway provides a central entry point for all model requests using the Kubernetes Gateway API backed by the Istio service mesh. The platform orchestrates these ingress resources declaratively through the open source Sail operator.
The problem it solves
This architecture solves a common challenge in production AI deployments: routing traffic to multiple models from a single external endpoint. Traditional Kubernetes Ingress can handle basic path routing, but Gateway API offers more flexibility because it provides advanced routing capabilities. For example, the gateway can route incoming network requests to completely different backend models based entirely on the URL path:
| Request URL | Target model |
|---|---|
https://gateway/llm-test/qwen-basic/v1/completions | qwen-basic |
https://gateway/llm-test/llama-7b/v1/completions | llama-7b |
https://gateway/llm-test/mistral-8x7b/v1/completions | mistral-8x7b |
The Gateway API provides several operational advantages over standard ingress controllers:
- Header-based routing: You can route incoming requests to specific destinations based on model version headers.
- Weighted traffic splits: You can send 90% of traffic to model-v1 and 10% to model-v2 to handle A/B testing and deployment rollouts.
- Custom backends: You can route traffic directly to an
InferencePoolinstead of a standardService, which enables intelligent scheduling. - Per-route policies: You can set custom timeouts for long-running inference requests, enforce rate limits, or apply retry policies for individual models.
Deployed resources
During installation, the platform creates the istio-system namespace to house the core control plane components. This namespace runs two primary deployments that manage your routing infrastructure:
$ kubectl get deployment -n istio-system
NAME READY UP-TO-DATE AVAILABLE
istiod 1/1 1 1
servicemesh-operator3 1/1 1 1The corresponding services expose these control plane endpoints internally:
$ kubectl get svc -n istio-system
NAME TYPE PORT(S)
istiod ClusterIP 15010/TCP,15012/TCP,443/TCP
servicemesh-operator3-metrics ClusterIP 8443/TCPThese core control plane components divide the work of managing cluster ingress:
- The
servicemesh-operator3(Sail operator) deployment manages the Istio control plane lifecycle. Sail makes Istio Kubernetes-native. You define your desired state with custom resources, and Sail deploys and upgrades Istio declaratively. The "3" in the resource name refers to Red Hat OpenShift Service Mesh 3, which uses Sail instead of the legacy Maistra-based operators. - The
istioddeployment acts as the core Istio control plane. It configures the underlying Envoy proxies and implements Gateway API routing rules.
Gateway and HTTPRoute custom resources
The platform installs a collection of CRDs to extend the standard Kubernetes networking capabilities:
$ kubectl get crd | grep gateway
backendtlspolicies.gateway.networking.k8s.io
gatewayclasses.gateway.networking.k8s.io
gateways.gateway.networking.k8s.io
grpcroutes.gateway.networking.k8s.io
httproutes.gateway.networking.k8s.io
referencegrants.gateway.networking.k8s.io
gateways.networking.istio.ioThese definitions split management responsibilities across distinct networking tasks:
- Core gateway resources: The
gateways.gateway.networking.k8s.ioCRD defines specific gateway instances, such as theinference-gateway, that trigger the creation of cloud load balancers. Thegatewayclasses.gateway.networking.k8s.ioCRD defines implementation types for those gateways, including theistioclass powered by Envoy proxies. - Intelligent routing: The
httproutes.gateway.networking.k8s.ioCRD maps HTTP traffic paths like/qwen-basicto their respective backend models. Thegrpcroutes.gateway.networking.k8s.ioCRD governs routing for model servers that communicate using gRPC-based inference APIs. - Security and policies: The
backendtlspolicies.gateway.networking.k8s.ioCRD manages TLS encryption for traffic flowing from the gateway to internal backends. Thereferencegrants.gateway.networking.k8s.ioCRD enables security-focused cross-namespace routing, allowing a shared gateway instance to reach services running in other namespaces. - Istio compatibility: The
gateways.networking.istio.ioCRD maintains the legacy Istio configuration format, maintained for backward compatibility with older deployments.
Gateway proxy runtime locations
The actual gateway Envoy proxy runs inside the redhat-ods-applications namespace:
$ kubectl get gateway -n redhat-ods-applications
NAME CLASS ADDRESS PROGRAMMED
inference-gateway istio k8s-redhatod-inferenc-xxx.elb.us-east-1.amazonaws.com TrueThis Gateway resource instructs Istio to create the actual Envoy proxy deployment:
$ kubectl get deployment -n redhat-ods-applications | grep gateway
NAME READY UP-TO-DATE AVAILABLE
inference-gateway-istio 1/1 1 1Note
You might notice another deployment named llmisvc-controller-manager running in this namespace. The llmisvc-controller-manager component serves as the KServe controller; it shares this namespace but isn't part of the gateway component.
What components sit behind the inference gateway?
You can visualize how the platform maps the root gateway configuration down to physical runtime pods by using the kubectl tree utility:
$ kubectl tree gateway inference-gateway -n redhat-ods-applications
Gateway/inference-gateway
├── Deployment/inference-gateway-istio (Istio Envoy proxy)
│ └── ReplicaSet/inference-gateway-istio-xxx
│ └── Pod/inference-gateway-istio-xxx
├── Service/inference-gateway-istio (ClusterIP - internal)
└── Service/inference-gateway-istio-lb (LoadBalancer - external)The cloud provider automatically provisions an external network load balancer to match this configuration:
$ kubectl get svc inference-gateway-istio -n redhat-ods-applications
NAME TYPE EXTERNAL-IP PORT(S)
inference-gateway-istio LoadBalancer k8s-redhatod-inferenc-xxx.elb.us-east-1.amazonaws.com 80:32626/TCPThis LoadBalancer is how external clients (including your application, curl commands, and GuideLLM testing utilities) reach the inference models. All models share this single gateway, and HTTPRoute rules determine which model handles each request.
Why it matters
Transitioning to the Gateway API and Istio provides advanced routing capabilities that standard Kubernetes Ingress controllers lack. You get one stable endpoint (the LoadBalancer) that can route traffic to dozens of models, with built-in traffic splitting for A/B tests, header-based routing for model versions, and the ability to route to custom backends like an InferencePool. Your client applications simply target different URL paths, allowing you to update or scale backend models without modifying client-side application code.
KServe (model serving framework)
KServe is the control plane for model serving. In practice, it translates your declared intent into various Kubernetes resources needed to serve a model. It provides the LLMInferenceService CRD and an operator that reconciles your YAML into actual workloads, handling the deployments, networking, and runtime configuration needed to make a model available for inference.
The problem it solves
Deploying a model to production isn't just "run vLLM." You need to solve a whole class of coordination problems:
- Cross-component configuration: Settings like the tensor parallelism degree need to be coordinated across your manifest. If you use tensor parallelism across two GPUs, you must reflect that setting in both the serving configuration and the GPU resource request to keep them in sync.
- Router-aware serving features: Some configurations require the router to recognize and compose with the model server. For example, if you want the router to make smarter decisions about where to send requests based on what's already cached in GPU memory, the router and model server need a direct communication channel. That wiring has to be configured consistently across both components, which is how the ZeroMQ (ZMQ) side channel and key-value (KV) cache event loops function together.
- Traffic exposure: You need to expose exactly the right subset of traffic. Misconfigured routing is how you accidentally make internal model endpoints publicly accessible. This type of misconfiguration can expose fine-tuning adapters or other model internals you never intended to be reachable.
- Traffic management: You must align your serving configuration with
HTTPRoutedefinitions to expose the right endpoints,InferencePoolresources to group backends, and proper traffic splitting for A/B testing. - Security and access control: You must create service accounts, RBAC policies, TLS certificates, and webhook configurations across every cluster component.
Without automated coordination, platform teams manually create and synchronize more than 200 lines of YAML configuration per model. This overhead includes managing separate definitions for deployments, services, HTTPRoute mappings, InferencePool groups, certificates, RBAC policies, and model-server configurations. Every change ripples across multiple resources.
For example, scaling from two replicas to four forces you to manually update the deployment manifest, reconfigure the InferencePool backends, adjust the HTTPRoute weights, and verify that RBAC policies still function correctly without falling out of sync.
KServe eliminates this manual operational overhead by allowing teams to define the target inference service in a single manifest of approximately 30 lines of YAML. KServe then owns the implementation. If you change a configuration from replicas: 2 to replicas: 4, KServe updates the vLLM deployment, adjusts the InferencePool backends, and rebalances routing automatically.
Deployed resources
The platform deploys the core management workloads inside the shared redhat-ods-applications namespace. Running kubectl get deployment highlights the active reconciler:
$ kubectl get deployment -n redhat-ods-applications | grep llmisvc
NAME READY UP-TO-DATE AVAILABLE
llmisvc-controller-manager 1/1 1 1The accompanying network services expose metrics and manage validation webhooks:
$ kubectl get svc -n redhat-ods-applications | grep llmisvc
NAME TYPE PORT(S)
llmisvc-controller-manager ClusterIP 8443/TCP
llmisvc-webhook-server-service ClusterIP 443/TCPThese core system resources split coordination duties across two specific roles:
- The
llmisvc-controller-managerdeployment acts as the KServe controller, watchingLLMInferenceServiceresources and reconciling them into vLLM deployments, services, routes, and pools. - The
llmisvc-webhook-server-servicedeployment validates and mutatesLLMInferenceServiceresources before they're created.
KServe custom resource definitions
The framework installs several CRDs to define the model serving APIs within the cluster:
$ kubectl get crd | grep -E 'llminferenceservice|kserve'
kserves.components.platform.opendatahub.io
llminferenceserviceconfigs.serving.kserve.io
llminferenceservices.serving.kserve.ioThese CRDs define the API for serving models:
- The
llminferenceservices.serving.kserve.ioCRD acts as the primary API, defining configuration fields you can use in your LLMInferenceService YAML (such as model URI, replicas, and GPU limits). - The
llminferenceserviceconfigs.serving.kserve.ioCRD provides reusable serving configurations that can be referenced by multiple models. - The
kserves.components.platform.opendatahub.ioCRD handles platform-level KServe configuration.
Why automated model coordination matters
Every other component in this stack—the gateway, the scheduler, the model server—has a specific job to do at runtime. KServe makes sure they all exist, remain configured correctly, and adapt to state changes automatically. Without this automation, that coordination burden falls entirely on your platform team, and it compounds with every new model you deploy.
llm-d (intelligent inference router)
The llm-d project provides the custom resource definitions that enable intelligent model routing. Unlike other components, the Helm chart only installs the CRDs—the actual llm-d EPP (Endpoint Picker) scheduler pods are deployed per-model when you create an LLMInferenceService.
The problem it solves
This dynamic routing layer resolves a critical limitation of standard Kubernetes services, which rely on basic round-robin load balancing. For example, the service sends the first request to the first pod, the second request to the second pod, and the third request back to the first pod—even if that first pod is still processing the initial query.
Round-robin doesn't know which pod is busy, which one has relevant context cached in GPU memory, or which requests are high-priority. For LLM serving, this means:
- Requests queue up behind slow requests instead of going to idle pods.
- Cache-friendly requests (same prompt prefix) don't land on the pod with that prefix already in memory.
- Batch jobs and premium requests get the same treatment.
llm-d enables:
- Load-aware routing: The system routes incoming queries based on the current pod load by analyzing the queue depth of both vLLM and the llm-d router, alongside the number of active requests currently in flight.
- Cache-aware routing: The router tracks the KV cache state to minimize cache misses. It automatically directs traffic to pods that already have the prompt prefix cached, or to pods with available cache space to encourage an even distribution of prefix allocation across the system.
- Priority-based flow control: The system references client request parameters to let premium, high-priority queries bypass the standard queue ahead of long-running batch jobs.
Deployed resources
Running kubectl get crd reveals the custom API definitions installed by the platform to manage network routing behavior:
$ kubectl get crd | grep inference.networking
inferencemodelrewrites.inference.networking.x-k8s.io
inferenceobjectives.inference.networking.x-k8s.io
inferencepoolimports.inference.networking.x-k8s.io
inferencepools.inference.networking.k8s.io
inferencepools.inference.networking.x-k8s.ioThese installed CRDs expand the cluster's control plane capability to handle advanced model backends:
- The
inferencepools.inference.networking.k8s.ioCRD defines the stable, standard version of theInferencePoolAPI. - The
inferencepools.inference.networking.x-k8s.ioCRD provides an extended version of theInferencePoolAPI with additional features. - The
inferenceobjectives.inference.networking.x-k8s.ioCRD defines priority levels (premium=100,normal=0,batch=-10) to be included in the request body from clients. - The
inferencemodelrewrites.inference.networking.x-k8s.ioCRD rewrites model names in requests (for example, mappinggpt-4tollama-70bfor a drop-in replacement). - The
inferencepoolimports.inference.networking.x-k8s.ioCRD imports external inference endpoints as pools (for example, OpenAI API or Amazon Bedrock).
Where are the llm-d pods?
The Helm chart installs the CRDs only. The actual llm-d Endpoint Picker (EPP) scheduler pods are created per-model when you deploy an LLMInferenceService.
For example, if you deploy qwen-basic with two replicas, you get:
- Two vLLM pods (
qwen-basic-kserve-xxx). - One llm-d EPP scheduler pod (
qwen-basic-kserve-router-scheduler-xxx).
The EPP scheduler is specific to that model and routes requests only to that model's vLLM pods.
Why it matters
llm-d makes routing "intelligent" instead of "random.” Without llm-d's intelligent routing, you're limited to round-robin, which means every tenth request might hit the one pod that's 90% full while nine other pods sit idle. With llm-d, the router is aware of the queue depth, cache state, and request priority, so it can send each request to the pod that will handle it fastest. The Helm chart installs the type definitions; KServe creates the actual EPP scheduler pods automatically when you deploy a model.
LeaderWorkerSet operator (advanced workload orchestration)
The LeaderWorkerSet (LWS) operator provides coordinated multi-pod deployments where pods work together as a logical unit with differentiated roles. Unlike standard Kubernetes deployments where all pods are identical and interchangeable, LWS enables diverse pod groups with leader-worker coordination.
The problem it solves
For advanced AI workloads like wide Expert Parallelism (wideEP) in Mixture of Experts (MoE) models, you need:
- A leader pod with specific configuration parameters for coordination.
- Worker pods with different configurations optimized for their role.
- The ability to scale workers independently of the leader.
- Co-location and synchronized lifecycle across the pod group.
Standard Kubernetes deployments cannot express this, as they treat all pods identically. StatefulSet resources provide ordering but not role differentiation or coordinated scaling.
LeaderWorkerSet enables the following operational patterns:
- Dual-template design: Separate configurations for leader and worker pods (different resource limits, arguments, and environment variables).
- Gang scheduling: All pods in a group are scheduled together as a unit; if one cannot be placed, none are created.
- Topology awareness: Co-locate pod groups on the same infrastructure (nodes, racks, or availability zones).
- Coordinated failure handling: When one pod fails, restart the entire group to maintain consistency.
- Group-level rolling updates: Upgrade complete pod groups as units rather than individual pods.
- Independent worker scaling: Scale workers (zero to N) without recreating the leader.
This also solves multi-node GPU allocation: if you need Tensor Parallelism (TP)=16 but each node has only eight GPUs, LeaderWorkerSet coordinates two pods across nodes to work as one server with peer-to-peer communication.
This enables large parallelism patterns:
- Tensor Parallelism (TP): Split model layers across GPUs (for example, TP=16 across two nodes).
- Data Parallelism (DP): Replicate model across GPUs (for example, DP=16 across two nodes).
- Expert Parallelism (EP): Distribute MoE model experts across GPUs, with leader handling coordination.
Note
The LeaderWorkerSet project also supports disaggregated inference patterns (separating prefill and decode phases) via DisaggregatedSet, but Red Hat AI Inference 3.4 doesn't include that capability yet.
Deployed resources
The platform creates the openshift-lws-operator namespace for this component.
Running kubectl get deployment reveals the active management controllers:
$ kubectl get deployment -n openshift-lws-operator
NAME READY UP-TO-DATE AVAILABLE
lws-controller-manager 2/2 2 2
openshift-lws-operator 1/1 1 1The corresponding cluster services expose metrics and webhook endpoints internally:
$ kubectl get svc -n openshift-lws-operator
NAME TYPE PORT(S)
lws-controller-manager-metrics-service ClusterIP 8443/TCP
lws-webhook-service ClusterIP 443/TCPThe deployed management workloads share specific operational responsibilities:
- The
lws-controller-managerdeployment creates coordinated multi-pod groups where pods work together as one logical server. - The
openshift-lws-operatordeployment manages the LeaderWorkerSet controller lifecycle.
Running kubectl get crd displays the custom API schemas installed for multi-pod orchestration:
$ kubectl get crd | grep leaderworkerset
leaderworkersets.leaderworkerset.x-k8s.io
leaderworkersetoperators.operator.openshift.ioThe platform uses these definitions to manage how the cluster recognizes multi-pod groups:
- The
leaderworkersets.leaderworkerset.x-k8s.ioCRD defines coordinated pod groups where multiple pods work together as one logical server. - The
leaderworkersetoperators.operator.openshift.ioCRD manages the LeaderWorkerSet operator installation and configuration.
Why it matters
Without LeaderWorkerSet, Kubernetes can't express that the distributed pods are parts of one server. You'd be limited to models that fit on your largest single node. With this operator, node boundaries stop being a constraint, allowing you to coordinate pods across nodes and treat your entire cluster as one large logical GPU pool.
Cloud manager operator (cloud provider integration)
The cloud manager operator integrates with cloud-specific features across Microsoft Azure, Amazon Web Services (AWS), and Google Cloud to manage autoscaling, spot instances, and resource quotas.
The problem it solves
Different clouds have distinct APIs to handle specific infrastructure tasks:
- Autoscaling GPU nodes when demand increases.
- Using spot or preemptible instances to reduce operational costs.
- Enforcing resource quotas and limits.
- Integrating with cloud-native load balancers.
The cloud manager provides a unified interface so that LLMInferenceService resources work the same way across various cloud environments.
Deployed resources
Running kubectl get deployment reveals the active provider integration controller:
$ kubectl get deployment -n rhai-cloudmanager-system
NAME READY UP-TO-DATE AVAILABLE
aws-cloud-manager-operator 1/1 1 1This single aws-cloud-manager-operator deployment handles all cloud-specific integrations for the cluster.
Running kubectl get crd displays the custom API schemas installed for multi-cloud integration:
$ kubectl get crd | grep -E 'kubernetesengine'
awskubernetesengines.infrastructure.opendatahub.ioThe awskubernetesengines.infrastructure.opendatahub.io CRD defines cloud-specific cluster configurations. For generic Kubernetes, this component is optional.
Why it matters
This component is what makes Red Hat AI Inference portable across clouds. The same LLMInferenceService YAML manifest works on Amazon EKS, Microsoft Azure Kubernetes Service (AKS), Google Kubernetes Engine (GKE), or on-premise Kubernetes clusters. The cloud manager handles the underlying cloud-specific orchestration details.
Red Hat AI Inference operator (platform orchestrator)
The Red Hat AI Inference operator acts as a meta-controller that manages the entire platform installation and keeps all components in sync.
The problem it solves
The Red Hat AI Inference platform has many moving parts, including KServe, Istio, cert-manager, llm-d, and monitoring utilities. When you upgrade Red Hat AI Inference, all these components must upgrade collectively in a specific sequence to maintain version compatibility.
To address this challenge, the Red Hat AI Inference operator handles the following tasks:
- Manages the lifecycle of all platform components.
- Ensures compatible versions are installed together.
- Handles upgrades and rollbacks.
- Watches for configuration changes.
- Reconciles desired state versus actual state.
Deployed resources
The platform installs these core control plane assets inside the redhat-ods-operator namespace. Running kubectl get deployment displays the active orchestrator controller:
$ kubectl get deployment -n redhat-ods-operator
NAME READY UP-TO-DATE AVAILABLE
rhai-operator 3/3 3 3The accompanying network services expose metrics and manage validation endpoints internally:
$ kubectl get svc -n redhat-ods-operator
NAME TYPE PORT(S)
rhai-operator-controller-manager-metrics-service ClusterIP 8443/TCP
rhai-operator-webhook-service ClusterIP 443/TCPThe deployed management workloads share structural and operational responsibilities across the platform:
- The
rhai-operatordeployment acts as the meta-controller, utilizing three replicas for high availability so that if a single pod crashes, the others keep running. - The
rhai-operator-controller-manager-metrics-serviceservice exposes Prometheus metrics regarding the operator's health and reconciliation status. - The
rhai-operator-webhook-serviceservice validates and mutates platform configuration to ensure compatibility across components.
Platform orchestration custom resource definitions
The Red Hat AI Inference operator manages platform-level component configurations (like the kserves.components.platform.opendatahub.io CRD from the KServe section). These define which platform components to deploy and how to configure them.
Why it matters
Often described as the operator that operates operators, this component is what makes helm upgrade work smoothly. The Red Hat AI Inference operator coordinates every component upgrade, making sure that the version transitions happen in the correct order with compatible dependencies. Without this centralized orchestration, upgrading the platform would require a manual, error-prone process of upgrading multiple different operators individually.
Bringing the automated inference architecture together
When you run helm install rhaii, you're not just deploying a container; you're standing up a complete AI inference platform with deployments, services, custom resource types, and an inference gateway working together across multiple namespaces.
These components work together so you don't have to coordinate them manually:
- cert-manager handles TLS certificates automatically.
- Istio and Gateway API provide one stable endpoint that routes to various models.
- KServe translates your ~30-line
LLMInferenceServiceYAML into 200 lines of coordinated Kubernetes resources. - llm-d enables queue-aware and cache-aware routing that standard Kubernetes
Serviceresources can't achieve. - LeaderWorkerSet allows models to span multiple nodes when they exceed single-node GPU capacity.
- Cloud manager keeps the same YAML portable across EKS, AKS, GKE, and on-premise environments.
- Red Hat AI Inference operator orchestrates upgrades so all underlying components remain compatible.
Once the platform is installed, you can deploy as many models as your cluster can support. Each deployment automatically receives intelligent routing, TLS, and multi-cloud portability with no per-model configuration needed.
Learn more about the core technologies:
- OpenDataHub RHAII on xKS deployment guide
- KServe control plane documentation
- Istio service mesh documentation
- cert-manager installation and configuration guides
Next steps: Deploying and testing your first model
This article covered the core platform components deployed by helm install. Part 2 of this series walks through deploying an active model, analyzing the specific runtime resources that KServe generates on a per-model basis (including vLLM pods, llm-d-router, InferencePool), and validating the environment with GuideLLM to demonstrate intelligent routing capabilities.
Ready to deploy Red Hat AI Inference on your cluster? Check out the Red Hat AI Inference on xKS documentation for complete installation instructions, or Red Hat OpenShift AI to explore the full platform.