Large enterprises run LLM inference, training, and fine-tuning on Kubernetes for the scale and flexibility it provides. As organizations look to optimize both performance and cost, AWS Inferentia and Trainium chips provide a powerful, cost-effective option for accelerating these workloads, delivering up to 70% lower cost per inference compared to other instance types in many scenarios. Through a joint effort between AWS and Red Hat, these AWS AI chips are now available to customers using Red Hat OpenShift Service on AWS and self-managed OpenShift clusters on AWS, giving organizations more choice in how they design and run their AI platforms.
The AWS Neuron Operator brings native support for AWS AI chips to Red Hat OpenShift, enabling you to run inference with full LLM support using frameworks like vLLM. This integration combines the cost benefits of AWS silicon with the enterprise features of OpenShift.
What the AWS Neuron Operator does
The AWS Neuron Operator automates the deployment and management of AWS Neuron devices on OpenShift clusters. It handles four key tasks:
- Kernel module deployment: Installs Neuron drivers using Kernel Module Management (KMM)
- Device plug-in management: Exposes Neuron devices as schedulable resources
- Intelligent scheduling: Deploys a custom Neuron-aware scheduler for optimal workload placement
- Telemetry collection: Provides basic metrics through a node-metrics DaemonSet
The operator reconciles a custom resource called DeviceConfig that lets you configure images and target specific nodes in your cluster.
Joint development by AWS and Red Hat
This operator represents a collaboration between AWS and Red Hat engineering teams. The operator includes core functionality, Neuron integration, OpenShift integration patterns, and lifecycle management. Red Hat, as the originators of the operator framework before it became a CNCF project, developed the operator based on established best practices.
The project consists of two open source repositories:
- operator-for-ai-chips-on-aws: The main operator and custom scheduler
- kmod-with-kmm-for-ai-chips-on-aws: Automated builds of KMM-compatible kernel modules
Both repositories use automated GitHub Actions workflows to build and publish container images to public registries, making installation straightforward.
Why use AWS AI chips for LLM workloads
AWS Inferentia and Trainium chips are purpose-built for machine learning. Inferentia focuses on inference workloads, while Trainium handles both training and inference. Here's what makes them compelling for LLM deployments:
- Cost efficiency: Run inference at up to 50% lower cost compared to GPU instances. For high-volume inference workloads, this translates to significant savings.
- Performance: Inferentia2 delivers up to 4x higher throughput and 10x lower latency than first-generation Inferentia. Trainium offers high-performance training for models with hundreds of billions of parameters.
- Framework support: The Neuron SDK integrates with popular frameworks including PyTorch, TensorFlow, and vLLM. You can deploy models from Hugging Face with minimal code changes.
- Full LLM support: Run popular models like Llama 2, Llama 3, Mistral, and other transformer-based architectures. The vLLM integration provides optimized inference with features like continuous batching and PagedAttention.
Architecture overview
The operator uses several OpenShift and Kubernetes components to enable Neuron devices:
- Node Feature Discovery (NFD): Detects Neuron PCI devices (vendor ID 1d0f) and labels nodes accordingly. This allows the operator to target the right nodes.
- Kernel Module Management (KMM): Loads the Neuron kernel driver on nodes with compatible hardware. KMM handles kernel version matching automatically, even across OpenShift upgrades.
- Custom Scheduler: A Neuron-aware scheduler extension that understands neuron core topology. This ensures workloads are placed on nodes with available neuron cores, not just nodes with Neuron devices.
- Device plug-in: Exposes
aws.amazon.com/neuronandaws.amazon.com/neuroncoreas allocatable resources. Pods can request these resources in their resource limits.
The operator manages all these components through a single DeviceConfig custom resource, simplifying operations.
Installing the AWS Neuron Operator
You can install the operator through the OpenShift web console or using the command line. Both methods require three prerequisite operators from Red Hat.
Prerequisites
Before installing the AWS Neuron Operator, install these operators from OperatorHub:
- Node Feature Discovery (NFD): Detects hardware features
- Kernel Module Management (KMM): Manages kernel drivers
- AWS Neuron Operator (by AWS): Manages Neuron devices
All three operators are available in the OpenShift OperatorHub catalog.
Installation via OpenShift console (recommended)
This method uses the OpenShift web console and is the easiest way to get started.
Step 1: Install Node Feature Discovery
- Open your cluster's web console.
- Navigate to Operators → OperatorHub.
- Search for
Node Feature Discovery. - Click Node Feature Discovery provided by Red Hat.
- Click Install, then Install again at the bottom.
- Once installed, click View Operator.
- Click Create Instance under NodeFeatureDiscovery.
- Click Create at the bottom (use default settings).
Step 2: Apply the NFD Rule for Neuron Devices
Create a file named neuron-nfd-rule.yaml:
apiVersion: nfd.openshift.io/v1alpha1
kind: NodeFeatureRule
metadata:
name: neuron-nfd-rule
namespace: ai-operator-on-aws
spec:
rules:
- name: neuron-device
labels:
feature.node.kubernetes.io/aws-neuron: "true"
matchAny:
- matchFeatures:
- feature: pci.device
matchExpressions:
vendor: {op: In, value: ["1d0f"]}
device: {op: In, value: [
"7064", "7065", "7066", "7067",
"7164", "7264", "7364"
]}Apply it:
oc apply -f neuron-nfd-rule.yamlThis rule labels nodes that have AWS Neuron devices, making them discoverable by the operator.
Step 3: Install Kernel Module Management
- Go back to Operators → OperatorHub.
- Search for
Kernel Module.v - Click Kernel Module Management provided by Red Hat.
- Click Install, then Install again.
Step 4: Install AWS Neuron Operator
- Go to Operators → OperatorHub.
- Search for
AWS Neuron. - Click AWS Neuron Operator provided by Amazon, Inc.
- Click Install, then Install again.
- Once installed, click View Operator.
- Click Create Instance under DeviceConfig.
- Update the YAML with your desired configuration (see below).
- Click Create.
Installation via command line
For automation or CI/CD pipelines, use the command-line installation method.
Step 1: Install Prerequisites
Install NFD and KMM operators through OperatorHub first, then create the NFD instance and apply the NFD rule shown above.
Step 2: Install the Operator
# Install the latest version
kubectl apply -f https://github.com/awslabs/operator-for-ai-chips-on-aws/releases/latest/download/aws-neuron-operator.yaml# Or install a specific version
kubectl apply -f https://github.com/awslabs/operator-for-ai-chips-on-aws/releases/download/v0.1.1/aws-neuron-operator.yamlStep 3: Create DeviceConfig
Create a file named deviceconfig.yaml:
apiVersion: k8s.aws/v1alpha1
kind: DeviceConfig
metadata:
name: neuron
namespace: ai-operator-on-aws
spec:
driversImage: public.ecr.aws/q5p6u7h8/neuron-openshift/neuron-kernel-module:2.24.7.0
devicePluginImage: public.ecr.aws/neuron/neuron-device-plugin:2.24.23.0
customSchedulerImage: public.ecr.aws/eks-distro/kubernetes/kube-scheduler:v1.32.9-eks-1-32-24
schedulerExtensionImage: public.ecr.aws/neuron/neuron-scheduler:2.24.23.0
selector:
feature.node.kubernetes.io/aws-neuron: "true"Apply it:
oc apply -f deviceconfig.yamlThe operator will automatically append the kernel version to the driversImage at runtime, ensuring the correct driver is loaded.
Verify installation
Check that all components are running:
# Check operator pods
oc get pods -n ai-operator-on-aws
# Verify KMM module
oc get modules.kmm.sigs.x-k8s.io -A
# Check node labels
oc get nodes -l feature.node.kubernetes.io/aws-neuron=true
# Verify Neuron resources are available
kubectl get nodes -o json | jq -r '
.items[]
| select(((.status.capacity["aws.amazon.com/neuron"] // "0") | tonumber) > 0)
| .metadata.name as $name
| "\($name)\n Neuron devices: \(.status.capacity["aws.amazon.com/neuron"])\n Neuron cores: \(.status.capacity["aws.amazon.com/neuroncore"])"
'You should see nodes with available Neuron devices and cores.
Running LLM inference with vLLM
Once the operator is installed, you can deploy LLM inference workloads using vLLM, a high-performance inference engine optimized for AWS Neuron.
Set up the inference environment
Create a namespace:
oc create namespace neuron-inferenceCreate a
PersistentVolumeClaimfor model storage. This PVC stores the downloaded model, so you don't need to download it every time you restart the deployment.apiVersion: v1 kind: PersistentVolumeClaim metadata: name: model-storage namespace: neuron-inference spec: accessModes: - ReadWriteOnce resources: requests: storage: 100GiCreate a Hugging Face token secret. Most LLM models require authentication to download from Hugging Face.
oc create secret generic hf-token \ --from-literal=token=YOUR_HF_TOKEN \ -n neuron-inferenceDeploy the vLLM Inference Server. Create a deployment that downloads the model and runs the vLLM server:
apiVersion: apps/v1 kind: Deployment metadata: name: vllm-inference namespace: neuron-inference spec: replicas: 1 selector: matchLabels: app: vllm-inference template: metadata: labels: app: vllm-inference spec: initContainers: - name: model-downloader image: python:3.10-slim command: - /bin/bash - -c - | pip install huggingface_hub python -c " from huggingface_hub import snapshot_download import os token = os.environ.get('HF_TOKEN') snapshot_download('meta-llama/Llama-2-7b-hf', local_dir='/model', token=token) " env: - name: HF_TOKEN valueFrom: secretKeyRef: name: hf-token key: token volumeMounts: - name: model-storage mountPath: /model containers: - name: vllm-server image: public.ecr.aws/neuron/vllm-neuron:latest command: - python - -m - vllm.entrypoints.openai.api_server - --model - /model - --tensor-parallel-size - "2" ports: - containerPort: 8000 name: http resources: limits: aws.amazon.com/neuron: 2 requests: aws.amazon.com/neuron: 2 volumeMounts: - name: model-storage mountPath: /model volumes: - name: model-storage persistentVolumeClaim: claimName: model-storageExpose the Service. Create a service and route for external access:
apiVersion: v1 kind: Service metadata: name: vllm-service namespace: neuron-inference spec: selector: app: vllm-inference ports: - port: 8000 targetPort: 8000 name: http --- apiVersion: route.openshift.io/v1 kind: Route metadata: name: vllm-route namespace: neuron-inference spec: to: kind: Service name: vllm-service port: targetPort: http tls: termination: edgeApply all resources:
oc apply -f pvc.yamloc apply -f deployment.yamloc apply -f service.yaml
Testing the inference endpoint
Once the vLLM server is running, you can send requests to the OpenAI-compatible API:
# Get the route URL
ROUTE_URL=$(oc get route vllm-route -n neuron-inference -o jsonpath='{.spec.host}')
# Send a test request
curl https://${ROUTE_URL}/v1/completions \
-H "Content-Type: application/json" \
-d '{
"model": "/model",
"prompt": "Explain quantum computing in simple terms:",
"max_tokens": 100,
"temperature": 0.7
}'The vLLM server provides an OpenAI-compatible API, making it easy to integrate with existing applications.
Cost optimization strategies
Running LLMs on AWS Neuron chips can significantly reduce your inference costs. Here are strategies to maximize savings:
- Use Inferentia2 for inference-only workloads. Inferentia2 instances like
inf2.xlargestart at a fraction of the cost of comparable GPU instances. For production inference, this is the most cost-effective option. - Leverage continuous batching. vLLM's continuous batching feature maximizes throughput by dynamically batching requests. This increases utilization and reduces cost per inference.
- Right-size your instances. Start with smaller instance types and scale up based on actual usage. Inferentia2 instances come in various sizes from
inf2.xlarge(1 Neuron device) toinf2.48xlarge(12 devices). - Use Spot instances for development. Red Hat OpenShift Service on AWS supports EC2 Spot instances through machine pools. Use Spot for development and testing environments to save up to 90%.
- Cache models on persistent volumes. As shown in the vLLM example, caching models on PVCs eliminates repeated downloads and reduces startup time.
Monitoring and troubleshooting
The operator includes basic telemetry through the node-metrics DaemonSet. For production deployments, integrate with OpenShift monitoring.
Common issues
Here are some common issues and their troubleshooting steps.
Pods stuck in Pending state
Check that nodes have the feature.node.kubernetes.io/aws-neuron=true label and that Neuron resources are available:
oc describe node <node-name> | grep neuronDriver not loading
Verify the KMM module is created and the DaemonSet is running:
oc get modules.kmm.sigs.x-k8s.io -Aoc get ds -n ai-operator-on-awsModel download failures
Check that the Hugging Face token is valid and the model name is correct. Review init container logs:
oc logs <pod-name> -c model-downloader -n neuron-inferenceScheduler not placing pods
Ensure the custom scheduler is running and pods are using the correct scheduler name:
oc get pods -n ai-operator-on-aws | grep schedulerWhat's next
The AWS Neuron Operator for OpenShift enables enterprise-grade AI acceleration. As AWS continues to invest in purpose-built AI chips and Red Hat enhances OpenShift's AI capabilities, expect more features and optimizations.
To get started:
- Review the operator documentation
- Check out the kernel module repository
- Explore AWS Neuron SDK documentation
- Join the discussion in the GitHub repositories
The combination of AWS AI chips and OpenShift provides a powerful platform for running cost-effective AI workloads at scale. Whether you're deploying LLMs for customer service, content generation, or data analysis, this integration makes it easier and more affordable.
Note
The AWS Neuron Operator is developed jointly by AWS and Red Hat. Contributions and feedback are welcome through the GitHub repositories.