Large enterprises run LLM inference, training, and fine-tuning on Kubernetes for the scale and flexibility it provides. As organizations look to optimize both performance and cost, AWS Inferentia and Trainium chips provide a powerful, cost-effective option for accelerating these workloads, delivering up to 70% lower cost per inference compared to other instance types in many scenarios. Through a joint effort between AWS and Red Hat, these AWS AI chips are now available to customers using Red Hat OpenShift Service on AWS and self-managed OpenShift clusters on AWS, giving organizations more choice in how they design and run their AI platforms like Red Hat OpenShift AI.
The AWS Neuron Operator brings native support for AWS AI chips to Red Hat OpenShift, enabling you to run inference with full LLM support using frameworks like vLLM. This integration combines the cost benefits of AWS silicon with the enterprise features of OpenShift and the overall Red Hat AI capabilities..
What the AWS Neuron Operator does
The AWS Neuron Operator automates the deployment and management of AWS Neuron devices on OpenShift clusters. It handles four key tasks:
- Kernel module deployment: Installs Neuron drivers using Kernel Module Management (KMM)
- Device plug-in management: Exposes Neuron devices as schedulable resources
- Intelligent scheduling: Deploys a custom Neuron-aware scheduler for optimal workload placement
- Telemetry collection: Provides basic metrics through a node-metrics DaemonSet
The operator reconciles a custom resource called DeviceConfig that lets you configure images and target specific nodes in your cluster.
Joint development by AWS and Red Hat
This operator represents a collaboration between AWS and Red Hat engineering teams. The operator includes core functionality, Neuron integration, OpenShift integration patterns, and lifecycle management. Red Hat, as the originators of the operator framework before it became a CNCF project, developed the operator based on established best practices.
The project consists of two open source repositories:
- operator-for-ai-chips-on-aws: The main operator and custom scheduler
- kmod-with-kmm-for-ai-chips-on-aws: Automated builds of KMM-compatible kernel modules
Both repositories use automated GitHub Actions workflows to build and publish container images to public registries, making installation straightforward.
Why use AWS AI chips for LLM workloads
AWS Inferentia and Trainium chips are purpose-built for machine learning. Inferentia focuses on inference workloads, while Trainium handles both training and inference. Here's what makes them compelling for LLM deployments:
- Cost efficiency: Run inference at up to 50% lower cost compared to GPU instances. For high-volume inference workloads, this translates to significant savings.
- Performance: Inferentia2 delivers up to 4x higher throughput and 10x lower latency than first-generation Inferentia. Trainium offers high-performance training for models with hundreds of billions of parameters.
- Framework support: The Neuron SDK integrates with popular frameworks including PyTorch, TensorFlow, and vLLM. You can deploy models from Hugging Face with minimal code changes.
- Full LLM support: Run popular models like Llama 2, Llama 3, Mistral, and other transformer-based architectures. The vLLM integration provides optimized inference with features like continuous batching and PagedAttention.
Architecture overview
The operator uses several OpenShift and Kubernetes components to enable Neuron devices:
- Node Feature Discovery (NFD): Detects Neuron PCI devices (vendor ID 1d0f) and labels nodes accordingly. This allows the operator to target the right nodes.
- Kernel Module Management (KMM): Loads the Neuron kernel driver on nodes with compatible hardware. KMM handles kernel version matching automatically, even across OpenShift upgrades.
- Custom Scheduler: A Neuron-aware scheduler extension that understands neuron core topology. This ensures workloads are placed on nodes with available neuron cores, not just nodes with Neuron devices.
- Device plug-in: Exposes
aws.amazon.com/neuronandaws.amazon.com/neuroncoreas allocatable resources. Pods can request these resources in their resource limits.
The operator manages all these components through a single DeviceConfig custom resource, simplifying operations.
Installing the AWS Neuron Operator
You can install the operator through the OpenShift web console or using the command line. Both methods require three prerequisite operators from Red Hat.
Prerequisites
Before installing the AWS Neuron Operator, install these operators from OperatorHub:
- Node Feature Discovery (NFD): Detects hardware features
- Kernel Module Management (KMM): Manages kernel drivers
- AWS Neuron Operator (by AWS): Manages Neuron devices
All three operators are available in the OpenShift OperatorHub catalog.
Installation via OpenShift console (recommended)
This method uses the OpenShift web console and is the easiest way to get started. The instructions below were validated with OpenShift 4.20.4 on Red Hat OpenShift Service on AWS.
Step 1: Install Node Feature Discovery
- Open your cluster's web console.
- Navigate to Ecosystem → Software Catalog (under the openshift-operators-redhat project).
- Search for "Node Feature Discovery."
- Click Node Feature Discovery provided by Red Hat.
- Click Install, then Install again at the bottom.
- Once installed, click View Operator.
- Click Create Instance under NodeFeatureDiscovery.
- Click Create at the bottom (use default settings).
Step 2: Apply the NFD Rule for Neuron Devices
We will use namespace ai-operator-on-aws as the target for our configuration settings. Create this namespace first:
oc apply -f - <<EOF
apiVersion: v1
kind: Namespace
metadata:
labels:
control-plane: controller-manager
security.openshift.io/scc.podSecurityLabelSync: 'true'
name: ai-operator-on-aws
EOFCreate the node feature discovery rule:
oc apply -f - <<EOF
apiVersion: nfd.openshift.io/v1alpha1
kind: NodeFeatureRule
metadata:
name: neuron-nfd-rule
namespace: ai-operator-on-aws
spec:
rules:
- name: neuron-device
labels:
feature.node.kubernetes.io/aws-neuron: "true"
matchAny:
- matchFeatures:
- feature: pci.device
matchExpressions:
vendor: {op: In, value: ["1d0f"]}
device: {op: In, value: [
"7064",
"7065",
"7066",
"7067",
"7164",
"7264",
"7364",
]}
EOFThis rule labels nodes that have AWS Neuron devices, making them discoverable by the operator.
Step 3: Install Kernel Module Management
- Go back to Ecosystem → Software Catalog.
- Search for "Kernel Module."
- Click Kernel Module Management provided by Red Hat.
- Click Install, then Install again.
Step 4: Install AWS Neuron Operator
- Go to Operators → OperatorHub.
- Search for "AWS Neuron."
- Click AWS Neuron Operator provided by Amazon, Inc.
- Click Install, then Install again.
- Once installed, click View Operator.
- Click Create Instance under DeviceConfig.
- Update the YAML with your desired configuration (see below).
- Click Create.
Installation via command line
For automation or CI/CD pipelines, use the command-line installation method.
Step 1: Install prerequisites
Install NFD and KMM operators through OperatorHub first, then create the NFD instance and apply the NFD rule shown above.
Step 2: Install the Operator
Enter the following:
# Install the latest version
kubectl apply -f https://github.com/awslabs/operator-for-ai-chips-on-aws/releases/latest/download/aws-neuron-operator.yaml
# Or install a specific version
kubectl apply -f https://github.com/awslabs/operator-for-ai-chips-on-aws/releases/download/v0.1.1/aws-neuron-operator.yamlCreate DeviceConfig
Create a DeviceConfig resource file named deviceconfig.yaml:
apiVersion: k8s.aws/v1alpha1
kind: DeviceConfig
metadata:
name: neuron
namespace: ai-operator-on-aws
spec:
driversImage: public.ecr.aws/q5p6u7h8/neuron-openshift/neuron-kernel-module:2.24.7.0 # actual pull at runtime will use <image>-$KERNEL_VERSION
devicePluginImage: public.ecr.aws/neuron/neuron-device-plugin:2.24.23.0
customSchedulerImage: public.ecr.aws/eks-distro/kubernetes/kube-scheduler:v1.32.9-eks-1-32-24
schedulerExtensionImage: public.ecr.aws/neuron/neuron-scheduler:2.24.23.0
selector:
feature.node.kubernetes.io/aws-neuron: "true"Apply it:
oc apply -f deviceconfig.yamlThe operator will automatically append the kernel version to the driversImage at runtime, ensuring the correct driver is loaded.
Verify installation
Check that all components are running:
# Check operator pods
oc get pods -n ai-operator-on-aws
# Verify KMM module
oc get modules.kmm.sigs.x-k8s.io -A
# Check node labels
oc get nodes -l feature.node.kubernetes.io/aws-neuron=true
# Verify Neuron resources are available
kubectl get nodes -o json | jq -r '
.items[]
| select(((.status.capacity["aws.amazon.com/neuron"] // "0") | tonumber) > 0)
| .metadata.name as $name
| "\($name)\n Neuron devices: \(.status.capacity["aws.amazon.com/neuron"])\n Neuron cores: \(.status.capacity["aws.amazon.com/neuroncore"])"
'You should see nodes with available Neuron devices and cores.
Running LLM inference with vLLM
Once the operator is installed, you can deploy LLM inference workloads using vLLM, a high-performance inference engine optimized for AWS Neuron.
Set up the inference environment
Prepare the OpenShift cluster by creating the necessary namespace, persistent storage for model caching, and authentication secrets.
Step 1: Create a namespace
oc create namespace neuron-inferenceStep 2: Create a PersistentVolumeClaim for model storage
This PVC stores the downloaded model, so you don't need to download it every time you restart the deployment.
oc apply -f - <<EOF
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
name: model-cache
namespace: neuron-inference
spec:
accessModes:
- ReadWriteOnce
resources:
requests:
storage: 50Gi
storageClassName: gp3-csi
EOFStep 3: Create a Hugging Face token secret
Most LLM models require authentication to download from Hugging Face (ensure that you have access to the meta-llama/Llama-3.1-8B-Instruct model or specify the model that you would like to use instead):
oc create secret generic hf-token \
--from-literal=token=YOUR_HF_TOKEN \
-n neuron-inferenceStep 4: Deploy the vLLM inference server
Create a deployment file deployment.yaml that downloads the model and runs the vLLM server:
apiVersion: apps/v1
kind: Deployment
metadata:
name: neuron-vllm-test
namespace: neuron-inference
labels:
app: neuron-vllm-test
spec:
replicas: 1
selector:
matchLabels:
app: neuron-vllm-test
template:
metadata:
labels:
app: neuron-vllm-test
spec:
schedulerName: neuron-scheduler
volumes:
- name: model-volume
persistentVolumeClaim:
claimName: model-cache
- name: shm
emptyDir:
medium: Memory
sizeLimit: "2Gi"
serviceAccountName: default
initContainers:
- name: fetch-model
image: python:3.11-slim
env:
- name: DOCKER_CONFIG
value: /auth
- name: HF_HOME
value: /model
- name: HF_TOKEN
valueFrom:
secretKeyRef:
name: hf-token # Your existing secret
key: token
command: ["/bin/sh","-c"]
args:
- |
set -ex
echo "--- SCRIPT STARTED ---"
echo "--- CHECKING /model DIRECTORY PERMISSIONS AND CONTENTS ---"
# Only pull if /model is empty
if [ ! -f "/model/config.json" ]; then
export PYTHONUSERBASE="/tmp/pip"
export PATH="$PYTHONUSERBASE/bin:$PATH"
pip install --no-cache-dir --user "huggingface_hub>=1.0"
echo "Pulling model..."
$PYTHONUSERBASE/bin/hf download meta-llama/Llama-3.1-8B-Instruct --local-dir /model
else
echo "Model already present, skipping model pull"
fi
volumeMounts:
- name: model-volume
mountPath: /model
containers:
- name: granite
image: 'public.ecr.aws/neuron/pytorch-inference-vllm-neuronx:0.7.2-neuronx-py310-sdk2.24.1-ubuntu22.04'
imagePullPolicy: IfNotPresent
workingDir: /model
env:
- name: VLLM_SERVER_DEV_MODE
value: '1'
- name: NEURON_CACHE_URL
value: "/model/neuron_cache"
command:
- python
- '-m'
- vllm.entrypoints.openai.api_server
args:
- '--port=8000'
- '--model=/model'
- '--served-model-name=meta-llama/Llama-3.1-8B-Instruct'
- '--tensor-parallel-size=2'
- '--device'
- 'neuron'
- '--max-num-seqs=4'
- '--max-model-len=4096'
resources:
limits:
memory: "100Gi"
aws.amazon.com/neuron: 1
requests:
memory: "10Gi"
aws.amazon.com/neuron: 1
volumeMounts:
- name: model-volume
mountPath: /model
- name: shm
mountPath: /dev/shm
restartPolicy: AlwaysStep 5: Expose the Service
Create a service and route for external access and store in service.yaml:
apiVersion: v1
kind: Service
metadata:
name: neuron-vllm-test
namespace: neuron-inference
spec:
selector:
app: neuron-vllm-test
ports:
- name: vllm-port
protocol: TCP
port: 80
targetPort: 8000
---
apiVersion: route.openshift.io/v1
kind: Route
metadata:
name: neuron-vllm-test
namespace: neuron-inference
spec:
to:
kind: Service
name: neuron-vllm-test
port:
targetPort: vllm-port
tls:
termination: edge
insecureEdgeTerminationPolicy: RedirectApply all resources:
oc apply -f deployment.yaml
oc apply -f service.yamlTesting the inference endpoint
Once the vLLM server is running, you can send requests to the OpenAI-compatible API:
# Get the route URL
ROUTE_URL=$(oc get route neuron-vllm-test -n neuron-inference -o jsonpath='{.spec.host}')
# Send a test request
curl https://$ROUTE_URL/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "meta-llama/Llama-3.1-8B-Instruct",
"messages": [{"role": "user", "content": "Explain quantum computing in simple terms"}],
"max_tokens": 50
}'The vLLM server provides an OpenAI-compatible API, making it easy to integrate with existing applications.
Cost optimization strategies
Running LLMs on AWS Neuron chips can significantly reduce your inference costs. Here are strategies to maximize savings:
- Use Inferentia2 for inference-only workloads. Inferentia2 instances like
inf2.xlargestart at a fraction of the cost of comparable GPU instances. For production inference, this is the most cost-effective option. - Leverage continuous batching. vLLM's continuous batching feature maximizes throughput by dynamically batching requests. This increases utilization and reduces cost per inference.
- Right-size your instances. Start with smaller instance types and scale up based on actual usage. Inferentia2 instances come in various sizes from
inf2.xlarge(1 Neuron device) toinf2.48xlarge(12 devices). - Use Spot instances for development. Red Hat OpenShift Service on AWS supports EC2 Spot instances through machine pools. Use Spot for development and testing environments to save up to 90%.
- Cache models on persistent volumes. As shown in the vLLM example, caching models on PVCs eliminates repeated downloads and reduces startup time.
Monitoring and troubleshooting
The operator includes basic telemetry through the node-metrics DaemonSet. For production deployments, integrate with OpenShift monitoring.
Common issues
Here are some common issues and their troubleshooting steps.
Pods stuck in Pending state
Check that nodes have the feature.node.kubernetes.io/aws-neuron=true label and that Neuron resources are available:
oc describe node <node-name> | grep neuronDriver not loading
Verify the KMM module is created and the DaemonSet is running:
oc get modules.kmm.sigs.x-k8s.io -A
oc get ds -n ai-operator-on-awsModel download failures
Check that the Hugging Face token is valid and the model name is correct. Review init container logs:
oc logs <pod-name> -c model-downloader -n neuron-inferenceScheduler not placing pods
Ensure the custom scheduler is running and pods are using the correct scheduler name:
oc get pods -n ai-operator-on-aws | grep schedulerWhat's next
The AWS Neuron Operator for OpenShift enables enterprise-grade AI acceleration. As AWS continues to invest in purpose-built AI chips and Red Hat enhances OpenShift's AI capabilities, expect more features and optimizations.
To support this vision, Red Hat AI Inference Server support for AWS AI chips (Inferentia and Trainium) is coming by January 2026. This developer preview will allow you to run the supported Red Hat AI Inference Server on AWS silicon, combining the cost efficiency of AWS Neuron with the lifecycle support and security of Red Hat AI.
To get started today:
- Review the operator documentation.
- Check out the kernel module repository.
- Explore the AWS Neuron SDK documentation.
- Join the discussion in the GitHub repositories.
The combination of AWS AI chips and OpenShift provides a powerful platform for running cost-effective AI workloads at scale. Whether you're deploying LLMs for customer service, content generation, or data analysis, this integration makes it easier and more affordable.
Note
The AWS Neuron Operator is developed jointly by AWS and Red Hat. Contributions and feedback are welcome through the GitHub repositories.