Run cost-effective AI workloads on OpenShift with AWS Neuron Operator

Large enterprises run LLM inference, training, and fine-tuning on Kubernetes for the scale and flexibility it provides. As organizations look to optimize both performance and cost, AWS Inferentia and Trainium chips provide a powerful, cost-effective option for accelerating these workloads, delivering up to 70% lower cost per inference compared to other instance types in many scenarios. Through a joint effort between AWS and Red Hat, these AWS AI chips are now available to customers using Red Hat OpenShift Service on AWS and self-managed OpenShift clusters on AWS, giving organizations more choice in how they design and run their AI platforms like Red Hat OpenShift AI.

The AWS Neuron Operator brings native support for AWS AI chips to Red Hat OpenShift, enabling you to run inference with full LLM support using frameworks like vLLM. This integration combines the cost benefits of AWS silicon with the enterprise features of OpenShift and the overall Red Hat AI capabilities..

What the AWS Neuron Operator does

The AWS Neuron Operator automates the deployment and management of AWS Neuron devices on OpenShift clusters. It handles four key tasks:

Kernel module deployment: Installs Neuron drivers using Kernel Module Management (KMM)
Device plug-in management: Exposes Neuron devices as schedulable resources
Intelligent scheduling: Deploys a custom Neuron-aware scheduler for optimal workload placement
Telemetry collection: Provides basic metrics through a node-metrics DaemonSet

The operator reconciles a custom resource called DeviceConfig that lets you configure images and target specific nodes in your cluster.

Joint development by AWS and Red Hat

This operator represents a collaboration between AWS and Red Hat engineering teams. The operator includes core functionality, Neuron integration, OpenShift integration patterns, and lifecycle management. Red Hat, as the originators of the operator framework before it became a CNCF project, developed the operator based on established best practices.

The project consists of two open source repositories:

operator-for-ai-chips-on-aws: The main operator and custom scheduler
kmod-with-kmm-for-ai-chips-on-aws: Automated builds of KMM-compatible kernel modules

Both repositories use automated GitHub Actions workflows to build and publish container images to public registries, making installation straightforward.

Why use AWS AI chips for LLM workloads

AWS Inferentia and Trainium chips are purpose-built for machine learning. Inferentia focuses on inference workloads, while Trainium handles both training and inference. Here's what makes them compelling for LLM deployments:

Cost efficiency: Run inference at up to 50% lower cost compared to GPU instances. For high-volume inference workloads, this translates to significant savings.
Performance: Inferentia2 delivers up to 4x higher throughput and 10x lower latency than first-generation Inferentia. Trainium offers high-performance training for models with hundreds of billions of parameters.
Framework support: The Neuron SDK integrates with popular frameworks including PyTorch, TensorFlow, and vLLM. You can deploy models from Hugging Face with minimal code changes.
Full LLM support: Run popular models like Llama 2, Llama 3, Mistral, and other transformer-based architectures. The vLLM integration provides optimized inference with features like continuous batching and PagedAttention.

Architecture overview

The operator uses several OpenShift and Kubernetes components to enable Neuron devices:

Node Feature Discovery (NFD): Detects Neuron PCI devices (vendor ID 1d0f) and labels nodes accordingly. This allows the operator to target the right nodes.
Kernel Module Management (KMM): Loads the Neuron kernel driver on nodes with compatible hardware. KMM handles kernel version matching automatically, even across OpenShift upgrades.
Custom Scheduler: A Neuron-aware scheduler extension that understands neuron core topology. This ensures workloads are placed on nodes with available neuron cores, not just nodes with Neuron devices.
Device plug-in: Exposes aws.amazon.com/neuron and aws.amazon.com/neuroncore as allocatable resources. Pods can request these resources in their resource limits.

The operator manages all these components through a single DeviceConfig custom resource, simplifying operations.

Installing the AWS Neuron Operator

You can install the operator through the OpenShift web console or using the command line. Both methods require three prerequisite operators from Red Hat.

Prerequisites

Before installing the AWS Neuron Operator, install these operators from OperatorHub:

Node Feature Discovery (NFD): Detects hardware features
Kernel Module Management (KMM): Manages kernel drivers
AWS Neuron Operator (by AWS): Manages Neuron devices

All three operators are available in the OpenShift OperatorHub catalog.

Installation via OpenShift console (recommended)

This method uses the OpenShift web console and is the easiest way to get started. The instructions below were validated with OpenShift 4.20.4 on Red Hat OpenShift Service on AWS.

Step 1: Install Node Feature Discovery

Open your cluster's web console.
Navigate to Ecosystem → Software Catalog (under the openshift-operators-redhat project).
Search for "Node Feature Discovery."
Click Node Feature Discovery provided by Red Hat.
Click Install, then Install again at the bottom.
Once installed, click View Operator.
Click Create Instance under NodeFeatureDiscovery.
Click Create at the bottom (use default settings).

Step 2: Apply the NFD Rule for Neuron Devices

We will use namespace ai-operator-on-aws as the target for our configuration settings. Create this namespace first:

oc apply -f - <<EOF
apiVersion: v1
kind: Namespace
metadata:
  labels:
    control-plane: controller-manager
    security.openshift.io/scc.podSecurityLabelSync: 'true'
  name: ai-operator-on-aws
EOF

Create the node feature discovery rule:

oc apply -f - <<EOF
apiVersion: nfd.openshift.io/v1alpha1
kind: NodeFeatureRule
metadata:
  name: neuron-nfd-rule
  namespace: ai-operator-on-aws
spec:
  rules:
    - name: neuron-device
      labels:
        feature.node.kubernetes.io/aws-neuron: "true"
      matchAny:
        - matchFeatures:
            - feature: pci.device
              matchExpressions:
                vendor: {op: In, value: ["1d0f"]}
                device: {op: In, value: [
                  "7064",
                  "7065",
                  "7066",
                  "7067",
                  "7164",
                  "7264",
                  "7364",
                ]}
EOF

This rule labels nodes that have AWS Neuron devices, making them discoverable by the operator.

Step 3: Install Kernel Module Management

Go back to Ecosystem → Software Catalog.
Search for "Kernel Module."
Click Kernel Module Management provided by Red Hat.
Click Install, then Install again.

Step 4: Install AWS Neuron Operator

Go to Operators → OperatorHub.
Search for "AWS Neuron."
Click AWS Neuron Operator provided by Amazon, Inc.
Click Install, then Install again.
Once installed, click View Operator.
Click Create Instance under DeviceConfig.
Update the YAML with your desired configuration (see below).
Click Create.

Installation via command line

For automation or CI/CD pipelines, use the command-line installation method.

Step 1: Install prerequisites

Install NFD and KMM operators through OperatorHub first, then create the NFD instance and apply the NFD rule shown above.

Step 2: Install the Operator

Enter the following:

# Install the latest version
kubectl apply -f https://github.com/awslabs/operator-for-ai-chips-on-aws/releases/latest/download/aws-neuron-operator.yaml
# Or install a specific version
kubectl apply -f https://github.com/awslabs/operator-for-ai-chips-on-aws/releases/download/v0.1.1/aws-neuron-operator.yaml

Create DeviceConfig

Create a DeviceConfig resource file named deviceconfig.yaml:

apiVersion: k8s.aws/v1alpha1
kind: DeviceConfig
metadata:
 name: neuron
 namespace: ai-operator-on-aws
spec:
 driversImage: public.ecr.aws/q5p6u7h8/neuron-openshift/neuron-kernel-module:2.24.7.0  # actual pull at runtime will use <image>-$KERNEL_VERSION
 devicePluginImage: public.ecr.aws/neuron/neuron-device-plugin:2.24.23.0
 customSchedulerImage: public.ecr.aws/eks-distro/kubernetes/kube-scheduler:v1.32.9-eks-1-32-24
 schedulerExtensionImage: public.ecr.aws/neuron/neuron-scheduler:2.24.23.0
 selector:
   feature.node.kubernetes.io/aws-neuron: "true"

Apply it:

oc apply -f deviceconfig.yaml

The operator will automatically append the kernel version to the driversImage at runtime, ensuring the correct driver is loaded.

Verify installation

Check that all components are running:

# Check operator pods
oc get pods -n ai-operator-on-aws
# Verify KMM module
oc get modules.kmm.sigs.x-k8s.io -A
# Check node labels
oc get nodes -l feature.node.kubernetes.io/aws-neuron=true
# Verify Neuron resources are available
kubectl get nodes -o json | jq -r '
  .items[]
  | select(((.status.capacity["aws.amazon.com/neuron"] // "0") | tonumber) > 0)
  | .metadata.name as $name
  | "\($name)\n  Neuron devices: \(.status.capacity["aws.amazon.com/neuron"])\n  Neuron cores: \(.status.capacity["aws.amazon.com/neuroncore"])"
'

You should see nodes with available Neuron devices and cores.

Running LLM inference with vLLM

Once the operator is installed, you can deploy LLM inference workloads using vLLM, a high-performance inference engine optimized for AWS Neuron.

Set up the inference environment

Prepare the OpenShift cluster by creating the necessary namespace, persistent storage for model caching, and authentication secrets.

Step 1: Create a namespace

oc create namespace neuron-inference

Step 2: Create a PersistentVolumeClaim for model storage

This PVC stores the downloaded model, so you don't need to download it every time you restart the deployment.

oc apply -f - <<EOF
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
  name: model-cache
  namespace: neuron-inference
spec:
  accessModes:
    - ReadWriteOnce
  resources:
    requests:
      storage: 50Gi
  storageClassName: gp3-csi
EOF

Step 3: Create a Hugging Face token secret

Most LLM models require authentication to download from Hugging Face (ensure that you have access to the meta-llama/Llama-3.1-8B-Instruct model or specify the model that you would like to use instead):

oc create secret generic hf-token \
  --from-literal=token=YOUR_HF_TOKEN \
  -n neuron-inference

Step 4: Deploy the vLLM inference server

Create a deployment file deployment.yaml that downloads the model and runs the vLLM server:

apiVersion: apps/v1
kind: Deployment
metadata:
 name: neuron-vllm-test
 namespace: neuron-inference
 labels:
   app: neuron-vllm-test
spec:
 replicas: 1
 selector:
   matchLabels:
     app: neuron-vllm-test
 template:
   metadata:
     labels:
       app: neuron-vllm-test
   spec:
     schedulerName: neuron-scheduler
     volumes:
       - name: model-volume
         persistentVolumeClaim:
           claimName: model-cache
       - name: shm
         emptyDir:
           medium: Memory
           sizeLimit: "2Gi"
     serviceAccountName: default
     initContainers:
       - name: fetch-model
         image: python:3.11-slim
         env:
           - name: DOCKER_CONFIG
             value: /auth
           - name: HF_HOME
             value: /model
           - name: HF_TOKEN
             valueFrom:
               secretKeyRef:
                 name: hf-token # Your existing secret
                 key: token
         command: ["/bin/sh","-c"]
         args:
           - |
            set -ex
            echo "--- SCRIPT STARTED ---"
            echo "--- CHECKING /model DIRECTORY PERMISSIONS AND CONTENTS ---"
            # Only pull if /model is empty
            if [ ! -f "/model/config.json" ]; then
             export PYTHONUSERBASE="/tmp/pip"
             export PATH="$PYTHONUSERBASE/bin:$PATH"
             pip install --no-cache-dir --user "huggingface_hub>=1.0"
             echo "Pulling model..."
             $PYTHONUSERBASE/bin/hf download meta-llama/Llama-3.1-8B-Instruct --local-dir /model
            else
             echo "Model already present, skipping model pull"
            fi
         volumeMounts:
           - name: model-volume
             mountPath: /model
     containers:
       - name: granite
         image: 'public.ecr.aws/neuron/pytorch-inference-vllm-neuronx:0.7.2-neuronx-py310-sdk2.24.1-ubuntu22.04'
         imagePullPolicy: IfNotPresent
         workingDir: /model
         env:
           - name: VLLM_SERVER_DEV_MODE
             value: '1'
           - name: NEURON_CACHE_URL
             value: "/model/neuron_cache"
         command:
           - python
           - '-m'
           - vllm.entrypoints.openai.api_server
         args:
           - '--port=8000'
           - '--model=/model'
           - '--served-model-name=meta-llama/Llama-3.1-8B-Instruct'
           - '--tensor-parallel-size=2'
           - '--device'
           - 'neuron'
           - '--max-num-seqs=4'
           - '--max-model-len=4096'
         resources:
           limits:
             memory: "100Gi"
             aws.amazon.com/neuron: 1
           requests:
             memory: "10Gi"
             aws.amazon.com/neuron: 1
         volumeMounts:
           - name: model-volume
             mountPath: /model
           - name: shm
             mountPath: /dev/shm
     restartPolicy: Always

Step 5: Expose the Service

Create a service and route for external access and store in service.yaml:

apiVersion: v1
kind: Service
metadata:
 name: neuron-vllm-test
 namespace: neuron-inference
spec:
 selector:
   app: neuron-vllm-test
 ports:
   - name: vllm-port
     protocol: TCP
     port: 80
     targetPort: 8000
---
apiVersion: route.openshift.io/v1
kind: Route
metadata:
 name: neuron-vllm-test
 namespace: neuron-inference
spec:
 to:
   kind: Service
   name: neuron-vllm-test
 port:
   targetPort: vllm-port
 tls:
   termination: edge
   insecureEdgeTerminationPolicy: Redirect

Apply all resources:

oc apply -f deployment.yaml
oc apply -f service.yaml

Testing the inference endpoint

Once the vLLM server is running, you can send requests to the OpenAI-compatible API:

# Get the route URL
ROUTE_URL=$(oc get route neuron-vllm-test -n neuron-inference -o jsonpath='{.spec.host}')
# Send a test request
curl https://$ROUTE_URL/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "meta-llama/Llama-3.1-8B-Instruct",
    "messages": [{"role": "user", "content": "Explain quantum computing in simple terms"}],
    "max_tokens": 50
  }'

The vLLM server provides an OpenAI-compatible API, making it easy to integrate with existing applications.

Cost optimization strategies

Running LLMs on AWS Neuron chips can significantly reduce your inference costs. Here are strategies to maximize savings:

Use Inferentia2 for inference-only workloads. Inferentia2 instances like inf2.xlarge start at a fraction of the cost of comparable GPU instances. For production inference, this is the most cost-effective option.
Leverage continuous batching. vLLM's continuous batching feature maximizes throughput by dynamically batching requests. This increases utilization and reduces cost per inference.
Right-size your instances. Start with smaller instance types and scale up based on actual usage. Inferentia2 instances come in various sizes from inf2.xlarge (1 Neuron device) to inf2.48xlarge (12 devices).
Use Spot instances for development. Red Hat OpenShift Service on AWS supports EC2 Spot instances through machine pools. Use Spot for development and testing environments to save up to 90%.
Cache models on persistent volumes. As shown in the vLLM example, caching models on PVCs eliminates repeated downloads and reduces startup time.

Monitoring and troubleshooting

The operator includes basic telemetry through the node-metrics DaemonSet. For production deployments, integrate with OpenShift monitoring.

Common issues

Here are some common issues and their troubleshooting steps.

Pods stuck in Pending state

Check that nodes have the feature.node.kubernetes.io/aws-neuron=true label and that Neuron resources are available:

oc describe node <node-name> | grep neuron

Driver not loading

Verify the KMM module is created and the DaemonSet is running:

oc get modules.kmm.sigs.x-k8s.io -A
oc get ds -n ai-operator-on-aws

Model download failures

Check that the Hugging Face token is valid and the model name is correct. Review init container logs:

oc logs <pod-name> -c model-downloader -n neuron-inference

Scheduler not placing pods

Ensure the custom scheduler is running and pods are using the correct scheduler name:

oc get pods -n ai-operator-on-aws | grep scheduler

What's next

The AWS Neuron Operator for OpenShift enables enterprise-grade AI acceleration. As AWS continues to invest in purpose-built AI chips and Red Hat enhances OpenShift's AI capabilities, expect more features and optimizations.

To support this vision, Red Hat AI Inference Server support for AWS AI chips (Inferentia and Trainium) is coming by January 2026. This developer preview will allow you to run the supported Red Hat AI Inference Server on AWS silicon, combining the cost efficiency of AWS Neuron with the lifecycle support and security of Red Hat AI.

To get started today:

Review the operator documentation.
Check out the kernel module repository.
Explore the AWS Neuron SDK documentation.
Join the discussion in the GitHub repositories.

The combination of AWS AI chips and OpenShift provides a powerful platform for running cost-effective AI workloads at scale. Whether you're deploying LLMs for customer service, content generation, or data analysis, this integration makes it easier and more affordable.

Note

The AWS Neuron Operator is developed jointly by AWS and Red Hat. Contributions and feedback are welcome through the GitHub repositories.

Last updated: December 3, 2025