Skip to main content
Redhat Developers  Logo
  • Products

    Platforms

    • Red Hat Enterprise Linux
      Red Hat Enterprise Linux Icon
    • Red Hat AI
      Red Hat AI
    • Red Hat OpenShift
      Openshift icon
    • Red Hat Ansible Automation Platform
      Ansible icon
    • See all Red Hat products

    Featured

    • Red Hat build of OpenJDK
    • Red Hat Developer Hub
    • Red Hat JBoss Enterprise Application Platform
    • Red Hat OpenShift Dev Spaces
    • Red Hat OpenShift Local
    • Red Hat Developer Sandbox

      Try Red Hat products and technologies without setup or configuration fees for 30 days with this shared Red Hat OpenShift and Kubernetes cluster.
    • Try at no cost
  • Technologies

    Featured

    • AI/ML
      AI/ML Icon
    • Linux
      Linux Icon
    • Kubernetes
      Cloud icon
    • Automation
      Automation Icon showing arrows moving in a circle around a gear
    • See all technologies
    • Programming languages & frameworks

      • Java
      • Python
      • JavaScript
    • System design & architecture

      • Red Hat architecture and design patterns
      • Microservices
      • Event-Driven Architecture
      • Databases
    • Developer experience

      • Productivity
      • Tools
      • GitOps
    • Automated data processing

      • AI/ML
      • Data science
      • Apache Kafka on Kubernetes
    • Platform engineering

      • DevOps
      • DevSecOps
      • Red Hat Ansible Automation Platform for applications and services
    • Secure development & architectures

      • Security
      • Secure coding
  • Learn

    Featured

    • Kubernetes & cloud native
      Openshift icon
    • Linux
      Rhel icon
    • Automation
      Ansible cloud icon
    • AI/ML
      AI/ML Icon
    • See all learning resources

    E-books

    • GitOps cookbook
    • Podman in action
    • Kubernetes operators
    • The path to GitOps
    • See all e-books

    Cheat sheets

    • Linux commands
    • Bash commands
    • Git
    • systemd commands
    • See all cheat sheets

    Documentation

    • Product documentation
    • API catalog
    • Legacy documentation
  • Developer Sandbox

    Developer Sandbox

    • Access Red Hat’s products and technologies without setup or configuration, and start developing quicker than ever before with our new, no-cost sandbox environments.
    • Explore the Developer Sandbox

    Featured Developer Sandbox activities

    • Get started with your Developer Sandbox
    • OpenShift virtualization and application modernization using the Developer Sandbox
    • Explore all Developer Sandbox activities

    Ready to start developing apps?

    • Try at no cost
  • Blog
  • Events
  • Videos

Run cost-effective AI workloads on OpenShift with AWS Neuron Operator

December 2, 2025
Erwan Gallen Yevgeny Shnaidman Nenad Peric Mikhail Shapirov - AWS
Related topics:
Artificial intelligenceKubernetesOpen sourceOperators
Related products:
Red Hat AIRed Hat OpenShift Service on AWS

    Large enterprises run LLM inference, training, and fine-tuning on Kubernetes for the scale and flexibility it provides. As organizations look to optimize both performance and cost, AWS Inferentia and Trainium chips provide a powerful, cost-effective option for accelerating these workloads, delivering up to 70% lower cost per inference compared to other instance types in many scenarios. Through a joint effort between AWS and Red Hat, these AWS AI chips are now available to customers using Red Hat OpenShift Service on AWS and self-managed OpenShift clusters on AWS, giving organizations more choice in how they design and run their AI platforms like Red Hat OpenShift AI.

    The AWS Neuron Operator brings native support for AWS AI chips to Red Hat OpenShift, enabling you to run inference with full LLM support using frameworks like vLLM. This integration combines the cost benefits of AWS silicon with the enterprise features of OpenShift and the overall Red Hat AI capabilities..

    What the AWS Neuron Operator does

    The AWS Neuron Operator automates the deployment and management of AWS Neuron devices on OpenShift clusters. It handles four key tasks:

    • Kernel module deployment: Installs Neuron drivers using Kernel Module Management (KMM)
    • Device plug-in management: Exposes Neuron devices as schedulable resources
    • Intelligent scheduling: Deploys a custom Neuron-aware scheduler for optimal workload placement
    • Telemetry collection: Provides basic metrics through a node-metrics DaemonSet

    The operator reconciles a custom resource called DeviceConfig that lets you configure images and target specific nodes in your cluster.

    Joint development by AWS and Red Hat

    This operator represents a collaboration between AWS and Red Hat engineering teams. The operator includes core functionality, Neuron integration, OpenShift integration patterns, and lifecycle management. Red Hat, as the originators of the operator framework before it became a CNCF project, developed the operator based on established best practices.

    The project consists of two open source repositories:

    • operator-for-ai-chips-on-aws: The main operator and custom scheduler
    • kmod-with-kmm-for-ai-chips-on-aws: Automated builds of KMM-compatible kernel modules

    Both repositories use automated GitHub Actions workflows to build and publish container images to public registries, making installation straightforward.

    Why use AWS AI chips for LLM workloads

    AWS Inferentia and Trainium chips are purpose-built for machine learning. Inferentia focuses on inference workloads, while Trainium handles both training and inference. Here's what makes them compelling for LLM deployments:

    • Cost efficiency: Run inference at up to 50% lower cost compared to GPU instances. For high-volume inference workloads, this translates to significant savings.
    • Performance: Inferentia2 delivers up to 4x higher throughput and 10x lower latency than first-generation Inferentia. Trainium offers high-performance training for models with hundreds of billions of parameters.
    • Framework support: The Neuron SDK integrates with popular frameworks including PyTorch, TensorFlow, and vLLM. You can deploy models from Hugging Face with minimal code changes.
    • Full LLM support: Run popular models like Llama 2, Llama 3, Mistral, and other transformer-based architectures. The vLLM integration provides optimized inference with features like continuous batching and PagedAttention.

    Architecture overview

    The operator uses several OpenShift and Kubernetes components to enable Neuron devices:

    • Node Feature Discovery (NFD): Detects Neuron PCI devices (vendor ID 1d0f) and labels nodes accordingly. This allows the operator to target the right nodes.
    • Kernel Module Management (KMM): Loads the Neuron kernel driver on nodes with compatible hardware. KMM handles kernel version matching automatically, even across OpenShift upgrades.
    • Custom Scheduler: A Neuron-aware scheduler extension that understands neuron core topology. This ensures workloads are placed on nodes with available neuron cores, not just nodes with Neuron devices.
    • Device plug-in: Exposes aws.amazon.com/neuron and aws.amazon.com/neuroncore as allocatable resources. Pods can request these resources in their resource limits.

    The operator manages all these components through a single DeviceConfig custom resource, simplifying operations.

    Installing the AWS Neuron Operator

    You can install the operator through the OpenShift web console or using the command line. Both methods require three prerequisite operators from Red Hat.

    Prerequisites

    Before installing the AWS Neuron Operator, install these operators from OperatorHub:

    1. Node Feature Discovery (NFD): Detects hardware features
    2. Kernel Module Management (KMM): Manages kernel drivers
    3. AWS Neuron Operator (by AWS): Manages Neuron devices

    All three operators are available in the OpenShift OperatorHub catalog.

    Installation via OpenShift console (recommended)

    This method uses the OpenShift web console and is the easiest way to get started. The instructions below were validated with OpenShift 4.20.4 on Red Hat OpenShift Service on AWS.

    Step 1: Install Node Feature Discovery

    1. Open your cluster's web console.
    2. Navigate to Ecosystem → Software Catalog (under the openshift-operators-redhat project).
    3. Search for "Node Feature Discovery."
    4. Click Node Feature Discovery provided by Red Hat.
    5. Click Install, then Install again at the bottom.
    6. Once installed, click View Operator.
    7. Click Create Instance under NodeFeatureDiscovery.
    8. Click Create at the bottom (use default settings).

    Step 2: Apply the NFD Rule for Neuron Devices

    We will use namespace ai-operator-on-aws as the target for our configuration settings. Create this namespace first:

    oc apply -f - <<EOF
    apiVersion: v1
    kind: Namespace
    metadata:
      labels:
        control-plane: controller-manager
        security.openshift.io/scc.podSecurityLabelSync: 'true'
      name: ai-operator-on-aws
    EOF

    Create the node feature discovery rule:

    oc apply -f - <<EOF
    apiVersion: nfd.openshift.io/v1alpha1
    kind: NodeFeatureRule
    metadata:
      name: neuron-nfd-rule
      namespace: ai-operator-on-aws
    spec:
      rules:
        - name: neuron-device
          labels:
            feature.node.kubernetes.io/aws-neuron: "true"
          matchAny:
            - matchFeatures:
                - feature: pci.device
                  matchExpressions:
                    vendor: {op: In, value: ["1d0f"]}
                    device: {op: In, value: [
                      "7064",
                      "7065",
                      "7066",
                      "7067",
                      "7164",
                      "7264",
                      "7364",
                    ]}
    EOF

    This rule labels nodes that have AWS Neuron devices, making them discoverable by the operator.

    Step 3: Install Kernel Module Management

    1. Go back to Ecosystem → Software Catalog.
    2. Search for "Kernel Module."
    3. Click Kernel Module Management provided by Red Hat.
    4. Click Install, then Install again.

    Step 4: Install AWS Neuron Operator

    1. Go to Operators → OperatorHub.
    2. Search for "AWS Neuron."
    3. Click AWS Neuron Operator provided by Amazon, Inc.
    4. Click Install, then Install again.
    5. Once installed, click View Operator.
    6. Click Create Instance under DeviceConfig.
    7. Update the YAML with your desired configuration (see below).
    8. Click Create.

    Installation via command line

    For automation or CI/CD pipelines, use the command-line installation method.

    Step 1: Install prerequisites

    Install NFD and KMM operators through OperatorHub first, then create the NFD instance and apply the NFD rule shown above.

    Step 2: Install the Operator

    Enter the following:

    # Install the latest version
    kubectl apply -f https://github.com/awslabs/operator-for-ai-chips-on-aws/releases/latest/download/aws-neuron-operator.yaml
    # Or install a specific version
    kubectl apply -f https://github.com/awslabs/operator-for-ai-chips-on-aws/releases/download/v0.1.1/aws-neuron-operator.yaml

    Create DeviceConfig

    Create a DeviceConfig resource file named deviceconfig.yaml:

    apiVersion: k8s.aws/v1alpha1
    kind: DeviceConfig
    metadata:
     name: neuron
     namespace: ai-operator-on-aws
    spec:
     driversImage: public.ecr.aws/q5p6u7h8/neuron-openshift/neuron-kernel-module:2.24.7.0  # actual pull at runtime will use <image>-$KERNEL_VERSION
     devicePluginImage: public.ecr.aws/neuron/neuron-device-plugin:2.24.23.0
     customSchedulerImage: public.ecr.aws/eks-distro/kubernetes/kube-scheduler:v1.32.9-eks-1-32-24
     schedulerExtensionImage: public.ecr.aws/neuron/neuron-scheduler:2.24.23.0
     selector:
       feature.node.kubernetes.io/aws-neuron: "true"

    Apply it:

    oc apply -f deviceconfig.yaml

    The operator will automatically append the kernel version to the driversImage at runtime, ensuring the correct driver is loaded.

    Verify installation

    Check that all components are running:

    # Check operator pods
    oc get pods -n ai-operator-on-aws
    # Verify KMM module
    oc get modules.kmm.sigs.x-k8s.io -A
    # Check node labels
    oc get nodes -l feature.node.kubernetes.io/aws-neuron=true
    # Verify Neuron resources are available
    kubectl get nodes -o json | jq -r '
      .items[]
      | select(((.status.capacity["aws.amazon.com/neuron"] // "0") | tonumber) > 0)
      | .metadata.name as $name
      | "\($name)\n  Neuron devices: \(.status.capacity["aws.amazon.com/neuron"])\n  Neuron cores: \(.status.capacity["aws.amazon.com/neuroncore"])"
    '

    You should see nodes with available Neuron devices and cores.

    Running LLM inference with vLLM

    Once the operator is installed, you can deploy LLM inference workloads using vLLM, a high-performance inference engine optimized for AWS Neuron.

    Set up the inference environment

    Prepare the OpenShift cluster by creating the necessary namespace, persistent storage for model caching, and authentication secrets.

    Step 1: Create a namespace

    oc create namespace neuron-inference

    Step 2: Create a PersistentVolumeClaim for model storage

    This PVC stores the downloaded model, so you don't need to download it every time you restart the deployment.

    oc apply -f - <<EOF
    apiVersion: v1
    kind: PersistentVolumeClaim
    metadata:
      name: model-cache
      namespace: neuron-inference
    spec:
      accessModes:
        - ReadWriteOnce
      resources:
        requests:
          storage: 50Gi
      storageClassName: gp3-csi
    EOF

    Step 3: Create a Hugging Face token secret

    Most LLM models require authentication to download from Hugging Face (ensure that you have access to the meta-llama/Llama-3.1-8B-Instruct model or specify the model that you would like to use instead):

    oc create secret generic hf-token \
      --from-literal=token=YOUR_HF_TOKEN \
      -n neuron-inference

    Step 4: Deploy the vLLM inference server

    Create a deployment file deployment.yaml that downloads the model and runs the vLLM server:

    apiVersion: apps/v1
    kind: Deployment
    metadata:
     name: neuron-vllm-test
     namespace: neuron-inference
     labels:
       app: neuron-vllm-test
    spec:
     replicas: 1
     selector:
       matchLabels:
         app: neuron-vllm-test
     template:
       metadata:
         labels:
           app: neuron-vllm-test
       spec:
         schedulerName: neuron-scheduler
         volumes:
           - name: model-volume
             persistentVolumeClaim:
               claimName: model-cache
           - name: shm
             emptyDir:
               medium: Memory
               sizeLimit: "2Gi"
         serviceAccountName: default
         initContainers:
           - name: fetch-model
             image: python:3.11-slim
             env:
               - name: DOCKER_CONFIG
                 value: /auth
               - name: HF_HOME
                 value: /model
               - name: HF_TOKEN
                 valueFrom:
                   secretKeyRef:
                     name: hf-token # Your existing secret
                     key: token
             command: ["/bin/sh","-c"]
             args:
               - |
                set -ex
                echo "--- SCRIPT STARTED ---"
                echo "--- CHECKING /model DIRECTORY PERMISSIONS AND CONTENTS ---"
                # Only pull if /model is empty
                if [ ! -f "/model/config.json" ]; then
                 export PYTHONUSERBASE="/tmp/pip"
                 export PATH="$PYTHONUSERBASE/bin:$PATH"
                 pip install --no-cache-dir --user "huggingface_hub>=1.0"
                 echo "Pulling model..."
                 $PYTHONUSERBASE/bin/hf download meta-llama/Llama-3.1-8B-Instruct --local-dir /model
                else
                 echo "Model already present, skipping model pull"
                fi
             volumeMounts:
               - name: model-volume
                 mountPath: /model
         containers:
           - name: granite
             image: 'public.ecr.aws/neuron/pytorch-inference-vllm-neuronx:0.7.2-neuronx-py310-sdk2.24.1-ubuntu22.04'
             imagePullPolicy: IfNotPresent
             workingDir: /model
             env:
               - name: VLLM_SERVER_DEV_MODE
                 value: '1'
               - name: NEURON_CACHE_URL
                 value: "/model/neuron_cache"
             command:
               - python
               - '-m'
               - vllm.entrypoints.openai.api_server
             args:
               - '--port=8000'
               - '--model=/model'
               - '--served-model-name=meta-llama/Llama-3.1-8B-Instruct'
               - '--tensor-parallel-size=2'
               - '--device'
               - 'neuron'
               - '--max-num-seqs=4'
               - '--max-model-len=4096'
             resources:
               limits:
                 memory: "100Gi"
                 aws.amazon.com/neuron: 1
               requests:
                 memory: "10Gi"
                 aws.amazon.com/neuron: 1
             volumeMounts:
               - name: model-volume
                 mountPath: /model
               - name: shm
                 mountPath: /dev/shm
         restartPolicy: Always

    Step 5: Expose the Service

    Create a service and route for external access and store in service.yaml:

    apiVersion: v1
    kind: Service
    metadata:
     name: neuron-vllm-test
     namespace: neuron-inference
    spec:
     selector:
       app: neuron-vllm-test
     ports:
       - name: vllm-port
         protocol: TCP
         port: 80
         targetPort: 8000
    ---
    apiVersion: route.openshift.io/v1
    kind: Route
    metadata:
     name: neuron-vllm-test
     namespace: neuron-inference
    spec:
     to:
       kind: Service
       name: neuron-vllm-test
     port:
       targetPort: vllm-port
     tls:
       termination: edge
       insecureEdgeTerminationPolicy: Redirect

    Apply all resources:

    oc apply -f deployment.yaml
    oc apply -f service.yaml

    Testing the inference endpoint

    Once the vLLM server is running, you can send requests to the OpenAI-compatible API:

    # Get the route URL
    ROUTE_URL=$(oc get route neuron-vllm-test -n neuron-inference -o jsonpath='{.spec.host}')
    # Send a test request
    curl https://$ROUTE_URL/v1/chat/completions \
      -H "Content-Type: application/json" \
      -d '{
        "model": "meta-llama/Llama-3.1-8B-Instruct",
        "messages": [{"role": "user", "content": "Explain quantum computing in simple terms"}],
        "max_tokens": 50
      }'

    The vLLM server provides an OpenAI-compatible API, making it easy to integrate with existing applications.

    Cost optimization strategies

    Running LLMs on AWS Neuron chips can significantly reduce your inference costs. Here are strategies to maximize savings:

    • Use Inferentia2 for inference-only workloads. Inferentia2 instances like inf2.xlarge start at a fraction of the cost of comparable GPU instances. For production inference, this is the most cost-effective option.
    • Leverage continuous batching. vLLM's continuous batching feature maximizes throughput by dynamically batching requests. This increases utilization and reduces cost per inference.
    • Right-size your instances. Start with smaller instance types and scale up based on actual usage. Inferentia2 instances come in various sizes from inf2.xlarge (1 Neuron device) to inf2.48xlarge (12 devices).
    • Use Spot instances for development. Red Hat OpenShift Service on AWS supports EC2 Spot instances through machine pools. Use Spot for development and testing environments to save up to 90%.
    • Cache models on persistent volumes. As shown in the vLLM example, caching models on PVCs eliminates repeated downloads and reduces startup time.

    Monitoring and troubleshooting

    The operator includes basic telemetry through the node-metrics DaemonSet. For production deployments, integrate with OpenShift monitoring.

    Common issues

    Here are some common issues and their troubleshooting steps.

    Pods stuck in Pending state

    Check that nodes have the feature.node.kubernetes.io/aws-neuron=true label and that Neuron resources are available:

    oc describe node <node-name> | grep neuron

    Driver not loading

    Verify the KMM module is created and the DaemonSet is running:

    oc get modules.kmm.sigs.x-k8s.io -A
    oc get ds -n ai-operator-on-aws

    Model download failures

    Check that the Hugging Face token is valid and the model name is correct. Review init container logs:

    oc logs <pod-name> -c model-downloader -n neuron-inference

    Scheduler not placing pods

    Ensure the custom scheduler is running and pods are using the correct scheduler name:

    oc get pods -n ai-operator-on-aws | grep scheduler

    What's next

    The AWS Neuron Operator for OpenShift enables enterprise-grade AI acceleration. As AWS continues to invest in purpose-built AI chips and Red Hat enhances OpenShift's AI capabilities, expect more features and optimizations.

    To support this vision, Red Hat AI Inference Server support for AWS AI chips (Inferentia and Trainium) is coming by January 2026. This developer preview will allow you to run the supported Red Hat AI Inference Server on AWS silicon, combining the cost efficiency of AWS Neuron with the lifecycle support and security of Red Hat AI.

    To get started today:

    • Review the operator documentation.
    • Check out the kernel module repository.
    • Explore the AWS Neuron SDK documentation.
    • Join the discussion in the GitHub repositories.

    The combination of AWS AI chips and OpenShift provides a powerful platform for running cost-effective AI workloads at scale. Whether you're deploying LLMs for customer service, content generation, or data analysis, this integration makes it easier and more affordable.

     

    Note

    The AWS Neuron Operator is developed jointly by AWS and Red Hat. Contributions and feedback are welcome through the GitHub repositories.

    Last updated: December 3, 2025

    Related Posts

    • Reduce LLM benchmarking costs with oversaturation detection

    • The hidden cost of large language models

    • Deploy an LLM inference service on OpenShift AI

    • Network performance in distributed training: Maximizing GPU utilization on OpenShift

    • How to deploy MCP servers on OpenShift using ToolHive

    • How to reduce costs with OpenShift on Graviton AWS

    Recent Posts

    • How in-place pod resizing boosts efficiency in OpenShift

    • Automate Oracle 19c deployments on OpenShift Virtualization

    • Monitoring OpenShift Gateway API and Service Mesh with Kiali

    • Improve efficiency with OpenStack Services on OpenShift

    • Quantum-secure gateways in Red Hat OpenShift Service Mesh 3.2

    What’s up next?

    Learn how to deploy an application on a cluster using Red Hat OpenShift Service on AWS.

    Start the learning path
    Red Hat Developers logo LinkedIn YouTube Twitter Facebook

    Platforms

    • Red Hat AI
    • Red Hat Enterprise Linux
    • Red Hat OpenShift
    • Red Hat Ansible Automation Platform
    • See all products

    Build

    • Developer Sandbox
    • Developer tools
    • Interactive tutorials
    • API catalog

    Quicklinks

    • Learning resources
    • E-books
    • Cheat sheets
    • Blog
    • Events
    • Newsletter

    Communicate

    • About us
    • Contact sales
    • Find a partner
    • Report a website issue
    • Site status dashboard
    • Report a security problem

    RED HAT DEVELOPER

    Build here. Go anywhere.

    We serve the builders. The problem solvers who create careers with code.

    Join us if you’re a developer, software engineer, web designer, front-end designer, UX designer, computer scientist, architect, tester, product manager, project manager or team lead.

    Sign me up

    Red Hat legal and privacy links

    • About Red Hat
    • Jobs
    • Events
    • Locations
    • Contact Red Hat
    • Red Hat Blog
    • Inclusion at Red Hat
    • Cool Stuff Store
    • Red Hat Summit
    © 2025 Red Hat

    Red Hat legal and privacy links

    • Privacy statement
    • Terms of use
    • All policies and guidelines
    • Digital accessibility

    Report a website issue