Skip to main content
Redhat Developers  Logo
  • Products

    Platforms

    • Red Hat Enterprise Linux
      Red Hat Enterprise Linux Icon
    • Red Hat AI
      Red Hat AI
    • Red Hat OpenShift
      Openshift icon
    • Red Hat Ansible Automation Platform
      Ansible icon
    • See all Red Hat products

    Featured

    • Red Hat build of OpenJDK
    • Red Hat Developer Hub
    • Red Hat JBoss Enterprise Application Platform
    • Red Hat OpenShift Dev Spaces
    • Red Hat OpenShift Local
    • Red Hat Developer Sandbox

      Try Red Hat products and technologies without setup or configuration fees for 30 days with this shared Red Hat OpenShift and Kubernetes cluster.
    • Try at no cost
  • Technologies

    Featured

    • AI/ML
      AI/ML Icon
    • Linux
      Linux Icon
    • Kubernetes
      Cloud icon
    • Automation
      Automation Icon showing arrows moving in a circle around a gear
    • See all technologies
    • Programming languages & frameworks

      • Java
      • Python
      • JavaScript
    • System design & architecture

      • Red Hat architecture and design patterns
      • Microservices
      • Event-Driven Architecture
      • Databases
    • Developer experience

      • Productivity
      • Tools
      • GitOps
    • Automated data processing

      • AI/ML
      • Data science
      • Apache Kafka on Kubernetes
    • Platform engineering

      • DevOps
      • DevSecOps
      • Red Hat Ansible Automation Platform for applications and services
    • Secure development & architectures

      • Security
      • Secure coding
  • Learn

    Featured

    • Kubernetes & cloud native
      Openshift icon
    • Linux
      Rhel icon
    • Automation
      Ansible cloud icon
    • AI/ML
      AI/ML Icon
    • See all learning resources

    E-books

    • GitOps cookbook
    • Podman in action
    • Kubernetes operators
    • The path to GitOps
    • See all e-books

    Cheat sheets

    • Linux commands
    • Bash commands
    • Git
    • systemd commands
    • See all cheat sheets

    Documentation

    • Product documentation
    • API catalog
    • Legacy documentation
  • Developer Sandbox

    Developer Sandbox

    • Access Red Hat’s products and technologies without setup or configuration, and start developing quicker than ever before with our new, no-cost sandbox environments.
    • Explore the Developer Sandbox

    Featured Developer Sandbox activities

    • Get started with your Developer Sandbox
    • OpenShift virtualization and application modernization using the Developer Sandbox
    • Explore all Developer Sandbox activities

    Ready to start developing apps?

    • Try at no cost
  • Blog
  • Events
  • Videos

Run cost-effective AI workloads on OpenShift with AWS Neuron Operator

December 2, 2025
Erwan Gallen Yevgeny Shnaidman Nenad Peric Mikhail Shapirov - AWS
Related topics:
Artificial intelligenceKubernetesOpen sourceOperators
Related products:
Red Hat AIRed Hat OpenShift Service on AWS

    Large enterprises run LLM inference, training, and fine-tuning on Kubernetes for the scale and flexibility it provides. As organizations look to optimize both performance and cost, AWS Inferentia and Trainium chips provide a powerful, cost-effective option for accelerating these workloads, delivering up to 70% lower cost per inference compared to other instance types in many scenarios. Through a joint effort between AWS and Red Hat, these AWS AI chips are now available to customers using Red Hat OpenShift Service on AWS and self-managed OpenShift clusters on AWS, giving organizations more choice in how they design and run their AI platforms.

    The AWS Neuron Operator brings native support for AWS AI chips to Red Hat OpenShift, enabling you to run inference with full LLM support using frameworks like vLLM. This integration combines the cost benefits of AWS silicon with the enterprise features of OpenShift.

    What the AWS Neuron Operator does

    The AWS Neuron Operator automates the deployment and management of AWS Neuron devices on OpenShift clusters. It handles four key tasks:

    • Kernel module deployment: Installs Neuron drivers using Kernel Module Management (KMM)
    • Device plug-in management: Exposes Neuron devices as schedulable resources
    • Intelligent scheduling: Deploys a custom Neuron-aware scheduler for optimal workload placement
    • Telemetry collection: Provides basic metrics through a node-metrics DaemonSet

    The operator reconciles a custom resource called DeviceConfig that lets you configure images and target specific nodes in your cluster.

    Joint development by AWS and Red Hat

    This operator represents a collaboration between AWS and Red Hat engineering teams. The operator includes core functionality, Neuron integration, OpenShift integration patterns, and lifecycle management. Red Hat, as the originators of the operator framework before it became a CNCF project, developed the operator based on established best practices.

    The project consists of two open source repositories:

    • operator-for-ai-chips-on-aws: The main operator and custom scheduler
    • kmod-with-kmm-for-ai-chips-on-aws: Automated builds of KMM-compatible kernel modules

    Both repositories use automated GitHub Actions workflows to build and publish container images to public registries, making installation straightforward.

    Why use AWS AI chips for LLM workloads

    AWS Inferentia and Trainium chips are purpose-built for machine learning. Inferentia focuses on inference workloads, while Trainium handles both training and inference. Here's what makes them compelling for LLM deployments:

    • Cost efficiency: Run inference at up to 50% lower cost compared to GPU instances. For high-volume inference workloads, this translates to significant savings.
    • Performance: Inferentia2 delivers up to 4x higher throughput and 10x lower latency than first-generation Inferentia. Trainium offers high-performance training for models with hundreds of billions of parameters.
    • Framework support: The Neuron SDK integrates with popular frameworks including PyTorch, TensorFlow, and vLLM. You can deploy models from Hugging Face with minimal code changes.
    • Full LLM support: Run popular models like Llama 2, Llama 3, Mistral, and other transformer-based architectures. The vLLM integration provides optimized inference with features like continuous batching and PagedAttention.

    Architecture overview

    The operator uses several OpenShift and Kubernetes components to enable Neuron devices:

    • Node Feature Discovery (NFD): Detects Neuron PCI devices (vendor ID 1d0f) and labels nodes accordingly. This allows the operator to target the right nodes.
    • Kernel Module Management (KMM): Loads the Neuron kernel driver on nodes with compatible hardware. KMM handles kernel version matching automatically, even across OpenShift upgrades.
    • Custom Scheduler: A Neuron-aware scheduler extension that understands neuron core topology. This ensures workloads are placed on nodes with available neuron cores, not just nodes with Neuron devices.
    • Device plug-in: Exposes aws.amazon.com/neuron and aws.amazon.com/neuroncore as allocatable resources. Pods can request these resources in their resource limits.

    The operator manages all these components through a single DeviceConfig custom resource, simplifying operations.

    Installing the AWS Neuron Operator

    You can install the operator through the OpenShift web console or using the command line. Both methods require three prerequisite operators from Red Hat.

    Prerequisites

    Before installing the AWS Neuron Operator, install these operators from OperatorHub:

    • Node Feature Discovery (NFD): Detects hardware features
    • Kernel Module Management (KMM): Manages kernel drivers
    • AWS Neuron Operator (by AWS): Manages Neuron devices

    All three operators are available in the OpenShift OperatorHub catalog.

    Installation via OpenShift console (recommended)

    This method uses the OpenShift web console and is the easiest way to get started.

    Step 1: Install Node Feature Discovery

    1. Open your cluster's web console.
    2. Navigate to Operators → OperatorHub.
    3. Search for Node Feature Discovery.
    4. Click Node Feature Discovery provided by Red Hat.
    5. Click Install, then Install again at the bottom.
    6. Once installed, click View Operator.
    7. Click Create Instance under NodeFeatureDiscovery.
    8. Click Create at the bottom (use default settings).

    Step 2: Apply the NFD Rule for Neuron Devices

    Create a file named neuron-nfd-rule.yaml:

    apiVersion: nfd.openshift.io/v1alpha1
    kind: NodeFeatureRule
    metadata:
      name: neuron-nfd-rule
      namespace: ai-operator-on-aws
    spec:
      rules:
        - name: neuron-device
          labels:
            feature.node.kubernetes.io/aws-neuron: "true"
          matchAny:
            - matchFeatures:
                - feature: pci.device
                  matchExpressions:
                    vendor: {op: In, value: ["1d0f"]}
                    device: {op: In, value: [
                      "7064", "7065", "7066", "7067",
                      "7164", "7264", "7364"
                    ]}

    Apply it:

    oc apply -f neuron-nfd-rule.yaml

    This rule labels nodes that have AWS Neuron devices, making them discoverable by the operator.

    Step 3: Install Kernel Module Management

    1. Go back to Operators → OperatorHub.
    2. Search for Kernel Module.v
    3. Click Kernel Module Management provided by Red Hat.
    4. Click Install, then Install again.

    Step 4: Install AWS Neuron Operator

    1. Go to Operators → OperatorHub.
    2. Search for AWS Neuron.
    3. Click AWS Neuron Operator provided by Amazon, Inc.
    4. Click Install, then Install again.
    5. Once installed, click View Operator.
    6. Click Create Instance under DeviceConfig.
    7. Update the YAML with your desired configuration (see below).
    8. Click Create.

    Installation via command line

    For automation or CI/CD pipelines, use the command-line installation method.

    Step 1: Install Prerequisites

    Install NFD and KMM operators through OperatorHub first, then create the NFD instance and apply the NFD rule shown above.

    Step 2: Install the Operator

    # Install the latest version
    kubectl apply -f https://github.com/awslabs/operator-for-ai-chips-on-aws/releases/latest/download/aws-neuron-operator.yaml
    # Or install a specific version
    kubectl apply -f https://github.com/awslabs/operator-for-ai-chips-on-aws/releases/download/v0.1.1/aws-neuron-operator.yaml

    Step 3: Create DeviceConfig

    Create a file named deviceconfig.yaml:

    apiVersion: k8s.aws/v1alpha1
    kind: DeviceConfig
    metadata:
      name: neuron
      namespace: ai-operator-on-aws
    spec:
      driversImage: public.ecr.aws/q5p6u7h8/neuron-openshift/neuron-kernel-module:2.24.7.0
      devicePluginImage: public.ecr.aws/neuron/neuron-device-plugin:2.24.23.0
      customSchedulerImage: public.ecr.aws/eks-distro/kubernetes/kube-scheduler:v1.32.9-eks-1-32-24
      schedulerExtensionImage: public.ecr.aws/neuron/neuron-scheduler:2.24.23.0
      selector:
        feature.node.kubernetes.io/aws-neuron: "true"

    Apply it:

    oc apply -f deviceconfig.yaml

    The operator will automatically append the kernel version to the driversImage at runtime, ensuring the correct driver is loaded.

    Verify installation

    Check that all components are running:

    # Check operator pods
    oc get pods -n ai-operator-on-aws
    # Verify KMM module
    oc get modules.kmm.sigs.x-k8s.io -A
    # Check node labels
    oc get nodes -l feature.node.kubernetes.io/aws-neuron=true
    # Verify Neuron resources are available
    kubectl get nodes -o json | jq -r '
      .items[]
      | select(((.status.capacity["aws.amazon.com/neuron"] // "0") | tonumber) > 0)
      | .metadata.name as $name
      | "\($name)\n  Neuron devices: \(.status.capacity["aws.amazon.com/neuron"])\n  Neuron cores: \(.status.capacity["aws.amazon.com/neuroncore"])"
    '

    You should see nodes with available Neuron devices and cores.

    Running LLM inference with vLLM

    Once the operator is installed, you can deploy LLM inference workloads using vLLM, a high-performance inference engine optimized for AWS Neuron.

    Set up the inference environment

    1. Create a namespace:

      oc create namespace neuron-inference
    2. Create a PersistentVolumeClaim for model storage. This PVC stores the downloaded model, so you don't need to download it every time you restart the deployment.

      apiVersion: v1
      kind: PersistentVolumeClaim
      metadata:
        name: model-storage
        namespace: neuron-inference
      spec:
        accessModes:
          - ReadWriteOnce
        resources:
          requests:
            storage: 100Gi
    3. Create a Hugging Face token secret. Most LLM models require authentication to download from Hugging Face.

      oc create secret generic hf-token \
        --from-literal=token=YOUR_HF_TOKEN \
        -n neuron-inference
    4. Deploy the vLLM Inference Server. Create a deployment that downloads the model and runs the vLLM server:

      apiVersion: apps/v1
      kind: Deployment
      metadata:
        name: vllm-inference
        namespace: neuron-inference
      spec:
        replicas: 1
        selector:
          matchLabels:
            app: vllm-inference
        template:
          metadata:
            labels:
              app: vllm-inference
          spec:
            initContainers:
            - name: model-downloader
              image: python:3.10-slim
              command:
              - /bin/bash
              - -c
              - |
                pip install huggingface_hub
                python -c "
                from huggingface_hub import snapshot_download
                import os
                token = os.environ.get('HF_TOKEN')
                snapshot_download('meta-llama/Llama-2-7b-hf', 
                                local_dir='/model',
                                token=token)
                "
              env:
              - name: HF_TOKEN
                valueFrom:
                  secretKeyRef:
                    name: hf-token
                    key: token
              volumeMounts:
              - name: model-storage
                mountPath: /model
            containers:
            - name: vllm-server
              image: public.ecr.aws/neuron/vllm-neuron:latest
              command:
              - python
              - -m
              - vllm.entrypoints.openai.api_server
              - --model
              - /model
              - --tensor-parallel-size
              - "2"
              ports:
              - containerPort: 8000
                name: http
              resources:
                limits:
                  aws.amazon.com/neuron: 2
                requests:
                  aws.amazon.com/neuron: 2
              volumeMounts:
              - name: model-storage
                mountPath: /model
            volumes:
            - name: model-storage
              persistentVolumeClaim:
                claimName: model-storage
    5. Expose the Service. Create a service and route for external access:

      apiVersion: v1
      kind: Service
      metadata:
        name: vllm-service
        namespace: neuron-inference
      spec:
        selector:
          app: vllm-inference
        ports:
        - port: 8000
          targetPort: 8000
          name: http
      ---
      apiVersion: route.openshift.io/v1
      kind: Route
      metadata:
        name: vllm-route
        namespace: neuron-inference
      spec:
        to:
          kind: Service
          name: vllm-service
        port:
          targetPort: http
        tls:
          termination: edge

      Apply all resources:

      oc apply -f pvc.yaml
      oc apply -f deployment.yaml
      oc apply -f service.yaml

    Testing the inference endpoint

    Once the vLLM server is running, you can send requests to the OpenAI-compatible API:

    # Get the route URL
    ROUTE_URL=$(oc get route vllm-route -n neuron-inference -o jsonpath='{.spec.host}')
    # Send a test request
    curl https://${ROUTE_URL}/v1/completions \
      -H "Content-Type: application/json" \
      -d '{
        "model": "/model",
        "prompt": "Explain quantum computing in simple terms:",
        "max_tokens": 100,
        "temperature": 0.7
      }'

    The vLLM server provides an OpenAI-compatible API, making it easy to integrate with existing applications.

    Cost optimization strategies

    Running LLMs on AWS Neuron chips can significantly reduce your inference costs. Here are strategies to maximize savings:

    • Use Inferentia2 for inference-only workloads. Inferentia2 instances like inf2.xlarge start at a fraction of the cost of comparable GPU instances. For production inference, this is the most cost-effective option.
    • Leverage continuous batching. vLLM's continuous batching feature maximizes throughput by dynamically batching requests. This increases utilization and reduces cost per inference.
    • Right-size your instances. Start with smaller instance types and scale up based on actual usage. Inferentia2 instances come in various sizes from inf2.xlarge (1 Neuron device) to inf2.48xlarge (12 devices).
    • Use Spot instances for development. Red Hat OpenShift Service on AWS supports EC2 Spot instances through machine pools. Use Spot for development and testing environments to save up to 90%.
    • Cache models on persistent volumes. As shown in the vLLM example, caching models on PVCs eliminates repeated downloads and reduces startup time.

    Monitoring and troubleshooting

    The operator includes basic telemetry through the node-metrics DaemonSet. For production deployments, integrate with OpenShift monitoring.

    Common issues

    Here are some common issues and their troubleshooting steps.

    Pods stuck in Pending state

    Check that nodes have the feature.node.kubernetes.io/aws-neuron=true label and that Neuron resources are available:

    oc describe node <node-name> | grep neuron

    Driver not loading

    Verify the KMM module is created and the DaemonSet is running:

    oc get modules.kmm.sigs.x-k8s.io -A
    oc get ds -n ai-operator-on-aws

    Model download failures

    Check that the Hugging Face token is valid and the model name is correct. Review init container logs:

    oc logs <pod-name> -c model-downloader -n neuron-inference

    Scheduler not placing pods

    Ensure the custom scheduler is running and pods are using the correct scheduler name:

    oc get pods -n ai-operator-on-aws | grep scheduler

    What's next

    The AWS Neuron Operator for OpenShift enables enterprise-grade AI acceleration. As AWS continues to invest in purpose-built AI chips and Red Hat enhances OpenShift's AI capabilities, expect more features and optimizations.

    To get started:

    • Review the operator documentation
    • Check out the kernel module repository
    • Explore AWS Neuron SDK documentation
    • Join the discussion in the GitHub repositories

    The combination of AWS AI chips and OpenShift provides a powerful platform for running cost-effective AI workloads at scale. Whether you're deploying LLMs for customer service, content generation, or data analysis, this integration makes it easier and more affordable.

     

    Note

    The AWS Neuron Operator is developed jointly by AWS and Red Hat. Contributions and feedback are welcome through the GitHub repositories.

    Related Posts

    • Reduce LLM benchmarking costs with oversaturation detection

    • The hidden cost of large language models

    • Deploy an LLM inference service on OpenShift AI

    • Network performance in distributed training: Maximizing GPU utilization on OpenShift

    • How to deploy MCP servers on OpenShift using ToolHive

    • How to reduce costs with OpenShift on Graviton AWS

    Recent Posts

    • Run cost-effective AI workloads on OpenShift with AWS Neuron Operator

    • Automate unique compliance checks with OpenShift and CustomRule

    • Build custom OS images for IBM Power systems (ppc64le) with bootc

    • Generate synthetic data for your AI models with SDG Hub

    • Kafka Monthly Digest: November 2025

    What’s up next?

    Learn how to deploy an application on a cluster using Red Hat OpenShift Service on AWS.

    Start the learning path
    Red Hat Developers logo LinkedIn YouTube Twitter Facebook

    Platforms

    • Red Hat AI
    • Red Hat Enterprise Linux
    • Red Hat OpenShift
    • Red Hat Ansible Automation Platform
    • See all products

    Build

    • Developer Sandbox
    • Developer tools
    • Interactive tutorials
    • API catalog

    Quicklinks

    • Learning resources
    • E-books
    • Cheat sheets
    • Blog
    • Events
    • Newsletter

    Communicate

    • About us
    • Contact sales
    • Find a partner
    • Site status dashboard
    • Report a security problem

    RED HAT DEVELOPER

    Build here. Go anywhere.

    We serve the builders. The problem solvers who create careers with code.

    Join us if you’re a developer, software engineer, web designer, front-end designer, UX designer, computer scientist, architect, tester, product manager, project manager or team lead.

    Sign me up

    Red Hat legal and privacy links

    • About Red Hat
    • Jobs
    • Events
    • Locations
    • Contact Red Hat
    • Red Hat Blog
    • Inclusion at Red Hat
    • Cool Stuff Store
    • Red Hat Summit
    © 2025 Red Hat

    Red Hat legal and privacy links

    • Privacy statement
    • Terms of use
    • All policies and guidelines
    • Digital accessibility

    Report a website issue