Skip to main content
Redhat Developers  Logo
  • AI

    Get started with AI

    • Red Hat AI
      Accelerate the development and deployment of enterprise AI solutions.
    • AI learning hub
      Explore learning materials and tools, organized by task.
    • AI interactive demos
      Click through scenarios with Red Hat AI, including training LLMs and more.
    • AI/ML learning paths
      Expand your OpenShift AI knowledge using these learning resources.
    • AI quickstarts
      Focused AI use cases designed for fast deployment on Red Hat AI platforms.
    • No-cost AI training
      Foundational Red Hat AI training.

    Featured resources

    • OpenShift AI learning
    • Open source AI for developers
    • AI product application development
    • Open source-powered AI/ML for hybrid cloud
    • AI and Node.js cheat sheet

    Red Hat AI Factory with NVIDIA

    • Red Hat AI Factory with NVIDIA is a co-engineered, enterprise-grade AI solution for building, deploying, and managing AI at scale across hybrid cloud environments.
    • Explore the solution
  • Learn

    Self-guided

    • Documentation
      Find answers, get step-by-step guidance, and learn how to use Red Hat products.
    • Learning paths
      Explore curated walkthroughs for common development tasks.
    • Guided learning
      Receive custom learning paths powered by our AI assistant.
    • See all learning

    Hands-on

    • Developer Sandbox
      Spin up Red Hat's products and technologies without setup or configuration.
    • Interactive labs
      Learn by doing in these hands-on, browser-based experiences.
    • Interactive demos
      Click through product features in these guided tours.

    Browse by topic

    • AI/ML
    • Automation
    • Java
    • Kubernetes
    • Linux
    • See all topics

    Training & certifications

    • Courses and exams
    • Certifications
    • Skills assessments
    • Red Hat Academy
    • Learning subscription
    • Explore training
  • Build

    Get started

    • Red Hat build of Podman Desktop
      A downloadable, local development hub to experiment with our products and builds.
    • Developer Sandbox
      Spin up Red Hat's products and technologies without setup or configuration.

    Download products

    • Access product downloads to start building and testing right away.
    • Red Hat Enterprise Linux
    • Red Hat AI
    • Red Hat OpenShift
    • Red Hat Ansible Automation Platform
    • See all products

    Featured

    • Red Hat build of OpenJDK
    • Red Hat JBoss Enterprise Application Platform
    • Red Hat OpenShift Dev Spaces
    • Red Hat Developer Toolset

    References

    • E-books
    • Documentation
    • Cheat sheets
    • Architecture center
  • Community

    Get involved

    • Events
    • Live AI events
    • Red Hat Summit
    • Red Hat Accelerators
    • Community discussions

    Follow along

    • Articles & blogs
    • Developer newsletter
    • Videos
    • Github

    Get help

    • Customer service
    • Customer support
    • Regional contacts
    • Find a partner

    Join the Red Hat Developer program

    • Download Red Hat products and project builds, access support documentation, learning content, and more.
    • Explore the benefits

Running AI inference on Rebellions ATOM NPU with Red Hat AI

Deploying large language models on Rebellions ATOM NPUs with Red Hat OpenShift AI

May 27, 2026
Erwan Gallen Chris Procter Liming Tsai
Related topics:
AI inferenceArtificial intelligence
Related products:
Red Hat AI InferenceRed Hat OpenShift AI

    As enterprises scale AI from proof of concept to production, there's a need for flexible and cost-effective inference infrastructure. As AI workloads move into production, enterprises need flexibility in the accelerators powering their inference infrastructure. Neural processing units (NPU), purpose-built for AI inference, complements existing infrastructure by delivering high throughput with greater energy efficiency, giving your organization more options to optimize cost and performance at scale.

    In December 2025, Red Hat and Rebellions announced a joint solution bringing Rebellions' ATOM NPUs to Red Hat OpenShift AI, reinforcing Red Hat's "any model, any accelerator, any cloud" strategy. Today, following months of intensive co-engineering, that solution is generally available.

    This milestone highlights Red Hat's leadership in the vLLM ecosystem, where we have collaborated with Rebellions to drive upstream contribution. In this post, we walk through how to deploy and serve large language models (LLMs) on Rebellions ATOM NPUs using Red Hat OpenShift AI and a certified vLLM container image on the Red Hat AI Inference Server.

    What is the Rebellions ATOM NPU?

    Rebellions is South Korea's first "unicorn" in the AI chips industry that designs processors optimized specifically for AI inference. The ATOM NPU delivers high throughput and low latency for LLM serving while consuming significantly less power than traditional GPUs, reducing both deployment and operational costs at the server and rack level.

    Each ATOM chip provides 16 GB of on-chip memory. A typical ATOM Max card contains 4 chips, and a single server can house multiple cards, providing substantial aggregate memory and compute for even large models. For example, a server with dual ATOM Max cards exposes 8 NPU devices with 128 GB of total NPU memory, which is sufficient to run 70B-parameter models.

    Architecture

    The joint solution consists of four layers that work together to deliver enterprise AI inference on NPUs:

    • Red Hat AI Inference server with Rebellions certified vLLM container runtime
    • Red Hat OpenShift AI provides enterprise AI services including model serving with KServe, integrated with the Rebellions SDK for hardware-accelerated inference on ATOM NPUs
    • Red Hat OpenShift delivers the enterprise Kubernetes foundation (logging, monitoring, GitOps, service mesh, certified storage) that is now NPU-aware. This enables intelligent scheduling, monitoring, and lifecycle management of NPU-based inference workloads, treating NPUs as first-class resources within the cluster.
    • Rebellions NPU operator, certified for Red Hat OpenShift, seamlessly integrates Rebellions' cloud-native toolkit (drivers, device plugins, monitoring) into OpenShift, enabling native NPU support with high performance and low latency.
    • Integrated infrastructure for the OpenShift control plane with NPU-powered inference nodes, delivered as a rapid deployment pattern for enterprise data centers.

    Prerequisites

    Before you begin, ensure you have the following:

    • Red Hat OpenShift 4.20 or later
    • Red Hat OpenShift AI 3.3 or later
    • A server with Rebellions ATOM NPUs (refer to the Red Hat Ecosystem Catalog for validated hardware configurations)
    • Cluster administrator access to your OpenShift cluster
    • A Hugging Face account with access to the model you want to serve (if pulling from Hugging Face Hub)

    Step 1: Install the Node feature discovery operator

    The Node Feature Discovery operator (NFD) detects hardware features on cluster nodes, including Rebellions NPUs. The NPU operator depends on NFD to identify nodes with ATOM devices.

    1. In the OpenShift web console, navigate to Ecosystem > Software Catalog.
    2. Search for Node Feature Discovery and install it with the default settings (see Figure 1).
    3. Create a NodeFeatureDiscovery instance to start detecting hardware features on your nodes.
    Figure 1: Installing the NFD operator
    Figure 1: Installing the NFD operator.

    After NFD has been installed, create a NodeFeatureDiscovery resource with the default settings. Nodes with Rebellions NPUs are labeled automatically. You can verify by checking the node labels:

    oc get node <node-name> -o jsonpath='{.metadata.labels}' \
    | jq 'with_entries(select(.key | contains("1eff")))'

    The expected output on a node with Rebellions NPUs:

    {
      "feature.node.kubernetes.io/pci-1eff.present": "true"
    }

    1eff is the PCI vendor ID for Rebellions Inc., so this label marks the node as having at least one Rebellions device present.

    The Node Feature Discovery operator can be used for a wide range of hardware detection and labeling tasks. For more details on using and configuring NFD, see the Openshift Documentation.

    Step 2: Install the Rebellions NPU operator

    The Rebellions NPU operator manages the full lifecycle of NPU drivers, device plugins, and monitoring components on OpenShift.

    1. In the OpenShift web console, navigate to the Software Catalog in Ecosystem > Software Catalog (see Figure 2).

      Software Catalog in Red Hat OpenShift
      Figure 2: Software Catalog in Red Hat OpenShift.
    2. Search for Rebellions NPU (it is certified in the Red Hat OpenShift Ecosystem Catalog).
    3. Install the operator with the default settings, ensuring it installs in the rbln-system namespace (Figure 3).

      Installing an operator in Red Hat OpenShift.
      Figure 3: Installing an operator in Red Hat OpenShift.

      Verify your progress so far using oc:

    oc get pods -n rbln-system
    NAME                        READY  STATUS   RESTARTS
    controller-manager-867..nqt  1/1   Running  0

    To be able to pull driver containers from the repo.rebellions.ai repository the operator expects a secret containing the credentials for that repo:

    oc create secret docker-registry drivercred \
      --docker-server=repo.rebellions.ai \
      --docker-username=<your-username> \
      --docker-password=<your-password> \
      --docker-email=<your-email>  \
      -n rbln-system

    The operator then needs two custom resources configured, firstly a RBLNDriver resource defining the driver to be installed, and second a RBLNClusterPolicy resource to configure the individual pods that the operator manages. You can do this from the Ecosystem > Installed Operators > RBLN operator page, clicking Create instance for both. For most installations the default settings are correct (Figure 4).

    Creating the two custom resources required by the RBLN operator.
    Figure 4: Creating the two custom resources required by the RBLN operator.

    After creating these, the operator automatically:

    • Builds and deploys the RBLN kernel module
    • Registers ATOM devices with the Kubernetes device plugin framework
    • Deploys metrics exporters for NPU monitoring

    You now have 8 pods running in the rbln-system namespace. For example:

    oc get pods -n rbln-system
    
    NAME                                        READY STATUS
    controller-manager-797798d7b8-rjzht         1/1   Running
    rbln-device-plugin-4qgxc                    1/1   Running
    rbln-metrics-exporter-jghbg                 1/1   Running
    rbln-npu-feature-discovery-zg47r            1/1   Running
    rbln-container-toolkit-ttz2c                1/1   Running
    rblndriver-sample-rhel9.6-5.14.0-570..qf9   1/1   Running
    rbln-operator-validator-qhf4t               1/1   Running

    Verify that ATOM devices are visible as allocatable resources on your nodes:

    oc get nodes -o \
    custom-columns=NAME:.metadata.name,NPUs:.status.capacity.'rebellions\.ai/npu'

    You see rebellions.ai/npu listed among the allocatable resources, for example:

    NAME                            NPUs
    rbln-npu-worker-01              32

    Note: The per-product labels applied to the nodes by rbln-npu-feature-discovery such as rebellions.ai/npu.product=RBLN-CA25 are the recommended mechanism for workloads to pin to a specific card type in heterogeneous clusters.

    At this point, the ATOM devices should be fully configured and ready for use with OpenShift AI.

    Step 3: Create the ATOM hardware profile in OpenShift AI

    A hardware profile tells OpenShift AI how much CPU, memory, and accelerator resources to allocate when deploying a model. Create a hardware profile for ATOM-based inference.

    1. Navigate to the Red Hat OpenShift AI dashboard (find this in the grid icon icon at the top right of the OpenShift GUI).
    2. In the OpenShift AI dashboard, navigate to Settings > Environment Setup > Hardware profiles.
    3. Click Create hardware profile and configure it according to your server configuration and the model you intend to serve.
    4. Add the Accelerator resource using the Add Resource button and assign the number of accelerators the model requires:

    For example:

    • Name: rebellions-atom
    • CPU: 28
    • Memory: 720GiB
    • Accelerator: rebellions.ai/npu
    • Accelerator count: 16

    If you intend to serve multiple models of different sizes on the cluster, you can create multiple hardware profiles by repeating the same procedure (Figure 5).

    Example hardware profiles created for Rebellions ATOM accelerators.
    Figure 5: Example hardware profiles created for Rebellions ATOM accelerators.

    Step 4: Create the vLLM RBLN ServingRuntime

    The ServingRuntime defines the container image, startup arguments, and environment variables used by KServe to serve models on ATOM NPUs. You can create a template ServingRuntime that can be reused for every model deployment by navigating to Settings > Model Resources and operations > Serving runtimes and clicking Add Serving Runtime selecting the appropriate API protocol and Model Type. Add the following YAML:

    apiVersion: serving.kserve.io/v1alpha1
    kind: ServingRuntime
    metadata:
      annotations:
        opendatahub.io/apiProtocol: REST
        opendatahub.io/model-type: '["generative"]'
        opendatahub.io/modelServingSupport: '["single"]'
        opendatahub.io/recommended-accelerators: '["rebellions.ai/ATOM"]'
        openshift.io/display-name: vLLM RBLN ATOM ServingRuntime for RedHat
      labels:
        opendatahub.io/dashboard: "true"
      name: vllm-rbln-runtime
    spec:
      annotations:
        prometheus.io/path: /metrics
        prometheus.io/port: "8080"
      containers:
        - args:
            - --port=8080
            - --model=/mnt/models
            - --served-model-name={{.Name}}
            - --block-size=1024
            - --max-num-seqs=1
            - --max-model-len=8192
            - --max-num-batched-tokens=128
            - --enable-chunked-prefill
          command:
            - python
            - -m
            - vllm.entrypoints.openai.api_server
          env:
            - name: HOME
              value: /workspace
            - name: HF_HOME
              value: /tmp/hf_home
            - name: VLLM_TARGET_DEVICE
              value: rbln
            - name: VLLM_USE_V1
              value: "1"
            - name: VLLM_RBLN_COMPILE_STRICT_MODE
              value: "1"
            - name: VLLM_RBLN_METRICS
              value: "1"
            - name: VLLM_RBLN_USE_VLLM_MODEL
              value: "1"
            - name: RBLN_KERNEL_MODE
              value: triton
            - name: VLLM_LOGGING_LEVEL
              value: WARNING
            - name: RBLN_ROOT_IP
              valueFrom:
                fieldRef:
                  fieldPath: status.podIP
            - name: RBLN_LOCAL_IP
              valueFrom:
                fieldRef:
                  fieldPath: status.podIP
          image: <vllm-rbln-image>
          name: kserve-container
          ports:
            - containerPort: 8080
              protocol: TCP
          readinessProbe:
            failureThreshold: 30
            initialDelaySeconds: 80
            periodSeconds: 20
            tcpSocket:
              port: 8080
            timeoutSeconds: 5
          volumeMounts:
            - mountPath: /workspace
              name: workspace-volume
      multiModel: false
      supportedModelFormats:
        - autoSelect: true
          name: vLLM
      volumes:
        - emptyDir:
            sizeLimit: 100G
          name: workspace-volume

    Replace <vllm-rbln-image> with the certified vLLM RBLN container image from the Rebellions registry. At the time of writing, this is repo.rebellions.ai/rebellions/vllm-rbln-rhel9:3.3, but consult the Rebellions documentation for the latest image reference.

    Creating the Serving Runtime.
    Figure 6: Creating the serving runtime in Red Hat OpenShift.

    Key environment variables to note:

    • VLLM_TARGET_DEVICE=rbln: Directs vLLM to use the Rebellions NPU backend
    • VLLM_USE_V1=1: Enables the vLLM V1 engine
    • VLLM_RBLN_COMPILE_STRICT_MODE=1: Enforces strict compilation mode for model graphs
    • VLLM_RBLN_METRICS=1: Enables NPU-specific metrics for Prometheus
    • RBLN_KERNEL_MODE=triton: Selects the Triton-based kernel execution mode

    The workspace-volume with a sizeLimit of 100G is required for the RBLN compilation cache, which stores compiled model graphs for faster subsequent startups.

    Once the ServingRuntime is created, it appears as vLLM RBLN ATOM ServingRuntime for Red Hat in the OpenShift AI dashboard (Figure 7) under available serving runtimes (Settings > Model Resources > Serving Runtimes ).

    Available serving runtimes in Red Hat OpenShift.
    Figure 7: Available serving runtimes in Red Hat OpenShift.

    Step 5: Deploy a model

    With the hardware profile and ServingRuntime in place, you can now deploy a model through the OpenShift AI dashboard.

    1. In the OpenShift AI dashboard, click on Projects in the menu on the left, and select or create a project.
    2. Under the Connections tab, configure a data connection for model storage (for example, an NFS-backed PersistentVolumeClaim or an S3-compatible object store) and upload your model weights (Figure 8).

      Configuring a data connection for model storage.
      Figure 8: Configuring a data connection for model storage.
    3. Under Serve Models in the Overview tab, click the Deploy model link and configure:
      • Model location: Point to the model path in your data connection
      • Model deployment name: For example, qwen3-0-6b
      • Hardware profile: Select rebellions-atom
      • Serving runtime: Select vLLM RBLN ATOM ServingRuntime
      • Model access: If you require external access to the model, enable Make model deployment available through an external route
    4. Click Deploy model (Figure 9).

      Deploying a model in Red Hat OpenShift.
      Figure 9: Deploying a model in Red Hat OpenShift.

    This creates an OpenShift project with all the objects needed to run the model, for example:

    oc get all -n rbln-demo
    
    NAME                              READY STATUS   RESTARTS
    pod/qwen3-0-6b-predictor-798..s2w  2/2  Running  0
    
    NAME                     TYPE       CLUSTER-IP     EXT..IP PORT
    service/qwen3..metrics   ClusterIP  172.30.176.153 <none>  8080/TCP
    service/qwen3..predictor ClusterIP  None           <none>  8443/TCP
    
    NAME                              READY  UP-TO-DATE   AVAILABLE
    deployment.apps/qwen3..predictor  1/1    1            1
    
    NAME                         DESIRED   CURRENT   READY
    replicaset.apps/qwen3..767   1         1         1
    
    NAME
    horizontalpodautoscaler.autoscaling/qwen3-0-6b-predictor   
    REFERENCE
    Deployment/qwen3-0-6b-predictor
    TARGETS              MINPODS   MAXPODS   REPLICAS
    cpu: <unknown>/80%   1         1         1
    
    NAME                                  HOST/PORT                                    PATH   SERVICES               PORT    TERMINATION          WILDCARD
    route.route.openshift.io/qwen3-0-6b
       qwen3-0-6b-rbln-demo.apps.sno-prod.rbln.ai          qwen3-0-6b-predictor   https   reencrypt/Redirect   None

    The first deployment takes longer than usual because the RBLN compiler needs to compile the model graph for the ATOM NPU architecture. Subsequent deployments of the same model will reuse the cached compilation artifacts.

    You can monitor the deployment progress in the OpenShift AI dashboard or by watching the pod logs:

    oc logs -f <inference-pod-name> -n <project-namespace>

    Once the readiness probe passes, the model is ready to serve requests.

    Step 6: Verify the devices

    You can check the devices visible to the pod, including a number of useful metrics such as power usage and memory utilisation using the rbln-smi utility:

    oc exec -it pod/qwen3-0-6b-predictor-798b5c767-cks2w \
    -n rbln-demo -- rbln-smi
    
    
    +-------------------------------------------------------------------------------------------------+
    |                                Device Information KMD ver: 3.0.0                                |
    +-----+-----------+---------+---------------+------+---------+------+---------------------+-------+
    | NPU |    Name   | Device  |   PCI BUS ID  | Temp |  Power  | Perf |  Memory(used/total) |  Util |
    +=====+===========+=========+===============+======+=========+======+=====================+=======+
    | 0   | RBLN-CA25 | rbln0   |  0000:0b:00.0 |  38C |  43.7W  | P14  |  14.0GiB / 15.7GiB  |   0.0 |
    | 1   |           | rbln1   |  0000:0c:00.0 |  40C |         | P14  |    0.0B / 15.7GiB   |   0.0 |
    | 2   |           | rbln2   |  0000:0d:00.0 |  33C |         | P14  |    0.0B / 15.7GiB   |   0.0 |
    | 3   |           | rbln3   |  0000:0e:00.0 |  29C |         | P14  |    0.0B / 15.7GiB   |   0.0 |
    +-----+-----------+---------+---------------+------+---------+------+---------------------+-------+
    | 4   | RBLN-CA25 | rbln4   |  0000:0f:00.0 |  35C |  43.5W  | P14  |    0.0B / 15.7GiB   |   0.0 |
    | 5   |           | rbln5   |  0000:10:00.0 |  39C |         | P14  |    0.0B / 15.7GiB   |   0.0 |
    | 6   |           | rbln6   |  0000:11:00.0 |  31C |         | P14  |    0.0B / 15.7GiB   |   0.0 |
    | 7   |           | rbln7   |  0000:12:00.0 |  32C |         | P14  |    0.0B / 15.7GiB   |   0.0 |
    +-----+-----------+---------+---------------+------+---------+------+---------------------+-------+
    | 8   | RBLN-CA25 | rbln16  |  0000:1b:00.0 |  38C |  44.8W  | P14  |    0.0B / 15.7GiB   |   0.0 |
    | 9   |           | rbln17  |  0000:1c:00.0 |  34C |         | P14  |    0.0B / 15.7GiB   |   0.0 |
    | 10  |           | rbln18  |  0000:1d:00.0 |  32C |         | P14  |    0.0B / 15.7GiB   |   0.0 |
    | 11  |           | rbln19  |  0000:1e:00.0 |  31C |         | P14  |    0.0B / 15.7GiB   |   0.0 |
    +-----+-----------+---------+---------------+------+---------+------+---------------------+-------+
    | 12  | RBLN-CA25 | rbln20  |  0000:1f:00.0 |  31C |  44.7W  | P14  |    0.0B / 15.7GiB   |   0.0 |
    | 13  |           | rbln21  |  0000:20:00.0 |  34C |         | P14  |    0.0B / 15.7GiB   |   0.0 |
    | 14  |           | rbln22  |  0000:21:00.0 |  28C |         | P14  |    0.0B / 15.7GiB   |   0.0 |
    | 15  |           | rbln23  |  0000:22:00.0 |  27C |         | P14  |    0.0B / 15.7GiB   |   0.0 |
    +-----+-----------+---------+---------------+------+---------+------+---------------------+-------+
    +-------------------------------------------------------------------------------------------------+
    |                                       Context Information                                       |
    +-----+---------------------+--------------+-----------+----------+------+---------------+--------+
    | NPU | Process             |     PID      |    CTX    | Priority | PTID |      Memalloc | Status |
    +=====+=====================+==============+===========+==========+======+===============+========+
    | 0   | VLLM::EngineCore    |     521      |   10001   |  normal  |  0   |       14.0GiB |  idle  |
    +-----+---------------------+--------------+-----------+----------+------+---------------+--------+

    Step 7: Run inference

    The deployed model exposes an OpenAI-compatible API endpoint. You can send requests using curl or any OpenAI-compatible client.

    First, retrieve the inference endpoint:

    oc get inferenceservice <model-name> \
    -n <project-namespace> \
    -o jsonpath='{.status.url}'

    Note: If you didn't enable an external route during deployment, this URL is only available to other pods running on the cluster.

    Send a chat completion request:

    curl -s https://<inference-endpoint>/v1/chat/completions \
      -H "Content-Type: application/json" \
      -d '{
        "model": "qwen3-0-6b",
        "messages": [
          {"role": "system", "content": "You are a helpful assistant."},
          {"role": "user", "content": "Explain the benefits of NPUs 9for AI inference in two sentences."}
        ],
        "max_tokens": 256,
        "temperature": 0.7
      }'

    Example response:

    {
      "id": "chatcmpl-abc123",
      "object": "chat.completion",
      "model": "qwen3-0-6b",
      "choices": [
        {
          "index": 0,
          "message": {
            "role": "assistant",
            "content": "NPUs are purpose-built for AI inference workloads, delivering higher throughput per watt compared to general-purpose GPUs, which translates directly into lower operational costs at scale. Their optimized architecture reduces latency for token generation while enabling dense deployment in data centers without the cooling and power overhead typically associated with GPU clusters."
          },
          "finish_reason": "stop"
        }
      ]
    }

    Serving large and Mixture of Experts models

    For larger dense models (such as Llama 3.3 70B) or Mixture of Experts (MoE) models (such as Qwen3-30B-A3B), you must distribute the model across multiple ATOM devices using tensor parallelism and expert parallelism.

    When deploying these models, add the following custom runtime arguments in the Configuration parameters section of Advanced settings stage of the model deployment form:

    --enable-expert-parallel
    --data-parallel-size=4
    --max-model-len=40960
    --block-size=8192

    Add the following environment variable to control the tensor parallel mapping:

    VLLM_RBLN_TP_SIZE=4OMP_NUM_THREADS=2

    This configuration distributes the model across 16 ATOM devices with tensor parallelism and enables expert parallelism for MoE architectures, while running 4 data-parallel replicas for higher throughput.

    Update the hardware profile accordingly to allocate 16 ATOM devices, 28 CPUs, and 720 Gi of memory for these larger models.

    Monitoring NPU

    When the NPU operator is installed, the Rebellions metrics exporter is added and exposes detailed telemetry for Rebellions NPUs in Prometheus format.

    • rbln_npu_temperature: Device temperature (°C)
    • rbln_npu_power: Card power draw (W)
    • rbln_npu_memory_used: DRAM currently in use (bytes)
    • rbln_npu_memory_total: Total DRAM (bytes)
    • rbln_npu_utilization: SM utilization (%)
    • rbln_npu_health: Binary health (0 = active, 1 = inactive)
    1. Activate OpenShift user workload monitoring by applying this configmap:
    apiVersion: v1
    kind: ConfigMap
    metadata:
      name: cluster-monitoring-config
      namespace: openshift-monitoring
    data:
      config.yaml: |
        enableUserWorkload: true
    1. Navigate to Observe > Metrics in the OpenShift web console (Figure 10) to see Rebellions ATOM metrics.

      Rebellions ATOM metrics in Red Hat OpenShift web console.
      Figure 10: Rebellions ATOM metrics in Red Hat OpenShift web console.

    Supported models

    The following model families have been validated with the vLLM RBLN runtime on Red Hat OpenShift AI:

    • Llama: Llama3.3-70b, Llama3.2-3b
    • Qwen: Qwen3-0.6b, Qwen3-8b, Qwen3-VL-8b
    • DeepSeek: DeepSeek-R1-Distill-Qwen-32b
    • Gemma: Gemma2-9b, Gemma-7b, Gemma-2b
    • Mistral: Mistral-7b
    • EXAONE: EXAONE-3.5-32b, EXAONE-3.5-2.4b
    • Others: Stable Diffusion, Time-Series-Transformer, gpt-oss-20b

    Supported vLLM features include continuous batching, chunked prefill, prefix caching, speculative decoding, LoRA, sliding window attention, tensor/pipeline/data/expert parallelism, structured output, and w4a16 group quantization.

    For the latest support matrix, including per-model feature compatibility, consult the Rebellions vLLM documentation.

    What comes next

    This is just the beginning of the Red Hat and Rebellions collaboration. Upcoming milestones include:

    • Multi-node NPU clusters for scaling inference across multiple servers
    • Disconnected (air-gapped) environment support for secure, isolated deployments
    • Integration with llm-d for disaggregated prefill/decode and advanced serving topologies
    • Support for REBEL NPU, Rebellions' next-generation chiplet architecture with 144 GB HBM3E, targeting GPU-class performance with NPU-class efficiency

    Get started

    To start running AI inference on Rebellions ATOM NPUs with Red Hat OpenShift AI:

    • Visit the Red Hat Ecosystem Catalog for the validated solution listing and certified operator
    • Review the Rebellions documentation for the latest support matrix, container images, and configuration guides
    • Read the joint press release for more on the partnership and strategic vision

    Red Hat's commitment to "any model, any accelerator, any cloud" means giving enterprises real choice in how they deploy AI. With Rebellions ATOM NPUs now fully supported on Red Hat OpenShift AI, organizations have a validated, energy-efficient path to production AI inference that doesn't compromise on enterprise-grade security, scalability, or operational simplicity.

    Related Posts

    • How to manage Red Hat OpenShift AI dependencies with Kustomize and Argo CD

    • Accelerated expert-parallel distributed tuning in Red Hat OpenShift AI

    • Deploy an enterprise RAG chatbot with Red Hat OpenShift AI

    • How to deploy language models with Red Hat OpenShift AI

    • What GPU kernels mean for your distributed inference

    • Combining KServe and llm-d for optimized generative AI inference

    Recent Posts

    • Running AI inference on Rebellions ATOM NPU with Red Hat AI

    • How we built integration testing for fast-moving AI backend

    • Testing infrastructure red teaming with abliterated models

    • Build an enterprise RAG system with OGX

    • Solutions for SELinux MCS challenges with GitLab runners

    What’s up next?

    Learning Path Red Hat AI

    How to run AI models in cloud development environments

    This learning path explores running AI models, specifically large language...
    Red Hat Developers logo LinkedIn YouTube Twitter Facebook

    Platforms

    • Red Hat AI
    • Red Hat Enterprise Linux
    • Red Hat OpenShift
    • Red Hat Ansible Automation Platform
    • See all products

    Build

    • Developer Sandbox
    • Developer tools
    • Interactive tutorials
    • API catalog

    Quicklinks

    • Learning resources
    • E-books
    • Cheat sheets
    • Blog
    • Events
    • Newsletter

    Communicate

    • About us
    • Contact sales
    • Find a partner
    • Report a website issue
    • Site status dashboard
    • Report a security problem

    RED HAT DEVELOPER

    Build here. Go anywhere.

    We serve the builders. The problem solvers who create careers with code.

    Join us if you’re a developer, software engineer, web designer, front-end designer, UX designer, computer scientist, architect, tester, product manager, project manager or team lead.

    Sign me up

    Red Hat legal and privacy links

    • About Red Hat
    • Jobs
    • Events
    • Locations
    • Contact Red Hat
    • Red Hat Blog
    • Inclusion at Red Hat
    • Cool Stuff Store
    • Red Hat Summit
    © 2026 Red Hat

    Red Hat legal and privacy links

    • Privacy statement
    • Terms of use
    • All policies and guidelines
    • Digital accessibility

    Chat Support

    Please log in with your Red Hat account to access chat support.