Skip to main content
Redhat Developers  Logo
  • Products

    Platforms

    • Red Hat Enterprise Linux
      Red Hat Enterprise Linux Icon
    • Red Hat AI
      Red Hat AI
    • Red Hat OpenShift
      Openshift icon
    • Red Hat Ansible Automation Platform
      Ansible icon
    • See all Red Hat products

    Featured

    • Red Hat build of OpenJDK
    • Red Hat Developer Hub
    • Red Hat JBoss Enterprise Application Platform
    • Red Hat OpenShift Dev Spaces
    • Red Hat OpenShift Local
    • Red Hat Developer Sandbox

      Try Red Hat products and technologies without setup or configuration fees for 30 days with this shared Red Hat OpenShift and Kubernetes cluster.
    • Try at no cost
  • Technologies

    Featured

    • AI/ML
      AI/ML Icon
    • Linux
      Linux Icon
    • Kubernetes
      Cloud icon
    • Automation
      Automation Icon showing arrows moving in a circle around a gear
    • See all technologies
    • Programming languages & frameworks

      • Java
      • Python
      • JavaScript
    • System design & architecture

      • Red Hat architecture and design patterns
      • Microservices
      • Event-Driven Architecture
      • Databases
    • Developer experience

      • Productivity
      • Tools
      • GitOps
    • Automated data processing

      • AI/ML
      • Data science
      • Apache Kafka on Kubernetes
    • Platform engineering

      • DevOps
      • DevSecOps
      • Red Hat Ansible Automation Platform for applications and services
    • Secure development & architectures

      • Security
      • Secure coding
  • Learn

    Featured

    • Kubernetes & cloud native
      Openshift icon
    • Linux
      Rhel icon
    • Automation
      Ansible cloud icon
    • AI/ML
      AI/ML Icon
    • See all learning resources

    E-books

    • GitOps cookbook
    • Podman in action
    • Kubernetes operators
    • The path to GitOps
    • See all e-books

    Cheat sheets

    • Linux commands
    • Bash commands
    • Git
    • systemd commands
    • See all cheat sheets

    Documentation

    • Product documentation
    • API catalog
    • Legacy documentation
  • Developer Sandbox

    Developer Sandbox

    • Access Red Hat’s products and technologies without setup or configuration, and start developing quicker than ever before with our new, no-cost sandbox environments.
    • Explore the Developer Sandbox

    Featured Developer Sandbox activities

    • Get started with your Developer Sandbox
    • OpenShift virtualization and application modernization using the Developer Sandbox
    • Explore all Developer Sandbox activities

    Ready to start developing apps?

    • Try at no cost
  • Blog
  • Events
  • Videos

Serve and benchmark Prithvi models with vLLM on OpenShift

Stand up and test an Earth and space model inference service on Red Hat AI Inference Server and Red Hat OpenShift AI

March 3, 2026
Michele Gazzetti Michael Johnston Christian Pinto Erwan Gallen
Related topics:
Artificial intelligenceData scienceKubernetesServerless
Related products:
Red Hat AI Inference ServerRed Hat AIRed Hat OpenShift AI

    In Scaling Earth and space AI models with Red Hat AI Inference Server and Red Hat OpenShift AI, we showed the performance benefits of serving inference for the Prithvi-EO model with Red Hat AI Inference Server. We demonstrated this using both a standalone setup and a combination of KServe and Knative. Here, we will dive deeper and show how to set up and test both cases. If you are feeling adventurous, you can also try using your own Earth and space model instead of Prithvi.

    Let’s dive in!

    Before you start

    This article includes two self-contained activities. In the first part, we deploy Prithvi using a traditional Deployment object. In the second part, we serve the model using KServe and run a benchmark test to observe how Knative scales serving replicas as traffic increases. To follow along, be sure to have a suitable environment meeting the following requirements.

    Prerequisites:

    • A Red Hat OpenShift cluster with at least one NVIDIA GPU
    • Red Hat OpenShift AI 2.25 or later

    Note: We run the service using an NVIDIA A100 80 GB GPU hosted on a bare metal OpenShift cluster.

    How to serve Prithvi with Red Hat AI Inference Server

    The following steps describe how to bring up a vLLM instance serving a Prithvi 2.0 model for flood detection using Red Hat AI Inference Server on OpenShift. They assume you are logged into OpenShift in a namespace where you can request GPUs.

    Step 1: Create a Red Hat AI Inference Server deployment serving the Prithvi model

    First, create a Deployment and Service YAML description to serve the model using Red Hat AI Inference Server. For example:

    apiVersion: apps/v1
    kind: Deployment
    metadata:
      name: rhaiis-prithvi
      labels:
        app: rhaiis-prithvi
    spec:
      replicas: 1
      selector:
        matchLabels:
          app: rhaiis-prithvi
      template:
        metadata:
          labels:
            app: rhaiis-prithvi
        spec:
          volumes:
            - name: shm
              emptyDir:
                medium: Memory
                sizeLimit: "2Gi"
          containers:
            - name: rhaiis-prithvi
              image: registry.redhat.io/rhaiis/vllm-cuda-rhel9:3
              command: ["vllm"]
              args: ["serve",
                    "ibm-nasa-geospatial/Prithvi-EO-2.0-300M-TL-Sen1Floods11",
                    "--enforce-eager",
                    "--skip-tokenizer-init",
                    "--enable-mm-embeds",
                    "--io-processor-plugin",
                    "terratorch_segmentation"]
              env:
                - name: HF_HUB_OFFLINE
                  value: "0"
              ports:
                - containerPort: 8000
              resources:
                limits:
                  cpu: "10"
                  memory: 20G
                  nvidia.com/gpu: "1"
                requests:
                  cpu: "2"
                  memory: 6G
                  nvidia.com/gpu: "1"
              volumeMounts:
                - name: shm
                  mountPath: /dev/shm
              livenessProbe:
                httpGet:
                  path: /health
                  port: 8000
                initialDelaySeconds: 120
                periodSeconds: 10
              readinessProbe:
                httpGet:
                  path: /health
                  port: 8000
                initialDelaySeconds: 120
                periodSeconds: 5
    ---
    apiVersion: v1
    kind: Service
    metadata:
      name: rhaiis-prithvi
    spec:
      ports:
        - name: rhaiis-prithvi
          port: 8000
          protocol: TCP
          targetPort: 8000
      selector:
        app: rhaiis-prithvi
      sessionAffinity: None
      type: ClusterIP

    Save the YAML to a file named rhaiis_prithvi.yaml. Run the following command to create the deployment in your current OpenShift namespace:

    oc create –f rhaiis_prithvi.yaml

    Once the rhaiis-prithvi pod becomes Ready (it can take several minutes depending on the network speed) inference requests can be sent to the model in the cluster via a service or port-forward. Start the port forward for rhaiis-prithvi using the following command. Note that the port-forward command does not return control of the terminal, so open a new terminal to complete the following section.

    oc port-forward svc/rhaiis-prithvi 8000:8000

    Step 2: Send an inference request to the Prithvi model

    Before sending a request to the service, we need to describe the request payload. The following example payload is for an inference request that specifies the input image using a URL and requests the output as a base64-encoded image. Save the JSON payload to a file named payload.json.

    {
        "data": {
        "data": "https://huggingface.co/ibm-nasa-geospatial/Prithvi-EO-2.0-300M-TL-Sen1Floods11/resolve/main/examples/India_900498_S2Hand.tif",
        "indices": [1, 2, 3, 8, 11, 12],
        "data_format": "url",
        "out_data_format": "b64_json",
        "image_format": "tiff"
        },
        "model": "ibm-nasa-geospatial/Prithvi-EO-2.0-300M-TL-Sen1Floods11" 
    }

    The image can be specified by either a path to a TIFF file (accessible on the file system the server has access to) or an URL pointing to the same. You can also specify whether the service saves the output image to a path on the server's file system or returns it as a base64-encoded in the response. For a full description of input and output options, see the TerraTorch project documentation.

    From a different terminal window, run the following command to send the inference request to vLLM. Ensure you run the command from the directory where you saved the payload The command decodes the output into a TIFF image and saves it as a mask.tiff file. Figure 1 shows the input image URL from the payload (left) and the mask Prithvi produced (right).

    curl -s -H "Content-Type: application/json" \
         --data @payload.json \
         http://localhost:8000/pooling \
      | jq -r '.data.data' \
      | base64 --decode \
      > mask.tiff
    Satellite view of a river delta (left) and its corresponding binary mask with bodies of water shown in white (right).
    Figure 1: Side by side comparison of the input image and the bodies of water detected by Prithvi.

    Benchmark the service

    The vLLM benchmarking tool, vllm bench, tests geospatial models by varying parameters such as request‑rate distribution and client‑side concurrency. Here we recommend installing vLLM from source, as this process is supported for all major architectures. However, check if a PyPI package is available for your specific architecture. To install vllm bench, you must specify the extra benchmarking dependencies by adding bench. For example, run the following command when installing from source:

    uv pip install -e ".[bench]"

    After the build process finishes, download the dataset_url_input_india.jsonl file from the repository:

    curl http://mgazz.github.io/dataset\_url\_input\_india.jsonl \ --output dataset_url_input_india.jsonl

    Then, run the following command line from the repository's top-level directory (substitute the value of --base-url as appropriate).

    vllm bench serve \
      --base-url http://localhost:8000 \
      --dataset-name=custom \
      --model ibm-nasa-geospatial/Prithvi-EO-2.0-300M-TL-Sen1Floods11 \
      --skip-tokenizer-init \
      --endpoint /pooling \
      --backend vllm-pooling \
      --percentile-metrics e2el \
      --metric-percentiles 25,75,99 \
      --num-prompts 10 \
      --dataset-path ./dataset_url_input_india.jsonl

    How to create a scalable geospatial inference service

    This section describes how to deploy the Prithvi-EO-2.0-300M-TL-Sen1Floods11 model using OpenShift AI. This installation uses vLLM as the serving engine, KServe as the inference platform, and Knative as the inference autoscaler. Combining these three technologies simplifies deployment and dynamically scales inference servers based on request load.

    These instructions assume you are logged into an OpenShift cluster with GPUs that has OpenShift AI installed.

    Step 1: Create the vLLM ServingRuntime and InferenceService

    The setup uses a custom ServingRuntime backed by a Red Hat AI Inference Server container and a serverless InferenceService with autoscaling based on request concurrency. First, create a YAML description of the KServe objects and a PersistentVolumeClaim (PVC) to deploy Red Hat AI Inference Server.

    apiVersion: v1
    kind: PersistentVolumeClaim
    metadata:
      name: shared-pvc
    spec:
      accessModes:
      - ReadWriteMany
      resources:
        requests:
          storage: 10Gi
    ---
    apiVersion: serving.kserve.io/v1alpha1
    kind: ServingRuntime
    metadata:
      labels:
        app: rhaiis-prithvi-300m
      name: rhaiis-prithvi-300m
    spec:
      containers:
      - args:
        - serve
        - ibm-nasa-geospatial/Prithvi-EO-2.0-300M-TL-Sen1Floods11
        - --skip-tokenizer-init
        - --enforce-eager
        - --io-processor-plugin
        - terratorch_segmentation
        - --enable-mm-embeds
        - --runner 
        - pooling
        command:
        - vllm
        env:
        - name: VLLM_LOGGING_LEVEL
          value: INFO
        - name: HF_HOME
          value: /tmp
        - name: HF_HUB_CACHE
          value: /cache
        - name: HOME
          value: /tmp
        image: registry.redhat.io/rhaiis/vllm-cuda-rhel9:3
        readinessProbe:
          httpGet:
            path: /health
            port: 8000
          periodSeconds: 2
        imagePullPolicy: Always
        name: kserve-container
        ports:
        - containerPort: 8000
          protocol: TCP
        resources:
          limits:
            cpu: "64"
            memory: 64G
            nvidia.com/gpu: "1"
          requests:
            cpu: "32"
            memory: 64G
            nvidia.com/gpu: "1"
        securityContext:
          capabilities:
            drop:
              - MKNOD
        volumeMounts:
        - mountPath: /cache
          name: tests-cache
      imagePullSecrets:
      - name: cp-icr-pull-secret
      multiModel: false
      supportedModelFormats:
      - autoSelect: true
        name: vLLM
      volumes:
      - name: tests-cache
        persistentVolumeClaim:
          claimName: shared-pvc
    ---
    apiVersion: serving.kserve.io/v1beta1
    kind: InferenceService
    metadata:
      annotations:
        serving.knative.openshift.io/enablePassthrough: "true"
        serving.kserve.io/deploymentMode: Serverless
        sidecar.istio.io/inject: "true"
        sidecar.istio.io/rewriteAppHTTPProbers: "true"
        prometheus.io/scrape: "true"  
        prometheus.io/path: "/metrics"
        prometheus.io/port: "8000" 
        autoscaling.knative.dev/metric: concurrency
        autoscaling.knative.dev/target: "13"
        autoscaling.knative.dev/window: "60s"
        autoscaling.knative.dev/panic-threshold-percentage: "150"
     
        sidecar.istio.io/proxyCPU: "2"
        sidecar.istio.io/proxyCPULimit: "4"
        sidecar.istio.io/proxyMemory: "4Gi"
        sidecar.istio.io/proxyMemoryLimit: "4Gi"
     
      name: rhaiis-prithvi-300m
    spec:
      predictor:
        affinity:
          podAntiAffinity:
                requiredDuringSchedulingIgnoredDuringExecution:
                  - labelSelector:
                      matchExpressions:
                        - key: app
                          operator: In
                          values:
                            - rhaiis-prithvi-300m
                    topologyKey: kubernetes.io/hostname
        nodeSelector:
          nvidia.com/gpu.product: NVIDIA-A100-80GB-PCIe
        containerConcurrency: 14
        runtime: rhaiis-prithvi-300m
        maxReplicas: 3
        minReplicas: 1
        model:
          modelFormat:
            name: vLLM
          name: ""

    Save the YAML content to a file named kserve_prithvi.yaml. To create the deployment in your current OpenShift namespace, run the following command:

    oc create –f kserve_prithvi.yaml

    Step 2: Verify that the service is up and running

    Inspect the InferenceService object and verify that it is in a Ready state:

    oc get isvc rhaiis-prithvi-300m

    Fetch the URL for the Red Hat AI Inference Server service:

    RHAIIS=$(oc get isvc rhaiis-prithvi-300m \ -o jsonpath='{.status.url}{"\n"}')

    From a different terminal window, run the following command to send an inference request to vLLM through KServe. You can use the same payload described in the section Sending an inference request to the Prithvi model section.

    curl -s -H "Content-Type: application/json" \
         --data @payload.json \
         http://localhost:8000/pooling \
      | jq -r '.data.data' \
      | base64 --decode \
      > mask.tiff

    Handling ingress bandwidth limits

    When running these tests, the throughput reported by each InferenceService replica might be lower than expected. This is often caused by network bandwidth saturation pulling images from the Hugging Face repository.

    To remove this bottleneck, deploy a local image server to serve the TIFF files from inside the cluster. The following example uses a simple BusyBox container running httpd to serve the image listed in the dataset_url_input_india.jsonl dataset file.

    kind: Deployment
    apiVersion: apps/v1
    metadata:
      name: image-server
    spec:
      replicas: 3
      selector:
        matchLabels:
          app: image-server
      template:
        metadata:
          labels:
            app: image-server
        spec:
          containers:
          - command:
            - sh
            - -c
            - |
              wget  https://huggingface.co/christian-pinto/Prithvi-EO-2.0-300M-TL-VLLM/resolve/main/India_900498_S2Hand.tif -O /tmp/India_900498_S2Hand.tif
              httpd -f -p 8080 -h /tmp
            image: busybox:latest
            imagePullPolicy: Always
            name: http
    ---
    kind: Service
    apiVersion: v1
    metadata:
      name: image-server
    spec:
      ports:
        - name: http
          protocol: TCP
          port: 80
          targetPort: 8080
      type: ClusterIP
      selector:
        app: image-server

    After deploying the local image server, update your dataset JSONL file so each input request points to the in‑cluster URL. This ensures the benchmark runs entirely within the cluster.

    http://image-server/India_900498_S2Hand.tif

    Test the service

    To measure dynamic autoscaling performance, run two instances of vllm bench to generate the traffic load. The first instance simulates background traffic at 13 requests per second (RPS). Start the second instance two minutes later to increase traffic with a burst at 26 RPS.

    Both instances run the same benchmarking command. Change the value of the TRAFFIC_LOAD environment variable to 13 for background traffic and 26 for burst traffic.

    vllm bench serve \
      --base-url "${RHAIIS}" \
      --dataset-name=custom \
      --model ibm-nasa-geospatial/Prithvi-EO-2.0-300M-TL-Sen1Floods11 \
      --seed 12345 \
      --skip-tokenizer-init \
      --endpoint /pooling \
      --backend vllm-pooling \
      --metric-percentiles 25,75,99 \
      --percentile-metrics e2el \
      --dataset-path ./dataset_url_input_india.jsonl \
      --num-prompts 500 \
      --request-rate ${TRAFFIC_LOAD}  \
      --max-concurrency ${TRAFFIC_LOAD} \
      --burstiness 5

    In a separate terminal, run the following command to watch for changes in the replicas associated with the rhaiis-prithvi-300m InferenceService:

    watch oc get pods -l app=rhaiis-prithvi-300m-predictor-00001 

    When Knative detects a traffic burst that exceeds the InferenceService concurrency constraints, it scales the replicas to handle the benchmark traffic. You can verify the scale-out event by checking the pod status:

    NAME                                               READY   STATUS    RESTARTS   AGE
    rhaiis-prithvi-300m-predictor-00001-deployment-55  2/2     Running   0          45s
    rhaiis-prithvi-300m-predictor-00001-deployment-82  2/2     Running   0          30s
    rhaiis-prithvi-300m-predictor-00001-deployment-99  2/2     Running   0          30s

    Experimenting with configuration settings

    The main parameter for addressing bursty traffic is the panic window threshold. You identify this using the autoscaling.knative.dev/panic-threshold-percentage annotation. In this example, the configuration scales replicas if the number of in-flight requests (concurrency) to a server exceeds 150% of the target value. We set this target to 13 using the autoscaling.knative.dev/target annotation. This target is based on our evaluation that a single vLLM server can sustain up to 14.6 RPS (where each request is a tile) when downloading from a URL.

    To prevent overloading the vLLM replicas, we set the containerConcurrency option to at 14, close to this throughput limit. Knative then begins queuing requests once traffic per vLLM instance approaches the maximum safe limit. This ensures the load is evenly distributed.

    Try experimenting with different parameter and benchmark settings to see how they change behavior. For example, the following command raises concurrency targets and triggers autoscaling events in response to larger traffic bursts:

    oc patch inferenceservice rhaiis-prithvi-300m \
      --type='merge' \
      -p='{
        "metadata": {
          "annotations": {
            "autoscaling.knative.dev/target": "16"
          }
        },
        "spec": {
          "predictor": {
            "containerConcurrency": 20
          }
        }
      }'

    After the experiment completes, delete the resources to free the GPUs.

    oc delete –f kserve_prithvi.yaml

    Bring your own model

    In this article we used Prithvi, a model available on Hugging Face that vLLM supports natively. You can extend vLLM with general plug-ins to support custom models. Register custom out-of-tree models in the vLLM model registry to make them available for serving. To deploy a custom model, make the plug-in available to Red Hat AI Inference Server at startup. For example, use a PVC and install the plug-in into the main Python environment used to start vLLM. Like other models, vLLM expects out-of-tree models to be hosted on Hugging Face or stored a local directory.

    Wrap up

    Red Hat OpenShift AI provides a ready‑to‑use AI application platform that simplifies deploying and scaling AI models based on traffic. This is an essential capability for geospatial use cases, where demand can spike unpredictably due to new data or sudden events such as natural disasters or extreme weather events.

    Learn more

    • Check the documentation
    • Read Red Hat’s overview of how vLLM accelerates AI inference and enterprise use cases
    • Deep dive into Red Hat AI Inference Server technical architecture and parallelism
    • Explore using vLLM for geospatial serving mechanics and more<
    • Try Prithvi models in your environment: Hugging Face, GitHub

    Related Posts

    • Fine-tune AI pipelines in Red Hat OpenShift AI 3.3

    • AI-generated product review summaries with OpenShift AI

    • Deploy an enterprise RAG chatbot with Red Hat OpenShift AI

    • How to build an AI-driven product recommender with OpenShift AI

    • Transform complex metrics into actionable insights with this AI quickstart

    • Building effective AI agents with Model Context Protocol (MCP)

    Recent Posts

    • Kafka Monthly Digest: March 2026

    • Introduction to Linux interfaces for virtual networking

    • Run Gemma 4 with Red Hat AI on Day 0: A step-by-step guide

    • Red Hat build of Perses with the cluster observability operator

    • How to plan your RHEL lifecycle with AI

    What’s up next?

    Learning Path intro-to-OS-LP-feature-image

    Introduction to OpenShift AI

    Learn how to use Red Hat OpenShift AI to quickly develop, train, and deploy...
    Red Hat Developers logo LinkedIn YouTube Twitter Facebook

    Platforms

    • Red Hat AI
    • Red Hat Enterprise Linux
    • Red Hat OpenShift
    • Red Hat Ansible Automation Platform
    • See all products

    Build

    • Developer Sandbox
    • Developer tools
    • Interactive tutorials
    • API catalog

    Quicklinks

    • Learning resources
    • E-books
    • Cheat sheets
    • Blog
    • Events
    • Newsletter

    Communicate

    • About us
    • Contact sales
    • Find a partner
    • Report a website issue
    • Site status dashboard
    • Report a security problem

    RED HAT DEVELOPER

    Build here. Go anywhere.

    We serve the builders. The problem solvers who create careers with code.

    Join us if you’re a developer, software engineer, web designer, front-end designer, UX designer, computer scientist, architect, tester, product manager, project manager or team lead.

    Sign me up

    Red Hat legal and privacy links

    • About Red Hat
    • Jobs
    • Events
    • Locations
    • Contact Red Hat
    • Red Hat Blog
    • Inclusion at Red Hat
    • Cool Stuff Store
    • Red Hat Summit
    © 2026 Red Hat

    Red Hat legal and privacy links

    • Privacy statement
    • Terms of use
    • All policies and guidelines
    • Digital accessibility

    Report a website issue