Skip to main content
Redhat Developers  Logo
  • Products

    Platforms

    • Red Hat Enterprise Linux
      Red Hat Enterprise Linux Icon
    • Red Hat AI
      Red Hat AI
    • Red Hat OpenShift
      Openshift icon
    • Red Hat Ansible Automation Platform
      Ansible icon
    • View All Red Hat Products

    Featured

    • Red Hat build of OpenJDK
    • Red Hat Developer Hub
    • Red Hat JBoss Enterprise Application Platform
    • Red Hat OpenShift Dev Spaces
    • Red Hat OpenShift Local
    • Red Hat Developer Sandbox

      Try Red Hat products and technologies without setup or configuration fees for 30 days with this shared Openshift and Kubernetes cluster.
    • Try at no cost
  • Technologies

    Featured

    • AI/ML
      AI/ML Icon
    • Linux
      Linux Icon
    • Kubernetes
      Cloud icon
    • Automation
      Automation Icon showing arrows moving in a circle around a gear
    • View All Technologies
    • Programming Languages & Frameworks

      • Java
      • Python
      • JavaScript
    • System Design & Architecture

      • Red Hat architecture and design patterns
      • Microservices
      • Event-Driven Architecture
      • Databases
    • Developer Productivity

      • Developer productivity
      • Developer Tools
      • GitOps
    • Automated Data Processing

      • AI/ML
      • Data Science
      • Apache Kafka on Kubernetes
    • Platform Engineering

      • DevOps
      • DevSecOps
      • Ansible automation for applications and services
    • Secure Development & Architectures

      • Security
      • Secure coding
  • Learn

    Featured

    • Kubernetes & Cloud Native
      Openshift icon
    • Linux
      Rhel icon
    • Automation
      Ansible cloud icon
    • AI/ML
      AI/ML Icon
    • View All Learning Resources

    E-Books

    • GitOps Cookbook
    • Podman in Action
    • Kubernetes Operators
    • The Path to GitOps
    • View All E-books

    Cheat Sheets

    • Linux Commands
    • Bash Commands
    • Git
    • systemd Commands
    • View All Cheat Sheets

    Documentation

    • Product Documentation
    • API Catalog
    • Legacy Documentation
  • Developer Sandbox

    Developer Sandbox

    • Access Red Hat’s products and technologies without setup or configuration, and start developing quicker than ever before with our new, no-cost sandbox environments.
    • Explore Developer Sandbox

    Featured Developer Sandbox activities

    • Get started with your Developer Sandbox
    • OpenShift virtualization and application modernization using the Developer Sandbox
    • Explore all Developer Sandbox activities

    Ready to start developing apps?

    • Try at no cost
  • Blog
  • Events
  • Videos

Benchmarking with GuideLLM in air-gapped OpenShift clusters

A simplified workflow for LLM serving and benchmarking with Red Hat AI Inference Server (vLLM) and GuideLLM

September 15, 2025
Philip Hayes Thameem Abbas Ibrahim Bathusha
Related topics:
Artificial intelligenceDisconnected Environments
Related products:
Red Hat AIRed Hat OpenShift AI

Share:

    The ability to deploy and benchmark large language models (LLMs) in an air-gapped (disconnected) environment is critical for enterprises operating in highly regulated sectors. In this article, we walk through deploying the Red Hat AI Inference Server using vLLM and evaluating its performance with GuideLLM—all within a fully disconnected Red Hat OpenShift cluster.

    We will use prebuilt container images, Persistent Volume Claims (PVCs) to mount models and tokenizers, and OpenShift-native Job resources to run benchmarks.

    What is GuideLLM?

    GuideLLM is an open source benchmarking tool designed to evaluate the performance of LLMs served through vLLM. It provides fine-grained metrics such as:

    • Token throughput
    • Latency (time-to-first-token, inter-token, request latency)
    • Concurrency scaling
    • Request-level diagnostics

    GuideLLM uses the model’s own tokenizer to prepare evaluation prompts and supports running entirely in disconnected environments. The text corpus used to generate the requests is included in the GuideLLM image, no additional datasets are required to perform benchmarking.

    Why does GuideLLM need to use the models own tokenizer?

    Different models use different tokenization schemes (e.g., byte-pair encoding, WordPiece, sentencepiece). These affect how a given input string is split into tokens and how long a sequence is in tokens vs. characters. Metrics like "tokens per second" and "time to first token" depend directly on how many tokens are involved in a prompt or response. Using the model’s native tokenizer ensures the prompt token count matches what the model actually receives and the output token count reflects the true workload.

    Architecture overview

    This benchmark stack includes:

    • OpenShift 4.14–4.18: Cluster running the GPU workloads
    • Node Feature Discovery (NFD) and NVIDIA GPU Operator: To enable GPU scheduling
    • Red Hat AI Inference Server (vLLM): Hosts and serves quantized LLMs
    • GuideLLM: Deployed as a containerized job to evaluate inference performance
    • Persistent volumes (PVCs):
      • Model weights (/mnt/models)
      • Tokenizer files (/mnt/tokenizer)
      • Benchmark results (/results)

    Mirroring images for a disconnected OpenShift environment

    In air-gapped OpenShift environments, public registries like quay.io and registry.redhat.io are not accessible at runtime. To ensure successful deployments, all required images must be mirrored into an internal registry that is accessible to the disconnected OpenShift cluster. Red Hat provides the oc-mirror tool to help generate a structured image set and transfer it across network zones.

    This section outlines how to mirror the necessary images and configure OpenShift to trust and pull from your internal registry.

    Prerequisites

    • A reachable internal container registry, such as:
      • A local registry hosted inside OpenShift (e.g., registry.apps.<cluster-domain>)
      • A mirror registry set up per Red Hat’s disconnected installation docs.
    • oc-mirror and rsync installed on a connected machine
    • Authentication credentials (e.g., pull secret) for any source registries (like registry.redhat.io)

    Identify required images

    Here’s a list of images required for this disconnected deployment:

    PurposeSource image
    Red Hat AI Inference Server vLLM runtimeregistry.redhat.io/rhaiis/vllm-cuda-rhel9:3.2.1-1756225581
    GuideLLMquay.io/rh-aiservices-bu/guidellm:f1f8ca8
    UBI base (for rsync), any other minimal image can be usedregistry.access.redhat.com/ubi9/ubi:latest

    Create the ImageSet configuration

    An ImageSet configuration is a YAML file used by the oc-mirror CLI tool to define which container images should be mirrored from public or external registries into a disconnected OpenShift environment.

    Start by defining which images you want to mirror. For benchmarking, we need:

    • GuideLLM container image
    • Red Hat AI Inference Server image
    cat << EOF > /mnt/local-images/imageset-config-custom.yaml
    ---
    kind: ImageSetConfiguration
    apiVersion: mirror.openshift.io/v1alpha2
    storageConfig:
      local:
        path: ./
    mirror:
      additionalImages:
      - name: quay.io/rh-aiservices-bu/guidellm:f1f8ca8
      - name: registry.redhat.io/rhaiis/vllm-cuda-rhel9:3.2.1-1756225581
      helm: {}
    EOF

    This YAML tells oc-mirror to collect the listed images and store them locally in the current directory.

    Run oc-mirror on the internet connected host

    Change to the working directory and run:

    cd /mnt/local-images
    oc-mirror --config imageset-config-custom.yaml file:///mnt/local-images

    This will:

    • Download all required image layers
    • Package them into tarballs and metadata directories (e.g., mirror_seq2_000000.tar)

    Transfer mirror tarballs to the disconnected host

    Use rsync or any file transfer method (USB, SCP, etc.) to move the mirrored content:

    rsync -avP /mnt/local-images/ disconnected-server:/mnt/local-images/

    Load the images into the internal registry

    On the disconnected host:

    cd /mnt/local-images
    oc-mirror --from=/mnt/local-images/mirror_seq2_000000.tar docker://$(hostname):8443

    This command unpacks the image layers and pushes them into the image registry running on the current node at port 8443.

    Apply ImageContentSourcePolicy (ICSP)

    To allow OpenShift to redirect image pulls to the internal mirror, create an ImageContentSourcePolicy (ICSP):

    cat << EOF > /mnt/local-images/imageset-content-source.yaml
    ---
    apiVersion: operator.openshift.io/v1alpha1
    kind: ImageContentSourcePolicy
    metadata:
      name: disconnected-mirror
    spec:
      repositoryDigestMirrors:
      - mirrors:
        - registry-host:8443/rh-aiservices-bu/guidellm
        source: quay.io/rh-aiservices-bu/guidellm
        - registry-host:8443/rhaiss/vllm-cuda-rhel9
        source: registry.redhat.io/rhaiis/vllm-cuda-rhel9
    EOF
    oc apply -f /mnt/local-images/imageset-content-source.yaml

    Warning: Update the mirror registry-host to match the address or DNS name of your internal registry.

    This policy ensures that when a pod attempts to pull from quay.io/rh-aiservices-bu/guidellm, OpenShift will instead pull from your local mirror.

    Now that the images are available in the disconnected environment, we can proceed to setting up persistent volumes for the model weights and tokenizer, and start running benchmarks.

    Copying model weights into a disconnected OpenShift environment

    In disconnected environments, model weights and tokenizer files must first be transferred from a connected system to a machine inside the disconnected environment. From there, they can be loaded into OpenShift using persistent volume claims (PVCs) and the oc CLI.

    Persistent volume setup

    This benchmarking stack requires the following PVCs:

    • The model weights directory for vLLM serving.
    • The tokenizer JSON/config files for generating test sequences.
    • A volume to store benchmark results.

    1. Model weights PVC

    The model weights must be made available to the Red Hat AI Inference Server (vLLM) pod, create a PVC large enough to store the weights and then copy the weights to the PVC using a model copy pod.

    oc create -f - <<EOF
    apiVersion: v1
    kind: PersistentVolumeClaim
    metadata:
      name: llama-31-8b-instruct-w4a16
    spec:
      accessModes:
        - ReadWriteOnce
      resources:
        requests:
          storage: 50Gi
    EOF

    Once the PVC is created, we can create a pod to temporarily mount the PVC so that we can copy data to it.

    Note: In the following example, we're using a ubi9 image to create this temporary pod. Other images can be used if this image is not available in the disconnected environment.

    oc create -f - <<EOF
    apiVersion: v1
    kind: Pod
    metadata:
      name: model-copy-pod
    spec:
      containers:
      - name: model-copy
        image: registry.access.redhat.com/ubi9/ubi:latest
        command: ["sleep", "3600"]
        volumeMounts:
        - name: models
          mountPath: /mnt/models
      volumes:
      - name: models
        persistentVolumeClaim:
          claimName: llama-31-8b-instruct-w4a16
    EOF

    Copy the model from the location on the disconnected server using the temporary pod, and then delete the pod.

    oc rsync ./local-models/llama/ model-copy-pod:/mnt/models/
    oc delete pod model-copy-pod

    2. Tokenizer PVC

    GuideLLM requires access to the tokenizer used by the model, we will create a PVC to store the tokenizer, and similar to the previous step, copy the files to the PVC using a temporary pod.

    oc create -f - <<EOF
    apiVersion: v1
    kind: PersistentVolumeClaim
    metadata:
      name: llama-31-8b-instruct-w4a16-tokenizer
    spec:
      accessModes:
        - ReadWriteOnce
      resources:
        requests:
          storage: 1Gi
    EOF

    Create a temporary pod to mount this PVC

    oc create -f - <<EOF
    apiVersion: v1
    kind: Pod
    metadata:
      name: tokenizer-copy-pod
    spec:
      containers:
      - name: tokenizer-copy
        image: registry.access.redhat.com/ubi9/ubi:latest
        command: ["sleep", "3600"]
        volumeMounts:
        - name: tokenizer
          mountPath: /mnt/tokenizer
      volumes:
      - name: tokenizer
        persistentVolumeClaim:
          claimName: llama-31-8b-instruct-w4a16-tokenizer
    EOF

    Copy the tokenizer and config to the PVC using the temporary pod and then delete the pod.

    oc cp ./local-models/llama/tokenizer.json tokenizer-copy-pod:/mnt/tokenizer/
    oc cp ./local-models/llama/config.json tokenizer-copy-pod:/mnt/tokenizer/
    oc delete pod tokenizer-copy-pod

    3. Benchmark results PVC

    It's also recommended to store the benchmarking results in a PVC.

    oc create -f - <<EOF
    apiVersion: v1
    kind: PersistentVolumeClaim
    metadata:
      name: benchmark-results
    spec:
      accessModes:
        - ReadWriteOnce
      resources:
        requests:
          storage: 5Gi
    EOF

    Deploy Red Hat AI Inference Server (vLLM)

    The Red Hat AI Inference Server enables model inference using the vLLM runtime and supports air-gapped environments out of the box. For this setup, you will deploy Red Hat AI Inference Server as a Deployment resource with attached persistent volumes for model weights and configuration.

    This deployment uses the image registry.redhat.io/rhaiis/vllm-cuda-rhel9:3.2.1-1756225581 and is configured for the meta-llama-3.1-8B-instruct-quantized.w4a16 model.

    Example deployment manifest

    apiVersion: apps/v1
    kind: Deployment
    metadata:
      name: llama
      labels:
        app: llama
    spec:
      replicas: 1
      selector:
        matchLabels:
          app: llama
      template:
        metadata:
          labels:
            app: llama
        spec:
          containers:
            - name: llama
              image: 'registry.redhat.io/rhaiis/vllm-cuda-rhel9:3.2.1-1756225581'
              imagePullPolicy: IfNotPresent
              command:
                - python
                - '-m'
                - vllm.entrypoints.openai.api_server
              args:
                - '--port=8000'
                - '--model=/mnt/models'
                - '--served-model-name=meta-llama-3.1-8B-instruct-quantized.w4a16'
                - '--tensor-parallel-size=1'
                - '--max-model-len=8096'
              resources:
                limits:
                  nvidia.com/gpu: '1'
                requests:
                  nvidia.com/gpu: '1'
              volumeMounts:
                - name: cache-volume
                  mountPath: /mnt/models
                - name: shm
                  mountPath: /dev/shm 
          tolerations:
            - key: nvidia.com/gpu
              operator: Exists
          volumes:
            - name: cache-volume
              persistentVolumeClaim:
                claimName: llama-31-8b-instruct-w4a16
            - name: shm
              emptyDir:
                medium: Memory
                sizeLimit: 2Gi
          restartPolicy: Always

    Notes:

    • Model weights PVC (llama-31-8b-instruct-w4a16) is mounted at /mnt/models.
    • --tensor-parallel-size is set to 1 for single-GPU workloads. Adjust accordingly.
    • GPU toleration is added to ensure the pod lands on a node with an NVIDIA GPU.

    Deploying

    Save the manifest as rhaiis-deployment.yaml and run:

    oc apply -f rhaiis-deployment.yaml

    Verify the pod is running and exposing port 8000:

    oc get pods
    oc logs <pod-name> -f

    Expose the Red Hat AI Inference Server deployment on port 8000

    After deploying the Red Hat AI Inference Server (vLLM) server, it is important to expose the pod using a Kubernetes Service to enable communication—either internally (e.g., for benchmarking tools like GuideLLM) or externally (via ingress or route, if desired).

    In this guide, we’ll expose the Red Hat AI Inference Server server on port 8000 using a Kubernetes Service, which makes it discoverable to other pods within the OpenShift cluster.

    Create the service

    Apply the following YAML to expose the deployment via a ClusterIP service named llama-31-8b-instruct-w4a16-svc.

    apiVersion: v1
    kind: Service
    metadata:
      name: llama-31-8b-instruct-w4a16-svc
      labels:
        app: rhaiis
    spec:
      selector:
        app: rhaiis
      ports:
        - protocol: TCP
          port: 8000 
          targetPort: 8000 
      type: ClusterIP 

    Save the above manifest as rhaiis-service.yaml and apply it:

    oc apply -f rhaiis-service.yaml

    This service:

    • Routes traffic to pods matching the label app: rhaiis
    • Exposes TCP port 8000 (the same as the container port used by the vLLM server)
    • Is only accessible within the cluster (type ClusterIP)

    With this deployment, the model is now being served over HTTP from port 8000 inside the cluster and is ready for benchmarking via GuideLLM.

    Run the benchmark job

    To run the GuideLLM benchmarks, we're going to deploy an OpenShift Job, which will complete once the benchmarking is complete. The expectation is we'll run GuideLLM multiple times with different configurations e.g. sequence-length, to benchmark different loads.

    The args section in the GuideLLM benchmark job defines how the test is run. Here's what each key setting does:

    • --target: The URL of the vLLM model server (Red Hat AI Inference Server endpoint) to benchmark.
    • --tokenizer-path: Path to the model’s tokenizer files, ensuring accurate prompt tokenization.
    • --sequence-length: Length of the input prompt in tokens (e.g., 512).
    • --max-requests: Total number of inference requests to send during the test.
    • --request-concurrency: Number of parallel requests sent at once to simulate load.
    • --output: Location to store the results JSON file (usually backed by a PVC).

    These parameters allow you to control test size, load intensity, and output location, providing a flexible framework to benchmark models consistently in disconnected environments. 

    The following table shows several example configurations demonstrating how different prompt and output sizes, as well as concurrency levels, can be used to mimic real-world workloads.

    Prompt tokens

    Output tokens

    Max duration (s)

    Concurrency levels

    Use case example

    2,0005003601, 5, 10, 25, 50, 100General-purpose LLM tasks, chat-like prompts
    8,0005003601, 5, 10, 25, 50, 100Extended context (e.g., multi-paragraph summarization)
    10,0005003601, 5, 10, 25, 50, 100Deep document prompts or structured context input
    20,0003,0006001, 5, 10, 25, 50, 100Long-form generation, code synthesis, legal document Q&A
    20,0005,0006001, 5, 10, 25, 50, 100Full document completion or summarization of large reports

    Here's an example job:

    apiVersion: batch/v1
    kind: Job
    metadata:
      name: guidellm-benchmark
    spec:
      template:
        spec:
          containers:
          - name: guidellm
            image: quay.io/rh-aiservices-bu/guidellm:f1f8ca8
            args:
              - --target=http://llama-31-8b-instruct-w4a16-rhaiis:8000
              - --tokenizer-path=/mnt/tokenizer
              - --sequence-length=512
              - --max-requests=100
              - --request-concurrency=8
              - --output=/results/output.json
            volumeMounts:
              - name: tokenizer
                mountPath: /mnt/tokenizer
              - name: results
                mountPath: /results
          restartPolicy: Never
          volumes:
            - name: tokenizer
              persistentVolumeClaim:
                claimName: llama-31-8b-instruct-w4a16-tokenizer
            - name: results
              persistentVolumeClaim:
                claimName: benchmark-results

    Apply the job:

    oc apply -f guidellm-job.yaml

    Once the Job is created, a pod will start and run the GuideLLM benchmark until completion.  Once the Job is completed an overview of the results will be displayed in the logs.

    Viewing the results

    The results will be saved as a structured JSON file in the /results mount. Example key metrics:

    • tokens_per_second
    • request_latency
    • time_to_first_token_ms
    • inter_token_latency_ms

    Post run result analysis

    The terminal output is shown in Figure 1.

    Image showing the output of GuideLLM performance benchmarking an LLM running on vLLM in a disconnected environment
    Figure 1: GuideLLM output in the terminal.

    The GuideLLM output JSON structure is as follows:

    • benchmarks: A list of all the individual concurrencies and test types that have been run.
      • benchmark
        • args: Full test information
        • Duration
        • End_time
        • Extras
        • metrics: Please refer to the successful requests only.
          • inter_token_latency_ms
          • output_token_count
          • output_tokens_per_second
          • prompt_token_count
          • request_concurrency
          • request_latency
          • requests_per_second
          • time_per_output_token_ms
          • time_to_first_token_ms
          • tokens_per_second
        • request_totals
        • run_stats: Includes run-level stats
        • requests: Contains the individual requests, responses, and metrics for the request in question.

    Conclusion

    Running LLM serving and benchmarking in a fully disconnected cluster doesn’t have to be hard. With vLLM and GuideLLM, the workflow is simple and OpenShift-native:

    • Serve fast with vLLM, no internet required. Point vLLM at a PVC with your model weights, deploy the standard container image, and expose port 8000. Swapping models or scaling GPUs is just a manifest tweak away.
    • Benchmark immediately with GuideLLM. GuideLLM ships its own corpus and uses the model’s native tokenizer, so you get accurate TPS/latency metrics without pulling external datasets or tools.
    • Mirror once, reuse everywhere. Using oc-mirror, the few required images are mirrored into your internal registry. After that, every deploy and benchmark runs repeatably with no outbound calls.
    • All Kubernetes primitives. PVCs, Jobs, and Services keep operations familiar, scriptable, and auditable for regulated environments.

    vLLM makes serving LLMs in a disconnected environment straightforward, and GuideLLM makes offline benchmarking easy. Together they provide a clean, reproducible path from model deployment to performance insights—entirely within an air-gapped OpenShift environment.

    Related Posts

    • Speech-to-text with Whisper and Red Hat AI Inference Server

    • Getting started with llm-d for distributed AI inference

    • Deploy a lightweight AI model with AI Inference Server containerization

    • Image mode for Red Hat Enterprise Linux quick start: AI inference

    • How we improved AI inference on macOS Podman containers

    • Integrate Red Hat AI Inference Server & LangChain in agentic workflows

    Recent Posts

    • What's New in OpenShift GitOps 1.18

    • Beyond a single cluster with OpenShift Service Mesh 3

    • Kubernetes MCP server: AI-powered cluster management

    • Unlocking the power of OpenShift Service Mesh 3

    • Run DialoGPT-small on OpenShift AI for internal model testing

    What’s up next?

    In disconnected environments, especially those with stringent security requirements, OpenShift installations require additional consideration, including mirroring of all necessary content locally and steps to simulate an internet connection for OpenShift's functionality. This cheat sheet shows you how to perform an OpenShift disconnected installation in a secured environment.

    Get the cheat sheet
    Red Hat Developers logo LinkedIn YouTube Twitter Facebook

    Platforms

    • Red Hat AI
    • Red Hat Enterprise Linux
    • Red Hat OpenShift
    • Red Hat Ansible Automation Platform
    • See all products

    Build

    • Developer Sandbox
    • Developer Tools
    • Interactive Tutorials
    • API Catalog

    Quicklinks

    • Learning Resources
    • E-books
    • Cheat Sheets
    • Blog
    • Events
    • Newsletter

    Communicate

    • About us
    • Contact sales
    • Find a partner
    • Report a website issue
    • Site Status Dashboard
    • Report a security problem

    RED HAT DEVELOPER

    Build here. Go anywhere.

    We serve the builders. The problem solvers who create careers with code.

    Join us if you’re a developer, software engineer, web designer, front-end designer, UX designer, computer scientist, architect, tester, product manager, project manager or team lead.

    Sign me up

    Red Hat legal and privacy links

    • About Red Hat
    • Jobs
    • Events
    • Locations
    • Contact Red Hat
    • Red Hat Blog
    • Inclusion at Red Hat
    • Cool Stuff Store
    • Red Hat Summit
    © 2025 Red Hat

    Red Hat legal and privacy links

    • Privacy statement
    • Terms of use
    • All policies and guidelines
    • Digital accessibility

    Report a website issue