Skip to main content
Redhat Developers  Logo
  • Products

    Platforms

    • Red Hat Enterprise Linux
      Red Hat Enterprise Linux Icon
    • Red Hat AI
      Red Hat AI
    • Red Hat OpenShift
      Openshift icon
    • Red Hat Ansible Automation Platform
      Ansible icon
    • View All Red Hat Products

    Featured

    • Red Hat build of OpenJDK
    • Red Hat Developer Hub
    • Red Hat JBoss Enterprise Application Platform
    • Red Hat OpenShift Dev Spaces
    • Red Hat OpenShift Local
    • Red Hat Developer Sandbox

      Try Red Hat products and technologies without setup or configuration fees for 30 days with this shared Openshift and Kubernetes cluster.
    • Try at no cost
  • Technologies

    Featured

    • AI/ML
      AI/ML Icon
    • Linux
      Linux Icon
    • Kubernetes
      Cloud icon
    • Automation
      Automation Icon showing arrows moving in a circle around a gear
    • View All Technologies
    • Programming Languages & Frameworks

      • Java
      • Python
      • JavaScript
    • System Design & Architecture

      • Red Hat architecture and design patterns
      • Microservices
      • Event-Driven Architecture
      • Databases
    • Developer Productivity

      • Developer productivity
      • Developer Tools
      • GitOps
    • Automated Data Processing

      • AI/ML
      • Data Science
      • Apache Kafka on Kubernetes
    • Platform Engineering

      • DevOps
      • DevSecOps
      • Ansible automation for applications and services
    • Secure Development & Architectures

      • Security
      • Secure coding
  • Learn

    Featured

    • Kubernetes & Cloud Native
      Openshift icon
    • Linux
      Rhel icon
    • Automation
      Ansible cloud icon
    • AI/ML
      AI/ML Icon
    • View All Learning Resources

    E-Books

    • GitOps Cookbook
    • Podman in Action
    • Kubernetes Operators
    • The Path to GitOps
    • View All E-books

    Cheat Sheets

    • Linux Commands
    • Bash Commands
    • Git
    • systemd Commands
    • View All Cheat Sheets

    Documentation

    • Product Documentation
    • API Catalog
    • Legacy Documentation
  • Developer Sandbox

    Developer Sandbox

    • Access Red Hat’s products and technologies without setup or configuration, and start developing quicker than ever before with our new, no-cost sandbox environments.
    • Explore Developer Sandbox

    Featured Developer Sandbox activities

    • Get started with your Developer Sandbox
    • OpenShift virtualization and application modernization using the Developer Sandbox
    • Explore all Developer Sandbox activities

    Ready to start developing apps?

    • Try at no cost
  • Blog
  • Events
  • Videos

Batch inference on OpenShift AI with Ray Data, vLLM, and CodeFlare

August 7, 2025
Bryan Keane
Related topics:
Artificial intelligence
Related products:
Red Hat AIRed Hat OpenShift AI

Share:

    Inference is the process of using a trained model to make predictions on new, unseen data. While running a prediction on a single row in a Jupyter notebook is straightforward, this approach doesn't scale to the millions of rows found in large datasets.

    This scaling problem often requires collaboration between data scientists and platform engineers to configure Kubernetes-based infrastructure for these compute-heavy workloads. In a Red Hat OpenShift AI environment, the goal is to provide an efficient way to run inference at scale, allowing data scientists to evaluate their models without becoming infrastructure experts.

    This blog demonstrates how both platform engineers and data scientists can solve this challenge using the CodeFlare SDK with Ray Data and vLLM on OpenShift AI. I will show you how to submit a Python script to run a large-scale, distributed batch inference job on a remote Ray cluster, bridging the gap between local development and production-scale execution.

    Online versus offline inference

    Before diving into the solution, it's important to distinguish between two primary types of model inference. The right approach depends entirely on the use case.

    • Online inference is designed for immediate, single-request predictions where low latency is essential. This is the process that powers most interactive AI applications. For instance, when you send a prompt to an LLM like Gemini or a customer service chatbot, your request hits an API endpoint. The model performs inference in real time to generate your response.
    • Offline (batch) inference is used to process a large volume of data at once. In this pattern, overall completion time is more important than the latency of any single prediction. This is ideal for large-scale, non-interactive tasks like generating reports, analyzing a month's worth of sensor data, or classifying a large collection of images.

    This blog focuses exclusively on performing offline batch inference at scale. 

    Connecting to a Ray cluster via the CodeFlare SDK

    The CodeFlare SDK acts as the bridge between your Python environment and the Ray cluster. You establish this connection the same way, whether you’re working from a workbench within OpenShift AI or from a notebook running locally on your machine.

    To connect, you need 2 pieces of information: the dashboard URL for the Ray cluster and a valid authentication token. As a data scientist, these can be obtained by contacting your platform engineer or cluster administrator.

    from codeflare_sdk import RayJobClient
    auth_token = "replace-me" 
    header = {
        'Authorization': f'Bearer {auth_token}'
    }
         
    ray_dashboard = "replace-me" 
    client = RayJobClient(address=ray_dashboard, headers=header, verify=True)

    The code block above prepares your connection by:

    1. Placing the auth_token inside a standard HTTP Authorization header.
    2. Identifying the ray_dashboard URL, which is the unique network address for your Ray cluster's API.
    3. Passing both the address and headers into the RayJobClient to initialize it.

    With the client successfully created and authentication set up, you are ready to define your batch inference workload, package it as a Ray Job, and submit it for distributed execution on the cluster.

    Building your remote batch inference job

    Now that you have an authenticated RayJobClient, you can define and submit your batch inference job. The entire logic for your job can be contained in a single Python script. We'll call ours batch_inference.py.

    The process involves 2 main stages:

    1. Configuring the job: Defining the model to use, where to get it, and performance settings like batch size.
    2. Submitting the job: Sending the script to the Ray cluster for remote execution.

    Let's walk through what batch_inference.py contains.

    Configuring the vLLM inference engine

    At the heart of our script is the vLLMEngineProcessorConfig. This object from Ray Data tells the cluster everything it needs to know about the model and how to run it.

    Sourcing your model

    You have several flexible options for specifying the model_source:

    • From the Hugging Face Hub (default): This is the simplest method. Provide the model's Hugging Face repository ID. If it's a private or gated model, you can provide your HF_TOKEN via the runtime_env, which we'll show in the submission step.

      # In batch_inference.py
      from ray.data.llm import vLLMEngineProcessorConfig
      processor_config = vLLMEngineProcessorConfig(
      model_source="unsloth/Llama-3.2-1B-Instruct",
      ...
      )
    • From cloud storage (Amazon S3, Google Cloud Storage): For production stability and to avoid Hugging Face rate limits, caching your model in your own cloud storage is a best practice. Ray provides a utility to do this easily: 

      python -m ray.llm.utils.upload_model --model-source <repository-id> --bucket-uri s3://your-bucket/llama3-etc

      You can then point model_source to the S3 URI. The documentation recommends using the unai_streamer format for optimized loading from cloud storage. You must also provide your cloud credentials via the runtime_env during job submission.

      # In batch_inference.py
      processor_config = vLLMEngineProcessorConfig(
      model_source="s3://your-bucket/llama3.2-1b/",
      engine_kwargs={"load_format": "runai_streamer"},
      ...
      )   
    • From a local path on the cluster: If a model is already on the cluster's filesystem (e.g., a shared network volume), you can provide its absolute path.

      model_source="/mnt/shared_models/my-custom-model"

    Once you’ve chosen how you want to source your model, you can move onto tuning it.

    Tuning for performance: Batching and concurrency

    You can tune several parameters to maximize the efficiency of your GPU resources.

    • concurrency: Sets the number of parallel vLLM workers. Set this to the number of available GPUs on your Ray cluster’s workers to process multiple batches simultaneously. So if you have 4 workers with 1 GPU each, this can be set to 4.
    • batch_size: Tells Ray Data how many rows to group together and send to a worker at once. Larger batches can improve GPU utilization.

    For very large models that don't fit on a single GPU, you can use model parallelism to shard the model across multiple devices. This is configured within engine_kwargs.

    # A more tuned configuration
    processor_config = vLLMEngineProcessorConfig(
    model_source="unsloth/Llama-3.1-8B-Instruct",
    batch_size=32,
    concurrency=4, # Example: for a Ray Cluster with 4 GPUs
    engine_kwargs={
    "dtype": "half",
    "max_model_len": 4096,
    "tensor_parallel_size": 4, # Shard the model 
    },
    )

    Defining the processing logic

    With the configuration defined, you use build_llm_processor to create the full inference pipeline. This involves defining preprocess and postprocess functions.

    • preprocess: This function takes a row from your input dataset and transforms it into the format the LLM expects. For most modern chat models, this is the OpenAI chat message format.
    • postprocess: This function takes the LLM's output and cleans it up, so you're left with just the data you need.
    # In batch_inference.py (continued)
    from ray.data.llm import build_llm_processor
    processor = build_llm_processor(
        processor_config,
        preprocess=lambda row: dict(
            messages=[
                {
                    "role": "system",
                    "content": "You are a calculator. Please only    
     output the answer of the given equation.",
                },
                {"role": "user", "content": f"{row['id']} ** 3 = ?"},
            ],
            sampling_params=dict(temperature=0.3, max_tokens=20),
        ),
        postprocess=lambda row: {
            "resp": row["generated_text"],
        },
    )
    # --- This last part executes the pipeline ---
    # 1. Create a dummy dataset of 100 rows
    ds = ray.data.range(100)
    # 2. Apply the processor (this is lazy and doesn't run yet)
    ds = processor(ds)
    # 3. Materialize the results and print them
    ds = ds.materialize()
    # 4. Print all outputs.
    for out in ds.take_all():
        print(out)

    Submitting and monitoring the job

    Now that the complete logic is in batch_inference.py, you can use the RayJobClient from your local notebook to submit it for remote execution.

    Here you define the entrypoint_command and the runtime_env. The runtime_env is critical for specifying dependencies (pip) and environment variables (like cloud credentials or HF_TOKEN).

    # Run this in the notebook you created the client earlier
    entrypoint_command = "python batch_inference.py"
    # Submit the job using the previously created RayJobClient
    submission_id = client.submit_job(
        entrypoint=entrypoint_command,
        runtime_env={
            "pip": ["vllm=="0.9.1", "s3fs"], # Add any needed libraries
            "env_vars": {
                "HF_TOKEN": "your-hugging-face-token",
                "AWS_ACCESS_KEY_ID": "your_access_key_id",
                "AWS_SECRET_ACCESS_KEY": "your_secret_access_key",
            }
        }
    )
    print(f"Job submitted with ID: {submission_id}")
    # You can then monitor the job
    client.get_job_status(submission_id)
    print("--- JOB LOGS ---")
    client.get_job_logs(submission_id)

    And with that, you have successfully orchestrated a batch inference job with vLLM and Ray Data via the CodeFlare SDK. By defining your model configuration and processing logic in a simple Python script, you can use the RayJobClient to send that workload to a remote Ray cluster. This workflow allows you to use the full power of distributed computing and modern inference engines like vLLM without ever leaving your Python environment.

    Conclusion

    Scaling machine learning inference from a single prediction to millions of records is a common but significant hurdle. As we've shown, this challenge doesn’t require a deep understanding of complex infrastructure management. By learning to connect a RayJobClient, define your logic in a Python script, and submit it for remote execution, you can use the CodeFlare SDK to bridge the gap between your local environment and a powerful, remote cluster.

    To learn more, check out the official documentation:

    • CodeFlare SDK: GitHub repository and examples
    • Ray Data: Official Ray Data documentation
    • vLLM: Official vLLM documentation

    Related Posts

    • LLM Compressor is here: Faster inference with vLLM

    • Llama 4 herd is here with Day 0 inference support in vLLM

    • How we optimized vLLM for DeepSeek-R1

    • AI meets containers: My first step into Podman AI Lab

    • Integrate vLLM inference on macOS/iOS with Llama Stack APIs

    • 2:4 Sparse Llama: Smaller models for efficient GPU inference

    Recent Posts

    • Cloud bursting with confidential containers on OpenShift

    • Reach native speed with MacOS llama.cpp container inference

    • A deep dive into Apache Kafka's KRaft protocol

    • Staying ahead of artificial intelligence threats

    • Strengthen privacy and security with encrypted DNS in RHEL

    What’s up next?

    This hands-on learning path demonstrates how retrieval-augmented generation (RAG) works and how users can implement a RAG workflow using Red Hat OpenShift AI and Elasticsearch vector database.

    Start the activity
    Red Hat Developers logo LinkedIn YouTube Twitter Facebook

    Products

    • Red Hat Enterprise Linux
    • Red Hat OpenShift
    • Red Hat Ansible Automation Platform

    Build

    • Developer Sandbox
    • Developer Tools
    • Interactive Tutorials
    • API Catalog

    Quicklinks

    • Learning Resources
    • E-books
    • Cheat Sheets
    • Blog
    • Events
    • Newsletter

    Communicate

    • About us
    • Contact sales
    • Find a partner
    • Report a website issue
    • Site Status Dashboard
    • Report a security problem

    RED HAT DEVELOPER

    Build here. Go anywhere.

    We serve the builders. The problem solvers who create careers with code.

    Join us if you’re a developer, software engineer, web designer, front-end designer, UX designer, computer scientist, architect, tester, product manager, project manager or team lead.

    Sign me up

    Red Hat legal and privacy links

    • About Red Hat
    • Jobs
    • Events
    • Locations
    • Contact Red Hat
    • Red Hat Blog
    • Inclusion at Red Hat
    • Cool Stuff Store
    • Red Hat Summit
    © 2025 Red Hat

    Red Hat legal and privacy links

    • Privacy statement
    • Terms of use
    • All policies and guidelines
    • Digital accessibility

    Report a website issue