Batch inference on OpenShift AI with Ray Data, vLLM, and CodeFlare

Inference is the process of using a trained model to make predictions on new, unseen data. While running a prediction on a single row in a Jupyter notebook is straightforward, this approach doesn't scale to the millions of rows found in large datasets.

This scaling problem often requires collaboration between data scientists and platform engineers to configure Kubernetes-based infrastructure for these compute-heavy workloads. In a Red Hat OpenShift AI environment, the goal is to provide an efficient way to run inference at scale, allowing data scientists to evaluate their models without becoming infrastructure experts.

This blog demonstrates how both platform engineers and data scientists can solve this challenge using the CodeFlare SDK with Ray Data and vLLM on OpenShift AI. I will show you how to submit a Python script to run a large-scale, distributed batch inference job on a remote Ray cluster, bridging the gap between local development and production-scale execution.

Online versus offline inference

Before diving into the solution, it's important to distinguish between two primary types of model inference. The right approach depends entirely on the use case.

Online inference is designed for immediate, single-request predictions where low latency is essential. This is the process that powers most interactive AI applications. For instance, when you send a prompt to an LLM like Gemini or a customer service chatbot, your request hits an API endpoint. The model performs inference in real time to generate your response.
Offline (batch) inference is used to process a large volume of data at once. In this pattern, overall completion time is more important than the latency of any single prediction. This is ideal for large-scale, non-interactive tasks like generating reports, analyzing a month's worth of sensor data, or classifying a large collection of images.

This blog focuses exclusively on performing offline batch inference at scale.

Connecting to a Ray cluster via the CodeFlare SDK

The CodeFlare SDK acts as the bridge between your Python environment and the Ray cluster. You establish this connection the same way, whether you’re working from a workbench within OpenShift AI or from a notebook running locally on your machine.

To connect, you need 2 pieces of information: the dashboard URL for the Ray cluster and a valid authentication token. As a data scientist, these can be obtained by contacting your platform engineer or cluster administrator.

from codeflare_sdk import RayJobClient
auth_token = "replace-me" 
header = {
    'Authorization': f'Bearer {auth_token}'
}
     
ray_dashboard = "replace-me" 
client = RayJobClient(address=ray_dashboard, headers=header, verify=True)

The code block above prepares your connection by:

Placing the auth_token inside a standard HTTP Authorization header.
Identifying the ray_dashboard URL, which is the unique network address for your Ray cluster's API.
Passing both the address and headers into the RayJobClient to initialize it.

With the client successfully created and authentication set up, you are ready to define your batch inference workload, package it as a Ray Job, and submit it for distributed execution on the cluster.

Building your remote batch inference job

Now that you have an authenticated RayJobClient, you can define and submit your batch inference job. The entire logic for your job can be contained in a single Python script. We'll call ours batch_inference.py.

The process involves 2 main stages:

Configuring the job: Defining the model to use, where to get it, and performance settings like batch size.
Submitting the job: Sending the script to the Ray cluster for remote execution.

Let's walk through what batch_inference.py contains.

Configuring the vLLM inference engine

At the heart of our script is the vLLMEngineProcessorConfig. This object from Ray Data tells the cluster everything it needs to know about the model and how to run it.

Sourcing your model

You have several flexible options for specifying the model_source:

From the Hugging Face Hub (default): This is the simplest method. Provide the model's Hugging Face repository ID. If it's a private or gated model, you can provide your HF_TOKEN via the runtime_env, which we'll show in the submission step.
```
# In batch_inference.py
from ray.data.llm import vLLMEngineProcessorConfig
processor_config = vLLMEngineProcessorConfig(
model_source="unsloth/Llama-3.2-1B-Instruct",
...
)
```
From cloud storage (Amazon S3, Google Cloud Storage): For production stability and to avoid Hugging Face rate limits, caching your model in your own cloud storage is a best practice. Ray provides a utility to do this easily:
```
python -m ray.llm.utils.upload_model --model-source <repository-id> --bucket-uri s3://your-bucket/llama3-etc
```
You can then point model_source to the S3 URI. The documentation recommends using the unai_streamer format for optimized loading from cloud storage. You must also provide your cloud credentials via the runtime_env during job submission.
```
# In batch_inference.py
processor_config = vLLMEngineProcessorConfig(
model_source="s3://your-bucket/llama3.2-1b/",
engine_kwargs={"load_format": "runai_streamer"},
...
)   
```
From a local path on the cluster: If a model is already on the cluster's filesystem (e.g., a shared network volume), you can provide its absolute path.
```
model_source="/mnt/shared_models/my-custom-model"
```

Once you’ve chosen how you want to source your model, you can move onto tuning it.

Tuning for performance: Batching and concurrency

You can tune several parameters to maximize the efficiency of your GPU resources.

concurrency: Sets the number of parallel vLLM workers. Set this to the number of available GPUs on your Ray cluster’s workers to process multiple batches simultaneously. So if you have 4 workers with 1 GPU each, this can be set to 4.
batch_size: Tells Ray Data how many rows to group together and send to a worker at once. Larger batches can improve GPU utilization.

For very large models that don't fit on a single GPU, you can use model parallelism to shard the model across multiple devices. This is configured within engine_kwargs.

# A more tuned configuration
processor_config = vLLMEngineProcessorConfig(
model_source="unsloth/Llama-3.1-8B-Instruct",
batch_size=32,
concurrency=4, # Example: for a Ray Cluster with 4 GPUs
engine_kwargs={
"dtype": "half",
"max_model_len": 4096,
"tensor_parallel_size": 4, # Shard the model 
},
)

Defining the processing logic

With the configuration defined, you use build_llm_processor to create the full inference pipeline. This involves defining preprocess and postprocess functions.

preprocess: This function takes a row from your input dataset and transforms it into the format the LLM expects. For most modern chat models, this is the OpenAI chat message format.
postprocess: This function takes the LLM's output and cleans it up, so you're left with just the data you need.

# In batch_inference.py (continued)
from ray.data.llm import build_llm_processor
processor = build_llm_processor(
    processor_config,
    preprocess=lambda row: dict(
        messages=[
            {
                "role": "system",
                "content": "You are a calculator. Please only    
 output the answer of the given equation.",
            },
            {"role": "user", "content": f"{row['id']} ** 3 = ?"},
        ],
        sampling_params=dict(temperature=0.3, max_tokens=20),
    ),
    postprocess=lambda row: {
        "resp": row["generated_text"],
    },
)
# --- This last part executes the pipeline ---
# 1. Create a dummy dataset of 100 rows
ds = ray.data.range(100)
# 2. Apply the processor (this is lazy and doesn't run yet)
ds = processor(ds)
# 3. Materialize the results and print them
ds = ds.materialize()
# 4. Print all outputs.
for out in ds.take_all():
    print(out)

Submitting and monitoring the job

Now that the complete logic is in batch_inference.py, you can use the RayJobClient from your local notebook to submit it for remote execution.

Here you define the entrypoint_command and the runtime_env. The runtime_env is critical for specifying dependencies (pip) and environment variables (like cloud credentials or HF_TOKEN).

# Run this in the notebook you created the client earlier
entrypoint_command = "python batch_inference.py"
# Submit the job using the previously created RayJobClient
submission_id = client.submit_job(
    entrypoint=entrypoint_command,
    runtime_env={
        "pip": ["vllm=="0.9.1", "s3fs"], # Add any needed libraries
        "env_vars": {
            "HF_TOKEN": "your-hugging-face-token",
            "AWS_ACCESS_KEY_ID": "your_access_key_id",
            "AWS_SECRET_ACCESS_KEY": "your_secret_access_key",
        }
    }
)
print(f"Job submitted with ID: {submission_id}")
# You can then monitor the job
client.get_job_status(submission_id)
print("--- JOB LOGS ---")
client.get_job_logs(submission_id)

And with that, you have successfully orchestrated a batch inference job with vLLM and Ray Data via the CodeFlare SDK. By defining your model configuration and processing logic in a simple Python script, you can use the RayJobClient to send that workload to a remote Ray cluster. This workflow allows you to use the full power of distributed computing and modern inference engines like vLLM without ever leaving your Python environment.

Conclusion

Scaling machine learning inference from a single prediction to millions of records is a common but significant hurdle. As we've shown, this challenge doesn’t require a deep understanding of complex infrastructure management. By learning to connect a RayJobClient, define your logic in a Python script, and submit it for remote execution, you can use the CodeFlare SDK to bridge the gap between your local environment and a powerful, remote cluster.

To learn more, check out the official documentation:

CodeFlare SDK: GitHub repository and examples
Ray Data: Official Ray Data documentation
vLLM: Official vLLM documentation

Red Hat Developer Sandbox

Programming languages & frameworks

System design & architecture

Developer experience

Automated data processing

Platform engineering

Secure development & architectures

E-books

Cheat sheets

Documentation

Batch inference on OpenShift AI with Ray Data, vLLM, and CodeFlare

Online versus offline inference

Connecting to a Ray cluster via the CodeFlare SDK

Building your remote batch inference job

Configuring the vLLM inference engine

Sourcing your model

Tuning for performance: Batching and concurrency

Defining the processing logic

Submitting and monitoring the job

Conclusion

Kafka Monthly Digest: November 2025

What you need to know about Red Hat's .NET container images

How to set up Red Hat Lightspeed Model Context Protocol

Lift and shift a .NET application to OpenShift

Run Ruby applications in FIPS mode on Red Hat Enterprise Linux

Platforms

Build

Quicklinks

Communicate

RED HAT DEVELOPER

Red Hat legal and privacy links

Red Hat legal and privacy links

Report a website issue