Inference is the process of using a trained model to make predictions on new, unseen data. While running a prediction on a single row in a Jupyter notebook is straightforward, this approach doesn't scale to the millions of rows found in large datasets.
This scaling problem often requires collaboration between data scientists and platform engineers to configure Kubernetes-based infrastructure for these compute-heavy workloads. In a Red Hat OpenShift AI environment, the goal is to provide an efficient way to run inference at scale, allowing data scientists to evaluate their models without becoming infrastructure experts.
This blog demonstrates how both platform engineers and data scientists can solve this challenge using the CodeFlare SDK with Ray Data and vLLM on OpenShift AI. I will show you how to submit a Python script to run a large-scale, distributed batch inference job on a remote Ray cluster, bridging the gap between local development and production-scale execution.
Online versus offline inference
Before diving into the solution, it's important to distinguish between two primary types of model inference. The right approach depends entirely on the use case.
- Online inference is designed for immediate, single-request predictions where low latency is essential. This is the process that powers most interactive AI applications. For instance, when you send a prompt to an LLM like Gemini or a customer service chatbot, your request hits an API endpoint. The model performs inference in real time to generate your response.
- Offline (batch) inference is used to process a large volume of data at once. In this pattern, overall completion time is more important than the latency of any single prediction. This is ideal for large-scale, non-interactive tasks like generating reports, analyzing a month's worth of sensor data, or classifying a large collection of images.
This blog focuses exclusively on performing offline batch inference at scale.
Connecting to a Ray cluster via the CodeFlare SDK
The CodeFlare SDK acts as the bridge between your Python environment and the Ray cluster. You establish this connection the same way, whether you’re working from a workbench within OpenShift AI or from a notebook running locally on your machine.
To connect, you need 2 pieces of information: the dashboard URL for the Ray cluster and a valid authentication token. As a data scientist, these can be obtained by contacting your platform engineer or cluster administrator.
from codeflare_sdk import RayJobClient
auth_token = "replace-me"
header = {
'Authorization': f'Bearer {auth_token}'
}
ray_dashboard = "replace-me"
client = RayJobClient(address=ray_dashboard, headers=header, verify=True)
The code block above prepares your connection by:
- Placing the
auth_token
inside a standard HTTP Authorization header. - Identifying the
ray_dashboard
URL, which is the unique network address for your Ray cluster's API. - Passing both the address and headers into the
RayJobClient
to initialize it.
With the client successfully created and authentication set up, you are ready to define your batch inference workload, package it as a Ray Job, and submit it for distributed execution on the cluster.
Building your remote batch inference job
Now that you have an authenticated RayJobClient
, you can define and submit your batch inference job. The entire logic for your job can be contained in a single Python script. We'll call ours batch_inference.py
.
The process involves 2 main stages:
- Configuring the job: Defining the model to use, where to get it, and performance settings like batch size.
- Submitting the job: Sending the script to the Ray cluster for remote execution.
Let's walk through what batch_inference.py
contains.
Configuring the vLLM inference engine
At the heart of our script is the vLLMEngineProcessorConfig
. This object from Ray Data tells the cluster everything it needs to know about the model and how to run it.
Sourcing your model
You have several flexible options for specifying the model_source
:
From the Hugging Face Hub (default): This is the simplest method. Provide the model's Hugging Face repository ID. If it's a private or gated model, you can provide your
HF_TOKEN
via theruntime_env
, which we'll show in the submission step.# In batch_inference.py from ray.data.llm import vLLMEngineProcessorConfig processor_config = vLLMEngineProcessorConfig( model_source="unsloth/Llama-3.2-1B-Instruct", ... )
From cloud storage (Amazon S3, Google Cloud Storage): For production stability and to avoid Hugging Face rate limits, caching your model in your own cloud storage is a best practice. Ray provides a utility to do this easily:
python -m ray.llm.utils.upload_model --model-source <repository-id> --bucket-uri s3://your-bucket/llama3-etc
You can then point
model_source
to the S3 URI. The documentation recommends using theunai_streamer
format for optimized loading from cloud storage. You must also provide your cloud credentials via theruntime_env
during job submission.# In batch_inference.py processor_config = vLLMEngineProcessorConfig( model_source="s3://your-bucket/llama3.2-1b/", engine_kwargs={"load_format": "runai_streamer"}, ... )
From a local path on the cluster: If a model is already on the cluster's filesystem (e.g., a shared network volume), you can provide its absolute path.
model_source="/mnt/shared_models/my-custom-model"
Once you’ve chosen how you want to source your model, you can move onto tuning it.
Tuning for performance: Batching and concurrency
You can tune several parameters to maximize the efficiency of your GPU resources.
concurrency
: Sets the number of parallel vLLM workers. Set this to the number of available GPUs on your Ray cluster’s workers to process multiple batches simultaneously. So if you have 4 workers with 1 GPU each, this can be set to 4.batch_size
: Tells Ray Data how many rows to group together and send to a worker at once. Larger batches can improve GPU utilization.
For very large models that don't fit on a single GPU, you can use model parallelism to shard the model across multiple devices. This is configured within engine_kwargs
.
# A more tuned configuration
processor_config = vLLMEngineProcessorConfig(
model_source="unsloth/Llama-3.1-8B-Instruct",
batch_size=32,
concurrency=4, # Example: for a Ray Cluster with 4 GPUs
engine_kwargs={
"dtype": "half",
"max_model_len": 4096,
"tensor_parallel_size": 4, # Shard the model
},
)
Defining the processing logic
With the configuration defined, you use build_llm_processor
to create the full inference pipeline. This involves defining preprocess
and postprocess
functions.
preprocess
: This function takes a row from your input dataset and transforms it into the format the LLM expects. For most modern chat models, this is the OpenAI chat message format.postprocess
: This function takes the LLM's output and cleans it up, so you're left with just the data you need.
# In batch_inference.py (continued)
from ray.data.llm import build_llm_processor
processor = build_llm_processor(
processor_config,
preprocess=lambda row: dict(
messages=[
{
"role": "system",
"content": "You are a calculator. Please only
output the answer of the given equation.",
},
{"role": "user", "content": f"{row['id']} ** 3 = ?"},
],
sampling_params=dict(temperature=0.3, max_tokens=20),
),
postprocess=lambda row: {
"resp": row["generated_text"],
},
)
# --- This last part executes the pipeline ---
# 1. Create a dummy dataset of 100 rows
ds = ray.data.range(100)
# 2. Apply the processor (this is lazy and doesn't run yet)
ds = processor(ds)
# 3. Materialize the results and print them
ds = ds.materialize()
# 4. Print all outputs.
for out in ds.take_all():
print(out)
Submitting and monitoring the job
Now that the complete logic is in batch_inference.py
, you can use the RayJobClient
from your local notebook to submit it for remote execution.
Here you define the entrypoint_command
and the runtime_env
. The runtime_env
is critical for specifying dependencies (pip
) and environment variables (like cloud credentials or HF_TOKEN
).
# Run this in the notebook you created the client earlier
entrypoint_command = "python batch_inference.py"
# Submit the job using the previously created RayJobClient
submission_id = client.submit_job(
entrypoint=entrypoint_command,
runtime_env={
"pip": ["vllm=="0.9.1", "s3fs"], # Add any needed libraries
"env_vars": {
"HF_TOKEN": "your-hugging-face-token",
"AWS_ACCESS_KEY_ID": "your_access_key_id",
"AWS_SECRET_ACCESS_KEY": "your_secret_access_key",
}
}
)
print(f"Job submitted with ID: {submission_id}")
# You can then monitor the job
client.get_job_status(submission_id)
print("--- JOB LOGS ---")
client.get_job_logs(submission_id)
And with that, you have successfully orchestrated a batch inference job with vLLM and Ray Data via the CodeFlare SDK. By defining your model configuration and processing logic in a simple Python script, you can use the RayJobClient
to send that workload to a remote Ray cluster. This workflow allows you to use the full power of distributed computing and modern inference engines like vLLM without ever leaving your Python environment.
Conclusion
Scaling machine learning inference from a single prediction to millions of records is a common but significant hurdle. As we've shown, this challenge doesn’t require a deep understanding of complex infrastructure management. By learning to connect a RayJobClient
, define your logic in a Python script, and submit it for remote execution, you can use the CodeFlare SDK to bridge the gap between your local environment and a powerful, remote cluster.
To learn more, check out the official documentation:
- CodeFlare SDK: GitHub repository and examples
- Ray Data: Official Ray Data documentation
- vLLM: Official vLLM documentation