Skip to main content
Redhat Developers  Logo
  • AI

    Get started with AI

    • Red Hat AI
      Accelerate the development and deployment of enterprise AI solutions.
    • AI learning hub
      Explore learning materials and tools, organized by task.
    • AI interactive demos
      Click through scenarios with Red Hat AI, including training LLMs and more.
    • AI/ML learning paths
      Expand your OpenShift AI knowledge using these learning resources.
    • AI quickstarts
      Focused AI use cases designed for fast deployment on Red Hat AI platforms.
    • No-cost AI training
      Foundational Red Hat AI training.

    Featured resources

    • OpenShift AI learning
    • Open source AI for developers
    • AI product application development
    • Open source-powered AI/ML for hybrid cloud
    • AI and Node.js cheat sheet

    Red Hat AI Factory with NVIDIA

    • Red Hat AI Factory with NVIDIA is a co-engineered, enterprise-grade AI solution for building, deploying, and managing AI at scale across hybrid cloud environments.
    • Explore the solution
  • Learn

    Self-guided

    • Documentation
      Find answers, get step-by-step guidance, and learn how to use Red Hat products.
    • Learning paths
      Explore curated walkthroughs for common development tasks.
    • Guided learning
      Receive custom learning paths powered by our AI assistant.
    • See all learning

    Hands-on

    • Developer Sandbox
      Spin up Red Hat's products and technologies without setup or configuration.
    • Interactive labs
      Learn by doing in these hands-on, browser-based experiences.
    • Interactive demos
      Click through product features in these guided tours.

    Browse by topic

    • AI/ML
    • Automation
    • Java
    • Kubernetes
    • Linux
    • See all topics

    Training & certifications

    • Courses and exams
    • Certifications
    • Skills assessments
    • Red Hat Academy
    • Learning subscription
    • Explore training
  • Build

    Get started

    • Red Hat build of Podman Desktop
      A downloadable, local development hub to experiment with our products and builds.
    • Developer Sandbox
      Spin up Red Hat's products and technologies without setup or configuration.

    Download products

    • Access product downloads to start building and testing right away.
    • Red Hat Enterprise Linux
    • Red Hat AI
    • Red Hat OpenShift
    • Red Hat Ansible Automation Platform
    • See all products

    Featured

    • Red Hat build of OpenJDK
    • Red Hat JBoss Enterprise Application Platform
    • Red Hat OpenShift Dev Spaces
    • Red Hat Developer Toolset

    References

    • E-books
    • Documentation
    • Cheat sheets
    • Architecture center
  • Community

    Get involved

    • Events
    • Live AI events
    • Red Hat Summit
    • Red Hat Accelerators
    • Community discussions

    Follow along

    • Articles & blogs
    • Developer newsletter
    • Videos
    • Github

    Get help

    • Customer service
    • Customer support
    • Regional contacts
    • Find a partner

    Join the Red Hat Developer program

    • Download Red Hat products and project builds, access support documentation, learning content, and more.
    • Explore the benefits

Batch inference on OpenShift AI with Ray Data, vLLM, and CodeFlare

August 7, 2025
Bryan Keane
Related topics:
Artificial intelligence
Related products:
Red Hat AIRed Hat OpenShift AI

    Inference is the process of using a trained model to make predictions on new, unseen data. While running a prediction on a single row in a Jupyter notebook is straightforward, this approach doesn't scale to the millions of rows found in large datasets.

    This scaling problem often requires collaboration between data scientists and platform engineers to configure Kubernetes-based infrastructure for these compute-heavy workloads. In a Red Hat OpenShift AI environment, the goal is to provide an efficient way to run inference at scale, allowing data scientists to evaluate their models without becoming infrastructure experts.

    This blog demonstrates how both platform engineers and data scientists can solve this challenge using the CodeFlare SDK with Ray Data and vLLM on OpenShift AI. I will show you how to submit a Python script to run a large-scale, distributed batch inference job on a remote Ray cluster, bridging the gap between local development and production-scale execution.

    Online versus offline inference

    Before diving into the solution, it's important to distinguish between two primary types of model inference. The right approach depends entirely on the use case.

    • Online inference is designed for immediate, single-request predictions where low latency is essential. This is the process that powers most interactive AI applications. For instance, when you send a prompt to an LLM like Gemini or a customer service chatbot, your request hits an API endpoint. The model performs inference in real time to generate your response.
    • Offline (batch) inference is used to process a large volume of data at once. In this pattern, overall completion time is more important than the latency of any single prediction. This is ideal for large-scale, non-interactive tasks like generating reports, analyzing a month's worth of sensor data, or classifying a large collection of images.

    This blog focuses exclusively on performing offline batch inference at scale. 

    Connecting to a Ray cluster via the CodeFlare SDK

    The CodeFlare SDK acts as the bridge between your Python environment and the Ray cluster. You establish this connection the same way, whether you’re working from a workbench within OpenShift AI or from a notebook running locally on your machine.

    To connect, you need 2 pieces of information: the dashboard URL for the Ray cluster and a valid authentication token. As a data scientist, these can be obtained by contacting your platform engineer or cluster administrator.

    from codeflare_sdk import RayJobClient
    auth_token = "replace-me" 
    header = {
        'Authorization': f'Bearer {auth_token}'
    }
         
    ray_dashboard = "replace-me" 
    client = RayJobClient(address=ray_dashboard, headers=header, verify=True)

    The code block above prepares your connection by:

    1. Placing the auth_token inside a standard HTTP Authorization header.
    2. Identifying the ray_dashboard URL, which is the unique network address for your Ray cluster's API.
    3. Passing both the address and headers into the RayJobClient to initialize it.

    With the client successfully created and authentication set up, you are ready to define your batch inference workload, package it as a Ray Job, and submit it for distributed execution on the cluster.

    Building your remote batch inference job

    Now that you have an authenticated RayJobClient, you can define and submit your batch inference job. The entire logic for your job can be contained in a single Python script. We'll call ours batch_inference.py.

    The process involves 2 main stages:

    1. Configuring the job: Defining the model to use, where to get it, and performance settings like batch size.
    2. Submitting the job: Sending the script to the Ray cluster for remote execution.

    Let's walk through what batch_inference.py contains.

    Configuring the vLLM inference engine

    At the heart of our script is the vLLMEngineProcessorConfig. This object from Ray Data tells the cluster everything it needs to know about the model and how to run it.

    Sourcing your model

    You have several flexible options for specifying the model_source:

    • From the Hugging Face Hub (default): This is the simplest method. Provide the model's Hugging Face repository ID. If it's a private or gated model, you can provide your HF_TOKEN via the runtime_env, which we'll show in the submission step.

      # In batch_inference.py
      from ray.data.llm import vLLMEngineProcessorConfig
      processor_config = vLLMEngineProcessorConfig(
      model_source="unsloth/Llama-3.2-1B-Instruct",
      ...
      )
    • From cloud storage (Amazon S3, Google Cloud Storage): For production stability and to avoid Hugging Face rate limits, caching your model in your own cloud storage is a best practice. Ray provides a utility to do this easily: 

      python -m ray.llm.utils.upload_model --model-source <repository-id> --bucket-uri s3://your-bucket/llama3-etc

      You can then point model_source to the S3 URI. The documentation recommends using the unai_streamer format for optimized loading from cloud storage. You must also provide your cloud credentials via the runtime_env during job submission.

      # In batch_inference.py
      processor_config = vLLMEngineProcessorConfig(
      model_source="s3://your-bucket/llama3.2-1b/",
      engine_kwargs={"load_format": "runai_streamer"},
      ...
      )   
    • From a local path on the cluster: If a model is already on the cluster's filesystem (e.g., a shared network volume), you can provide its absolute path.

      model_source="/mnt/shared_models/my-custom-model"

    Once you’ve chosen how you want to source your model, you can move onto tuning it.

    Tuning for performance: Batching and concurrency

    You can tune several parameters to maximize the efficiency of your GPU resources.

    • concurrency: Sets the number of parallel vLLM workers. Set this to the number of available GPUs on your Ray cluster’s workers to process multiple batches simultaneously. So if you have 4 workers with 1 GPU each, this can be set to 4.
    • batch_size: Tells Ray Data how many rows to group together and send to a worker at once. Larger batches can improve GPU utilization.

    For very large models that don't fit on a single GPU, you can use model parallelism to shard the model across multiple devices. This is configured within engine_kwargs.

    # A more tuned configuration
    processor_config = vLLMEngineProcessorConfig(
    model_source="unsloth/Llama-3.1-8B-Instruct",
    batch_size=32,
    concurrency=4, # Example: for a Ray Cluster with 4 GPUs
    engine_kwargs={
    "dtype": "half",
    "max_model_len": 4096,
    "tensor_parallel_size": 4, # Shard the model 
    },
    )

    Defining the processing logic

    With the configuration defined, you use build_llm_processor to create the full inference pipeline. This involves defining preprocess and postprocess functions.

    • preprocess: This function takes a row from your input dataset and transforms it into the format the LLM expects. For most modern chat models, this is the OpenAI chat message format.
    • postprocess: This function takes the LLM's output and cleans it up, so you're left with just the data you need.
    # In batch_inference.py (continued)
    from ray.data.llm import build_llm_processor
    processor = build_llm_processor(
        processor_config,
        preprocess=lambda row: dict(
            messages=[
                {
                    "role": "system",
                    "content": "You are a calculator. Please only    
     output the answer of the given equation.",
                },
                {"role": "user", "content": f"{row['id']} ** 3 = ?"},
            ],
            sampling_params=dict(temperature=0.3, max_tokens=20),
        ),
        postprocess=lambda row: {
            "resp": row["generated_text"],
        },
    )
    # --- This last part executes the pipeline ---
    # 1. Create a dummy dataset of 100 rows
    ds = ray.data.range(100)
    # 2. Apply the processor (this is lazy and doesn't run yet)
    ds = processor(ds)
    # 3. Materialize the results and print them
    ds = ds.materialize()
    # 4. Print all outputs.
    for out in ds.take_all():
        print(out)

    Submitting and monitoring the job

    Now that the complete logic is in batch_inference.py, you can use the RayJobClient from your local notebook to submit it for remote execution.

    Here you define the entrypoint_command and the runtime_env. The runtime_env is critical for specifying dependencies (pip) and environment variables (like cloud credentials or HF_TOKEN).

    # Run this in the notebook you created the client earlier
    entrypoint_command = "python batch_inference.py"
    # Submit the job using the previously created RayJobClient
    submission_id = client.submit_job(
        entrypoint=entrypoint_command,
        runtime_env={
            "pip": ["vllm=="0.9.1", "s3fs"], # Add any needed libraries
            "env_vars": {
                "HF_TOKEN": "your-hugging-face-token",
                "AWS_ACCESS_KEY_ID": "your_access_key_id",
                "AWS_SECRET_ACCESS_KEY": "your_secret_access_key",
            }
        }
    )
    print(f"Job submitted with ID: {submission_id}")
    # You can then monitor the job
    client.get_job_status(submission_id)
    print("--- JOB LOGS ---")
    client.get_job_logs(submission_id)

    And with that, you have successfully orchestrated a batch inference job with vLLM and Ray Data via the CodeFlare SDK. By defining your model configuration and processing logic in a simple Python script, you can use the RayJobClient to send that workload to a remote Ray cluster. This workflow allows you to use the full power of distributed computing and modern inference engines like vLLM without ever leaving your Python environment.

    Conclusion

    Scaling machine learning inference from a single prediction to millions of records is a common but significant hurdle. As we've shown, this challenge doesn’t require a deep understanding of complex infrastructure management. By learning to connect a RayJobClient, define your logic in a Python script, and submit it for remote execution, you can use the CodeFlare SDK to bridge the gap between your local environment and a powerful, remote cluster.

    To learn more, check out the official documentation:

    • CodeFlare SDK: GitHub repository and examples
    • Ray Data: Official Ray Data documentation
    • vLLM: Official vLLM documentation

    Related Posts

    • LLM Compressor is here: Faster inference with vLLM

    • Llama 4 herd is here with Day 0 inference support in vLLM

    • How we optimized vLLM for DeepSeek-R1

    • AI meets containers: My first step into Podman AI Lab

    • Integrate vLLM inference on macOS/iOS with Llama Stack APIs

    • 2:4 Sparse Llama: Smaller models for efficient GPU inference

    Recent Posts

    • Debugging image mode with Red Hat OpenShift 4.20: A practical guide

    • EvalHub: Because "looks good to me" isn't a benchmark

    • SQL Server HA on RHEL: Meet Pacemaker HA Agent v2 (tech preview)

    • Deploy with confidence: Continuous integration and continuous delivery for agentic AI

    • Every layer counts: Defense in depth for AI agents with Red Hat AI

    What’s up next?

    This hands-on learning path demonstrates how retrieval-augmented generation (RAG) works and how users can implement a RAG workflow using Red Hat OpenShift AI and Elasticsearch vector database.

    Start the activity
    Red Hat Developers logo LinkedIn YouTube Twitter Facebook

    Platforms

    • Red Hat AI
    • Red Hat Enterprise Linux
    • Red Hat OpenShift
    • Red Hat Ansible Automation Platform
    • See all products

    Build

    • Developer Sandbox
    • Developer tools
    • Interactive tutorials
    • API catalog

    Quicklinks

    • Learning resources
    • E-books
    • Cheat sheets
    • Blog
    • Events
    • Newsletter

    Communicate

    • About us
    • Contact sales
    • Find a partner
    • Report a website issue
    • Site status dashboard
    • Report a security problem

    RED HAT DEVELOPER

    Build here. Go anywhere.

    We serve the builders. The problem solvers who create careers with code.

    Join us if you’re a developer, software engineer, web designer, front-end designer, UX designer, computer scientist, architect, tester, product manager, project manager or team lead.

    Sign me up

    Red Hat legal and privacy links

    • About Red Hat
    • Jobs
    • Events
    • Locations
    • Contact Red Hat
    • Red Hat Blog
    • Inclusion at Red Hat
    • Cool Stuff Store
    • Red Hat Summit
    © 2026 Red Hat

    Red Hat legal and privacy links

    • Privacy statement
    • Terms of use
    • All policies and guidelines
    • Digital accessibility

    Chat Support

    Please log in with your Red Hat account to access chat support.