Skip to main content
Redhat Developers  Logo
  • AI

    Get started with AI

    • Red Hat AI
      Accelerate the development and deployment of enterprise AI solutions.
    • AI learning hub
      Explore learning materials and tools, organized by task.
    • AI interactive demos
      Click through scenarios with Red Hat AI, including training LLMs and more.
    • AI/ML learning paths
      Expand your OpenShift AI knowledge using these learning resources.
    • AI quickstarts
      Focused AI use cases designed for fast deployment on Red Hat AI platforms.
    • No-cost AI training
      Foundational Red Hat AI training.

    Featured resources

    • OpenShift AI learning
    • Open source AI for developers
    • AI product application development
    • Open source-powered AI/ML for hybrid cloud
    • AI and Node.js cheat sheet

    Red Hat AI Factory with NVIDIA

    • Red Hat AI Factory with NVIDIA is a co-engineered, enterprise-grade AI solution for building, deploying, and managing AI at scale across hybrid cloud environments.
    • Explore the solution
  • Learn

    Self-guided

    • Documentation
      Find answers, get step-by-step guidance, and learn how to use Red Hat products.
    • Learning paths
      Explore curated walkthroughs for common development tasks.
    • Guided learning
      Receive custom learning paths powered by our AI assistant.
    • See all learning

    Hands-on

    • Developer Sandbox
      Spin up Red Hat's products and technologies without setup or configuration.
    • Interactive labs
      Learn by doing in these hands-on, browser-based experiences.
    • Interactive demos
      Click through product features in these guided tours.

    Browse by topic

    • AI/ML
    • Automation
    • Java
    • Kubernetes
    • Linux
    • See all topics

    Training & certifications

    • Courses and exams
    • Certifications
    • Skills assessments
    • Red Hat Academy
    • Learning subscription
    • Explore training
  • Build

    Get started

    • Red Hat build of Podman Desktop
      A downloadable, local development hub to experiment with our products and builds.
    • Developer Sandbox
      Spin up Red Hat's products and technologies without setup or configuration.

    Download products

    • Access product downloads to start building and testing right away.
    • Red Hat Enterprise Linux
    • Red Hat AI
    • Red Hat OpenShift
    • Red Hat Ansible Automation Platform
    • See all products

    Featured

    • Red Hat build of OpenJDK
    • Red Hat JBoss Enterprise Application Platform
    • Red Hat OpenShift Dev Spaces
    • Red Hat Developer Toolset

    References

    • E-books
    • Documentation
    • Cheat sheets
    • Architecture center
  • Community

    Get involved

    • Events
    • Live AI events
    • Red Hat Summit
    • Red Hat Accelerators
    • Community discussions

    Follow along

    • Articles & blogs
    • Developer newsletter
    • Videos
    • Github

    Get help

    • Customer service
    • Customer support
    • Regional contacts
    • Find a partner

    Join the Red Hat Developer program

    • Download Red Hat products and project builds, access support documentation, learning content, and more.
    • Explore the benefits

Batch inference on OpenShift AI with llm-d: Architecture, integration, and workflows

July 2, 2026
Lior Aronovich Raymond Zhao Jooyeon Mok
Related topics:
AI inferenceArtificial intelligenceKubernetes
Related products:
Red Hat Connectivity LinkRed Hat OpenShift AI

    Most public LLM discussion centers on interactive inference—chatbots, coding assistants, agents—where latency is the headline, but production traffic is wider.

    In production, teams run heavy-duty tasks: model evaluations, dataset scoring, massive embeddings for retrieval-augmented generation (RAG), backfills when policies or model versions change. These workloads are deadline-driven, not latency-sensitive. That is what batch is for: work the platform can advance when accelerator utilization is low, keep in the background relative to real-time users, and throttle or defer when interactive demand spikes—while still finishing inside an agreed window.

    A job that finishes 50,000 prompts in four hours is equivalent to one that finishes in 20 minutes if both meet the same SLO. This difference creates a clear optimization opportunity: offline workloads can use spare GPU capacity, yield to interactive traffic during spikes, and align with pricing models such as differential batch rates.

    Without a dedicated batch layer, operators often script loops against a chat endpoint, which scales poorly and leaves cancel, recovery, and alignment with cluster security and rate limits largely ad hoc.

    Introducing the llm-d batch gateway

    The llm-d batch gateway is a Kubernetes-native batch inference service that plugs into the same llm-d inference stack managed by Red Hat OpenShift AI. It addresses that gap with OpenAI-compatible /v1/batches and /v1/files, so existing clients can target your cluster without a parallel batch stack. Batch work follows the same OpenShift AI path as online traffic, including existing LLMInferenceService backends, with Red Hat Connectivity Link at the gateway for authentication, policy enforcement, and rate control.

    In this article, we look at the architecture of the llm-d batch gateway, its integration with OpenShift AI and Connectivity Link, and the end-to-end workflow from job submission to result retrieval.

    Background information

    To understand how batch workloads operate efficiently, it helps to look at the underlying platform architecture and the limitations of traditional setups.

    llm-d and Red Hat OpenShift AI

    llm-d is an open source, Kubernetes-native platform for LLM inference featuring intelligent request routing, KV cache-aware scheduling, and disaggregated prefill/decode. Combined with Red Hat OpenShift AI, it powers production-grade online inference through the LLMInferenceService custom resource—which automatically provisions model servers, an Endpoint Picker (EPP) for prefix-aware routing, and Gateway API HTTPRoute resources.

    The batch inference gap

    While llm-d excels at interactive inference, production environments also need a way to handle high-volume offline workloads efficiently. Without a dedicated batch layer, teams typically resort to scripting loops of individual API calls—losing job-level lifecycle management, fault tolerance, and the ability to control how batch traffic interacts with interactive traffic on shared GPU resources.

    The OpenAI Batch API standard

    The OpenAI Batch API has established a widely adopted contract for batch inference: upload a JSONL input file, submit a batch job, poll for progress, and download results. The llm-d batch gateway implements the OpenAI-compatible contract (/v1/batches, /v1/files), so clients built for that contract can target a private gateway with the same paths and payloads—typically after changing only base URL, TLS, and authentication settings.

    Core components

    The platform relies on a distributed architecture where components share hardware resources but remain independent at the control plane.

    Architecture overview

    The batch gateway coexists with OpenShift AI's interactive inference stack on the same OpenShift cluster, sharing the Gateway API ingress while maintaining full separation of concerns (Figure 1).

    Client requests flow through an OpenShift Gateway ingress, splitting into separate batch API and LLM inference service routing paths.
    Figure 1: Batch gateway and LLMInferenceService share a single OpenShift gateway, with separate HTTPRoutes and independent security policies.

    There are two traffic paths through the gateway:

    • Interactive path: OpenShift AI automatically creates an HTTPRoute for each LLMInferenceService. Requests flow to the llm-d Endpoint Picker (EPP), which applies prefix-aware routing to maximize KV cache hit rates before forwarding to vLLM worker replicas.
    • Batch path: A dedicated HTTPRoute routes /v1/batches and /v1/files to the batch gateway API server. The processor then asynchronously dispatches individual inference requests to the LLMInferenceService backend—the same model pods that serve interactive traffic.

    Both paths share the same gateway and the same model pods, but each has its own security policies (AuthPolicy, RateLimitPolicy) and independent scaling. This means batch processing benefits from the same llm-d optimizations—KV cache-aware routing, disaggregated prefill/decode—that interactive workloads enjoy.

    Batch gateway internals

    The batch gateway itself consists of three distinct control components that manage the data lifecycle.

    ComponentRole
    API serverOpenAI-compatible REST API for /v1/batches and /v1/files. Accepts job submissions, tracks progress, and serves results.
    ProcessorPulls jobs from a priority queue, builds per-model execution plans, dispatches requests to the inference backend within configured concurrency limits, and writes successful or failed lines to output or error files for final upload.
    Garbage collectorPeriodically removes expired files and cleans up completed batch resources.

    The data layer uses PostgreSQL for job and file metadata, Redis for the priority queue and event channels (cancellation and status updates), and file system or Amazon S3 bucket for input and output file storage.

    Integration with OpenShift AI

    The integration between the batch gateway and OpenShift AI is straightforward. The processor uses a model-to-URL mapping configuration, defined in modelGateways, that points each model name to the internal URL of the corresponding LLMInferenceService. When the processor dispatches a batch request, it sends a standard OpenAI-format inference request to that URL—matching the exact format used by interactive clients.

    Because the batch gateway treats the LLMInferenceService as an OpenAI-compatible endpoint, it operates without needing detailed knowledge of vLLM, KV cache, or EPP internals. Adding a new model requires only a minor configuration change in the modelGateways parameter to map the new model name to its destination URL. No application code changes are needed, though you must restart the processor to apply the updated mapping.

    Additionally, the pipeline forwards the caller's Authorization header directly to the inference backend using the passThroughHeaders parameter. This configuration allows the same Kubernetes role-based access control (RBAC) policies to apply to both interactive and batch requests.

    Workflows and processes

    Developers can interact with the batch ecosystem using standard tools and predictable execution paths.

    User workflow

    From the user's perspective, the batch gateway follows the same four-step workflow as the OpenAI Batch API:

    1. Upload an input file. The input file is a JSONL file where each line represents one inference request:

      {"custom_id": "req-1", "method": "POST", "url": "/v1/chat/completions", "body": {"model": "facebook/opt-125m", "messages": [{"role": "user", "content": "What is Kubernetes?"}]}}
      {"custom_id": "req-2", "method": "POST", "url": "/v1/chat/completions", "body": {"model": "facebook/opt-125m", "messages": [{"role": "user", "content": "Explain KV cache."}]}}

      Upload it via the Files API:

      curl -X POST https://<gateway>/v1/files \
        -H "Authorization: Bearer ${TOKEN}" \
        -F "purpose=batch" -F "file=@input.jsonl"
    2. Create a batch job. Submit the file ID to create a batch:

      curl -X POST https://<gateway>/v1/batches \
        -H "Authorization: Bearer ${TOKEN}" \
        -H "Content-Type: application/json" \
        -d '{"input_file_id": "file-abc123", "endpoint": "/v1/chat/completions", "completion_window": "24h"}'
    3. Poll for status. The job progresses through states: validating → in_progress → completed (or failed, expired, cancelled). Clients can poll or cancel at any point:

      # Check status
      curl https://<gateway>/v1/batches/batch-xyz789 -H "Authorization: Bearer ${TOKEN}"
      
      # Cancel if needed
      curl -X POST https://<gateway>/v1/batches/batch-xyz789/cancel -H "Authorization: Bearer ${TOKEN}"
    4. Download results. When the job completes (or ends with partial success), retrieve the output file and, if the job produces one, the error file—both use the JSONL format.

      The output file has one line per successful request, typically containing the full inference response and token use. The error file has one line per failed or undispatched request, including error details and the corresponding custom_id so you can reconcile against your input.

      {"id": "batch_req_...", "custom_id": "req-1", "response": {"status_code": 200, "body": {"choices": [{"message": {"content": "..."}}], "usage": {"prompt_tokens": 12, "completion_tokens": 64, "total_tokens": 76}}}}

      Example error-line shapes (depends on failure type):

      {"id": "batch_req_...", "custom_id": "req-2", "error": {"code": "server_error", "message": "..."}}

      If your integration already targets OpenAI's Batch and Files APIs (/v1/batches, /v1/files), you can aim the same client at your batch gateway instead. Because request shapes and paths match, this transition requires only configuration modifications—such as updating the base URL, TLS trust, and credentials—rather than a rewrite of your job workflow.

    Internal processing flow

    Behind the scenes, the processor manages the execution steps required to turn a single batch job into thousands of individual inference requests.

    Batch Job Submitted
            │
            ▼
    ┌──────────────────┐
    │  API Server      │  Validates input file, stores file,
    │                  │  creates metadata in PostgreSQL,
    │                  │  enqueues job in Redis priority queue.
    └────────┬─────────┘
             │
             ▼
    ┌──────────────────┐
    │  Processor       │  Dequeues job, downloads/reads input.jsonl,
    │  (Ingestion)     │  builds per-model plans (offset, length,
    │                  │  PrefixHash); sorts by PrefixHash per model.
    └────────┬─────────┘
             │
             ▼
    ┌──────────────────┐
    │  Processor       │  Executes the plan: for each model, walks
    │  (Execution)     │  sorted entries, calls the same
    │                  │  LLMInferenceService endpoints as interactive
    │                  │  clients, within configured concurrency limits.
    └────────┬─────────┘
             │
             ▼
    ┌──────────────────┐
    │  Result Upload   │  Streams results to output file
    │                  │  or error file as appropriate.
    │                  │  Updates job status and request counts
    │                  │  in real time.
    └──────────────────┘

    Key design decisions in the processing flow

    To establish a stable platform architecture, the processing system implements several critical engineering choices at the control and data layers.

    Bounded memory via disk-backed plans

    To maintain stability, jobs use configurable limits (defaulting to 50,000 lines and a 200 MB input file). This approach enables memory-efficient ingestion so the processor never stalls on large files. By mapping offsets, sorting by PrefixHash, and using random access via ReadAt, the system maintains a minimal memory footprint even under heavy workloads.

    Model isolation and prefix hash ordering for downstream cache

    To organize execution paths, one goroutine per model drains that model's plan. Within a model, entries are sorted by PrefixHash, so requests with the same system prompt run back-to-back. That ordering is meant to improve prefix or KV cache behavior on the inference gateway, not to mix unrelated lines from the JSONL.

    Dual writers for output and error logs

    The engine routes logs through a dual-writer interface where successful completions append to output.jsonl, while inference failures and undispatched lines—such as those remaining after SLO expiry, user cancellation, or certain errors— are drained into error.jsonl with structured error entries. Finalization passes upload non-empty files and record both file IDs on the tracking job.

    Tunable two-level concurrency controls

    Tunable two-level concurrency controls maintain fairness under load. A global cap limits total in-flight inference for the processor, while per-model caps stop one model inside a multi-model job from crowding out the rest on shared backends. 

    Under the hood, the engine acquires per-model capacity before the global slot so a goroutine waiting on cluster-wide headroom avoids holding a global slot and starving other models. Operators rarely need to configure this detail, but it maintains stable scheduling balances under heavy load.

    Separate concerns for shutdowns, deadlines, and cancellations

    The system separates its operational concerns into three distinct routines governing shutdowns, deadlines, and active cancellations. A graceful shutdown makes sure that in-flight batches are not abruptly terminated when the system receives a SIGTERM or similar signal. 

    Meanwhile, deadline mechanics manage the completion_window parameter so jobs are processed within the promised timeframe. 

    Finally, cancellation provides a dedicated mechanism for users to manually stop batches, allowing the system to record distinct reasons like user cancellation, SLO limits, or structural shutdowns instead of grouping them as generic failures.

    Startup recovery without checkpoint resumes

    The platform performs startup recovery sequences without relying on checkpoint resumes. On container startup, the processor inspects leftover job directories from crashes caused by out-of-memory (OOM) errors, panics, or evictions. It correlates these findings with the PostgreSQL state and acts according to the job status. 

    For example, it will re-enqueue from scratch when there is no output yet, or upload partial output and mark a terminal failure when work should not be discarded. Checkpointed resume mid-plan is not used, as final outcomes remain fully observable via metrics.

    Security model

    The integration between the batch gateway, OpenShift AI, and Red Hat Connectivity Link creates a layered security model that separates authentication, authorization, and rate limiting across different components.

    Authentication

    All traffic through the gateway—both interactive and batch—is authenticated via Kubernetes TokenReview. Red Hat Connectivity Link's Authorino component validates that every request carries a valid ServiceAccount token. Unauthenticated requests receive an HTTP 401 response.

    Authorization

    The architecture supports two distinct authorization paths using a single identity. For interactive routes, OpenShift AI creates a gateway-level AuthPolicy that performs a Kubernetes SubjectAccessReview. Under this setup, only service accounts with get permissions on the specific LLMInferenceService resource can reach that model, maintaining fine-grained, Kubernetes-native RBAC.

    Conversely, for batch routes, the batch AuthPolicy validates the token at the gateway layer but defers model-level permission checks downstream. When the processor dispatches individual inference requests to the LLMInferenceService, it passes the caller's original Authorization header through to the backend, where the same access control check applies.

    Rate limiting

    Red Hat Connectivity Link applies two distinct types of rate limiting across different routes to manage cluster resources. The first is token-level rate limiting for inference endpoints. Unlike standard request-count limits, a TokenRateLimitPolicy meters actual token use. For example, a single chat completion that generates 2,000 tokens is metered differently from one that generates 20 tokens, accurately reflecting the underlying accelerator cost. The gateway applies this policy per user.

    The second type is request-count rate limiting for batch endpoints. A RateLimitPolicy on the batch HTTPRoute limits how many API calls each user can make per time window, preventing the abuse of the management API.

    LayerMechanismScope
    AuthenticationKubernetes TokenReview (Authorino)All routes
    Inference authorizationSubjectAccessReview on LLMInferenceServiceInference routes only
    Batch authorizationToken passthrough to inference backendEnforced at dispatch time
    Token rate limitingTokenRateLimitPolicyInference endpoints
    Request rate limitingRateLimitPolicyBatch endpoints

    This design means organizations don't need a separate identity system for LLM access. Existing Kubernetes RBAC policies—managed through GitOps, Open Policy Agent (OPA), or Kyverno—extend naturally to cover both interactive and batch inference.

    Best practices

    When deploying the batch gateway alongside OpenShift AI, keep these practices in mind.

    Tune concurrency limits to your GPU capacity

    The processor's global and per-model concurrency semaphores control how aggressively batch requests compete with interactive traffic. Start conservatively and increase as you understand your cluster's headroom.

    Always enable token passthrough

    Configure passThroughHeaders: {Authorization} so the processor forwards inference requests on behalf of the original user. This makes sure that the LLM route's AuthPolicy can enforce model-level authorization consistently across batch and interactive workloads.

    Use per-model configuration for varied deployments

    Each model in modelGateways can have its own timeout, retry count, and backoff settings. Larger models serving longer prompts typically need longer timeouts and more generous retry budgets.

    Choose the right file storage backend

    File system storage (PVC) works well for single-cluster deployments, while S3-compatible storage is better for multi-cluster or high-volume scenarios where you need durable, shared access to input and output files.

    Apply token-level rate limiting for inference

    Request-count rate limits don't reflect actual GPU cost. Use TokenRateLimitPolicy on inference endpoints to meter real token consumption—this prevents a single user with large prompts from consuming disproportionate resources.

    Common challenges and solutions

    Running high-volume offline workloads alongside real-time systems can sometimes introduce configuration bottlenecks. Review these common production issues and their resolutions to keep execution pipelines balanced.

    Batch traffic starving interactive requests

    Without concurrency limits, a large batch job could flood the model server. The processor's two-level semaphore design (global and per-model) prevents this. If you observe interactive latency degradation during batch processing, reduce the per-model concurrency limit.

    Authorization failures during batch processing

    If batch jobs fail with 403 errors, verify that the submitting user's ServiceAccount has get permission on the target LLMInferenceService resource. The batch route intentionally defers authorization to the LLM route—so RBAC must be configured for the model rather than the batch endpoint.

    Processor crash mid-job

    Recovery is supported, but the processor does not resume a job from an arbitrary mid-plan checkpoint. On restart, startup recovery reconciles local job directories with PostgreSQL—re-enqueuing from scratch when there is no output yet, or uploading partial output and marking a terminal state when work should not be discarded. Checkpointed resume mid-plan is not used, as final outcomes remain fully observable via metrics.

    Model URL changes after redeployment

    The processor resolves model URLs from modelGateways at job dispatch time. If you redeploy an LLMInferenceService, update the modelGateways URL in the Helm values and restart the processor to pick up the new endpoint.

    Use cases and examples

    The batch gateway processes workloads where throughput matters more than latency. Here are common operational scenarios:

    • Model evaluation and benchmarking. Teams can run thousands of evaluation prompts against a new model version to collect structured results for comparison. This workflow runs without affecting the interactive inference services that serve the same model.
    • Dataset scoring and labeling. This platform allows you to score or classify large datasets using LLMs to support training data preparation, content moderation pipelines, or quality assurance workflows.
    • Embedding generation at scale. Engineers can generate embeddings for document corpora, search indexes, or recommendation systems as a batch operation. Running these resource-intensive tasks during off-peak windows takes advantage of available GPU capacity.
    • Differential pricing. Because batch and interactive requests flow through separate HTTPRoute resources with distinct policies, organizations can offer batch inference at a lower cost-per-token. This strategy reflects a lower infrastructure cost achieved through flexible scheduling and a higher tolerance for latency.

    Wrap up

    The llm-d project improves interactive inference by increasing speed and efficiency through KV cache-aware routing, disaggregated prefill and decode models, and multi-turn workload optimization. The batch gateway extends these capabilities to the other half of the inference workload spectrum: offline jobs that do not require results in milliseconds, but need them reliably before a deadline.

    By integrating with Red Hat OpenShift AI and Red Hat Connectivity Link, the batch gateway turns a single OpenShift cluster into a unified inference platform that serves both workload classes. This architecture delivers security-focused, Kubernetes-native management alongside token-level rate limiting on inference traffic, request-aware limits on batch management APIs, and concurrency controls that prevent batch workloads from overwhelming the model servers serving live users.

    Get started with the batch gateway

    Ready to try it yourself? Once you spin up your Red Hat OpenShift AI environment, you can follow the deployment guide on OpenShift AI to deploy the batch gateway alongside your existing llm-d inference stack. The guide walks you through installing all prerequisites, deploying the gateway components, and configuring security policies step by step.

    Learn more:

    • llm-d batch gateway repository
    • llm-d project
    • Red Hat OpenShift AI documentation
    • Red Hat Connectivity Link documentation
    • Batch gateway Helm chart
    • OpenAI Batch API reference

    Related Posts

    • Intelligent inference scheduling with llm-d on Red Hat AI

    • Combining KServe and llm-d for optimized generative AI inference

    • Accelerate multi-turn LLM workloads on OpenShift AI with llm-d intelligent routing

    • Introduction to distributed inference with llm-d

    • Master KV cache aware routing with llm-d for efficient AI inference

    • Getting started with llm-d for distributed AI inference

    Recent Posts

    • Batch inference on OpenShift AI with llm-d: Architecture, integration, and workflows

    • Upgrade RHEL with leapp

    • Kafka Monthly Digest: June 2026

    • Build a multi-agent supervisor pattern on Red Hat AI

    • Inside the vLLM-Omni architecture: Serving Qwen3-Omni

    What’s up next?

    Backstage learning path components feature image

    How to learn AI with Red Hat

    Whether you're just getting started with artificial intelligence or looking...
    Red Hat Developers logo LinkedIn YouTube Twitter Facebook

    Platforms

    • Red Hat AI
    • Red Hat Enterprise Linux
    • Red Hat OpenShift
    • Red Hat Ansible Automation Platform
    • See all products

    Build

    • Developer Sandbox
    • Developer tools
    • Interactive tutorials
    • API catalog

    Quicklinks

    • Learning resources
    • E-books
    • Cheat sheets
    • Blog
    • Events
    • Newsletter

    Communicate

    • About us
    • Contact sales
    • Find a partner
    • Report a website issue
    • Site status dashboard
    • Report a security problem

    RED HAT DEVELOPER

    Build here. Go anywhere.

    We serve the builders. The problem solvers who create careers with code.

    Join us if you’re a developer, software engineer, web designer, front-end designer, UX designer, computer scientist, architect, tester, product manager, project manager or team lead.

    Sign me up

    Red Hat legal and privacy links

    • About Red Hat
    • Jobs
    • Events
    • Locations
    • Contact Red Hat
    • Red Hat Blog
    • Inclusion at Red Hat
    • Cool Stuff Store
    • Red Hat Summit
    © 2026 Red Hat

    Red Hat legal and privacy links

    • Privacy statement
    • Terms of use
    • All policies and guidelines
    • Digital accessibility

    Chat Support

    Please log in with your Red Hat account to access chat support.