Skip to main content
Redhat Developers  Logo
  • AI

    Get started with AI

    • Red Hat AI
      Accelerate the development and deployment of enterprise AI solutions.
    • AI learning hub
      Explore learning materials and tools, organized by task.
    • AI interactive demos
      Click through scenarios with Red Hat AI, including training LLMs and more.
    • AI/ML learning paths
      Expand your OpenShift AI knowledge using these learning resources.
    • AI quickstarts
      Focused AI use cases designed for fast deployment on Red Hat AI platforms.
    • No-cost AI training
      Foundational Red Hat AI training.

    Featured resources

    • OpenShift AI learning
    • Open source AI for developers
    • AI product application development
    • Open source-powered AI/ML for hybrid cloud
    • AI and Node.js cheat sheet

    Red Hat AI Factory with NVIDIA

    • Red Hat AI Factory with NVIDIA is a co-engineered, enterprise-grade AI solution for building, deploying, and managing AI at scale across hybrid cloud environments.
    • Explore the solution
  • Learn

    Self-guided

    • Documentation
      Find answers, get step-by-step guidance, and learn how to use Red Hat products.
    • Learning paths
      Explore curated walkthroughs for common development tasks.
    • Guided learning
      Receive custom learning paths powered by our AI assistant.
    • See all learning

    Hands-on

    • Developer Sandbox
      Spin up Red Hat's products and technologies without setup or configuration.
    • Interactive labs
      Learn by doing in these hands-on, browser-based experiences.
    • Interactive demos
      Click through product features in these guided tours.

    Browse by topic

    • AI/ML
    • Automation
    • Java
    • Kubernetes
    • Linux
    • See all topics

    Training & certifications

    • Courses and exams
    • Certifications
    • Skills assessments
    • Red Hat Academy
    • Learning subscription
    • Explore training
  • Build

    Get started

    • Red Hat build of Podman Desktop
      A downloadable, local development hub to experiment with our products and builds.
    • Developer Sandbox
      Spin up Red Hat's products and technologies without setup or configuration.

    Download products

    • Access product downloads to start building and testing right away.
    • Red Hat Enterprise Linux
    • Red Hat AI
    • Red Hat OpenShift
    • Red Hat Ansible Automation Platform
    • See all products

    Featured

    • Red Hat build of OpenJDK
    • Red Hat JBoss Enterprise Application Platform
    • Red Hat OpenShift Dev Spaces
    • Red Hat Developer Toolset

    References

    • E-books
    • Documentation
    • Cheat sheets
    • Architecture center
  • Community

    Get involved

    • Events
    • Live AI events
    • Red Hat Summit
    • Red Hat Accelerators
    • Community discussions

    Follow along

    • Articles & blogs
    • Developer newsletter
    • Videos
    • Github

    Get help

    • Customer service
    • Customer support
    • Regional contacts
    • Find a partner

    Join the Red Hat Developer program

    • Download Red Hat products and project builds, access support documentation, learning content, and more.
    • Explore the benefits

Scale document ingestion with Docling and Ray on OpenShift AI

June 30, 2026
Ana Biazetti Cathal O'Connor Saad Zaher
Related topics:
Artificial intelligenceData scienceKubernetes
Related products:
Red Hat OpenShift AI

    Many AI projects stall before they begin—before any model is deployed, before the RAG application is running, or before the first chatbot is started. This failure to launch can happen for many reasons, but one of the primary causes is data. Common culprits include a lack of quality data, inaccessible data, and poorly structured or formatted data. Developers spend the bulk of their time on data preparation, and the challenge compounds with "dark data," such as legacy PDFs with complex tables, multi-column layouts, and embedded figures that standard extraction tools mangle or lose. Parsing a few files on a laptop is fine, but scaling to 10,000+ documents without distributed processing can take days.

    We built a guided example that combines Docling for structure-aware parsing, with Ray Data for distributed streaming execution, running on Red Hat OpenShift AI. This post covers the architecture, key design decisions, and potential extensions to your batch processing workflow. Also consider reading our blog post on breaking the RAG bottleneck for more details on the business reasons behind this solution.

    The stack: High-fidelity parsing at scale

    A production ingestion process needs to pair parsing quality with distributed compute. Red Hat OpenShift AI provides the foundation.

    Docling

    This open source document understanding library and Cloud Native Computing Foundation project uses layout analysis models to recognize tables, code blocks, formulas, and multi-column layouts, producing structured Markdown or JSON.

    The quality comes at a cost. Docling loads about 1 GB of machine learning (ML) models at startup and can take 5 to 20 seconds per PDF. Processing 10,000 documents sequentially can take 14 to 55 hours.

    Ray Data

    This data processing engine and Ray ecosystem component uses a streaming execution model to overlap read, process, and write stages. The framework processes initial documents while subsequent files are still being read. These actor pools amortize Docling's model loading cost by initializing once per actor and processing multiple documents.

    KubeRay

    This Kubernetes operator and cluster lifecycle management tool uses automated orchestration to govern Ray cluster deployments and autoscaling on OpenShift, handling RayCluster and RayJob custom resources.

    CodeFlare SDK

    This Python API and cluster management framework uses simplified abstraction layers to convert Kubernetes YAML configurations into Python objects, so you can define cluster specifications, submit jobs, and monitor progress directly from a notebook enviroment.

    Prerequisites

    Before running the example, you need the following in place:

    • Red Hat OpenShift AI with the KubeRay operator installed. The operator manages RayCluster and RayJob custom resources that the example depends on.
    • A custom runtime image containing Ray and Docling. The standard OpenShift AI workbench images do not include Docling, so every Ray worker node must run a purpose-built image. A ready-to-build Dockerfile is available in the distributed-workloads repository. It starts from the OpenShift AI Ray CPU base image (quay.io/modh/ray:2.52.1-py312-cpu) and adds Docling 2.74, pandas, pyarrow, and S3 libraries. Build and push this image to a registry accessible from your cluster, then reference it in the notebook's ClusterConfiguration.
    • A ReadWriteMany (RWX) PVC mounted on all Ray nodes. Input PDFs are read from this shared volume. The PVC must use a storage class that supports RWX access mode, such as NFS or CephFS.
    • A workbench running Minimal Python 3.12 with codeflare-sdk installed. The workbench only submits the RayJob and monitors progress while all heavy processing runs on the RayCluster. No GPU is needed on the workbench.

    Architecture

    Processing requires either persistent volume claims (PVCs) or S3 or object storage. All Ray workers read from and write to the same ReadWriteMany PVC. By wrapping Docling in a Ray Data map_batches operation, we transform a CPU-bound, single-threaded parsing library into a distributed, multi-node ingestion engine, as shown in Figure 1.

    KubeRay operator managing a RayCluster on OpenShift AI. A head node coordinates worker pods to process PDFs from a shared PVC into Markdown and JSON.
    Figure 1: Architecture of distributed document ingestion using Docling and Ray on Red Hat OpenShift AI.

    Three-stage processing

    The processing script (ray_data_process.py) implements a streaming three-stage process, illustrated in Figure 2:

    1. Read: Ray Data creates a dataset of file paths on the PVC (not file contents, which keeps memory low regardless of corpus size). The dataset is repartitioned into fine-grained blocks for even distribution across actors.
    2. Process: DoclingProcessor actors receive batches of file paths, load Docling's models once, and process many documents. The ActorPoolStrategy scales from a warm pool (min_size) up to max_size under load. For each PDF, the actor converts via Docling, writes Markdown and JSON to the PVC, and returns per-file metrics.
    3. Report: Results stream back as processing completes. The driver aggregates throughput, error rates, actor distribution, and error details into a performance report.
    Three-stage pipeline showing Read, Process, and Report steps, with a dashed container indicating that Ray Data streaming allows all stages to overlap.]
    Figure 2: Three-stage streaming execution model in Ray Data.

    Because Ray Data streams execution, all three stages overlap. Wall-clock time is dominated by the processing stage, not the sum of all three.

    Two deployment patterns

    We provide two notebooks using different CodeFlare SDK patterns for different operational needs.

    Ephemeral clusters for batch processing

    This notebook uses RayJob with ManagedClusterConfig for a complete lifecycle: submit a job, KubeRay creates a cluster, the job runs, the cluster tears down (Figure 3). 

    Timeline showing a three-step ephemeral lifecycle on OpenShift AI: startup creates a cluster, execution runs a Ray job, and teardown destroys the cluster.
    Figure 3: Ephemeral Ray job architecture lifecycle from startup to teardown.

    This configuration is best for periodic batch processing, continuous integration/continuous delivery (CI/CD), or any scenario where you want zero idle cost between runs.

    Persistent clusters for interactive work

    This notebook uses Cluster with ClusterConfiguration to create a long-lived cluster, then submits jobs via the Ray Job Submission Client, as mapped out in Figure 4. 

    Multiple users submitting jobs to a persistent Ray application on OpenShift AI, utilizing a long-lived head node, worker pods, and a shared object store.

    Figure 4: Persistent Ray cluster architecture for interactive workloads.

    This setup is best for development, debugging, parameter tuning, or submitting multiple jobs to the same cluster.

    Select the deployment pattern that best aligns with your operational requirements and development workflow.

    Operational needRecommended deployment pattern
    Nightly batch processingRayJob
    First-time setup, tuning parametersRayCluster
    CI/CD integrationRayJob
    Processing different document sets iterativelyRayCluster
    Production automationRayJob

    The configuration calculator

    Sizing a Ray cluster for Docling means balancing actors per worker, CPUs per actor, memory allocation between Ray's object store and actor heap, and dataset partitioning for even distribution.

    The configure.py script encodes these relationships and produces a complete configuration from a few inputs, outputting ready-to-paste code snippets for job submission.

    Key formulas:

    • Schedulable CPUs per worker = worker_cpus - 2 (reserves 2 CPUs for Ray's raylet and object store)
    • Actors per worker = schedulable_cpus / cpus_per_actor
    • Max actors = num_workers x actors_per_worker
    • Memory per actor = (worker_memory - object_store) / actors_per_worker (must be >=4 GB; >=6 GB with OCR)
    • Total blocks = max_actors x repartition_factor (fine-grained blocks prevent straggler effects)

    It also validates constraints and warns about low actor memory, straggler risk from oversized blocks, over-provisioned clusters, or batch sizes that exceed files per block.

    Example output for 8 workers x 8 CPUs x 16 GB, processing 10,000 PDFs:

    • Workers: 8 x (8 CPUs, 16 GB)
    • Schedulable CPUs: 6 per worker (2 reserved)
    • Actor pool: 8..24 (2 CPUs, approximately 4.8 GB each)
    • Estimated time: Approximately 35 min (fast) to 2.3 hr (slow)

    Performance characteristics

    Each run produces a performance report covering throughput (files/second, pages/second), actor distribution, error and timeout rates, and per-file processing times.

    Performance is highly dependent on the hardware, CPU and memory of workers. Sample throughput for standard business PDFs (5 to 20 pages, text-heavy, some tables) and simple Docling configurations includes:

    • 8 workers x 8 CPUs: About 4 to 8 files/second (10,000 files in 20 to 40 minutes)
    • 4 workers x 8 CPUs: About 2 to 4 files/second (10,000 files in 40 to 80 minutes)

    Processing time scales linearly with document complexity. The repartition_factor and per-file timeout settings let you tune for your specific document mix.

    Extending for your own environment

    We designed this example as a starting point. Common extensions include:

    • Different document types. Docling supports DOCX, PPTX, HTML, and images. Change the glob pattern and format options. Docling auto-detects by extension.
    • Connecting to S3 instead of PVC. Replace PVC reads/writes with boto3 calls or ray.data.read_binary_files("s3://..."). This removes the ReadWriteMany requirement at the cost of network latency per file.
    • Enabling OCR (optical character recognition). Set do_ocr = True in the processing options. Increase memory per actor to 6 GB or more (optical character recognition models add about 2 GB). Use the configuration calculator with --ocr to validate sizing. Expect 2 to 5 times slower processing.
    • Custom output formats. Docling also supports Doctags and dictionary export. Modify the converter subprocess to call additional export methods alongside Markdown and JSON.

    Summary

    Combining Docling's parsing accuracy with Ray Data's distributed streaming execution addresses the two core problems in document ingestion: data quality and scale. Subprocess isolation ensures fault tolerance. The configuration calculator removes guesswork from cluster sizing. And two deployment patterns (ephemeral RayJob runs for automation, persistent RayCluster setups for development) let teams match compute strategy to workflow.

    Of course, there are scenarios (such as a small set of documents) that might not require the distribution and scalability of Ray, in which case using Docling directly on the documents might be enough.

    Red Hat OpenShift AI provides the orchestration and networking to move these workloads from laptop to production with minimal code changes.

    Get started

    Document ingestion is the front of every retrieval-augmented generation (RAG) and agentic pipeline: the quality and scale of your parsing determines what your models can actually reason over. This example gives you a scale-ready starting point.

    Try the guided example

    Clone the repository, navigate to the folder red-hat-ai-examples/examples/ray/data/docling, create an OpenShift AI workbench, and run your first batch in under an hour. The notebooks walk you through both deployment patterns, so you can start interactively and move to automated batch jobs when you are ready.

    Adapt it to your environment

    Point the pipeline at your own document types, switch from PVC to S3, or enable OCR for scanned archives. Use the configuration calculator to size your cluster for your corpus and hardware, rather than guessing at actor and memory settings.

    Build it into your AI platform

    Structured, scalable ingestion is the foundation for RAG and agentic workloads on Red Hat OpenShift AI. Once your documents are parsed into clean Markdown and JSON, the same platform provides the model serving, tuning, and orchestration to turn that data into production AI applications.

    Join the conversation

    Questions, improvements, or a document type that breaks in an interesting way? Open an issue or start a discussion in the Red Hat AI community.

    References and resources

    • Guided example code: Red Hat AI Examples repository on GitHub
    • Docling project: Official Docling documentation and resources
    • Ray project: Ray framework for scaling Python applications
    • KubeRay documentation: Kubernetes operator for Ray clusters
    • CodeFlare SDK: Python API for managing Ray clusters on OpenShift
    • Breaking the RAG Bottleneck: Scalable Document Processing with Ray Data and Docling: Anyscale blog on Ray Data and Docling for RAG document processing

    Related Posts

    • How I built an agentic application for Docling with MCP

    • Tame Ray workloads on OpenShift AI with KubeRay and Kueue

    • Batch inference on OpenShift AI with Ray Data, vLLM, and CodeFlare

    • Submit remote RayJobs to a Ray cluster with the CodeFlare SDK

    • How to fine-tune Llama 3.1 with Ray on OpenShift AI

    Recent Posts

    • Scale document ingestion with Docling and Ray on OpenShift AI

    • Deploy secure agentic AI: Protocols and performance tuning

    • How to set up Red Hat OpenShift Virtualization ServiceNow CMDB discovery: Configuring virtual machines with a custom Discovery Pattern

    • How to set up OpenShift Virtualization ServiceNow CMDB discovery: Deploying an in-cluster MID Server

    • Build a dynamic E2E test quarantine system with Prometheus and Grafana

    What’s up next?

    Learning Path intro-to-OS-LP-feature-image

    Introduction to OpenShift AI

    Learn how to use Red Hat OpenShift AI to quickly develop, train, and deploy...
    Red Hat Developers logo LinkedIn YouTube Twitter Facebook

    Platforms

    • Red Hat AI
    • Red Hat Enterprise Linux
    • Red Hat OpenShift
    • Red Hat Ansible Automation Platform
    • See all products

    Build

    • Developer Sandbox
    • Developer tools
    • Interactive tutorials
    • API catalog

    Quicklinks

    • Learning resources
    • E-books
    • Cheat sheets
    • Blog
    • Events
    • Newsletter

    Communicate

    • About us
    • Contact sales
    • Find a partner
    • Report a website issue
    • Site status dashboard
    • Report a security problem

    RED HAT DEVELOPER

    Build here. Go anywhere.

    We serve the builders. The problem solvers who create careers with code.

    Join us if you’re a developer, software engineer, web designer, front-end designer, UX designer, computer scientist, architect, tester, product manager, project manager or team lead.

    Sign me up

    Red Hat legal and privacy links

    • About Red Hat
    • Jobs
    • Events
    • Locations
    • Contact Red Hat
    • Red Hat Blog
    • Inclusion at Red Hat
    • Cool Stuff Store
    • Red Hat Summit
    © 2026 Red Hat

    Red Hat legal and privacy links

    • Privacy statement
    • Terms of use
    • All policies and guidelines
    • Digital accessibility

    Chat Support

    Please log in with your Red Hat account to access chat support.