The controller-runtime is the standard tool for building operators. Standing up an operator is easy, and going from zero to a deployable operator is extremely fast. However, as operators scale to large production clusters, you may need to understand and tune the caching behavior to manage memory usage effectively. This issue shows up as an out of memory error (OOM), which causes the operator to be restarted. Operators deployed in cluster scope are most likely to be affected.
When in production, operators may seem to use more memory than expected, sometimes causing the pods to be killed by OOM errors. That is of course when the deployment has its resource request and limits set. This may be due to the large workloads the operator is handling or the unexpected default caching behaviour of the controller-runtime.
This OOM issue may also only show itself when the operator is restarting, like during an upgrade. There is a burst of data on startup. The operator fetches all resources required for reconciliation on startup. It is doing a sync (remember that, it'll come up later).
Or sometimes the operator may run for a while, and then there's a large spike in memory usage. This may be enough to push the operator to an OOM error (or not).
The hidden cost is memory. How the controller-runtime uses memory can be surprising. But why?
Why is the cost there?
To understand why the cost exists, first we need to understand what the operator is doing, and the role of the controller-runtime in that operation. An operator is made up of controllers. These controllers have a reconcile function, which normally reconciles custom resources (CR) from the cluster. These reconciliation functions use a client to interact with other resources in the cluster. For example, the operator may be configuring a Redis deployment, and for persistent storage it may interact with a persistent volume claim (PVC) on the cluster during its reconcile loop.
In the controller configuration, the for` function specifies which custom resource definition (CRD) the reconcile function handles. In the above example, that would be the Redis CRD. Other functions can be configured to define relationships with other resources, such as a Watches. In the Redis example, this might watch the PVC . If there are changes to the PVC, then the operator may need to make changes to the Redis deployment. The Watches can be configured with event handlers and predicates. This is good practice, because it can limit the scope of resource (PVC) that can trigger the reconcile loop. Normally this would be done based on labels, or field values.
The role of the controller-runtime
In this setup, the controller-runtime builds a cache of cluster resources used by the different controllers. The controller-runtime receives an event from the cluster's API server, update the cache, and then tells the controller it got this new resource. This in turn triggers the handlers, and predicates on the controller. The controller then makes the choice whether to act on this new resource.
During the reconcile loop, when a request by a Get is made to the cluster, the controller-runtime intercepts this request, checks the cache, and returns the cached result to the controller. This makes the whole process very efficient.
How it works
When the controller-runtime is starting up, informers are configured for the resources defined in the controllers configuration (in this case, Redis CRD and PVC). These informers create HTTP streams with the API server on the cluster, registering interest in the different resources. As time passes, the API server sends events for those resources, which the controller-runtime processes and passes to the controllers.
When a client within the controller does a Get to the cluster, the request is intercepted by the controller-runtime, and the controller-runtime checks its cache. If the resource is not in the cache, the controller-runtime creates a new informer for that resource, in turn creating a new HTTP stream.
At this point, the workflow seems reasonable. It would be reasonable to ask where the problem is in that workflow.
As you may recall, I stated earlier that an operator is made up of controllers, more than one. These controllers watch a number of resources based on the configuration, and any given resource may be watched by other controllers. Each controller can have its own handlers and predicates on each resource.
Because there may be conflicting predicates applied by different controllers, the resource filtering is done after the controller-runtime has cached, and forwarded the resources to the interested controllers.
When using the manager with default configuration, if there is a controller that watches secrets with a given label, then every secret on the cluster is loaded into the controller-runtime cache, into the operator's memory.
To make this worse. If a client does a Get or List on a resource doing the reconcile loop, the controller-runtime configures the informer, and all of those resources are requested from the cluster, and loaded into the operator's memory. It's not just the resource that was requested.
Startups and running compared
There is one last step in the workflow to address, and that's the controller-runtime behavior at startup. When informers are configured, they run a sync, checking with the API server that all resources have been received. Because the operator is only starting, no resources have been received. The API server starts sending all the resources, and the informers block the operator from continuing with any controller reconcile functions. This causes a burst in memory usage at startup because all cluster resources are loaded into the operator's memory.
While the operator is running, and the number of resources increases on the cluster, the API server sends events of the new resources. If the rate of these events is slow enough memory may be freed from the cache, allowing the operator to handle the increases. This results in high memory usage.
How to solve the problem
How the cache works can be configured, and some cluster resources aren't a problem (see When not to filter the cache for more detail). There are two important use cases to review: Resources configured in controller configurations, and resources requested during the reconciliation process.
Resources requested during reconciliation process
When a resource is requested from the cluster using the client, the controller-runtime configures an informer for that resource. There are two ways this can be addressed: Don't use the client, or tell the client not to use the cache. Which method is best depends on the use case.
Configuring the client not to use the cache is one of the easiest solutions. However, this means that when there is a resource cached based on labels (see Resources watched by the controller), the client does not use that cache. Depending on your access patterns, this could be extremely costly. The client doesn't use pre-cached resources, and accesses the cluster directly on each call. On the plus side, this change does not require refactoring the controllers, so it's fast to implement.
The client using cache can be disabled within each resource. To do so, you add the setting to controller-runtime.Options, which is passed when configuring the controller-runtime manager. For example, this configures the client not to use the cache when requests for secrets are made from within a controller reconcile function:
controller-runtime.Options{
Client: client.Options{
Cache: &client.CacheOptions{
DisableFor: []client.Object{
&corev1.Secret{},
},
},
},
}The second option is to use a different client: In this case, the dynamic client. The dynamic client provided by client-go does not use the cache. The client within the controller-runtime is built using components from the client-go package. This means there is no need to add the configuration for the client in the controller-runtime options. Doing this also can have the added benefit of adding a cache configuration that rejects all requests by the standard client, when a watch has not been defined by a controller at startup.
There is a big downside to using the dynamic client. It requires a refactor in all controllers, so every request to the cluster must be modified. The API for using the dynamic client is also different, and coming from a background of using the standard client, this difference can seem harder to use.
It's difficult to show an example of the dynamic client replacing the standard client. What I can demonstrate is a configuration that would prevent a future mistake where someone uses the standard client on a resource that's not being watched, and creates an informer unknowingly for that resource:
controller-runtime.Options{
Cache: cache.Options{
ReaderFailOnMissingInformer: true,
},
}Personally, I would use the second option if I was starting a new project, because the cost of refactoring the existing controller clients would be low. For projects that are up and running, I would use the first option, but I'd also think about doing the refactor to the second option later.
Resources watched by the controller
When a controller is configured to reconcile on a resource, the controller-runtime sets up an informer for that resource. Unlike the client, the cache cannot be disabled for the resource and still allow the controller to reconcile on resource events. In these cases, configuration of the cache is required.
There are a number of default options for the cache that can be configured. However, as documented in When not to filter the cache, it may be better to stay away from these default options. It's often better to focus on a single resource, object by object, or more specifically cache.ByObject. The cache.ByObject config provides a lot of options, but the one that matters here is Label. Two other useful filters are Field and Namespace.
Keep in mind that these are AND filters, not an OR filters. If two labels are stated in the selector, then both labels are required. The same goes with combining Label, Field, and Namespace. All must return True. This constraint makes migrating from one set of filters to another much harder with operators in production, and it's possible to define defaults at a global level that ByObject can override.
Here's an example of a configuration that uses ByObject to target only secrets. The label selector targets secrets labelled with example.com/redis-secret = true. More information on the options that can be set in for cache in the controller-runtime can be found here.
controller-runtime.Options{
Cache: cache.Options{
ByObject: map[client.Object]cache.ByObject{
&corev1.Secret{}: {
Label: labels.SelectorFromSet(labels.Set{"example.com/redis-secret": "true"}),
},
},
},
}For resources created by the operator, it is recommended to apply well known labels for Kubernetes. These recommendations do not define how user-generated resources should be labeled.
When not to filter the cache
There is a very valid reason to not filter cache. CRDs defined by the operator are normally always reconciled by the defining operator. This means all events from these resources should be handled. Filtering the cache by default would also affect these CRDs, so your users would have to add filter-matching criteria to resources (most likely in the form of labels).
Identifying the scale of the problem
Now you know there's an issue with controller-runtime caching every resource on your cluster, but how can you tell what's being affected?
The types of events that cause the controller-runtime to create informers can be classified as implicit or explicit. The explicit cache informers are created by the watches on the controller configurations. These informers are configured on operator startup. An audit of the controller configuration can show what resources are creating informers.
The implicit cache informers are harder to spot. These informers are created by clients making Get and List requests to the cluster. To audit the code base for these resources is difficult because it requires reviewing every Get and List function call.
To help with auditing, the controller-runtime produces a log that suggests there is an informer being configured. Scanning the operator for the message Starting EventSource helps you narrow your search:
{"level":"info","ts":"2026-01-15T14:24:41Z","msg":"Starting EventSource","controller":"secret","controllerGroup":"","controllerKind":"Secret","source":"kind source: *v1.Secret"}Conclusion
The controller-runtime makes setting up an operator very easy by hiding the complexity of configuring the watches. However, this ease of use comes at the cost of abstracting how the caching infrastructure is configured. The complexity of caching still exists, but now you understand how to mitigate that.
As a maintainer, the best thing you can do is audit your operators, check your caching configuration, and build software that does not try to read all the resources from the cluster.