This article explains how to diagnose and address application misbehavior after a Red Hat OpenShift upgrade.
Container awareness is a primary focus, as it dictates how an application behaves within a container. I therefore consider this article a follow-up to How to use Java container awareness in OpenShift 4, serving as a second expansion package after How does cgroups v2 impact Java, .NET, and Node.js in OpenShift 4?.
I discuss several points regarding this matter in the sections that follow. The topic is extensive so I will be as concise as possible. This article assumes you are already investigating unexpected application behavior. Therefore, it will not discuss other topics (in migration) such as how to do it, or how to prepare for doing so.
Migration
Upgrading your Red Hat OpenShift version should not require changes to your application, and most migrations are painless.
This is generally true if your application is compatible with cgroups v1 or cgroups v2. In these cases, application behavior remains consistent across upgrades.
However, some scenarios can lead to unexpected outcomes; these troublesome scenarios are discussed in the following sections.
Troublesome scenarios
After migrating to a new cluster, the application behaves differently than in the previous version.
Application changes
Although seemingly straightforward, this is a common source of problems, especially when image tags are modified. The tag might be incorrect or the image can be cached on the registry.
Workload changes
Workload patterns might change due to Ingress parameters or increased application access.
This is easily verifiable using the metrics already embedded in Red Hat OpenShift 4, such as Prometheus and Grafana metrics for use and networking (packets) usage or exchange.
Instrumentation
Changes to instrumentation—such as a different image or a new sidecar configuration—can cause the application to consume more memory than expected. Even with an identical image, updated instrumentation settings might introduce unexpected overhead or cause the application to misbehave. Because these issues are case-specific, consult your tool vendor to evaluate how a cgroups version migration affects their software.
Non-container-aware applications
An application that is not container-aware looks at the host hardware sizing rather than the specific resource allocations of its pod foundation.
JVM container awareness
When optimizing heap configurations, awareness metrics depend on deployment states.
If a JVM deployment cannot detect its container limits, the application becomes non-container-aware.
The application typically behaves consistently if it fails to detect limits on either the source or target hosts, assuming both environments share identical CPU and memory resources. The following table summarizes these scenarios:
| Source host state | Target host state | Scenario description |
|---|---|---|
| Container-aware | Non-container-aware | Most common scenario: The application detects limits on the source host but fails to detect them on the target host. |
| Non-container-aware | Non-container-aware | Uncommon scenario: The application must already be deployed as non-container-aware. |
| Non-container-aware | Container-aware | Rare scenario: The application or its deployment script must dynamically detect host limits. |
Off-heap non-container-aware applications
For instance, Netty (depending on the application and configuration) is not always container-aware and can scale according to the number of CPUs on the host. This means off-heap resource use inside the container aligns directly with the host's CPU metrics.
Non-JVM component awareness
Even if the JVM is container-aware, underlying runtime components might not be. For example, native libraries like jemalloc and glibc reside below the JVM level and do not inherently detect container constraints.
When deployed in an OpenShift cluster with fewer resource restrictions on its nodes, the application scales based on the host resources instead of the container limits.
cgroups version change
This is a specific instance of the previous use case. An application deployed under cgroups v1 might successfully detect container limits, but fail to do so after migrating to cgroups v2.
When an application is migrated to a different cgroups version, Java runtime environments might fail to detect container memory or CPU limits.
Review the following criteria to identify and resolve these compatibility failures:
- When it will fail: Deploying an application that is incompatible with cgroups v2 can cause unexpected behavior in Red Hat OpenShift 4.19.
- How to fix it: Upgrade to a cgroups v2-compatible version of your runtime environment.
- Workaround: If an upgrade is not possible, limit CPU and memory allocations manually to mitigate broader cluster issues.
For instance, the removal of cgroups v1 in Red Hat OpenShift 4.19 could explain differences, especially if the application is not cgroups v2 compliant. In this case, the application will behave as non-container-aware.
This is the core topic of the article How does cgroups v2 impact Java, .NET, and Node.js in OpenShift 4?
Unbounded deployments
Unbounded deployments also directly impact application migration.
When deploying an application inside the Red Hat OpenShift host, you can set that application without limits, allowing it to use the full host for performance.
This configuration is known as an unbounded application. This type of deployment allows the application to spike when workloads peak, taking advantage of over-provisioned OpenShift nodes.
Java uses container limits rather than requests for resource calculations. For unbounded deployments, host-level specifications directly dictate runtime thread and memory footprint calculations.
CPU impact
Host CPU cores directly affect the application thread count and overall resource footprint, as many heap and off-heap components base their internal configurations on available CPU cores. For example, Netty scales its thread pool according to the host processor count. In an unbounded deployment, a high host CPU count creates an excessively large thread pool. This leads to CPU throttling as multiple threads compete for kernel time slices while garbage collection (GC) threads execute simultaneously.
Memory impact
The memory will be calculated from the host limits, which results in a larger footprint.
Even at baseline idle states, resource consumption can be significantly higher than in strictly bounded environments.
Given larger memory use, the GC will directly affect performance; for instance, using a memory-intensive collector like G1GC means memory use increases until Full GCs are required.
Although the extent to which you should use unbounded applications is debatable, the problem of noisy neighbors and the normal OpenShift workload directly affects the application, even if the resource footprint varies.
FAQ
The following frequently asked questions can help you verify and address this problem:
Q1. What is the first step to verify an issue after migration?
A1. Verify the cgroups difference. This can be done proactively.
Q2. What would be a possible troubleshooting flow?
A2. Verify the cgroups difference, deploy and verify. Also benchmark to make sure the problem always happens. Intermittent issues can often be more challenging to resolve than persistent, predictable failures.
Q3. Is having more or fewer GC cycles a sign of a problem?
A3. Not necessarily. There are several underlying conditions that can impact, for example, unbounded deployments.
Overview
The following table maps common migration scenarios to their corresponding diagnostic actions and verification steps.
| Troubleshooting scenario | Diagnostic action | Verification step |
|---|---|---|
| Application change | Track the specific application change backward. | Verify elements of application difference, such as the image SHA and image metadata. |
| Workload change | Track workload and application behaviors using instrumentation and data collection. | Verify elements of workload difference, cluster metrics, and memory allocation. |
| Instrumentation change or impact | Track the instrumentation overhead and disable the instrumentation or settings. | Verify elements of workload difference not related to application memory use. |
| Cgroups version change | Either the application or the container startup script lacks updates, or the application might not be compatible with cgroups v2. | Verify the application's compatibility with cgroups v2. |
| Unbounded deployment | Unbounded deployments face host configuration fluctuations, and noisy neighbor issues can affect resource distribution. | Verify details on the OpenShift host. |
Additional resources
The following articles cover similar themes and are complementary to this guide:
- How does cgroups v2 impact Java, .NET, and Node.js in OpenShift 4?.
- Cgroups v2 in OpenJDK container in Openshift 4
- Verifying Cgroup v2 Support in OpenJDK Images
- What Red Hat middleware software is cgroups v2 compatible?
To learn more about Java container awareness and how it prevents heap decoupling, see How to use Java container awareness in OpenShift 4. For detailed release notes, review Severin Gehwolf's article on cgroups v2 support in OpenJDK 8u372.
Data collection
You can gather specific data points to troubleshoot these scenario-based issues. For instance, for Java 11 and later you can use the VM.info file to verify heap size, off-heap configurations, container size, and CPU details. If the runtime fails to detect cgroups limits, the file entries show this mismatch.
Isolate resource constraints after your next upgrade
When troubleshooting application disruptions after an OpenShift upgrade, check your cgroups compatibility first. Verifying whether your runtime components accurately detect container-level limits isolates the root cause of unexpected resource footprint spikes and streamlines your diagnostic path.
Acknowledgments
Thanks to Moises Lozano and Pamela Giz for their contributions to this work.