Last year, Red Hat announced the availability of resource optimization for Red Hat OpenShift. This article goes deeper into discussing the benefits of Kruize Autotune, the engine that provides container right-sizing recommendations for resource optimization.
What is Kruize Autotune?
Kruize provides container right-sizing recommendations in Kubernetes in the form of CPU and memory requests and limits. The request and limit values for both CPU and memory are set to be the same. The recommendations are based on monitoring a data source such as Prometheus where the data source can be local or remote. The recommendations are based on resource usage in the past 24 hours (short term), 7 days (medium term), and 15 days (long term) and provide cost and performance-optimized suggestions for each term on a per-container basis.
Kruize also provides capacity and utilization data used to represent resource request versus actual resource utilization data (e.g., as a box plot) to better understand the recommendations.
Performance-optimized recommendations currently use the 98th percentile for CPU usage for the given term. The usage includes any throttling that may have happened in the term.
Memory recommendations use the max value in the observed term with an added buffer. The buffer represents the minimum of 20% over the max value and the maximum interval spike in the observed term.
memRecommendation = termMemMaxInterval + min(0.2 * termMemMaxInterval, termSpikeMaxInterval);
Here, Interval refers to the minimum observable duration of the gathered metrics. The default interval has currently been set to 15 minutes.
Cost-optimized recommendations uses the 60th percentile for CPU usage for the given term (including throttling), and the memory recommendation is the same as that of performance-optimized recommendations.
Note that the algorithms used to arrive at these recommendations are subject to change.
Production cluster example
Let's look at an example to explain the CPU and memory recommendations. In this example, the container swatch-tally-service on a production cluster has resource utilization over a 15-day term (Figure 1).

In Figure 2, the container currently has a CPU request and limit of 2 cores, a memory request of 2 GiB, and a limit of 4 GiB.
Based on the past 15 days of actual usage, we see the cost-optimized recommendation from Kruize is 0.15 cores for CPU, which is 93% less than what has been set currently, and 957 MiB for memory, which is 53% less.

If the same container needs to be optimized for performance, Figure 3 shows that Kruize recommends CPU request and limit of 1.18 cores, which is still 41% less than what is set currently. Memory remains 53% less than what has been set currently.

Staging cluster example
In this example, the container swatch-tally-service now runs on a staging (non-production) cluster. Figure 4 shows the resource utilization over the last 24 hours.

We see that the container currently has a CPU request and limit of 2 cores, memory request of 1 GiB, and a limit of 4 GiB.
Based on the past 24 hours of actual usage, we see the cost-optimized recommendation from Kruize to be 0.1 core for CPU, which is 95% less than what has been set currently, and 4.5 GiB for memory, which is 12% more. The memory recommendation is higher than what has been currently set for two reasons.
First, the max is very close to what has been set. Second, there may be observed spikes, which may push the usage beyond what was set as the limit. Since memory is not a compressible resource, the recommendation is higher to help offset any OOM scenarios (Figure 5).

On the other hand, the performance-optimized recommendation for the same term is 3.32 cores, which is an increase of 66% compared to the current set term. The memory recommendation is 4.5 GiB, which is an increase of 12%, as shown in Figure 6.

Warnings in the recommendations
In certain conditions, recommendations display warnings against them. This section discusses them in more detail.
Warnings about idle containers:
Containers can idle (< 1 millicore of CPU usage) in the observed term. In Figure 7, we see a container that has been idle for the last 24 hours.

In this scenario, Kruize will be unable to generate a recommendation for CPU for the respective term. Figure 8 shows a recommendation for the CPU idling case, which has an empty CPU recommendation with a warning icon.

Containers that do not have either a request or limit set in their current configuration:
Figure 9 shows a case where the CPU limit was not set in the current configuration, which results in a warning icon.

Similarly, Figure 10 shows a case where both CPU and memory requests and limits have not been set.

Important takeaways
For critical production workloads, we recommend setting the performance-optimized configuration recommendation based on the previous 15-day term. To prevent or reduce disruption to production workloads, it would be better to have fewer updates to the container configuration and only do updates if the recommended configuration is significantly different from the current one.
For non-production workloads, it would be wise to optimize for cost and update the configuration more frequently. In this case, you can get the maximum benefit if the configuration is set to the cost-optimized recommendation based on resource usage of the past 24 hours.
In general, when we optimize an entire cluster in this fashion, we see a more than 40% reduction in overall resource usage and the associated cost benefits thanks to Kruize. We are working on new and exciting recommendations, including AI workloads, so stay tuned.