Speed up Shenandoah garbage collection

Our primary motivation for the Shenandoah OpenJDK garbage collection (GC) project is to reduce garbage collection pause times. In JDK 12, we released the original Shenandoah garbage collector, which implements concurrent heap evacuation, which solved the major problem of cleaning (potentially large) heaps without stopping the application. This version was eventually backported to JDK 11. In JDK 14, we implemented concurrent class unloading, and in JDK 16, we added concurrent reference processing, both of which further reduced pause times in those garbage collection operations. The remaining garbage collection operation under pause was thread-stack processing, which we've solved in JDK 17.

This article introduces the new concurrent thread-stack processing in Shenandoah GC. Processing thread stacks concurrently gives us reliable sub-millisecond pauses in JDK 17.

Thread processing in Java

What is thread processing and why do we need to stop the application for it? Java programs are executed in threads, and each thread owns a stack: A list of stack frames, each frame holding local variables, monitors, and other information related to the currently executed method. Most importantly, in the context of Java garbage collection, it holds references to heap objects (e.g., local variables to reference typed objects).

When a garbage collection cycle is initiated, we first scan all of the threads' stacks to seed marking queues with the references that we find on stacks. We do so at a GC pause (safepoint) because we need a consistent state of the stack at mark start, without the thread's execution concurrently messing with the stack. When we're done, we resume execution and traverse the graph of reachable objects, starting from the references that we found during the initial thread scan.

Likewise, when evacuating reachable objects into empty regions, we need to update all references on thread stacks to point to the new object locations. We need to do so at a pause because garbage collection load barriers normally act when loading the reference from the heap (for example into the local variable or register), which means local variables or registers cannot have object references in the state that requires a GC intervention. It is too late, at that point, to pass the garbage collection barrier. Invoking a garbage collection barrier for every local variable or register access quicky runs into performance problems.

Scanning and processing threads stacks takes time. Smallish workloads (few threads with small stacks) would probably take very few milliseconds to scan, but large workloads— application servers, I'm looking at you!—can easily take several dozens of milliseconds to process. All of that processing is done while the application is stopped, so it affects the application's overall end-to-end latency.

Concurrent thread processing in OpenJDK 17

How can we improve the situation and process thread stacks concurrently? We do so by utilizing a mechanism called stack watermarks (originally implemented by ZGC developers). The central observation is that all the thread stack's action happens in the top-most frame: The currently executed method. All the frames below that are basically static and don't change—they can safely be scanned concurrently by garbage collection threads. All we need to do is coordinate GC threads with executing threads whenever a stack frame is destroyed (e.g., by a return to caller, or by throwing an exception), and thus falls out of GC processing. This coordination is achieved by the stack watermark, a pointer that tells us which parts of the stack are safe to scan, and a barrier that allows garbage collectors to deal with returns. Figure 1 illustrates the role of the stack watermark in concurrent thread processing.

Figure 1: The stack watermark in concurrent thread processing.

Using stack watermarks during garbage collection

Let's consider an example. Say, at the beginning of marking, during the initial pause, we set the stack watermark to the top-most frame of each thread and arm the thread. That means that we consider all frames as safe for concurrent scanning, but none (yet) for execution. Upon returning from the safepoint, we hand control back to the Java program, and thus need the top frame to be safe for execution. Here, the stack-watermark barrier kicks in, and lets the garbage collector handle the top frame (and its caller, for practical reasons). The thread will scan the top frames and lower the watermark accordingly, and resume its own execution at the point where it left before the safepoint. Meanwhile, GC threads also start working to scan the stacks, from the bottom up to the watermark, i.e., in the safe zone. Whenever a thread executes something that would destroy a stack frame it does a few things:

Lowers the watermark by one frame.
Prevents GC threads from scanning beyond that watermark.
Handles the frame that is now above the watermark, by scanning it for any references.

The end result is that we will effectively scan all the same frames and references as we would have done at the initial mark pause, but we have done so concurrently, while the program is executing.

Benchmarking Shenandoah GC

So what is the effect of those changes in practice? I have run a number of benchmarks measuring garbage collection pauses. The table below shows the average pause times over all benchmarks in JDK 11, JDK 16, and JDK 17. The difference between JDK 16 and JDK 17 shows the improvement that is achieved by concurrent stack processing. The difference to JDK 11 is shown for completeness and includes the various other improvements from previous releases.

	Init mark	Final mark
JDK 11	421 µs	1294 µs
JDK 16	321 µs	704 µs
JDK 17	63 µs	328 µs

Conclusion

This article explained how concurrent thread stack processing in Shenandoah GC solves the remaining garbage collection pause-time problem and delivers reliable sub-millisecond garbage collection pauses in JDK 17. To learn more, visit the GitHub repository and OpenJDK Wiki page for the Shenandoah GC project. See the previous articles in this series detailing how Shenandoah improves garbage collection pause times in OpenJDK. You can also download the Red Hat build of OpenJDK.

Last updated: January 6, 2023

Shenandoah in OpenJDK 17: Sub-millisecond GC pauses

Thread processing in Java

Concurrent thread processing in OpenJDK 17

Using stack watermarks during garbage collection

Benchmarking Shenandoah GC

Conclusion

Understanding ApplicationSets - Generators (Part 2)

Benchmark Red Hat Data Grid in OpenShift 4 using Hyperfoil

Layered sandboxing for AI agents: OpenShift and OpenShell

How obs-mcp boosts AI-native OpenShift observability

Red Hat build of Agent Sandbox: Isolated workload management with Kubernetes

Platforms

Build

Quicklinks

Communicate

RED HAT DEVELOPER

Red Hat legal and privacy links

Red Hat legal and privacy links