Featured image for Java.

The primary motivation behind the Shenandoah garbage collection (GC) project in the OpenJDK was to reduce garbage collection pause times. Reference processing has traditionally been one of the primary contributors to GC pauses. The relationship is mostly linear: The more references the application is churning, the higher is the impact on garbage collection pauses and latency. The key here is "churning," or how many references need to be processed at every GC cycle. The references with referents that never die, or that die along with references themselves, are not a problem.

I have myself recommended in the past that if you care about latency, you had better not churn soft, weak, and phantom references or finalizees. In this article, I want to show why reference processing has contributed to Java garbage collection pauses in the past, and how we solved that problem by making reference processing concurrent in JDK 16.

TL;DR: If your application churns through soft, weak, or phantom references or finalizees, JDK 16 with its concurrent reference processing in Shenandoah GC might significantly improve your application's latency.

References recap

The term references in this article means Java objects of type java.lang.ref.Reference and its subtypes SoftReference, WeakReference, PhantomReference, and FinalReference (more on the last one later). Regular references between objects are also called strong references. Each reference points to one referent. The purpose of the various reference types is to be able to reference an object, but not keep that object from being reclaimed by the reference. Reclamation follows reachability rules, which are specified roughly as follows (in order of decreasing reachability):

  • SoftReference: The referent can be reclaimed as soon as it is no longer reachable by any strong reference. Reclamation is additionally subject to other heuristics; e.g., usually soft references are not expected to be reclaimed unless memory pressure is high. This makes soft references suitable for memory-sensitive (cache) implementations. A rather bad choice, but a choice nonetheless: GC would clear soft references when memory is tight.
  • WeakReference: The referent can be reclaimed as soon as it is no longer reachable by any strong or soft reference. As long as it has not been reclaimed, the referent may still be accessed, at which point it becomes strongly reachable again.
  • FinalReference: You probably don't know this one, because it is a JDK-internal reference type. It is used to implement Object.finalize(). Any object that implements finalize() becomes the referent of a FinalReference and is registered in a corresponding ReferenceQueue. As soon as that object is no longer reachable by strong, soft, or weak references, it is processed and its finalize() method is called.
  • PhantomReference: The referent may be reclaimed as soon as it is no longer reachable by any strong, soft, weak, or final reference; in other words, when it is properly unreachable. The referent can never be accessed. This is done so that the referent cannot be accidentally resurrected after it has been determined to be unreachable. As soon as the reachability of referents has been determined, unreachable referents get cleared (set to null) and their corresponding reference objects get enqueued in a respective ReferenceQueue for further processing by the application. Phantom references are typically used for managing resources such as native memory or operating system file handles that could otherwise not be handled by GC.

Traditional reference processing

In versions prior to JDK 16, Shenandoah discovers and processes references in a way that closely follows the reachability rules in the previous section:

  • First we determine the reachability of all strongly reachable objects during concurrent marking. As soon as the marking wavefront reaches a reference object that does not still have a marked referent (i.e., an object that is not strongly reachable from somewhere else), it stops there, and enqueues the reference in one of four discovery queues for later processing. There is one discovery queue per reference type. A special case concerns soft references: the GC may decide, based on heuristics (e.g., memory pressure and/or object age), whether to treat soft references like strong references.
  • Next, as soon as concurrent (strong) marking is complete, the JVM stops the application and starts processing the discovery queues:
    1. All SoftReference objects are inspected. If the referent is not strongly reachable (i.e., marked by concurrent marking), it is cleared, and the SoftReference is put in the processing queue.
    2. All WeakReference objects are inspected. If the referent is not strongly reachable, it is cleared, and the WeakReference is put in the processing queue.
    3. All FinalReference objects are inspected. If the referent is not strongly reachable, the FinalReference is put in the processing queue, but the referent is not yet cleared, because it is still needed later so that the GC can call its finalize() method. Also, something special happens now: starting from the otherwise unreachable referent, marking is resumed, and the subgraph of objects found from the referent is marked. This is important in the next step, to avoid reclaiming PhantomReference objects that are reachable from finalizees. This additional marking pass is problematic because it happens while the JVM is stopped, and it is theoretically bounded only by the live data set size. In other words, we may spend a lot of time here marking the subgraph from finalizees.
    4. All PhantomReference objects are inspected. If the referent is not reachable (neither strongly nor from finalizees), the referent is cleared, and the PhantomReference is put in the processing queue.
  • Finally, the processing queue is added to a Java linked list for further ReferenceQueue processing.

From these steps, it follows that the contribution of reference processing to GC pause time is basically proportional to number-of-processed-references + size-of-newly-marked-subgraph.

In order to address the pause-time problem of reference processing, we need the GC to do it concurrently. The task of reference processing can be divided into two subtasks: the concurrent marking of referents, including their respective subgraphs, and the concurrent processing and clearing of references. Those two subtasks are entangled in the traditional implementation, so let's see how we can untangle them.

Concurrent reference marking

Upon closer inspection of the traditional implementation, we find that we can simplify our reachability model. We don't really have five levels of reachability (strong, soft, weak, final, phantom), but only two. Let's look at the criteria for classifying references from a different angle:

  • SoftReference objects get cleared and enqueued when the referent is not strongly reachable and we meet a certain heuristic. In essence, we can decide, before or during concurrent marking, whether the SoftReference should be treated like a strong reference or a weak reference.
  • WeakReference objects get cleared and enqueued when the referent is not strongly reachable.
  • FinalReference objects get enqueued when the referent is not strongly reachable.
  • PhantomReferences get cleared and enqueued when the referent is not strongly reachable and not reachable from any FinalReference.

In other words, our two relevant reachability levels are strongly reachable and finalizably reachable.

The traditional implementation determines reachability by marking in the following steps:

  1. During concurrent marking, we establish the set of all strongly reachable objects.
  2. We process all references—soft, weak, and final—that require only this reachability level.
  3. We continue marking from finalizees.
  4. We process the remaining phantom references that also require this new information.

Can we determine both strong and finalizable reachability concurrently? Sure we can. But we need to extend our marking bitmap a little bit: Instead of using one bit per object (marked versus unmarked), we now need two bits to represent all possible states: strongly reachable, finalizably reachable, and unreachable (and a 4th state that is used internally). Equipped with this information, we can now concurrently mark through all the live objects and determine both strong and finalizable reachability:

  1. Start marking from roots with normal strong wavefront.
  2. As soon as a strong wavefront encounters a SoftReference, decide (based on heuristics) whether it should be treated as a strong or weak reference. If the soft references should be treated as strong, normally mark through the referent (marking its subgraph as strong), otherwise stop the wavefront there and enqueue the soft reference in the discovered queue.
  3. When a strong wavefront encounters a WeakReference or PhantomReference, stop the wavefront there and enqueue the reference in the discovered queue.
  4. As soon as a strong wavefront encounters a FinalReference, mark that FinalReference as strong, and switch to the finalizable wavefront for marking the referent. All objects reachable from there will now be marked finalizable.
  5. When a strong wavefront encounters an object that is already marked finalizable, upgrade the object and its subgraph to strong.

We now have concurrently marked all reachable objects and determined whether they are strongly or finalizably reachable. We also have enqueued all reference objects in our discovered queue for further processing. This solves the first half of the problem. We still need to clear the unreachable referents and enqueue the reference objects.

Concurrent reference processing

We established reachability concurrently during marking. This is finished in the final-mark pause, and we also set up for evacuation during that pause. Concurrent reference processing happens at the beginning of the concurrent evacuation phase. We need to do two things:

  1. Scan the discovered queue and clear all unreachable referents.
  2. Enqueue "unreachable" reference objects in a processing queue for further processing on the Java side.

The tasks are mostly as simple as they sound: We concurrently scan the discovered queue, inspect each reference object, and see whether the referent is reachable. We follow the reachability rules cited early: to be reachable, soft, weak, and final references must be strongly reachable, while phantom references must be either strongly or finalizably reachable. If a reference is unreachable, clear the referent and put the reference into the processing queue.

But wait: What if the Java program tries to access the referent before we cleared it? It would still see the referent that we determined to be otherwise unreachable, and would thus resurrect it. This would be a violation of the spec and cause all sorts of troubles, up to JVM crashes, because the GC would subsequently reclaim or override that object, and the Java program would end up with a dangling pointer.

The solution to this problem is a special barrier that we insert in Reference.get(). We already have an LRB there because it is a reference load:

// Pseudocode of Reference.get() intrinsic
T Reference_get() {
  T ref = this.referent;
  return lrb(ref);
}
Let's return null when the referent is unreachable:

// Pseudocode of Reference.get() for concurrent reference processing
T Reference_get() {
  T ref = this.referent;
  if (isUnreachable(ref) {
    return null;
  }
  return lrb(ref);
}

Voila! We now always return null whenever we try to access an unreachable referent, and the Java application never gets to resurrect an object by accident.

Enough theory, show me the numbers

How bad is concurrent garbage collection, actually?

Let's look at a workload that makes modest use of references and finalizees, running under JDK 11. The -Xlog:gc+stats option gives us some statistics:

Pause Final Mark (N)           =    0,278 s (a =     3812 us) (n =    73) (lvls, us =      893,     2891,     3535,     4414,     7707)
  Finish Queues                =    0,020 s (a =      273 us) (n =    73) (lvls, us =       94,      123,      137,      227,     4426)
  Weak References              =    0,214 s (a =     2929 us) (n =    73) (lvls, us =      158,     2109,     2695,     3555,     6072)
    Process                    =    0,213 s (a =     2924 us) (n =    73) (lvls, us =      154,     2109,     2695,     3555,     6067)

This shows that out of the average final-mark pause of 3.8ms, GC spends 2.9ms in weak reference processing, and in the worst case even 4.4ms out of 7.7ms.

Let's look at the same code with JDK 16:

Pause Final Mark (G)           =    0,208 s (a =     2044 us) (n =   102) (lvls, us =      688,      820,      914,     1562,    36975)
Pause Final Mark (N)           =    0,089 s (a =      870 us) (n =   102) (lvls, us =      500,      625,      705,      764,     4489)
  Finish Queues                =    0,034 s (a =      329 us) (n =   102) (lvls, us =      107,      125,      174,      246,     4072)
  Update Region States         =    0,004 s (a =       38 us) (n =   102) (lvls, us =       26,       34,       38,       41,       52)
  Manage GC/TLABs              =    0,001 s (a =       13 us) (n =   102) (lvls, us =        9,       13,       13,       14,       22)
  Choose Collection Set        =    0,019 s (a =      183 us) (n =   102) (lvls, us =       97,      156,      188,      203,      323)
  Rebuild Free Set             =    0,002 s (a =       23 us) (n =   102) (lvls, us =       13,       20,       23,       25,       29)
  Initial Evacuation           =    0,028 s (a =      274 us) (n =   102) (lvls, us =      133,      207,      227,      260,     3264)
    E:                  =    0,247 s (a =     2422 us) (n =   102) (lvls, us =      799,     1875,     2031,     2090,    39583)
    E: Thread Roots            =    0,247 s (a =     2422 us) (n =   102) (lvls, us =      799,     1875,     2031,     2090,    39583)
Concurrent Weak References     =    0,967 s (a =     9479 us) (n =   102) (lvls, us =      152,     4824,     9473,    11914,    39231)
  Process                      =    0,966 s (a =     9470 us) (n =   102) (lvls, us =      148,     4805,     9453,    11914,    39215)
[

Reference processing is completely gone from the final-mark pause, and the pause is down to 0.9ms. Reference processing is now listed separately under "Concurrent Weak References" and takes an average of 9.5ms—but we don't care all that much because it doesn't keep the application from running.

These numbers are from a relatively modest workload. We have worked with a customer workload with literally millions of weak references and finalizees, and as is to be expected, the effect is much more dramatic there.

Conclusion

If you are using weak references, soft reference caches, finalizees, or phantom references for resource cleanups, and you care about garbage collection pauses and latency, you might want to upgrade to JDK 16 and give Shenandoah GC a try.

For more information about Java, please refer to the related Red Hat Developer topic page.