OpenJDK

Have you ever enabled a HotSpot Java virtual machine (JVM) debugging option and had performance drastically improved? Totally unexpected and confusing. We had that very problem during a JVM warm-up investigation, and this article will detail the steps, options, and tools used to get to the bottom of the issue. Hopefully our journey will help you better understand how to debug unexpected warm-up behavior.

Problem and solution

We observed a strange behavior affecting warm-up performance from HotSpot JVM while measuring throughput using Spring Boot's PetClinic application. During the measurement runs, the application warmed up much quicker when the PrintTieredEvents option is enabled (using -XX:+PrintTieredEvents) compared to the case when running with the default settings (i.e. without the PrintTieredEvents option or -XX:-PrintTieredEvents)

Note: "Warm-up time" refers to the time it takes to reach the peak throughput. An application's warm-up time depends a lot on the just-in-time (JIT) compilation of hot methods. HotSpot has two JIT compilers, C1 and C2, and, by default, employs tiered compilation where the compilation level ranges from 1 to 4. While the C1 compiler takes care of compilations done at levels 1 to 3, C2 is responsible for level 4 compilations. It is level 4 where all the heavy optimizations are performed and is therefore by far the most CPU-intensive compilation level.

PrintTieredEvents is an option that logs various tiered compilation events.

This was a very puzzling behavior because PrintTieredEvents is not expected to alter compilation behavior and, consequently, the performance characteristics of the application. To give an idea of the impact, the warm-up time with default settings (i.e., without PrintTieredEvents option or -XX:-PrintTieredEvents) was around 14 minutes, but when -XX:+PrintTieredEvents was added, the warm-up time decreased to around 4 minutes. The peak throughput in both cases was comparable, but the time to reach peak throughput is very different. (See Figure 1).

Throughput for Spring Boot PetClinic.
Figure 1: Throughput for Spring Boot PetClinic.

The root cause for this unexpected behavior turns out to be competition between compiler and application threads for CPU. With PrintTieredEvents disabled, the application threads run at full throttle. Large compilations get context switched more frequently to allow the application to run, so they take longer to complete. The result is a delay in delivery of the JITted code needed to boost throughput. With PrintTieredEvents enabled, the application threads suspend regularly on I/O waits. Large compilations switch less often or complete without switching, delivering JITted code and higher throughput earlier.

This highlights a configuration issue that HotSpot users need to be aware of. The request rate a server can support during warm-up needs to be capped at a lower value than the rate for peak running. This was confirmed when we started with fewer JMeter threads during the warm-up phase and then ramped up later; the compiler threads were no longer starved of the CPU.

This solution to the puzzle of delayed ramp-up makes sense and is relatively clear and understandable once you see it. However, when faced with the initial, counter-intuitive observed behavior, it was unclear what this resolution might be. It is no accident that we eventually solved the puzzle. HotSpot supports a host of built-in monitoring, tracing, and debugging tools and options that make it possible to gather evidence and analyze these sorts of anomalous behaviors. Although they are familiar to most HotSpot developers, they are less well known by end users, even though they are actually very useful.

The rest of this article explains how we used various available tools and options to arrive at the above resolution of this ramp-up puzzle.

Setup

This investigation was done on a 4-core system running Red Hat Enterprise Linux 8.6. The CPU model was Intel(R) i7-8665U. SMT (simultaneous multithreading) was disabled on the system. To prevent interference from other system processes, the PetClinic app was bound to 2 CPUs. You can achieve this using TuneD with "cpu-partitioning" profile and then using a taskset to pin the process to the desired CPUs. THP (transparent huge pages) and turbo boost were also disabled.

# disable SMP

sudo sh -c "/bin/echo off > /sys/devices/system/cpu/smt/control"



# disable THP

sudo sh -c "/bin/echo "never" > /sys/kernel/mm/transparent_hugepage/enabled"



# disable turbo boost

sudo sh -c "/bin/echo 1 > /sys/devices/system/cpu/intel_pstate/no_turbo"

To generate the load, we used Apache JMeter with 50 threads for a duration of 20 minutes.

Debugging

To determine the root cause of this behavior, we enabled the -XX:+PrintCompilation and -XX:+PrintCompilation2 options to get an idea of compilation progress in both -XX:-PrintTieredEvent and -XX:+PrintTieredEvents modes.

Note: -XX:+PrintCompilation logs a message for every compilation. The log includes the time, compilation level, and the method name. -XX:+PrintCompilation2 is a diagnostic option that logs a message when a compilation is completed. This message includes the time taken to compile the method.

An example of the output generated by -XX:+PrintCompilation and -XX:+PrintCompilation2 is:

     56    3       3       jdk.internal.util.Preconditions::checkIndex (18 bytes)

     56    3   size: 560(360) time: 0 inlined: 0 bytes

     56    4       3       java.lang.String::charAt (25 bytes)

     57    4   size: 1224(864) time: 0 inlined: 44 bytes

     58    5       3       java.util.zip.ZipUtils::SH (21 bytes)

     58    5   size: 488(272) time: 0 inlined: 0 bytes

After post-processing of the output, we realized that some of the C2 compilations are taking an exceptionally long time to complete when running with -XX:-PrintTieredEvents compared to -XX:+PrintTieredEvents. More importantly these results were consistent across multiple runs. Some examples from the log file of such compilations:

With -XX:-PrintTieredEvents:

  23692 10541       4       org.thymeleaf.engine.ElementTagStructureHandler::applyAttributes (247 bytes)

  27263 10541   size: 33400(23784) time: 3571 inlined: 3228 bytes



  27293 10281       4       org.thymeleaf.engine.ProcessorTemplateHandler::handleOpenElement (1189 bytes)

  36088 10281   size: 68408(42528) time: 8794 inlined: 7688 bytes



  62596 9872       4       org.thymeleaf.engine.ProcessorTemplateHandler::handleStandaloneElement (1266 bytes)

  68251 9872   size: 65560(39920) time: 5513 inlined: 8636 bytes

Compilation of same methods when running with -XX:+PrintTieredEvents:

 44083 13049       4       org.thymeleaf.engine.ElementTagStructureHandler::applyAttributes (247 bytes)

  44155 13049   size: 9632(6600) time: 71 inlined: 573 bytes



  27899 9771       4       org.thymeleaf.engine.ProcessorTemplateHandler::handleOpenElement (1189 bytes)

  28323 9771   size: 50120(36416) time: 423 inlined: 3899 bytes



  28526 9789       4       org.thymeleaf.engine.ProcessorTemplateHandler::handleStandaloneElement (1266 bytes)

  28855 9789   size: 57816(34400) time: 329 inlined: 9389 bytes

Here is a table created from the above logs for easier comparison of the compilation time of the methods with and without PrintTieredEvents option:

Method

-XX:-PrintTieredEvents

-XX:+PrintTieredEvents

ElementTagStructureHandler::applyAttributes

3571 ms

71 ms

ProcessorTemplateHandler::handleOpenElement

8794 ms

423 ms

ProcessorTemplateHandler::handleStandaloneElement

5513 ms

329 ms

So now we have a lead to follow. It seems like something is wrong with C2 compilation, at least for these methods.

We decided to focus on the org.thymeleaf.engine.ProcessorTemplateHandler::handleOpenElement() method and trace its compilation to look for oddities and anything that would explain the additional time.

Using LogCompilation and CompilerDirectives

To get compilation logs, HotSpot provides the -XX:+LogCompilation option. OpenJDK has a nice wiki page describing LogCompilation. This option produces an XML-based output of the compilations. By default, it generates logs for each compilation, which is almost too much data for any useful analysis. Instead, we limited the logs to only the handleOpenElement method under investigation using Compiler Directives.

See the JEP and the documentation, which describes how to use them. Compiler Directives provide fine-grained control over compilations performed by C1 and C2 compilers. They are specified in JSON format in a file which is passed to the JVM using -XX:CompilerDirectivesFile option.

To enable logging for the org.thymeleaf.engine.ProcessorTemplateHandler::handleOpenElement() method, we used the following directives:

[

  {

    match: "org/thymeleaf/engine/ProcessorTemplateHandler.handleOpenElement",

    c2: {

      Log: true,

    },

  },

  {

    match: "*.*",

    Log: false,

  },

]

The above snippet has two directives. The first directive matches the method org/thymeleaf/engine/ProcessorTemplateHandler.handleOpenElement and tells the compiler to generate the logs for C2 compilation(s).

The second directive matches any method (using regex *.*) and disables the generation of compilation logs.

The order of the directives is important here because the compiler applies the first directive whose pattern matches the method being compiled. So, if we reverse the order of directives, then no logs will be generated.

The actual command line options to use for generating the logs for the target method is:

-XX:+UnlockDiagnosticVMOptions -XX:+LogCompilation -XX:CompilerDirectivesFile=<file name>

Looking at the logs thus generated, we realized it was missing the information about each individual phase of the compilation performed by the C2 compiler. To get that data, we need to add -XX:+CITimeVerbose to the command line. So the command line expands to:

-XX:+UnlockDiagnosticVMOptions -XX:+LogCompilation -XX:CompilerDirectivesFile=<file name> -XX:+CITimeVerbose

The compilation logs now show the timestamp (elapsed time in milliseconds since the start of the application) for the start and end of each phase from which the time taken can be computed.  For instance:

  <phase name="scheduler" nodes="13572" live="13572" stamp="31.857">

    <phase name="ctorChaitin" nodes="13697" live="13572" stamp="31.870">

      <phase_done stamp="31.870"/>

    </phase>

    <phase name="computeLive" nodes="13697" live="13572" stamp="31.871">

      <phase_done stamp="31.883"/>

    </phase>

    <phase_done stamp="31.889”/>


The same information can also be obtained using -XX:+PrintCompilation-XX:+CITimeVerbose, but in this case the output generated by -XX:+CITimeVerbose can be interleaved with the -XX:+PrintCompilation output and makes it difficult to read the data correctly.

The compilation process can be divided into a few main phases, each of which can have many sub-phases. Compilation time obtained for each phase is compared in the table below. As there are many sub-phases in C2 compilation, we have omitted their details to keep the table short.

Phase

Time in msecs with -XX:-PrintTieredEvents

Time in msecs with -XX:+PrintTieredEvents

parse

870

39

optimizer

1877

131

matcher

176

32

scheduler

32

38

regalloc

2522

190

output

15

20

Other than scheduler and output phases, the time taken without PrintTieredEvents varies from 5–20 times the time taken with PrintTieredEvents.

One of the reasons for the difference in compilation time could be a result of different inlining decisions between the two runs. As JIT compilation is non-deterministic and depends on various counters, etc, that vary from run to run, Inlining decisions can also vary across runs depending on various factors. So, we switched our focus on getting the inlining sequence for this method.

Inlining

One approach to obtain the inlining sequence is to use -XX:+PrintInlining. But we can also extract the same information from compilation logs that we already have. There is a tool called LogCompilation available in OpenJDK. This tool can parse the compilation log and generate PrintCompilation type output. It also has various other options to derive useful information from the compilation log. One of the options is to print the inlining decisions which is the same as the data generated by -XX:+PrintInlining.

List of options supported by LogCompilation:

Usage: LogCompilation [ -v ] [ -c ] [ -s ] [ -e ] file1 ...

By default, the tool will print the logged compilations ordered by start time.

  -c:   clean up malformed 1.5 xml

  -i:   print inlining decisions

  -S:   print compilation statistics

  -U:   print uncommon trap statistics

  -t:   print with time stamps

  -s:   sort events by start time (default)

  -e:   sort events by elapsed time

  -n:   sort events by name and start

  -z:   sort events by compiled code size

  -C:   compare logs (give files to compare on command line)

  -d:   do not print compilation IDs

Comparing the inlining decisions generated by LogCompilation for ProcessorTemplateHandler.handleOpenElement we did see differences in the inlining sequence. Some call trees were inlined when running with -XX:+PrintTieredEvents but were absent in the logs for -XX:-PrintTieredEvents, and vice-versa.

But these inlining decisions also differed across different runs. So, it was not very clear if these differences were responsible for the compilation time deltas. 

HotSpot has a mechanism called replay compilation that allows a compilation process to be recorded and replayed later in another JVM process. At this point in our investigation, we decided to try replay compilation so that if the same behavior is reproduced, it will give us some confidence that the compilation process is misbehaving somewhere. And if it doesn't reproduce, we can, with some certainty, rule out the compilation process from the picture.

Either way it would help us to make further progress.

Replay compilation

How does replay compilation work? During a normal run, the JVM outputs all the dependencies and profiling data used by the compiler during compilation of a method to a replay file. Using this replay file, the JVM can replay the compilation of that particular method. The key assumption is that if all the dependencies are met, then the recorded profiling data would cause the compiler to exhibit the same behavior during replay compilation as in the initial run.

This technique is really helpful in debugging crashes during compilation. But be aware that it is not always successful in recreating the issue, as mentioned in Tobias Hartmann's talk Debugging the Java HotSpot VM.

The first step in replay compilation is to create the replay file. The compiler directives we used earlier also support a DumpReplay option to dump the replay file (See the documentation for the list of supported options). So we updated our directives file as follows:

[

  {

    match: "org/thymeleaf/engine/ProcessorTemplateHandler.handleOpenElement",

    c2: {

      Log: true,

      DumpReplay: true,

    },

  },

  {

    match: "*.*",

    Log: false,

  },

]

This would generate a replay file that captures all the information that the compiler needs to replay the compilation later. To use the replay file, we need to specify two options:

  -XX:+ReplayCompiles -XX:ReplayDataFile=<replay file>

In addition to that, all the classes that the compiler needs to compile the method must be specified on the classpath. How do you know all the classes needed for compilation? You don't. And actually, you don't need to know the exact set of classes. It's sufficient to specify all the jar files used by the app on the classpath, which should certainly be the super set of the classes needed by the compiler.

We generated a replay file for the ProcessorTemplateHandler.handleOpenElement method without PrintTieredEvents (the slow run) and then used it with PrintTieredEvents enabled. The amount of time consumed for compilation was in the same order as the time taken when running with PrintTieredEvents. So, this kind of confirmed for us that the problem is not really with the compilation of the method. Something else is off here.

Perf

Another tool commonly used to detect where an application is spending time is the Linux perf tool. You can find many resources on using perf for performance investigations. One such resource is Brendan Gregg's blog.

We decided to try out perf-stat on the C2 compiler thread. But you need the thread ID of the C2 compiler thread to do that. This can be obtained easily using jcmd's Thread.print command and grepping for "C2 CompilerThread".

For example:

$ jcmd <PID> Thread.print | grep "C2 CompilerThread"

"C2 CompilerThread0" #13 [1114912] daemon prio=9 os_prio=0 cpu=5134.24ms elapsed=62.64s tid=0x00007fa0bc134200 nid=1114912 waiting on condition  [0x0000000000000000]

Here, the native thread ID is 1114912. We can now run perf-stat on this thread after the application has started:

$ perf stat -t 1114912 -i record -o perf.data

When the application terminates, the perf-stat data can be shown using following command:

$ perf stat report -i perf.data

To continue with our investigation, we realized ProcessorTemplateHandler.handleOpenElement gets compiled quite early in our run; within 15 seconds of the load. So we decided to cut down the load time to 15 seconds and recorded perf-stat for the C2 compiler thread.

With PrintTieredEvents disabled, perf-stat showed following output:

          2,741.32 msec task-clock:u              #    0.132 CPUs utilized          

                 0      context-switches:u        #    0.000 /sec                   

                 0      cpu-migrations:u          #    0.000 /sec                   

             3,046      page-faults:u             #    1.111 K/sec                  

     5,029,272,291      cycles:u                  #    1.835 GHz                    

     6,754,992,986      instructions:u            #    1.34  insn per cycle         

     1,414,343,830      branches:u                #  515.935 M/sec                  

        43,941,056      branch-misses:u           #    3.11% of all branches        



      20.771120850 seconds time elapsed

With PrintTieredEvents enabled, the perf-stat data was:

          4,830.36 msec task-clock:u              #    0.238 CPUs utilized          

                 0      context-switches:u        #    0.000 /sec                   

                 0      cpu-migrations:u          #    0.000 /sec                   

               531      page-faults:u             #  109.930 /sec                   

     8,764,471,042      cycles:u                  #    1.814 GHz                    

    11,598,331,303      instructions:u            #    1.32  insn per cycle         

     2,436,266,591      branches:u                #  504.365 M/sec                  

        78,176,556      branch-misses:u           #    3.21% of all branches        



      20.261335595 seconds time elapsed

The two profiles gathered are very different in every relevant event. The main takeaway is that when running PrintTieredEvents disabled, the C2 compiler thread gets only half of as much CPU time as with PrintTieredEvents enabled—i.e., 2.7 seconds versus 4.8 seconds of CPU time. This clearly indicates that enabling PrintTieredEvents results in the C2 compiler thread getting more CPU cycles.

Recall that PrintTieredEvents causes frequent I/O to log compilation events. These I/O operations allow the OS scheduler to give the C2 compiler thread more cycles. This is also reflected in the number of context switches reported by perf if we use the event context-switches:k to measure the context switches.

With -XX:-PrintTieredEvents:

 C2 CompilerThre-3119318                609      context-switches:k

With -XX:+PrintTieredEvents:

 C2 CompilerThre-3121836              4,307      context-switches:k

Another mystery is if the perf shows C2 compiler thread ran for only 2.7 seconds, why do the logs generated by -XX:+PrintCompilation2 indicate that the compilation of ProcessorTemplateHandler::handleOpenElement took almost 8 seconds. How is that possible?

Looking at the source code, we realized the values reported by -XX:+PrintCompilation2 are the system-wide elapsed time, not the actual compilation time taken by the compiler thread. Even the timestamps generated for each phase of compilation in LogCompilation are also the elapsed time. The elapsed time is obtained using clockid CLOCK_MONOTONIC on Linux. To track the actual compilation time, we updated the HotSpot code to use clock id CLOCK_THREAD_CPUTIME_ID to show the thread-specific CPU time for each phase. And as expected, the compilation time was almost the same with the PrintTieredEvents option disabled or enabled.

We have created an issue with OpenJDK to enhance the compilation logs by logging the actual compilation time of each phase.

Conclusion

With the help of various tools and options provided by HotSpot, we were able to rule out any issue in the compilation process. Furthermore, the perf report clearly highlighted the role of CPU contention between application threads and the C2 compiler thread, leading to the slow warm-up observed when PrintTieredEvents option was disabled.

Summary

This article shared various tools and techniques that HotSpot provides to allow the user to gain useful insights into the compilation strategy, thus getting a better understanding of the JIT compilation process. It also introduced the reader to some advanced debugging techniques like replay compilation.

We hope these techniques will be useful to you in your debugging journey with HotSpot JVM.

Acknowledgements

I would like to acknowledge Roland Westrelin (Red Hat) for helping in this investigation, and Andrew Dinn (Red Hat), Dan Heidinga (Red Hat), and Francesco Nigro (Red Hat) for reviewing the content and suggesting numerous changes to improve the content of the article.