Skip to main content
Redhat Developers  Logo
  • Products

    Platforms

    • Red Hat Enterprise Linux
      Red Hat Enterprise Linux Icon
    • Red Hat AI
      Red Hat AI
    • Red Hat OpenShift
      Openshift icon
    • Red Hat Ansible Automation Platform
      Ansible icon
    • View All Red Hat Products

    Featured

    • Red Hat build of OpenJDK
    • Red Hat Developer Hub
    • Red Hat JBoss Enterprise Application Platform
    • Red Hat OpenShift Dev Spaces
    • Red Hat OpenShift Local
    • Red Hat Developer Sandbox

      Try Red Hat products and technologies without setup or configuration fees for 30 days with this shared Openshift and Kubernetes cluster.
    • Try at no cost
  • Technologies

    Featured

    • AI/ML
      AI/ML Icon
    • Linux
      Linux Icon
    • Kubernetes
      Cloud icon
    • Automation
      Automation Icon showing arrows moving in a circle around a gear
    • View All Technologies
    • Programming Languages & Frameworks

      • Java
      • Python
      • JavaScript
    • System Design & Architecture

      • Red Hat architecture and design patterns
      • Microservices
      • Event-Driven Architecture
      • Databases
    • Developer Productivity

      • Developer productivity
      • Developer Tools
      • GitOps
    • Automated Data Processing

      • AI/ML
      • Data Science
      • Apache Kafka on Kubernetes
    • Platform Engineering

      • DevOps
      • DevSecOps
      • Ansible automation for applications and services
    • Secure Development & Architectures

      • Security
      • Secure coding
  • Learn

    Featured

    • Kubernetes & Cloud Native
      Openshift icon
    • Linux
      Rhel icon
    • Automation
      Ansible cloud icon
    • AI/ML
      AI/ML Icon
    • View All Learning Resources

    E-Books

    • GitOps Cookbook
    • Podman in Action
    • Kubernetes Operators
    • The Path to GitOps
    • View All E-books

    Cheat Sheets

    • Linux Commands
    • Bash Commands
    • Git
    • systemd Commands
    • View All Cheat Sheets

    Documentation

    • Product Documentation
    • API Catalog
    • Legacy Documentation
  • Developer Sandbox

    Developer Sandbox

    • Access Red Hat’s products and technologies without setup or configuration, and start developing quicker than ever before with our new, no-cost sandbox environments.
    • Explore Developer Sandbox

    Featured Developer Sandbox activities

    • Get started with your Developer Sandbox
    • OpenShift virtualization and application modernization using the Developer Sandbox
    • Explore all Developer Sandbox activities

    Ready to start developing apps?

    • Try at no cost
  • Blog
  • Events
  • Videos

How we solved a HotSpot performance puzzle

Debugging with the HotSpot Java virtual machine

September 29, 2023
Ashutosh Mehra
Related topics:
JavaKubernetes
Related products:
Red Hat build of OpenJDK

Share:

    Have you ever enabled a HotSpot Java virtual machine (JVM) debugging option and had performance drastically improved? Totally unexpected and confusing. We had that very problem during a JVM warm-up investigation, and this article will detail the steps, options, and tools used to get to the bottom of the issue. Hopefully our journey will help you better understand how to debug unexpected warm-up behavior.

    Problem and solution

    We observed a strange behavior affecting warm-up performance from HotSpot JVM while measuring throughput using Spring Boot's PetClinic application. During the measurement runs, the application warmed up much quicker when the PrintTieredEvents option is enabled (using -XX:+PrintTieredEvents) compared to the case when running with the default settings (i.e. without the PrintTieredEvents option or -XX:-PrintTieredEvents)

    Note: "Warm-up time" refers to the time it takes to reach the peak throughput. An application's warm-up time depends a lot on the just-in-time (JIT) compilation of hot methods. HotSpot has two JIT compilers, C1 and C2, and, by default, employs tiered compilation where the compilation level ranges from 1 to 4. While the C1 compiler takes care of compilations done at levels 1 to 3, C2 is responsible for level 4 compilations. It is level 4 where all the heavy optimizations are performed and is therefore by far the most CPU-intensive compilation level.

    PrintTieredEvents is an option that logs various tiered compilation events.

    This was a very puzzling behavior because PrintTieredEvents is not expected to alter compilation behavior and, consequently, the performance characteristics of the application. To give an idea of the impact, the warm-up time with default settings (i.e., without PrintTieredEvents option or -XX:-PrintTieredEvents) was around 14 minutes, but when -XX:+PrintTieredEvents was added, the warm-up time decreased to around 4 minutes. The peak throughput in both cases was comparable, but the time to reach peak throughput is very different. (See Figure 1).

    Throughput for Spring Boot PetClinic.
    Figure 1: Throughput for Spring Boot PetClinic.

    The root cause for this unexpected behavior turns out to be competition between compiler and application threads for CPU. With PrintTieredEvents disabled, the application threads run at full throttle. Large compilations get context switched more frequently to allow the application to run, so they take longer to complete. The result is a delay in delivery of the JITted code needed to boost throughput. With PrintTieredEvents enabled, the application threads suspend regularly on I/O waits. Large compilations switch less often or complete without switching, delivering JITted code and higher throughput earlier.

    This highlights a configuration issue that HotSpot users need to be aware of. The request rate a server can support during warm-up needs to be capped at a lower value than the rate for peak running. This was confirmed when we started with fewer JMeter threads during the warm-up phase and then ramped up later; the compiler threads were no longer starved of the CPU.

    This solution to the puzzle of delayed ramp-up makes sense and is relatively clear and understandable once you see it. However, when faced with the initial, counter-intuitive observed behavior, it was unclear what this resolution might be. It is no accident that we eventually solved the puzzle. HotSpot supports a host of built-in monitoring, tracing, and debugging tools and options that make it possible to gather evidence and analyze these sorts of anomalous behaviors. Although they are familiar to most HotSpot developers, they are less well known by end users, even though they are actually very useful.

    The rest of this article explains how we used various available tools and options to arrive at the above resolution of this ramp-up puzzle.

    Setup

    This investigation was done on a 4-core system running Red Hat Enterprise Linux 8.6. The CPU model was Intel(R) i7-8665U. SMT (simultaneous multithreading) was disabled on the system. To prevent interference from other system processes, the PetClinic app was bound to 2 CPUs. You can achieve this using TuneD with "cpu-partitioning" profile and then using a taskset to pin the process to the desired CPUs. THP (transparent huge pages) and turbo boost were also disabled.

    # disable SMP
    
    sudo sh -c "/bin/echo off > /sys/devices/system/cpu/smt/control"
    
    
    
    # disable THP
    
    sudo sh -c "/bin/echo "never" > /sys/kernel/mm/transparent_hugepage/enabled"
    
    
    
    # disable turbo boost
    
    sudo sh -c "/bin/echo 1 > /sys/devices/system/cpu/intel_pstate/no_turbo"

    To generate the load, we used Apache JMeter with 50 threads for a duration of 20 minutes.

    Debugging

    To determine the root cause of this behavior, we enabled the -XX:+PrintCompilation and -XX:+PrintCompilation2 options to get an idea of compilation progress in both -XX:-PrintTieredEvent and -XX:+PrintTieredEvents modes.

    Note: -XX:+PrintCompilation logs a message for every compilation. The log includes the time, compilation level, and the method name. -XX:+PrintCompilation2 is a diagnostic option that logs a message when a compilation is completed. This message includes the time taken to compile the method.

    An example of the output generated by -XX:+PrintCompilation and -XX:+PrintCompilation2 is:

         56    3       3       jdk.internal.util.Preconditions::checkIndex (18 bytes)
    
         56    3   size: 560(360) time: 0 inlined: 0 bytes
    
         56    4       3       java.lang.String::charAt (25 bytes)
    
         57    4   size: 1224(864) time: 0 inlined: 44 bytes
    
         58    5       3       java.util.zip.ZipUtils::SH (21 bytes)
    
         58    5   size: 488(272) time: 0 inlined: 0 bytes
    

    After post-processing of the output, we realized that some of the C2 compilations are taking an exceptionally long time to complete when running with -XX:-PrintTieredEvents compared to -XX:+PrintTieredEvents. More importantly these results were consistent across multiple runs. Some examples from the log file of such compilations:

    With -XX:-PrintTieredEvents:

      23692 10541       4       org.thymeleaf.engine.ElementTagStructureHandler::applyAttributes (247 bytes)
    
      27263 10541   size: 33400(23784) time: 3571 inlined: 3228 bytes
    
    
    
      27293 10281       4       org.thymeleaf.engine.ProcessorTemplateHandler::handleOpenElement (1189 bytes)
    
      36088 10281   size: 68408(42528) time: 8794 inlined: 7688 bytes
    
    
    
      62596 9872       4       org.thymeleaf.engine.ProcessorTemplateHandler::handleStandaloneElement (1266 bytes)
    
      68251 9872   size: 65560(39920) time: 5513 inlined: 8636 bytes

    Compilation of same methods when running with -XX:+PrintTieredEvents:

     44083 13049       4       org.thymeleaf.engine.ElementTagStructureHandler::applyAttributes (247 bytes)
    
      44155 13049   size: 9632(6600) time: 71 inlined: 573 bytes
    
    
    
      27899 9771       4       org.thymeleaf.engine.ProcessorTemplateHandler::handleOpenElement (1189 bytes)
    
      28323 9771   size: 50120(36416) time: 423 inlined: 3899 bytes
    
    
    
      28526 9789       4       org.thymeleaf.engine.ProcessorTemplateHandler::handleStandaloneElement (1266 bytes)
    
      28855 9789   size: 57816(34400) time: 329 inlined: 9389 bytes

    Here is a table created from the above logs for easier comparison of the compilation time of the methods with and without PrintTieredEvents option:

    Method

    -XX:-PrintTieredEvents

    -XX:+PrintTieredEvents

    ElementTagStructureHandler::applyAttributes

    3571 ms

    71 ms

    ProcessorTemplateHandler::handleOpenElement

    8794 ms

    423 ms

    ProcessorTemplateHandler::handleStandaloneElement

    5513 ms

    329 ms

    So now we have a lead to follow. It seems like something is wrong with C2 compilation, at least for these methods.

    We decided to focus on the org.thymeleaf.engine.ProcessorTemplateHandler::handleOpenElement() method and trace its compilation to look for oddities and anything that would explain the additional time.

    Using LogCompilation and CompilerDirectives

    To get compilation logs, HotSpot provides the -XX:+LogCompilation option. OpenJDK has a nice wiki page describing LogCompilation. This option produces an XML-based output of the compilations. By default, it generates logs for each compilation, which is almost too much data for any useful analysis. Instead, we limited the logs to only the handleOpenElement method under investigation using Compiler Directives.

    See the JEP and the documentation, which describes how to use them. Compiler Directives provide fine-grained control over compilations performed by C1 and C2 compilers. They are specified in JSON format in a file which is passed to the JVM using -XX:CompilerDirectivesFile option.

    To enable logging for the org.thymeleaf.engine.ProcessorTemplateHandler::handleOpenElement() method, we used the following directives:

    [
    
      {
    
        match: "org/thymeleaf/engine/ProcessorTemplateHandler.handleOpenElement",
    
        c2: {
    
          Log: true,
    
        },
    
      },
    
      {
    
        match: "*.*",
    
        Log: false,
    
      },
    
    ]

    The above snippet has two directives. The first directive matches the method org/thymeleaf/engine/ProcessorTemplateHandler.handleOpenElement and tells the compiler to generate the logs for C2 compilation(s).

    The second directive matches any method (using regex *.*) and disables the generation of compilation logs.

    The order of the directives is important here because the compiler applies the first directive whose pattern matches the method being compiled. So, if we reverse the order of directives, then no logs will be generated.

    The actual command line options to use for generating the logs for the target method is:

    -XX:+UnlockDiagnosticVMOptions -XX:+LogCompilation -XX:CompilerDirectivesFile=<file name>

    Looking at the logs thus generated, we realized it was missing the information about each individual phase of the compilation performed by the C2 compiler. To get that data, we need to add -XX:+CITimeVerbose to the command line. So the command line expands to:

    -XX:+UnlockDiagnosticVMOptions -XX:+LogCompilation -XX:CompilerDirectivesFile=<file name> -XX:+CITimeVerbose

    The compilation logs now show the timestamp (elapsed time in milliseconds since the start of the application) for the start and end of each phase from which the time taken can be computed.  For instance:

      <phase name="scheduler" nodes="13572" live="13572" stamp="31.857">
    
        <phase name="ctorChaitin" nodes="13697" live="13572" stamp="31.870">
    
          <phase_done stamp="31.870"/>
    
        </phase>
    
        <phase name="computeLive" nodes="13697" live="13572" stamp="31.871">
    
          <phase_done stamp="31.883"/>
    
        </phase>
    
        <phase_done stamp="31.889”/>


    The same information can also be obtained using -XX:+PrintCompilation-XX:+CITimeVerbose, but in this case the output generated by -XX:+CITimeVerbose can be interleaved with the -XX:+PrintCompilation output and makes it difficult to read the data correctly.

    The compilation process can be divided into a few main phases, each of which can have many sub-phases. Compilation time obtained for each phase is compared in the table below. As there are many sub-phases in C2 compilation, we have omitted their details to keep the table short.

    Phase

    Time in msecs with -XX:-PrintTieredEvents

    Time in msecs with -XX:+PrintTieredEvents

    parse

    870

    39

    optimizer

    1877

    131

    matcher

    176

    32

    scheduler

    32

    38

    regalloc

    2522

    190

    output

    15

    20

    Other than scheduler and output phases, the time taken without PrintTieredEvents varies from 5–20 times the time taken with PrintTieredEvents.

    One of the reasons for the difference in compilation time could be a result of different inlining decisions between the two runs. As JIT compilation is non-deterministic and depends on various counters, etc, that vary from run to run, Inlining decisions can also vary across runs depending on various factors. So, we switched our focus on getting the inlining sequence for this method.

    Inlining

    One approach to obtain the inlining sequence is to use -XX:+PrintInlining. But we can also extract the same information from compilation logs that we already have. There is a tool called LogCompilation available in OpenJDK. This tool can parse the compilation log and generate PrintCompilation type output. It also has various other options to derive useful information from the compilation log. One of the options is to print the inlining decisions which is the same as the data generated by -XX:+PrintInlining.

    List of options supported by LogCompilation:

    Usage: LogCompilation [ -v ] [ -c ] [ -s ] [ -e ] file1 ...
    
    By default, the tool will print the logged compilations ordered by start time.
    
      -c:   clean up malformed 1.5 xml
    
      -i:   print inlining decisions
    
      -S:   print compilation statistics
    
      -U:   print uncommon trap statistics
    
      -t:   print with time stamps
    
      -s:   sort events by start time (default)
    
      -e:   sort events by elapsed time
    
      -n:   sort events by name and start
    
      -z:   sort events by compiled code size
    
      -C:   compare logs (give files to compare on command line)
    
      -d:   do not print compilation IDs

    Comparing the inlining decisions generated by LogCompilation for ProcessorTemplateHandler.handleOpenElement we did see differences in the inlining sequence. Some call trees were inlined when running with -XX:+PrintTieredEvents but were absent in the logs for -XX:-PrintTieredEvents, and vice-versa.

    But these inlining decisions also differed across different runs. So, it was not very clear if these differences were responsible for the compilation time deltas. 

    HotSpot has a mechanism called replay compilation that allows a compilation process to be recorded and replayed later in another JVM process. At this point in our investigation, we decided to try replay compilation so that if the same behavior is reproduced, it will give us some confidence that the compilation process is misbehaving somewhere. And if it doesn't reproduce, we can, with some certainty, rule out the compilation process from the picture.

    Either way it would help us to make further progress.

    Replay compilation

    How does replay compilation work? During a normal run, the JVM outputs all the dependencies and profiling data used by the compiler during compilation of a method to a replay file. Using this replay file, the JVM can replay the compilation of that particular method. The key assumption is that if all the dependencies are met, then the recorded profiling data would cause the compiler to exhibit the same behavior during replay compilation as in the initial run.

    This technique is really helpful in debugging crashes during compilation. But be aware that it is not always successful in recreating the issue, as mentioned in Tobias Hartmann's talk Debugging the Java HotSpot VM.

    The first step in replay compilation is to create the replay file. The compiler directives we used earlier also support a DumpReplay option to dump the replay file (See the documentation for the list of supported options). So we updated our directives file as follows:

    [
    
      {
    
        match: "org/thymeleaf/engine/ProcessorTemplateHandler.handleOpenElement",
    
        c2: {
    
          Log: true,
    
          DumpReplay: true,
    
        },
    
      },
    
      {
    
        match: "*.*",
    
        Log: false,
    
      },
    
    ]

    This would generate a replay file that captures all the information that the compiler needs to replay the compilation later. To use the replay file, we need to specify two options:

      -XX:+ReplayCompiles -XX:ReplayDataFile=<replay file>

    In addition to that, all the classes that the compiler needs to compile the method must be specified on the classpath. How do you know all the classes needed for compilation? You don't. And actually, you don't need to know the exact set of classes. It's sufficient to specify all the jar files used by the app on the classpath, which should certainly be the super set of the classes needed by the compiler.

    We generated a replay file for the ProcessorTemplateHandler.handleOpenElement method without PrintTieredEvents (the slow run) and then used it with PrintTieredEvents enabled. The amount of time consumed for compilation was in the same order as the time taken when running with PrintTieredEvents. So, this kind of confirmed for us that the problem is not really with the compilation of the method. Something else is off here.

    Perf

    Another tool commonly used to detect where an application is spending time is the Linux perf tool. You can find many resources on using perf for performance investigations. One such resource is Brendan Gregg's blog.

    We decided to try out perf-stat on the C2 compiler thread. But you need the thread ID of the C2 compiler thread to do that. This can be obtained easily using jcmd's Thread.print command and grepping for "C2 CompilerThread".

    For example:

    $ jcmd <PID> Thread.print | grep "C2 CompilerThread"
    
    "C2 CompilerThread0" #13 [1114912] daemon prio=9 os_prio=0 cpu=5134.24ms elapsed=62.64s tid=0x00007fa0bc134200 nid=1114912 waiting on condition  [0x0000000000000000]

    Here, the native thread ID is 1114912. We can now run perf-stat on this thread after the application has started:

    $ perf stat -t 1114912 -i record -o perf.data

    When the application terminates, the perf-stat data can be shown using following command:

    $ perf stat report -i perf.data

    To continue with our investigation, we realized ProcessorTemplateHandler.handleOpenElement gets compiled quite early in our run; within 15 seconds of the load. So we decided to cut down the load time to 15 seconds and recorded perf-stat for the C2 compiler thread.

    With PrintTieredEvents disabled, perf-stat showed following output:

              2,741.32 msec task-clock:u              #    0.132 CPUs utilized          
    
                     0      context-switches:u        #    0.000 /sec                   
    
                     0      cpu-migrations:u          #    0.000 /sec                   
    
                 3,046      page-faults:u             #    1.111 K/sec                  
    
         5,029,272,291      cycles:u                  #    1.835 GHz                    
    
         6,754,992,986      instructions:u            #    1.34  insn per cycle         
    
         1,414,343,830      branches:u                #  515.935 M/sec                  
    
            43,941,056      branch-misses:u           #    3.11% of all branches        
    
    
    
          20.771120850 seconds time elapsed

    With PrintTieredEvents enabled, the perf-stat data was:

              4,830.36 msec task-clock:u              #    0.238 CPUs utilized          
    
                     0      context-switches:u        #    0.000 /sec                   
    
                     0      cpu-migrations:u          #    0.000 /sec                   
    
                   531      page-faults:u             #  109.930 /sec                   
    
         8,764,471,042      cycles:u                  #    1.814 GHz                    
    
        11,598,331,303      instructions:u            #    1.32  insn per cycle         
    
         2,436,266,591      branches:u                #  504.365 M/sec                  
    
            78,176,556      branch-misses:u           #    3.21% of all branches        
    
    
    
          20.261335595 seconds time elapsed

    The two profiles gathered are very different in every relevant event. The main takeaway is that when running PrintTieredEvents disabled, the C2 compiler thread gets only half of as much CPU time as with PrintTieredEvents enabled—i.e., 2.7 seconds versus 4.8 seconds of CPU time. This clearly indicates that enabling PrintTieredEvents results in the C2 compiler thread getting more CPU cycles.

    Recall that PrintTieredEvents causes frequent I/O to log compilation events. These I/O operations allow the OS scheduler to give the C2 compiler thread more cycles. This is also reflected in the number of context switches reported by perf if we use the event context-switches:k to measure the context switches.

    With -XX:-PrintTieredEvents:

     C2 CompilerThre-3119318                609      context-switches:k

    With -XX:+PrintTieredEvents:

     C2 CompilerThre-3121836              4,307      context-switches:k

    Another mystery is if the perf shows C2 compiler thread ran for only 2.7 seconds, why do the logs generated by -XX:+PrintCompilation2 indicate that the compilation of ProcessorTemplateHandler::handleOpenElement took almost 8 seconds. How is that possible?

    Looking at the source code, we realized the values reported by -XX:+PrintCompilation2 are the system-wide elapsed time, not the actual compilation time taken by the compiler thread. Even the timestamps generated for each phase of compilation in LogCompilation are also the elapsed time. The elapsed time is obtained using clockid CLOCK_MONOTONIC on Linux. To track the actual compilation time, we updated the HotSpot code to use clock id CLOCK_THREAD_CPUTIME_ID to show the thread-specific CPU time for each phase. And as expected, the compilation time was almost the same with the PrintTieredEvents option disabled or enabled.

    We have created an issue with OpenJDK to enhance the compilation logs by logging the actual compilation time of each phase.

    Conclusion

    With the help of various tools and options provided by HotSpot, we were able to rule out any issue in the compilation process. Furthermore, the perf report clearly highlighted the role of CPU contention between application threads and the C2 compiler thread, leading to the slow warm-up observed when PrintTieredEvents option was disabled.

    Summary

    This article shared various tools and techniques that HotSpot provides to allow the user to gain useful insights into the compilation strategy, thus getting a better understanding of the JIT compilation process. It also introduced the reader to some advanced debugging techniques like replay compilation.

    We hope these techniques will be useful to you in your debugging journey with HotSpot JVM.

    Acknowledgements

    I would like to acknowledge Roland Westrelin (Red Hat) for helping in this investigation, and Andrew Dinn (Red Hat), Dan Heidinga (Red Hat), and Francesco Nigro (Red Hat) for reviewing the content and suggesting numerous changes to improve the content of the article.

    Related Posts

    • What's new for developers in JDK 21

    • How the JIT compiler boosts Java performance in OpenJDK

    • MIR: A lightweight JIT compiler project

    • Why you should use io_uring for network I/O

    • Shenandoah in OpenJDK 17: Sub-millisecond GC pauses

    • How to choose the best Java garbage collector

    Recent Posts

    • Cloud bursting with confidential containers on OpenShift

    • Reach native speed with MacOS llama.cpp container inference

    • A deep dive into Apache Kafka's KRaft protocol

    • Staying ahead of artificial intelligence threats

    • Strengthen privacy and security with encrypted DNS in RHEL

    What’s up next?

    java-nutshell-cover_Share

    Java in a Nutshell is the reference guide every Java developer needs at their fingertips. It’s designed to help you get the most out of versions through Java 17, with examples that show how to take advantage of modern Java APIs and development best practices.

    Get the e-book
    Red Hat Developers logo LinkedIn YouTube Twitter Facebook

    Products

    • Red Hat Enterprise Linux
    • Red Hat OpenShift
    • Red Hat Ansible Automation Platform

    Build

    • Developer Sandbox
    • Developer Tools
    • Interactive Tutorials
    • API Catalog

    Quicklinks

    • Learning Resources
    • E-books
    • Cheat Sheets
    • Blog
    • Events
    • Newsletter

    Communicate

    • About us
    • Contact sales
    • Find a partner
    • Report a website issue
    • Site Status Dashboard
    • Report a security problem

    RED HAT DEVELOPER

    Build here. Go anywhere.

    We serve the builders. The problem solvers who create careers with code.

    Join us if you’re a developer, software engineer, web designer, front-end designer, UX designer, computer scientist, architect, tester, product manager, project manager or team lead.

    Sign me up

    Red Hat legal and privacy links

    • About Red Hat
    • Jobs
    • Events
    • Locations
    • Contact Red Hat
    • Red Hat Blog
    • Inclusion at Red Hat
    • Cool Stuff Store
    • Red Hat Summit
    © 2025 Red Hat

    Red Hat legal and privacy links

    • Privacy statement
    • Terms of use
    • All policies and guidelines
    • Digital accessibility

    Report a website issue