In a previous article, I explained that OpenJDK's Just-in-Time (JIT) compiler relies heavily on speculation to achieve higher performance. This optimization technique generates code under an assumption that is expected to be correct but can't be proven certain. If the assumption turns out to be wrong, the JIT compiler prevents incorrect execution by returning control to the bytecode interpreter. This process is called deoptimization.
This article explains how profiling at runtime can improve speculation and contribute to optimization in other ways. I'll show examples of profile data collected at runtime, explain how the JIT compiler uses the data, and illustrate the benefits of runtime profiling.
Patterns in compiler speculation
Speculation means that the compiler starts by making an assumption that is likely true. Sometimes, as with null checks, a reasonable assumption is almost obvious: Null checks rarely cause an exception, so speculating that they don't cause one is expected to pay off.
A similar pattern exists for range checks. Range checks are tests that guard references to array elements in Java code to protect the program from out-of-bound accesses. In practice, an array access looks like the following pseudocode:
int[] array; // an array of some type
if (array == null) {
  throw new NullPointerException();
}
if (i < 0 || i >= array.length) {
  throw new ArrayIndexOutOfBoundsException();
}
int v = array[i];
Failing range checks are usually caused by programmer errors, as is the case for null pointer exceptions. Therefore, it is not considered crucial for performance to handle a failed check efficiently. With that consideration in mind, the previous snippet can be compiled down to:
int[] array;
if (array == null) {
  deoptimize();
}
if (i < 0 || i >= array.length) {
  deoptimize();
}
int v = array[i];
If a thread makes an out-of-bound access, it reaches the deoptimize() pseudo-call, which causes the thread's execution to exit the compiled code and resume in the interpreter at the exact same point in the Java method. Interpreted execution then throws the exception. As a consequence, the exception-throwing logic does not need to be included in the compiled code.
Why we need runtime profiling
Consider this example illustrating the limits of speculation:
Object o;
...
if (o instanceof C) {
  ...
}
It would be nice to have the compiler speculate here, as well, regarding whether the instanceof test will succeed. But without knowing anything else from the program, how can we know whether the test is more likely to succeed or fail? Can we guess a likely type for the o variable from this expression? A good answer to either question appears unlikely from looking at this code snippet alone.
Using runtime data to drive our decisions is a much better solution. If we expect the test to fail, we can compile the snippet down to the following code. It falls through to the rest of the compiled code when the test fails, and returns to the interpreter when the test succeeds:
if (o instanceof C) {
  deoptimize();
}
If we expect the test to succeed, we can instead insert the following code. It falls through the rest of the compiled code when the test succeeds and returns to the interpreter when the test fails:
if (!(o instanceof C)) {
  deoptimize();
}
Hot methods and tiered compilation
My previous article explained that JIT compilation is resource hungry. Because of this, it is beneficial only if resources are focused on "hot" methods, which are expected to execute often, rather than having all methods that execute at least once be highly optimized.
As a consequence, all code is executed initially either in the interpreter or as compiled code generated at a low level of optimization. Only hot methods run at the highest level of optimization. This optimization choice is known as tiered compilation. The highest level of optimization is the high tier. For the rest of this discussion, I will designate all other levels as low tiers.
Runtime profiling for hot methods
Profiling helps determine when a transition to a hot method is justified. Code at low tiers is instrumented to increment counters that enable the detection of hot methods. When counters cross a threshold, the transition to highly optimized compiled code is triggered.
For hot methods, the speed of highly optimized code is crucial. The speed of code running at low tiers is not as important, as long as methods that turn out to be hot are promoted to highly optimized code early in the program run.
The key idea, then, is to observe what the program does when it runs at low tiers by using instrumentation and to record important observations, effectively collecting profile data. Once a method is being compiled by the high tier, the compiler can look for meaningful data points in what was collected and use them to drive speculation. Profiling slows down execution speed at the low tiers, but, carefully crafted profiling pays off overall as a slight slowdown to code running at low tiers lets us generate code at the high tier that's measurably faster. That, in turn, can result in an overall speedup of applications running on the JVM.
Benefits of profile-guided optimization
Profile-guided optimizations are not specific to virtual machines. Some offline compilers feature them. The way to take advantage of them is usually to have training runs driven by synthetic workloads and then feed the data to the build process that produces the production-ready binaries. As a consequence, in offline compilers, leveraging this optimization has the drawback of requiring extra work from developers. With the JVM design explained in this section, profile data is collected during the application's execution. The resulting data should be more faithful and comes at no cost to the developer.
Example #1: Optimizing a subtype check with profile data
What kind of profile data would make sense for the type check shown earlier? Are we only interested in the success or failure of the instanceof test? Knowing that would allow us to trim unneeded paths from the methods, which is certainly good, but we should hope for more.
Consider the cost of instanceof (and type checks, in general). As implemented in HotSpot, the OpenJDK JVM, type checks generally require several tests and even, in some cases, a loop. Can profile data help both determine the result of the type check and improve its execution time?
HotSpot collects the type of the object involved in a type check. In the snippet shown in the previous section, let's say that o is always observed to be of type A. Then, the snippet can be compiled down to the following (expressed as pseudocode):
if (o.getClass() != A.class) {
  deoptimize();
}
if (C.class.isAssignableFrom(A.class)) { // this was o instanceof C
  ...
}
The o instanceof C check can be transformed to C.class.isAssignableFrom(A.class) because the preceding runtime check guarantees that o is of type A. Given that both A and C are constant, that second check folds at compile time and is eliminated.
As a result, the time taken by instanceof improves and an unneeded path in the method is trimmed.
But there's another benefit to this optimization. With the snippet compiled into the following code, o would be known to be a subclass of C following a successful test:
if (o instanceof C) {
  deoptimize();
}
But if the snippet is compiled to the following, o is known more precisely to be of class A following the test:
if (o.getClass() != A.class) {
  deoptimize();
}
Knowing the exact type of o will likely help us further optimize code that follows, whereas knowing simply that o is a subtype of C wouldn't. Suppose that later in the method, we have:
if (o instanceof B) {
  ...
}
and B is a subtype of C. Knowing only that o is a subtype of C doesn't help with this test. Knowing that o is of class A allows this test to fold away.
As a summary, in this case, profile data helps us to:
- Remove an unneeded path.
- Decrease the cost of the type check.
- Improve the compiler's knowledge of types so that it can better optimize the surrounding code.
This example has shown how profile data improves speculation (through the first two items in the list) and positively interacts with existing classical optimizations (through the third item).
Example #2: Optimizing never-taken branches
Now, what if, in the example snippet, o is seen to have either type A or B, both subtypes of C? Profile data then doesn't report a single type for o, and the transformation discussed in the previous section can't apply anymore. We would still want the never-taken branch to optimize out, though.
In HotSpot, low tiers count the number of times a branch is taken. For conditional branches (remember that HotSpot operates on Java bytecode that can contain unconditional branches), HotSpot also counts the number of times the branch is not taken.
This profile data, in turn, indicates whether a branch is always taken (indicated when the untaken count is zero) or never taken (indicated when the taken count is zero). This information allows the JIT to trim paths that are expected to never be executed. The optimization applies to all types of branches, not only to those checking for a subtype.
Peeking at profile data
As is true for most aspects of code execution in HotSpot, it's hard to observe which profile is available to the high tier and how the profile is used. The observation requires carefully crafted benchmarks to observe performance variations, or looking at the generated assembly code. One way to peek into what profile data is collected is to run the java command with the (very verbose) -XX:+PrintMethodData diagnostic option. For instance, assume we want to profile the following example:
public class TestProfileData {
    public static void main(String[] args) {
        for (int i = 0; i < 20_000; i++) {
            test(new A());
//            test(new B());
//            test(new D());
        }
    }
    private static void test(Object o) {
        if (o instanceof C) {
            System.out.println("A subtype");
        }
    }
    private static class C {
    }
    private static class A extends C {
    }
    private static class B extends C {
    }
    private static class D {
    }
}
Running the code with the -XX:+PrintMethodData option produced the following output with an OpenJDK 16 build:
$ java -XX:+UnlockDiagnosticVMOptions -XX:+PrintMethodData TestProfileData
...
static TestProfileData::test(Ljava/lang/Object;)V
  interpreter_invocation_count:     7308 
  invocation_counter:               7308 
  backedge_counter:                    0 
  mdo size: 576 bytes
0 fast_aload_0
1 instanceof 16 <TestProfileData$C>
  0   bci: 1    ReceiverTypeData    count(0) nonprofiled_count(0) entries(1)
                                    'TestProfileData$A'(6797 1.00)
4 ifeq 15
  56  bci: 4    BranchData          taken(0) displacement(88)
                                    not taken(6797)
7 getstatic 18 <java/lang/System.out/Ljava/io/PrintStream;> 
10 fast_aldc A subtype
12 invokevirtual 26 <java/io/PrintStream.println(Ljava/lang/String;)V> 
  88  bci: 12   VirtualCallData     count(0) nonprofiled_count(0) entries(1)
                                    'java/io/PrintStream'(6797 1.00)
15 return
...
A long output is produced. The truncated output shown here includes data for the method performing the instanceof test. The output lists all bytecodes of the method. Below the method name and signature, counters that are used to trigger compilation appear (see lines containing interpreter_invocation_count, invocation_counter, and backedge_counter).
Then for bytecode 1, the instanceof test, TestProfileData$A is listed as the only class seen at this bytecode (see the line containing ReceiverTypeData and the following line). Bytecode 4 is the branch that takes the result of the instanceof test as input. The branch is shown as never taken (see the line containing BranchData and the following line).
Uncommenting line 5 of the source code, which generates code using the B type, leads to a slightly different profile:
static TestProfileData::test(Ljava/lang/Object;)V
  interpreter_invocation_count:     6882 
  invocation_counter:               6882 
  backedge_counter:                    0 
  mdo size: 576 bytes
0 fast_aload_0
1 instanceof 19 <TestProfileData$C>
  0   bci: 1    ReceiverTypeData    count(0) nonprofiled_count(0) entries(2)
                                    'TestProfileData$B'(3186 0.50)
                                    'TestProfileData$A'(3185 0.50)
4 ifeq 15
  56  bci: 4    BranchData          taken(0) displacement(88)
                                    not taken(6371)
7 getstatic 21 <java/lang/System.out/Ljava/io/PrintStream;> 
10 fast_aldc A subtype
12 invokevirtual 29 <java/io/PrintStream.println(Ljava/lang/String;)V> 
  88  bci: 12   VirtualCallData     count(0) nonprofiled_count(0) entries(1)
                                    'java/io/PrintStream'(6371 1.00)
15 return
Two different classes are seen at the instanceof test with roughly equal counts (see the line containing ReceiverTypeData and following two lines). The branch is still never taken (see the line containing BranchData and the following line).
Uncommenting line 6 in the source code, which generates code using the D type, yields the following:
static TestProfileData::test(Ljava/lang/Object;)V
  interpreter_invocation_count:     8452 
  invocation_counter:               8452 
  backedge_counter:                    0 
  mdo size: 576 bytes
0 fast_aload_0
1 instanceof 22 <TestProfileData$C>
  0   bci: 1    ReceiverTypeData    count(4294964573) nonprofiled_count(0) entries(2)
                                    'TestProfileData$D'(4 1.00)
                                    'TestProfileData$A'(2723 680.75)
4 ifeq 15
  56  bci: 4    BranchData          taken(2723) displacement(88)
                                    not taken(5445)
7 getstatic 24 <java/lang/System.out/Ljava/io/PrintStream;> 
10 fast_aldc A subtype
12 invokevirtual 32 <java/io/PrintStream.println(Ljava/lang/String;)V> 
  88  bci: 12   VirtualCallData     count(0) nonprofiled_count(0) entries(1)
                                    'java/io/PrintStream'(5445 1.00)
15 return
It’s notable (see the line containing ReceiverTypeData and the following two lines), that only two types are recorded at the instanceof test, even though three should be seen at runtime. One type (class B in this run) was not recorded.
The reason B is missing is that the compiler wouldn’t be able to take advantage of the third recorded type if there was one, so it makes little sense to record it. Actually, the compiler doesn’t even need the second recorded type. All it would need is some indication that more than one type was seen. The decision to collect two types at the instanceof test, when one would be good enough, is an artifact of the implementation: The same logic for type profiling is used at virtual calls, where the compiler can make use of more than one type entry in the profile data. The global count (the line containing ReceiverTypeData) is what tells the compiler that some types were not recorded.
The branch is now reported as usually not taken, but still sometimes taken (see the line containing BranchData and following line).
Conclusion
This article offered a couple of examples of code patterns that are optimized by collecting profile data at runtime and relying on speculation. Profiling is such a powerful technique that the HotSpot JVM uses it in many other ways not discussed here. For instance, profiling helps determine what call is hot in a method and drives inlining, a key performance optimization. Profiling and speculation together are especially powerful, because they can dramatically improve performance while transparently adapting to the workload being run on the JVM. They also interact well with other classic optimizations.
Last updated: January 12, 2024