Featured image for Java topics.

The just-in-time (JIT) compiler in OpenJDK improves Java performance through a number of optimizations, particularly in loops. Until recently, many optimizations worked only when the loop index was an int variable. This article shows how the HotSpot virtual machine was upgraded to add the same optimizations for long variables. The article covers particularly out-of-bounds checking (also called range checks).

Why optimization was added for long variables

One of the important promises of Java, along with many other modern languages, is to catch out-of-bounds errors, such as when you mistakenly end a loop at array.length instead of at array.length-1. The HotSpot virtual machine eliminates range checks when possible as a performance optimization. As discussed in a previous article, the JIT compiler enables range checks when the compiler is not sure that an index will stay within the bounds of an array.

The JIT compiler implements range checks as follows:

for (int i = start; i < stop; i += stride)) {
  if (scale * i + offset >=u array.length) { // range check
    deoptimize();
  }
  // access to element scale * i + offset of array
}

In the previous code, >=u is an unsigned comparison, and the deoptimize() function causes execution of the thread to continue in the interpreter where the out of bound exception is thrown.

Until recent versions of OpenJDK, optimization of range checks worked only if the loop variable i was an int. For a long, range checks would always be performed. Many other optimizations were also unavailable.

The reason for restricting loop optimization to int indexes was that they were considered the only ones common enough to deserve special treatment. One reason is that loops commonly iterate over arrays. Because the size of a Java array is a 32-bit integer, an int is the natural choice for the loop variable.

Usages are evolving, though. The Panama Project offers developers a better way to get access to off-heap memory areas. Offsets within memory are 64 bits, so loops that iterate over memory with that API tend to use a long loop variable.

The lack of proper optimizations for long counted loops initially was such a pain point for the Panama Project that it implemented workarounds in the library code to detect cases where the offsets fit into a 32-bit integer. In such cases, the project included special code that the JIT could better optimize. But once the improvements covered in this article were rolled out, those workarounds became unnecessary and could be removed. The end result is simplified library code with better overall performance.

The loop shape that needs to be optimized is as follows:

for (long i = start; i < stop; i += stride)) {
  if (scale * i + offset >=u length) { // range check
    deoptimize();
  }
  // access to memory at offset scale * i + offset
}

The discussion in this article assumes a loop that goes upward (that is, an index that increases), along with a positive scale. The project has generalized the optimization to accommodate a decreasing loop and a negative scale as well, but we won't include those cases because they would overcomplicate the discussion.

Recognizing and optimizing long counted loops

Teaching the compiler to recognize the new loop shape with a long loop variable instead of an int loop variable is fairly straightforward, but is not sufficient to enable existing optimizations such as range check elimination, loop unrolling, and vectorization. Those optimization passes must be adapted to operate on a long counted loop as well. Doing so is not a simple matter of including long variables, because some of the optimizations have to deal with integer overflow. They were protected from overflow for int variables by promoting some 32-bit integer values to 64-bit integers. Supporting these same optimizations on 64-bit integers would require promotion to the next larger integer type (probably 128 bits), for which the compiler lacks support.

Is there a way to upgrade the existing optimizations for a long without having to rewrite these optimizations and incur a high risk of introducing bugs?

The solution we went with started by transforming the long counted loop into a nested loop with an int counted inner loop. The previous example was transformed roughly to:

for (long i = start; i < stop;) {
  int j;
  for (j = 0; j < min(stop - i, max_int); j += (int)stride) {
    if (scale * (j + i) + offset >=u length) { // range check
      deoptimize();
    }
    // access to memory at offset scale * (j+i) + offset
  }
  i += j;
}

Here, max_int represents the largest signed 32-bit integer. The loop variable for the inner loop is also a 32-bit integer. That inner loop has the shape of a counted loop, so it's subject to existing counted loop optimizations such as unrolling. Adding an extra loop has some overhead, but if the original loop executes for a large number of iterations, most of the time should be spent in the inner loop and the cost of setting it up should be negligible.

Note that this transformation is valid only for stride values that fit into a 32-bit integer, and pays off mostly if stride is a relatively small 32-bit integer. (Otherwise, the inner loop executes only a small number of iterations and the overhead of nested loops is bigger.)

This extra transformation has the benefit of enabling several existing loop optimizations to long loop counters. But one important optimization still isn't triggered by the transformed loop: Eliminating the range check. Indeed, the range check in the previous loop nest is still expressed in terms of 64-bit integers: j+i and length. The next section discusses how we make sure the range check has a proper shape so that the compiler recognizes it and enables range check optimization.

A new API point for range checks

A prerequisite for range check elimination is to make sure the loop has a shape that the compiler can properly optimize. In particular, the range check in the loop has to follow a canonical shape like the following, with an unsigned comparison and a deoptimization if the range check fails:

if (scale * i + offset >=u length) {
  deoptimize();
}

The array accesses in Java include a range check, so the compiler is free to generate the pattern that works best for optimization. In the Panama Project's memory access API, the range check is not built into the language. So the check would have to be performed explicitly by the API in code similar to:

long o = scale * i + offset;
if (o >= length || o < 0) {
  throw new SomeException();
}

The JIT compiler would then have to recognize this code pattern as a range check, which would be complicated to perform reliably. After all, there would be more than one way to write this logic.

We have a more robust solution, which involves extending the core Java libraries with a new API point for range checks:

long o = java.util.Objects.checkIndex(scale * i + offset, length);

The checkIndex() call already exists for int arguments.

We have made the JIT compiler aware of the new API function. The compiler can use its own implementation for checkIndex() as long as its behavior appears identical to the Java implementation to outside callers.

The technique of replacing a standard implementation of a function with one provided by the compiler is called a compiler intrinsic. That implementation can be carefully crafted to allow optimizations.

The new checkIndex() function and the corresponding underlying intrinsic are available as of JDK 16. Note that the new function is not restricted to the memory access API. It offers a reliable way of performing range checks that can be optimized well by the virtual machine, making the function a valuable addition to the core libraries for all developers.

Optimizing long-range checks

So far, we have discussed a transformation of the loop so that it becomes suitable for existing optimizations. In the same spirit, this section discusses how to transform range checks in order to trigger existing range check optimizations for long indexes. We need to reshape the nested loop to:

for (long i = start; i < stop;) {
  int j;
  int scale' = ..;
  int offset' = ..;
  int length' = ..;
  for (j = 0; j < min(stop - i, max_int); j += (int)stride) {
    if (scale' * j + offset' >=u length') {
      deoptimize();
    }
    // access to memory at offset scale * (j+i) + offset
  }
  i += j;
}

The scale', offset', and length' variables end with a prime symbol (') to denote a variable that is derived from or related to another variable. These variables are 32-bit integers that are invariant in the inner loop. Because the range check is expressed as a 32-bit comparison that operates on the loop variable of the inner loop, which is itself a loop with a 32-bit index, existing optimizations are triggered.

Assuming for instance that loop predication optimizes this loop nest, the result would be roughly:

for (long i = start; i < stop;) {
  int j;
  int scale' = ..;
  int offset' = ..;
  int length' = ..;
  if (scale' * 0 + offset' >=u length') {
    deoptimize();
  }
  if (scale' * jmax + offset' >=u length') {
    deoptimize();
  }
  for (j = 0; j < min(stop - i, max_int); j += (int)stride) {
    // access to memory at offset scale * (j+i) + offset
  }
  i += j;
}

jmax is the largest value j takes in the inner loop for a particular iteration i.

Assuming, as we did before, that the inner loop runs for a large number of iterations, the range check becomes essentially free.

scale', offset', and length' have to be derivatives of scale, offset, length, and i, the variables of the initial range check. They are all 64-bit integers. Let's see how to compute them.

scale' can be set to scale, but only if it fits in a 32-bit integer. Otherwise, there's no way to transform the range check. Another tricky issue here is that scale' * j could overflow the int range. One easy way around that problem is to adjust the inner loop's bounds so that overflow never happens, such as:

for (long i = start; i < stop;) {
  int j;
  int scale' = scale;
  int offset' = ..;
  int length' = ..;
  for (j = 0; j < min(stop - i, max_int / scale); j += (int)stride) {
    if (scale' * j + offset' >=u length') {
      deoptimize();
    }
    // access to memory at offset scale * (j+i) + offset
  }
  i += j;
}

If scale turns out to be a relatively large 32-bit integer, the number of iterations of the inner loop is small and this transformation is unlikely to pay off. The Panama Project expects a small scale. Typically, this variable is the size of some basic data type.

Let's note the range of values of j for some iteration i of the outer loop as[0, jmax]. Then scale * (i + j) + offset is in the range[scale * i + offset, scale * (i + jmax) + offset] (remember that this discussion assumes a positive scale). Let's call that interval [range_min, range_max]. It's then possible to express offset' and length' in terms of range_min and range_max. In the simplest case (range_min >= 0, no overflow), we set:

  • offset' = 0
  • length' = max(range_min, min(length, range_max+1)) - range_min

Let's see why that works as expected—that is, why the transformed range check (expressed with j) succeeds whenever the initial range check (expressed with i) succeeds, and fails when the initial check fails.

We know that j falls within [0, jmax]. Then range has the value scale * (i + j) + offset and range' has the value scale * j + offset'. Using variables we defined earlier, range is in [range_min, range_max] and range = range' + range_min. What happens for various values of length?

When length is greater than range_max

In this case, the range check always succeeds. length' is range_max+1 - range_min. The range check becomes:

range' <u range_max+1 - range_min

This expression is always true because range is within [range_min, range_max] and so range' is within [0, range_max - range_min].

When length is less than range_min

In this case, the range check always fails. length' is 0. The range check becomes range'<u 0, which is always false (no unsigned value can be negative).

When length is somewhere in [range_min, range_max]

If length is in [range_min, range_max], the range check succeeds sometimes and fails other times. length' is length - range_min. The range check becomes:

range' <u length - range_min

which is equivalent to:

0 <= range' < length - range_min

or

range_min <= range' + range_min < length

or

range_min <= range < length

Because we assumed range_min to be positive, this is the same as range <u length, the range check before transformation.

length' must be a 32-bit integer, but is computed in terms of 64-bit values. Note, however, that range_max - range_min fits in a 32-bit integer, because loop bounds require scale * j to fit in a 32-bit integer. That, in turn, guarantees that length' can safely be stored as a 32-bit value.

Extensive optimizations accommodate more Java loops

This article covered recent optimizations in the OpenJDK HotSpot virtual machine that support loops with long loop variables, with a particular focus on range checks operating on it. I gave an overview of the code patterns that developers can expect to be properly optimized. The article also demonstrated how new APIs require special virtual machine support and how the entire Java platform evolves to meet changing usages. Finally, I showed how reshaping a loop with a long index enables a range of optimizations.

The cases we didn't cover in this article—loops going downward and a negative scale—are discussed in a lengthy comment added to the JDK change request, augmented by this follow-up.

Last updated: February 11, 2024