The just-in-time (JIT) compiler in OpenJDK improves Java performance through a number of optimizations, particularly in loops. Until recently, many optimizations worked only when the loop index was an `int`

variable. This article shows how the HotSpot virtual machine was upgraded to add the same optimizations for `long`

variables. The article covers particularly out-of-bounds checking (also called *range checks*).

## Why optimization was added for long variables

One of the important promises of Java, along with many other modern languages, is to catch out-of-bounds errors, such as when you mistakenly end a loop at `array.length`

instead of at `array.length-1`

. The HotSpot virtual machine eliminates range checks when possible as a performance optimization. As discussed in a previous article, the JIT compiler enables range checks when the compiler is not sure that an index will stay within the bounds of an array.

The JIT compiler implements range checks as follows:

```
for (int i = start; i < stop; i += stride)) {
if (scale * i + offset >=u array.length) { // range check
deoptimize();
}
// access to element scale * i + offset of array
}
```

In the previous code, `>=u`

is an unsigned comparison, and the `deoptimize()`

function causes execution of the thread to continue in the interpreter where the out of bound exception is thrown.

Until recent versions of OpenJDK, optimization of range checks worked only if the loop variable `i`

was an `int`

. For a `long`

, range checks would always be performed. Many other optimizations were also unavailable.

The reason for restricting loop optimization to `int`

indexes was that they were considered the only ones common enough to deserve special treatment. One reason is that loops commonly iterate over arrays. Because the size of a Java array is a 32-bit integer, an `int`

is the natural choice for the loop variable.

Usages are evolving, though. The Panama Project offers developers a better way to get access to off-heap memory areas. Offsets within memory are 64 bits, so loops that iterate over memory with that API tend to use a `long`

loop variable.

The lack of proper optimizations for `long`

counted loops initially was such a pain point for the Panama Project that it implemented workarounds in the library code to detect cases where the offsets fit into a 32-bit integer. In such cases, the project included special code that the JIT could better optimize. But once the improvements covered in this article were rolled out, those workarounds became unnecessary and could be removed. The end result is simplified library code with better overall performance.

The loop shape that needs to be optimized is as follows:

```
for (long i = start; i < stop; i += stride)) {
if (scale * i + offset >=u length) { // range check
deoptimize();
}
// access to memory at offset scale * i + offset
}
```

The discussion in this article assumes a loop that goes upward (that is, an index that increases), along with a positive `scale`

. The project has generalized the optimization to accommodate a decreasing loop and a negative `scale`

as well, but we won't include those cases because they would overcomplicate the discussion.

## Recognizing and optimizing long counted loops

Teaching the compiler to recognize the new loop shape with a `long`

loop variable instead of an `int`

loop variable is fairly straightforward, but is not sufficient to enable existing optimizations such as range check elimination, loop unrolling, and vectorization. Those optimization passes must be adapted to operate on a `long`

counted loop as well. Doing so is not a simple matter of including `long`

variables, because some of the optimizations have to deal with integer overflow. They were protected from overflow for `int`

variables by promoting some 32-bit integer values to 64-bit integers. Supporting these same optimizations on 64-bit integers would require promotion to the next larger integer type (probably 128 bits), for which the compiler lacks support.

Is there a way to upgrade the existing optimizations for a `long`

without having to rewrite these optimizations and incur a high risk of introducing bugs?

The solution we went with started by transforming the `long`

counted loop into a nested loop with an `int`

counted inner loop. The previous example was transformed roughly to:

```
for (long i = start; i < stop;) {
int j;
for (j = 0; j < min(stop - i, max_int); j += (int)stride) {
if (scale * (j + i) + offset >=u length) { // range check
deoptimize();
}
// access to memory at offset scale * (j+i) + offset
}
i += j;
}
```

Here, `max_int`

represents the largest signed 32-bit integer. The loop variable for the inner loop is also a 32-bit integer. That inner loop has the shape of a counted loop, so it's subject to existing counted loop optimizations such as unrolling. Adding an extra loop has some overhead, but if the original loop executes for a large number of iterations, most of the time should be spent in the inner loop and the cost of setting it up should be negligible.

Note that this transformation is valid only for `stride`

values that fit into a 32-bit integer, and pays off mostly if `stride`

is a relatively small 32-bit integer. (Otherwise, the inner loop executes only a small number of iterations and the overhead of nested loops is bigger.)

This extra transformation has the benefit of enabling several existing loop optimizations to `long`

loop counters. But one important optimization still isn't triggered by the transformed loop: Eliminating the range check. Indeed, the range check in the previous loop nest is still expressed in terms of 64-bit integers: `j+i`

and `length`

. The next section discusses how we make sure the range check has a proper shape so that the compiler recognizes it and enables range check optimization.

## A new API point for range checks

A prerequisite for range check elimination is to make sure the loop has a shape that the compiler can properly optimize. In particular, the range check in the loop has to follow a canonical shape like the following, with an unsigned comparison and a deoptimization if the range check fails:

```
if (scale * i + offset >=u length) {
deoptimize();
}
```

The array accesses in Java include a range check, so the compiler is free to generate the pattern that works best for optimization. In the Panama Project's memory access API, the range check is not built into the language. So the check would have to be performed explicitly by the API in code similar to:

```
long o = scale * i + offset;
if (o >= length || o < 0) {
throw new SomeException();
}
```

The JIT compiler would then have to recognize this code pattern as a range check, which would be complicated to perform reliably. After all, there would be more than one way to write this logic.

We have a more robust solution, which involves extending the core Java libraries with a new API point for range checks:

`long o = java.util.Objects.checkIndex(scale * i + offset, length);`

The `checkIndex()`

call already exists for `int`

arguments.

We have made the JIT compiler aware of the new API function. The compiler can use its own implementation for `checkIndex()`

as long as its behavior appears identical to the Java implementation to outside callers.

The technique of replacing a standard implementation of a function with one provided by the compiler is called a *compiler intrinsic*. That implementation can be carefully crafted to allow optimizations.

The new `checkIndex()`

function and the corresponding underlying intrinsic are available as of JDK 16. Note that the new function is not restricted to the memory access API. It offers a reliable way of performing range checks that can be optimized well by the virtual machine, making the function a valuable addition to the core libraries for all developers.

## Optimizing long-range checks

So far, we have discussed a transformation of the loop so that it becomes suitable for existing optimizations. In the same spirit, this section discusses how to transform range checks in order to trigger existing range check optimizations for `long`

indexes. We need to reshape the nested loop to:

```
for (long i = start; i < stop;) {
int j;
int scale' = ..;
int offset' = ..;
int length' = ..;
for (j = 0; j < min(stop - i, max_int); j += (int)stride) {
if (scale' * j + offset' >=u length') {
deoptimize();
}
// access to memory at offset scale * (j+i) + offset
}
i += j;
}
```

The `scale'`

, `offset'`

, and `length'`

variables end with a prime symbol (`'`

) to denote a variable that is derived from or related to another variable. These variables are 32-bit integers that are invariant in the inner loop. Because the range check is expressed as a 32-bit comparison that operates on the loop variable of the inner loop, which is itself a loop with a 32-bit index, existing optimizations are triggered.

Assuming for instance that loop predication optimizes this loop nest, the result would be roughly:

```
for (long i = start; i < stop;) {
int j;
int scale' = ..;
int offset' = ..;
int length' = ..;
if (scale' * 0 + offset' >=u length') {
deoptimize();
}
if (scale' * jmax + offset' >=u length') {
deoptimize();
}
for (j = 0; j < min(stop - i, max_int); j += (int)stride) {
// access to memory at offset scale * (j+i) + offset
}
i += j;
}
```

`jmax`

is the largest value `j`

takes in the inner loop for a particular iteration `i`

.

Assuming, as we did before, that the inner loop runs for a large number of iterations, the range check becomes essentially free.

`scale'`

, `offset'`

, and `length'`

have to be derivatives of `scale`

, `offset`

, `length`

, and `i`

, the variables of the initial range check. They are all 64-bit integers. Let's see how to compute them.

`scale'`

can be set to `scale`

, but only if it fits in a 32-bit integer. Otherwise, there's no way to transform the range check. Another tricky issue here is that `scale' * j`

could overflow the `int`

range. One easy way around that problem is to adjust the inner loop's bounds so that overflow never happens, such as:

```
for (long i = start; i < stop;) {
int j;
int scale' = scale;
int offset' = ..;
int length' = ..;
for (j = 0; j < min(stop - i, max_int / scale); j += (int)stride) {
if (scale' * j + offset' >=u length') {
deoptimize();
}
// access to memory at offset scale * (j+i) + offset
}
i += j;
}
```

If `scale`

turns out to be a relatively large 32-bit integer, the number of iterations of the inner loop is small and this transformation is unlikely to pay off. The Panama Project expects a small `scale`

. Typically, this variable is the size of some basic data type.

Let's note the range of values of `j`

for some iteration `i`

of the outer loop as`[0, jmax]`

. Then `scale * (i + j) + offset`

is in the range`[scale * i + offset, scale * (i + jmax) + offset]`

(remember that this discussion assumes a positive `scale`

). Let's call that interval `[range_min, range_max]`

. It's then possible to express `offset'`

and `length'`

in terms of `range_min`

and `range_max`

. In the simplest case (`range_min >= 0`

, no overflow), we set:

`offset' = 0`

`length' = max(range_min, min(length, range_max+1)) - range_min`

Let's see why that works as expected—that is, why the transformed range check (expressed with `j`

) succeeds whenever the initial range check (expressed with `i`

) succeeds, and fails when the initial check fails.

We know that `j`

falls within `[0, jmax]`

. Then `range`

has the value `scale * (i + j) + offset`

and `range'`

has the value `scale * j + offset'`

. Using variables we defined earlier, `range`

is in `[range_min, range_max]`

and `range = range' + range_min`

. What happens for various values of `length`

?

### When length is greater than range_max

In this case, the range check always succeeds. `length'`

is `range_max+1 - range_min`

. The range check becomes:

`range' <u range_max+1 - range_min`

This expression is always true because `range`

is within `[range_min, range_max]`

and so `range'`

is within `[0, range_max - range_min]`

.

### When length is less than range_min

In this case, the range check always fails. `length'`

is `0`

. The range check becomes `range'<u 0`

, which is always false (no unsigned value can be negative).

### When length is somewhere in [range_min, range_max]

If `length`

is in `[range_min, range_max]`

, the range check succeeds sometimes and fails other times. `length'`

is `length - range_min`

. The range check becomes:

`range' <u length - range_min`

which is equivalent to:

```
0 <= range' < length - range_min
```

or

```
range_min <= range' + range_min < length
```

or

`range_min <= range < length`

Because we assumed `range_min`

to be positive, this is the same as `range <u length`

, the range check before transformation.

`length'`

must be a 32-bit integer, but is computed in terms of 64-bit values. Note, however, that `range_max - range_min`

fits in a 32-bit integer, because loop bounds require `scale * j`

to fit in a 32-bit integer. That, in turn, guarantees that `length'`

can safely be stored as a 32-bit value.

## Extensive optimizations accommodate more Java loops

This article covered recent optimizations in the OpenJDK HotSpot virtual machine that support loops with `long`

loop variables, with a particular focus on range checks operating on it. I gave an overview of the code patterns that developers can expect to be properly optimized. The article also demonstrated how new APIs require special virtual machine support and how the entire Java platform evolves to meet changing usages. Finally, I showed how reshaping a loop with a `long`

index enables a range of optimizations.

The cases we didn't cover in this article—loops going downward and a negative `scale`

—are discussed in a lengthy comment added to the JDK change request, augmented by this follow-up.

*Last updated: February 11, 2024*