In Java, as I discussed in a previous article on Red Hat Developer, every array access is guarded by a range check, a test that guards references to array elements and protects the program from out-of-bound accesses by throwing an ArrayIndexOutOfBoundsException
. In this article, you'll see how HotSpot, OpenJDK's Java Virtual Machine, can improve application performance by cleverly transforming code to eliminate the need for range checks.
Note: This article uses interval notation, which consists of parentheses and brackets, to describe ranges. If you're unfamiliar with it, check out the table on interval notation in this document.
Why range checks matter for performance
Code where your application spends much of its time, called hot code, is just-in-time compiled in HotSpot. Given that out-of-bounds accesses are usually programmer errors, HotSpot compiles the bound check for an array[i]
read or write down to:
if (i < 0 || i >= array.length) {
deoptimize();
}
In other words, if the check fails, thread execution switches to the interpreter rather than continuing in compiled code. The interpreter re-executes the bound check and throws the exception.
The two comparisons of the bound check can be expressed as a single unsigned one:
if (i >=u array.length) {
deoptimize();
}
Indeed, a negative i
, once converted into an unsigned number, is an integer larger than the maximum signed integer.
Most array accesses are performed in loops. As a result, array range checks are frequently executed and finding ways to decrease their cumulative cost is particularly important.
The basics of range check elimination in loops
Consider a generic loop with an array bound check, which we'll call Loop 1:
for (int i = start; should_continue(i); i = next(i)) {
if (element(i) >=u array.length) {
deoptimize();
}
// some code that uses element(i)'s array element
}
Here, should_continue()
tests a loop exit condition on i
, next()
computes the value of i
for the next iteration, and element()
computes the index of the array element to access.
Now, assume that, over the set of values that i
takes as the loop executes, you can compute that element(i)
is in the range [elementmin
, elementmax
]. If the range check succeeds for both bounds of the range, it's guaranteed to succeed for all i
values in that loop. This observation allows the elimination of the bound check from the loop. That's illustrated in the following code, which we'll call Loop 2:
if (element_min >=u array.length) {
deoptimize();
}
if (element_max >=u array.length) {
deoptimize();
}
for (int i = start; should_continue(i); i = next(i)) {
// some code that uses element(i)'s array element
}
If the range check fails for some value of i
in Loop 1, one of the two checks before the loop body of Loop 2 will catch it. At that point, the thread would deoptimize, continue execution in the interpreter, and execute the loop body with actual bound checks for each iteration until it hits the value of i
that causes the out-of-bound exception to be thrown. Note that the tests that precede the loop above only tell us that the bound check fails for some value of i
, but not for which value.
Now, consider a loop body that looks like the following. We'll call it Loop 3:
for (int i = start; should_continue(i); i = next(i)) {
if (element(i) >=u array.length) {
throw new ArrayIndexOutOfBoundsException();
}
// some code that uses element(i)'s array element
}
If you applied the same transformation to Loop 3 that turned Loop 1 into Loop 2, you'd get the following—call it Loop 4:
if (element_min >=u array.length) {
throw new ArrayIndexOutOfBoundsException();
}
if (element_max >=u array.length) {
throw new ArrayIndexOutOfBoundsException();
}
for (int i = start; should_continue(i); i = next(i)) {
// some code that uses element(i)'s array element
}
But this transformation isn't legal. If an out-of-bound access occurs for some i
in Loop 3, the exception is thrown before Loop 4 starts executing and not at the iteration where the failing access should execute. What makes the transformation from Loop 1 to Loop 2 legal is the fact that range checks deoptimize rather than throw. In short, speculation is what allows this optimization.
Now consider the following code, which we'll call Loop 5:
for (int i = start; should_continue(i); i = next(i)) {
if (some_condition) {
// frequently executed but not always
if (element(i) >=u array.length) {
deoptimize();
}
}
// some code that uses element(i)'s array element
}
Loop 5 could be transformed as follows into Loop 6:
if (element_min >=u array.length) {
deoptimize();
}
if (element_max >=u array.length) {
deoptimize();
}
for (int i = start; should_continue(i); i = next(i)) {
if (some_condition) {
// frequently executed but not always
}
}
In this case, it's possible that one of the conditions before Loop 6 fails even though no range check fails in Loop 5. In other words, for the values of i
for which the range check would fail in Loop 5, the range check is not executed (because in those situations some_condition
is false). In that case, Loop 6 causes a deoptimization, and when execution resumes in the interpreter, no exception is thrown, so the false positive doesn't affect correctness.
The technique described here to hoist range checks out of loops is called loop predication in HotSpot.
Range check elimination in practice
Obviously, computing elementmin
and elementmax
in the most general case is out of reach for the compiler. HotSpot addresses this by restricting the shape of the loop functions should_continue()
and next()
and of the addressing function element()
to shapes commonly used by developers. These shapes are simple enough that it's feasible to reason about them. In particular:
next(i)
isi = i + stride
, withstride
a positive (the loop goes up) or negative (the loop goes down) constant.should_continue(i)
isi < stop
ori > stop
(withstop
some integer value) depending on whether the loop goes up or down.element(i)
isscale * i + offset
, withscale
andoffset
required to be loop-invariant integer values.
A loop with the shape defined by points 1 and 2 above is called a counted loop. (Note that the following discussion assumes that scale
and stride
are positive numbers, but the reasoning described here can be generalized to all combinations of signs for scale
and stride
.)
The loop variable i
takes values in the range [start
, stop
). The last value that i
takes is in the range [stop
- stride
, stop
). That is, if stride
is 1, the last value of i
is in [stop
- 1, stop
)—or, in other words, is stop
- 1. If stride
is 2, the last value of i
is in [stop
- 2, stop
) and can be stop
- 2 or stop
- 1, depending on the exact value of start
. Let's call that last value last
.
With i
in the range [start
, last
], it's then feasible to compute the minimal and the maximal values taken by element(i)
: element(i)
is in [scale
* start
+ offset
, scale
* last
+ offset
].
last
can be computed with:
start
+ ((stop
- start
- 1) / stride
) * stride
The division operations here round down. For instance, if start
is 1, stop
is 10, and stride
is 3, i
takes the values 1, 4, and 7. last
is 1 + ((10 - 1 - 1) / 3) * 3 = 7.
With elementmin
and elementmax
computed, it's now possible to hoist the range check using the technique described previously.
Another range check elimination technique
Again, hoisting of range checks out of loop as described so far is only possible when the range checks are speculated never to fail, and a failing range check causes deoptimization. However, HotSpot implements a second range check elimination pass that wouldn't in principle require speculation, although in practice, it's only implemented in HotSpot for speculative range checks, as speculation is so pervasive.
Given a counted loop and a range check, assume you can compute [imin
, imax
), which is a range of values of i
for which you know the range check is guaranteed to not fail. How can this be leveraged to eliminate bound checks for most loop iterations? The trick, shown in the next listing, is to make copies of the loop and run each copy for a smaller number of iterations; the combination of the copies runs all the iterations.
Note that the main loop has no range check; there's no need for one, as it executes the range of values of i
for which there can't be a range check failure. For most programs, most of the execution time should then be spent in that loop. If a range check fails, then it happens in the pre or post loops, which still have the required logic to throw an exception.
int i;
// pre loop
for (i = start; i < imin; i += stride)) {
if (scale * i + offset >=u array.length) {
deoptimize();
}
// some code that uses element(i)'s array element
}
// main loop
for (; i < imax; i += stride)) {
// some code that uses element(i)'s array element
}
// post loop
for (; i < stop; i += stride)) {
if (scale * i + offset >=u array.length) {
deoptimize();
}
// some code that uses element(i)'s array element
}
Recall the final loop shape pattern outlined above: element(i)
is scale * i + offset
, with scale
and offset
required to be loop-invariant integer values, and with scale
and stride
assumed to be positive. A range check that has this pattern is guaranteed to not fail for:
scale
*i
+offset
>=0scale
*i
+offset
<array.length
Which implies:
i
>= -offset
/scale
i
< (array.length
-offset
) /scale
It's actually a bit trickier than that because integer rounding has to be taken into account. For instance, if offset
is -1 and scale
is 2, the formula in bullet 1 above gives i
>= 0; but for i
= 0, scale
* i
+ offset
= -1, which is an out-of-bound access. So instead:
i
>= -offset
/scale
+ 1i
< (array.length
-offset
) /scale
Integer rounding can lead to a slightly pessimistic but still correct inequality for the formula in bullet 2 here, as the integer division rounds down. For instance, if array.length
is 10, offset
is 1 and scale
is 2, the formula in the second bullet results in i
< 4; but for i
= 4, scale
* i
+ offset
= 9, which is a valid access.
Using the inequalities above, all it takes is to set:
imin
= min(stop
, (-offset
+ 1) /scale
)imax
= min(stop
, (array.length
-offset
) /scale
)
This can be generalized to multiple range checks (with different values of scale
, offset
, and array.length
) and all combinations of signs for scale
and stride
.
Why counted loops are key to performance
Counted loops lend themselves well to optimizations other than range check elimination. Unrolling is a particularly important technique used in HotSpot. Unrolling involves making multiple copies of the loop body, one after another, to create a new loop body that executes several iterations of the initial loop at a time with a single exit test at the end of all the copies. For instance, a loop body can be unrolled 8 times. If the initial loop iterates 80 times, then after unrolling it only executes the unrolled loop body 10 times.
But what if the iteration count is 81? Then executing the unrolled loop body an extra time would cause the iteration count to reach 88! To solve this issue, unrolling also takes advantage of loop clones. The main loop body is unrolled to execute the 8 copies of the initial loop. The main loop's bounds are adjusted so that when only 1 iteration (in this example) is left, execution leaves the unrolled loop. Execution then continues in a copy that executes iterations one at a time.
Unrolling is important because the cost of the exit test is amortized over several iterations, and because it enables optimizations to apply to a larger body where redundancies are likely to be found and eliminated. One particularly important optimization that's enabled by unrolling in HotSpot is vectorization. A vector CPU instruction is an instruction that processes multiple elements at a time rather than operating on a single element of an array. For instance, with vector instructions, a CPU can load or store multiple array elements or perform arithmetic operations on multiple elements with a single instruction.
Code shapes that lend themselves well to vectorization usually experience a major performance improvement. HotSpot implements auto-vectorization by looking in an unrolled loop body for memory or arithmetic operations that apply to adjacent array elements. This rarely happens in most untransformed loops. However, unrolling by a factor of 4 or 8 can often produce code that includes operations on the exact number of adjacent elements that can be stored in a vector register.
Conclusion
This article has covered range check elimination in loops, a key optimization for Java because every array access is guarded by a bound check, and arrays are usually used heavily in loops. You learned how speculation enables such optimization and how targeting common loop shapes allows HotSpot to eliminate bound checks. It also covers counted loops, a particular category of loops commonly found in developers' code, and how they are particularly suitable for optimizations.