Coding shared image

For languages like Rust and C++, one of the main obstacles to developer productivity are long compilation times. Build times for large code bases can easily range from tens of minutes to hours, depending on the available hardware. The LLVM compiler toolchain, which provides the optimization and code-generation capabilities for many programming languages, is a case in point. Even on a 16-core machine, my standard build configuration takes about fifteen minutes for a non-incremental build. You can imagine how long this takes on a laptop.

Figure 1 illustrates a comedic look at compiling time (courtesy of xkcd).

A cartoon comic-like image showing developers passing the time by sword fighting while compiling.
Figure 1: What you do while compiling.

Unfortunately, compilers suffer from the "flat profile" syndrome. There usually aren't any significant performance hotspots, on which optimization efforts can be focused. Instead, time is spent across hundreds of different analysis and optimization passes.

Even a large improvement in a single pass usually only translates into a small improvement in end-to-end compilation times. As such, compile-time changes (both improvements and regressions) are typically the result of accumulating many small changes on the order of 0.1%.

This article covers one of the rare exceptions to this rule. A change to the InstCombine pass resulted in a ~4% end-to-end compile-time reduction on CTMark (Figure 2). This change will be part of LLVM 18.

A graph showing compile time change over time.
Figure 2: This graph shows compile time change in percent over time for compiler configurations. The big jump on the left is the InstCombine change.
Figure 2: This graph shows compile time change in percent over time for different compiler configurations. The big jump on the left is the InstCombine change.

The instruction combining pass

The instruction combining pass is responsible for performing peephole optimizations, where a small sequence of instructions is replaced with some other, hopefully better, sequence of instructions. For example, x / 2 could be replaced by x >> 1, or x + y == y by x == 0.

The InstCombine pass consists of hundreds (likely thousands) of folds of varying complexity. Some of them are as simple as the previous examples, while others have to inspect larger instruction graphs or prove non-trivial preconditions using analysis passes.

InstCombine is the proverbial death by a thousand cuts. Each individual transform may be cheap, but if you add up hundreds of them, it comes as no surprise that InstCombine tends to be the single most expensive pass in the optimization pipeline. As new transforms get added to InstCombine on a daily basis, this is a situation that only gets worse with time. It is one of the ways in which compile-times keep slowly creeping upwards, and why continuous effort is required to push them back down.

Worklists and fixpoints

One way to improve InstCombine performance is to optimize individual transforms and the analyses they use. I have done my fair share of this kind of optimization over the years, which results in exactly the kind of small, incremental improvements previously mentioned. The other way is to improve the high-level algorithm, which is what I will talk about here. To that end, we should first discuss how InstCombine works on a technical level.

The basic task of InstCombine is simple, visit each instruction in the function and apply some folds to it. The interesting part is what happens when a fold succeeds and produces new instructions. It's possible that the new instructions enable new folds. If we only perform a single pass over the function, we will miss such opportunities.

InstCombine solves this problem by using the worklist algorithm. Initially, the worklist is populated with all instructions in the function. On each iteration, one instruction is removed from the worklist and visited. If any fold applies, the changed or newly created instructions are added back to the worklist, together with any instructions using them to be reprocessed in a future iteration.

In an ideal world, this approach would be sufficient to find and exploit all optimization opportunities. In practice, this does not quite work out due to transforms that do something unusual or fail to manage the worklist correctly. We'll discuss examples in the following sections. To mitigate these cases, InstCombine combined the worklist algorithm with an additional fixpoint iteration. After folding all instructions with the worklist algorithm, it would check whether any changes have been made. If yes, it would start again from scratch, folding all instructions a second time. This would continue until no more changes were made. That is, until a fixpoint was reached.

This approach is quite wasteful. In the common case, InstCombine would first perform one iteration that does useful work, and then a second iteration that serves no purpose other than verifying that a fixpoint has indeed been reached. The compile-time improvement comes down to removing this fixpoint iteration and making InstCombine only perform a single iteration. This is easier said than done because we first have to fix the various ways in which the worklist algorithm may fail to reach a fixpoint in one iteration.

Convergence failures

Fixing failures to reach a fixpoint involved many dozens of individual changes, so I will only discuss some of the more important ones here. To the most part, these come down to incorrect worklist management.

IRBuilder polymorphism

All instructions that are newly created by InstCombine need to be added back to the worklist for reprocessing. For this purpose, InstCombine uses a special IRBuilder, which will automatically add new instructions to the worklist.

As long as all instructions are created through this IRBuilder, everything is fine. However, InstCombine sometimes calls into other transformation utilities such as SimplifyLibCalls, which historically could not use InstCombine's IRBuilder, because it was templated over InstCombine implementation details. For this reason, I migrated IRBuilder from using templates to virtual methods, so that there is a common base class that can be freely shared across passes. This also had various benefits outside this work.

Deferred worklist

Even if all instructions get added to the worklist, we have to be careful about their order. If an instruction C depends on instructions A and B, we want A and B to be simplified before C. Ignoring the cross-block case, this can be achieved by visiting instructions in program order. On a technical level, the worklist is actually a deduplicated stack, from which the top element is popped on each iteration. Initially, all instructions are pushed to this stack in reverse program order, so that they will be popped in program order.

If a transform creates new instructions A, B and C, these used to be pushed to the worklist stack in that order, which means they were popped in the order C, B, A. That's the exact opposite of what we want. To fix this, an additional deferred worklist was introduced, into which newly created instructions are initially pushed. Before visiting the next instruction, these will be popped from the deferred worklist and pushed to the normal worklist, reversing them in the process and establishing the desired order.

The deferred worklist also serves the additional purpose of removing dead instructions earlier. Unused instructions without side-effects will be removed when processing the deferred worklist, so that dead code is no longer present when the next instruction is visited. This is beneficial, because many transforms are use-count sensitive.

In particular, it is common that transforms have a one-use restriction. For example, InstCombine will convert (A * B) + (C * B) into (A + C) * B. But what happens if the expressions (A * B) and (C * B) are also used somewhere else and cannot be removed? In that case, this transform would not save one multiply, but instead add one. The transform is only profitable under a one-use restriction. As such, we should also revisit users if a transform makes the use count of an instruction drop to one.

Reverse post order

One case I brushed over are cross-block dependencies. There is no well-defined static notion of program order if the function contains control flow. Consider a simple if/else structure like this:

if (...) {
} else {

Historically, InstCombine visited blocks using a depth-first search. That is, it would either use the order A B D C or A C D B. This is bad, because it violates our goal of making sure that folds always see already simplified operands. If D depends on values from B and C, then it will see unsimplified values in one of the blocks.

The solution is to use reverse post-order (RPO) instead, which results in the order A B C D or A C B D. RPO guarantees that predecessors are visited before successors, except in the presence of loop backedges, where this is fundamentally impossible. The downside is that RPO is expensive to compute, making this a small compile-time sacrifice in pursuit of the greater good.

Unreachable code elimination

As a result of instruction combining, we may determine that the argument to some conditional branches is either true or false. In this case we could, in theory, remove the dead edge from the control flow graph, as well as any resulting unreachable code.

InstCombine is not actually allowed to do this, because it is a control-flow preserving pass. This limitation allows it to preserve certain expensive analysis passes, such as the dominator tree. Other passes like SimplifyCFG are responsible for cleaning up the control-flow later.

However, removing unreachable code can often expose new optimization opportunities in InstCombine, for example, by reducing use-counts. For that reason, we make a compromise, where InstCombine will not modify the control-flow graph but still remove all the instructions from unreachable blocks.

Historically, this was done during initial worklist population, which means it only happened on the next InstCombine iteration. What we now do instead is to keep track of all control-flow edges that are known to be dead and add new ones if a branch condition turns into a constant. When this renders new code unreachable, it is directly removed as part of visiting the branch instruction.

The long tail

The previous changes are some of the more high level ones. Many others were fixes to individual transforms. InstCombine provides a number of utilities to perform certain operations (like replacing or removing instructions) in a way that correctly manages the worklist. However, quite a few transforms failed to use these (especially the older ones), resulting in failures to reach the fixpoint.

Another common issue were transforms that don't just look at instruction operands, but also at adjacent instructions in program order. For transforms like these, it is important to scan instructions in the correct direction. For example, when trying to eliminate dead stores, it's necessary to walk upwards to find a store to the same pointer, instead of downwards. That's because the preceding instructions have already been simplified, while the following have not.

Removing the fixpoint iteration

There will always be cases where InstCombine fails to a reach a fixpoint in one iteration. However, at some point they become so rare that paying the compile-time cost of performing the fixpoint iteration is no longer worthwhile. Before switching InstCombine to single-iteration mode, I collected some statistics on how many iterations it performs on llvm-test-suite. The result was that only in 0.04% of cases an additional iteration was needed to reach a fixpoint. Conversely, in 22.3% of cases, InstCombine performed an extra iteration that did no useful work and only verified that the fixpoint was reached.

Paying 4% of total compile-time to only see a difference in 0.04% of cases is clearly a bad tradeoff, especially when taking into account that InstCombine runs at least 8 times during the optimization pipeline, which means that something not caught by one run can still be handled by a later one.

An important consideration here is how to avoid backsliding. If we only perform one iteration, it would be easy to introduce new cases that fail to reach the fixpoint without anyone noticing. We prevent this by enabling fixpoint-verification in our tests, which runs two iterations and checks that the second one made no changes. The verification is disabled for normal compilations.

Lessons learned

The basic idea behind this change is simple. Don't run a transform in a loop. There is nothing groundbreaking here. Making optimization algorithms sparse (that is, reprocessing only what is strictly necessary instead of whole functions) is a core tenet of optimizing compilers. The choice to add fixpoint iteration to InstCombine was made 15 years ago, when InstCombine was much simpler and smaller. Undoing that choice took a good bit of effort over the course of three years. This shows the importance of engineering optimization passes for compile-time efficiency from the start and tracking compile-time changes over time. It is much more expensive to fix mistakes decades later, than to avoid introducing them in the first place.