A pipelined processor requires a steady stream of instructions to be fed into the pipeline. Any delay in feeding instructions into the pipeline will hurt performance. For a sequence of instructions without branches it is relatively easy to determine the next instruction to feed into the pipeline, particularly for processors with fixed sized instructions. Variable-sized instructions might complicate finding the start of each instruction, but it is still a contiguous, linear stream of bytes from memory.
Keeping the processor pipeline filled when there are changes in the control flow of a program due to branches is more difficult. Function calls, function returns, conditional branches, and indirect jumps all affect which instruction is the next in the sequence. In the worst case the pipeline might have to sit idle until all the prior instructions complete and this can significantly hurt performance. Given the frequency that instructions changing the control flow are executed in code these waits can greatly reduce performance.
To improve pipeline performance the processor attempts to predict what instructions to execute next. These mechanisms help the processor speculatively execute code past the branch instructions. If the prediction is wrong, the speculative work is discarded to preserve the programmer's simple execution model semantics.
Static Branch Prediction
For branches where the processor has no previous information about the instruction's behavior the processor may resort to making assumptions about what the result of the instruction will be. Below is a example list of predictions made by some processors:
- unconditional branches are taken
- backward conditional branches are taken
- forward conditional branches are not taken
When the compiler generates code it will attempt to generate a sequences of instructions that follows those assumption as much as possible. The compiler estimates the relative frequency that conditional branches are taken. However, there are cases where the compiler does not have enough information to accurately structure the conditional statement to take advantage of the processor's static predictions. For example, there might be some branch in a function that is only taken for some rare condition based on a value passed into the function. The Linux kernel code uses some GCC attributes to allow the programmer to provide hints on the likelihood of the condition being true. The hinting macros are in the Linux source code file include/linux/compiler.h:
/*
* Using __builtin_constant_p(x) to ignore cases where the return
* value is always the same. This idea is taken from a similar patch
* written by Daniel Walker.
*/
# ifndef likely
# define likely(x) (__builtin_constant_p(x) ? !!(x) : __branch_check__(x, 1))
# endif
# ifndef unlikely
# define unlikely(x) (__builtin_constant_p(x) ? !!(x) : __branch_check__(x, 0))
# endif
An example use is seen in the snippet of code for the the Linux kernel's pipe_read function where the vast majority of times there number of bytes read is going to be more than zero. The following C code hints to GCC to generate code that make the common non-zero total_len take advantage of the static prediction rules so that case is faster:
static ssize_t
pipe_read(struct kiocb *iocb, struct iov_iter *to)
{
size_t total_len = iov_iter_count(to);
...
/* Null read succeeds. */
if (unlikely(total_len == 0))
return 0;
These static hints are most useful for the cases where the code is run infrequently. If the code is run frequently such as as in a loop, the processor can use other dynamic prediction techniques to better determine which instruction will come next.
Dynamic Branch Prediction
Dynamic branch prediction techniques uses recent runtime history to predict whether the conditional branch is taken or will fall through, with the goal of being more accurate than the simple static prediction. The dynamic prediction hardware also predicts the target addresses of branch instructions with computed destinations and return instructions.
Branch History Prediction
When static branch prediction fails for a branch the processor will use a history of branches up to that branch and start recording the history of branches for that particular branch to improve the prediction the next time that particular branch instruction is encountered. The number of branch instructions that the processor can concurrently keep branch histories for range from a few hundred to several thousand. If there are many different conditional branch instructions are executed in the code, some of the previous predictions will be discarded to make room for more recent predictions, yet another reason to avoid "spaghetti code".
In some cases the randomness of the branch makes it impossible for the branch prediction hardware to accurate predict the outcome of the branch. The basic implementation of absolute value function is one example of code with unpredictable branches:
int abs (int i) { return i < 0 ? -i : i; }
As suggested by older AMD optimization manuals, the 32-bit absolute value function could be rewritten as straight line code using bit manipulations like the following to avoid the unpredictable conditional branch:
static inline int abs(int ecx) { int ebx = ecx; ecx = ecx >> 31; ebx = ebx ^ ecx; ebx -= ecx; return ebx; }
Return Stack
Return instructions can also benefit from specialized prediction hardware. The return instruction itself does not have any information about its destination. Each time a call instruction is executed the specialized prediction hardware pushes the return address onto a return address prediction stack. When a return instruction is encountered the top entry of that return address prediction stack is used as the target for the next instruction to fetch. This allows better prediction in the case that the same function might be call recursively or from different locations in the code.
The depth of the return address prediction stack is limited. It might hold the most recent eight or 16 calls. Thus, if the call tree is very deep some of the predictions might be lost resulting in reduced performance. Inlining some functions might help by eliminating some call/return pairs and limiting the depth of the call tree.
Branch Target Buffers
Branch Target Buffers (BTB) hardware is another method of reducing the delay when branches and jumps are encountered in the code. These are caches that store the location of the predicted branch instruction and the associated predicted taken branch destination address based on previous execution of the branch instruction. If the predicted address was not available, the processor may be required to wait for a computation of a relative branch destination or for a register to be loaded with an indirect call for a method in an object-oriented language. The Branch Target Buffer allows the processor to speculatively do work when that information is not yet available. However, if the prediction target address is wrong, the speculative work must be discarded.
The Branch Target Buffer can only store one address for each branch instruction. For a switch-case statement a C compiler might generate a table with the address of the code for each case and index that table to get the address of the next instruction to execute. Each time the switch-case statement is executed there might be a different case run, resulting the processor mispredicting what instruction to execute next and processor performance suffers. Object-oriented code might suffer a similar misprediction problem if polymorphism is used (the same code being used for different classes and different classes have different implementations of the same method). Each time the code used for a different class there would be misprediction penalties.
Investigating Branch Prediction Further
Both Intel and AMD have documentation discussing performance tuning for their processors which includes suggestions on how to address some of these performance issues:
Last updated: February 23, 2024