Featured image for: Value range propagation in GCC with Project Ranger.

Most commonly used contemporary CPUs support some SIMD instructions, which allow processing of multiple data by a single instruction. Using these instructions can significantly improve performance of parallelizable code. There are three main ways to make use of these vector capabilities: intrinsics, explicit vectorization, and auto-vectorization.

Intrinsics

Because the C language doesn’t provide direct access to these instructions, CPU manufacturers almost always provide a library and header files for capabilities in processors that are not directly supported by the language. These are called intrinsics. The header files define the C types and functions, and then the library is frequently implemented in inline assembly code and directly makes use of the processor’s vector instructions. A developer can then make use of these types and functions by including a header like x86 <x86intrin.h> intrinsics.

Almost all CPU vendors provide this kind of intrinsics. They are readily available for ARM Neon, MVE or SVE intrinsics, PowerPC Altivec, and VSX, as well as many more platforms.

In GCC, it is important to note that while use of the intrinsic macros/functions defined in such headers are supported, using special built-in functions used in their implementations which were added only in order to implement those intrinsics without using those intrinsics is not supported; GCC can—and in many cases has already—remove those built-in functions in newer versions if the intrinsics can be implemented differently, change their arguments or behavior.

A problem with these intrinsics is because they are written for a particular CPU, they are not at all portable. To give programmers a way to write portable code, GCC provides some compiler extensions that abstract away the architecture specific code. The GNU generic vectors provide access to many of the vectorization features, but do not provide as complete coverage of the vectorization instructions as the native CPU intrinsics.

Auto-vectorization

Auto-vectorization is a compiler optimization in which the compiler analyzes the source code and determines that it can convert scalar code into vectorized code to make it run faster. Broadly speaking, there are two kinds of auto-vectorization that GCC can perform: loop vectorization and basic block vectorization, sometimes called superword-level parallelism vectorization.

Loop vectorization is where multiple iterations of a loop are executed at the same time using the SIMD instructions. For example, if the vector width in the CPU is 4 32-bit integers in a loop which uses only up to 32-bit integer types, then iterations 0, 1, 2, and 3 would execute simultaneously and then iterations 4, 5, 6, and 7.

Basic block vectorization is where independent scalar instructions can be combined into SIMD instructions. Reordering the loads and stores within the basic block to use the vector registers, and then operating on those vector registers can often allow the CPU and its memory controller to optimize use of the memory bandwidth and complete a task in fewer cycles than if instructions were executed sequentially and performing arithmetic operations, etc., using a single instruction on multiple values can boost performance as well.

An auto-vectorizer was added to GCC 4.0 and has been significantly extended since then. When GCC 4.3 was released 15 years ago, the auto-vectorizer was turned on by default only when -O3 was specified. Unfortunately, most of user code is compiled with -O2 instead and so unless users specially asked for vectorization, it wasn’t performed in such compilations. Only relatively recently in GCC 12.1 was the auto-vectorizer enabled when -O2 is specified. 

There are several reasons why there was this 14-year delay.

Vectorization is not always unconditionally beneficial; there are costs to using vector operations. One of these costs is that the size of the resulting code is almost always larger. Part of the reason for this is that resulting code must often also contain a sequential version of the code in addition to the vector portion to deal with any non-vector width remainders. While making larger code is acceptable in -O3, it is less acceptable when using -O2.

The analysis that goes into deciding if some code should be vectorized is also hard. When the compiler doesn't know what code is executed often and what code is only rarely executed, it has to guess it using heuristics. The compiler could end up wasting instruction cache with vectorized code on rarely executed with insufficient performance benefit. The heuristics that guide the compiler to make the decision that vectorization is worth it for a loop or a basic block are called cost models.

When auto-vectorization was first added to -O3 in GCC 4.3 it defaulted to using the dynamic cost model where the compiler estimates cost of the loop using scalar instructions and vector instructions and can decide whether vectorization is profitable using either compile time or runtime checks.

Another cost model is the cheap cost model. It is like dynamic, but doesn't vectorize if the runtime checks for data dependencies or alignment would be too costly. 

The very cheap cost model vectorizes only if the entire loop can be replaced with a vectorized loop without a non-vectorized scalar remainder and without any runtime checks. The most modern vector ISAs make this much easier because they allow remainders to be handled by masking off portions of the vector register rather than having a sequential version of the loop.

Finally, there is the unlimited cost model which assumes vectorization is always profitable. The unlimited cost model crosses over from auto-vectorization into explicit vectorization. It is used by default when a developer marks a loop with the OpenMP simd or loop directives or combined and composite directives which include those directives.

Since the particular cost model can be important in making decisions as to whether a loop is vectorized, the cost model can be changed using the -fvect-cost-model= option (or for OpenMP, simd loops using the -fsimd-cost-model= option).

If you want to apply the right cost model for each loop, you can use a technique called profile-guided optimizations. This gathers information from sample runs of the program to inform the compiler about how aggressively it should try to optimize particular portions of the code. 

The advantages of auto-vectorization can be quite notable. For example, GCC 13 was used to compile the SPEC CPU 2017 benchmark on x86_64, the enablement of vectorization at -O2 by default (as measured comparing -O2 -fno-tree-vectorize versus -O2) results in 0.35% code size growth on average and 4% improvement on SPECint2017 and 3% improvement on SPECfp2017. The most notable improvements were a 55% improvement on the 625.x264_s benchmark and a 20% improvement on the 638.imagick_s benchmark. However, the fact that vectorization can be a tradeoff is highlighted by the fact that the 605.mcf_s benchmark was 2% slower.

Also, note whether vectorization can be used at all for a particular loop, how efficiently and how many values it can process simultaneously heavily depends on the selected vector ISAs.

While GCC can be configured to default to a particular architecture and tuning, when using GCC built for various distributions the default is usually the least common base for a particular architecture which the distribution supports. So, for example, on 32-bit x86, that default typically does not include any vector ISAs at all and vectorization most of the time isn’t performed at all, while on 64-bit x86-64 the default is often just SSE2 and not anything newer. So some vectorization is possible, but on only 128-bit, and many instructions are missing.

If a particular program will only be used on newer hardware, it is desirable if developers tell explicitly the compiler which architecture it will be, using -march=, -mtune= options, etc., or by selecting specific ISAs, e.g., -mavx2 if all supported CPUs will support at least AVX2. If code will be only run on the exact machine on which it has been compiled, one can use the -march=native option to enable all ISAs of the detected CPU. In programs or libraries that need to run on any architecture, performance sensitive parts of the program can be built multiple times, with some versions using target pragmas or attributes and at runtime select, e.g., using ifuncs, depending on the CPU capabilities, the right version. Or one can use the GCC multi-versioning feature to do that automatically.

Explicit vectorization

Since choosing to vectorize code can both increase the size of code and sometimes result in slower code and because the compiler doesn’t always make the right choice about when it is beneficial for a loop or a block to be vectorized. Developers need ways to make their wishes explicit.

One way to do this was already mentioned. Adding the OpenMP simd directive to a loop and compiling with either -fopenmp or -fopenmp-simd tells the compiler to use the unlimited cost model by default for that particular loop and therefore assume that vectorizing this loop is always beneficial and in addition to promise the loop does not contain loop-carried lexical backward dependencies. For example:

void f8 (void) {

  extern int a[1024], b[1024], c[1024];

  #pragma omp simd
  for (int i = 0; i < 512; ++i) { a[i + 1] = b[i] + 1; c[i] = a[i] - 2; }

}

Sometimes, the compiler’s analysis of the loop doesn’t allow it to tell if a loop is actually vectorizable. Sometime the C restrict keyword (or in C++, __restrict) can be used to mark pointers so that compiler can assume a lack of data dependencies in a loop that it can't determine that otherwise.

void f4 (int *restrict p, int *restrict q) {

  for (int i = 0; i < 1024; ++i) p[i] += q[i];

}

In other cases, #pragma GCC ivdep can be used to assert the loop doesn't contain loop carried lexical backward dependencies which would require a runtime dependency check. However, lexical forward dependencies are ok.

If auto-vectorization is not beneficial for certain compilation unit or certain loops, it can be disabled using the GCC command-line options -fno-tree-vectorize, -fno-tree-loop-vectorize, -fno-tree-slp-vectorize. More locally, a developer can use #pragma GCC novector pragma above a particular loop to explicitly tell the compiler not to vectorize this loop. If OpenMP is being used, then #pragma omp simd if (0) accomplishes the same thing.

Warnings

Developers should be aware that vectorization can trigger crashes in broken code. For example, accessing data through unaligned pointers is undefined behavior in C/C++ (unless using, for example, packed structures) in scalar code. However programs often get away with it on many common CPUs which don't trap on unaligned accesses. Vector instructions are often not quite as forgiving and they will trap on misaligned accesses causing the code to crash. If a program suddenly starts crashing after vectorizing is turned on, then you can use -fsanitize=undefined option to diagnose misaligned accesses at runtime to fix such bugs.

Other examples

int a[1024], b[1024], c[1024];

struct S { int a, b, c, d; } d, e;


void f1 (void) {

  for (int i = 0; i < 1024; ++i) a[i] += b[i];

}

void f2 (void) {

  d.a *= e.a; d.b *= e.b; d.c *= e.c; d.d *= e.d;

}

void f3 (int *p, int *q) {

  for (int i = 0; i < 1024; ++i) p[i] += q[i];

}

void f5 (int *p, int *q) {

  #pragma omp simd
  for (int i = 0; i < 1024; ++i) p[i] += q[i];

}

void f6 (void) {

  for (int i = 0; i < 47; ++i) a[i] += b[i];

}

void f7 (int j) {

  for (int i = 0; i < j; ++i) a[i] += b[i];

}


void f9 (void) {

  for (int i = 0; i < 512; ++i) { c[i] = a[i] - 2; a[i + 1] = b[i] + 1; }

}

In the f1 function, the number of loop iterations is known and divisible by the vectorization factors and there are no data dependencies that would prevent vectorization, so the loop is successfully vectorized even at -O2 with the very cheap cost model, e.g., on x86-64 with -msse2 (default) by performing 4 iterations of the loop concurrently, with -mavx2 8 iterations or with -mavx512f 16 iterations.

The f2 function is an example of SLP vectorization, there is no loop, but same arithmetics is performed on adjacent data, so there can be two vector loads of 4 ints, one vector multiplication and one vector store of 4 ints.

The f3 loop has a possible data dependency, if called, for example, with f3 (&a[1], a), then the loop must be scalar and can't be vectorized; if called with f3 (&a[2], a), then it could be vectorized at most with vectorization factor of 2, if with f3 (&a[4], a) at most with vectorization factor 4, etc. When p is smaller than or equal to q, or when p is larger than or equal to q + 1024, the loop can be vectorized with any vectorization factor. So, in order to vectorize that loop, the compiler needs to add a runtime check that (uintptr_t) p - (uintptr_t) q - sizeof (int) > (vectorization_factor - 2) * sizeof (int).

This is considered too costly for the very cheap cost model (as then there would need to be both the vectorized and scalar loops) and so f3 is not vectorized at -O2, but is cheap enough for, e.g., -O2 -fvect-cost-model=cheap or -O3; though, if there are too many data dependency checks, even the cheap cost model would give up.

The f4 function earlier is a variant of f3, where the user promises the pointers will not overlap using the C restrict keyword and so the loop is vectorized without a runtime data dependency check, even at -O2. Calling f4 (&a[32], a) would invoke undefined behavior.

The f5 function is another variant of f3, where the promise is expressed with OpenMP directive. Calling f5 (&a[1], a) would invoke undefined behavior when using -fopenmp or -fopenmp-simd and so the loop is also vectorized. If one specifies #pragma omp simd safelen(4) in f5 instead, the compiler could vectorize it at most with vectorization factor 4 and e.g., calling it with f5 (&a[8], a) would be valid.

The f6 function has a known number of loop iterations, but that number is odd, so it is not vectorized in the very cheap cost model because it would require one scalar iteration after the vectorized loop, it is vectorized in all the other cost models.

The f7 function has unknown number of loop iterations, normal vectorization would compute at runtime how many vectorized loop iterations are needed and perform up to vectorization factor minus one scalar loop iterations (or one vector iteration with masking) after it and so it is again inappropriate for the very cheap cost model enabled by default at -O2, but is vectorized, e.g., with -O3 or -O2 -fvect-cost-model=cheap.

The f8 loop earlier is an example of loop carried lexical forward dependency, which is allowed in OpenMP simd loops (but not other OpenMP directives which often assert lack of any dependencies between iterations) and so the loop is vectorized with -fopenmp or -fopenmp-simd.

The f9 loop is an example of loop carried lexical backward dependency, which would not be allowed in OpenMP simd loops (if it was, e.g., a[i + 4], it would be okay to use simdlen(4)) and the loop is not vectorized at all.

Conclusion

For performance-sensitive code, I'd strongly recommend using PGO optimizations, build the program using -fprofile-generate, run the program on some good enough realistic workload and then rebuild with -fprofile-use ; in that case, the compiler will know what parts of code are hot, what parts are cold, and use the appropriate vectorization modes on hot code. Also, checking which of the hot loops seen during profiling are vectorized and if not, why, can be very helpful. Sometimes, with minor changes, it is possible to get them auto-vectorized by the compiler, or if that is not possible, perhaps vectorize manually using CPU intrinsics or generic vectors.

You can. use the -fopt-info-missed or even more detailed -fopt-info-note-missed or -fdump-tree-vect-details options  to find out which loops weren't vectorized and why.