Customize the compilation process with Clang: Making compromises

Customize the compilation process with Clang: Making compromises

In this two-part series, we’re looking at the Clang compiler and various ways of customizing the compilation process. These articles are an expanded version of the presentation, called Merci le Compilo, which was given at CPPP in June.

In part one, we looked at specific options for customization. And, in this article, we’ll look at some examples of compromises and tradeoffs involved in different approaches.

Making compromises

Everything you need to grow your career.

With your free Red Hat Developer program membership, unlock our library of cheat sheets and ebooks on next-generation application development.

SIGN UP

Debug precision vs. size

Increasing the accuracy of debug information leads to a bigger binary. On the opposite, decreasing the accuracy of debug information reduces its size. You can control this behavior with:

  • -g1: Lower precision.
  • -g2
  • -g3: Higher precision.
  • -fdebug-macro: Include debug information for macros!

Recall that one can extract debug information to a separate file. Distributions like Fedora, Red Hat Enterprise Linux, or Debian do that and provide separate debug packages:

  • objcopy --only-keep-debug to extract debug information.
  • objcopy --compress-debug-sections to compress them.

Unlike gcc, clang doesn’t make any difference between -g2 and -g3 on our test case:

$ for g in 1 2 3 ""
  do
    printf "-g$g: \t" && curl $sq | clang -c -O2 -g$g -xc - -o- | wc -c
  done
-g1   : 3168632
-g2   : 7025488
-g3   : 7025488
-g    : 7025488

Bonus: -fdebug-macro -g : 7167752

Impact of the optimization level on compilation time

One could expect that more optimization takes more time—that the compiler tries harder. The following experiment, however, invalidates this intuition:

$ for O in 0 1 2 3
  do
  /usr/bin/time -f "-O$O: %e s" clang sqlite3.c -c -O$O
  done
-O0: 22.15 s
-O1: 24.02 s
-O2: 22.68 s
-O3: 22.36 s

This is still understandable; many optimizations remove instructions from the code, which leads to smaller input and thus faster processing by later optimization steps.

Accuracy vs. performance

In some situations, it may be relevant to trade accuracy (of the computations) for performance. This is especially true for floating-point operations:

  • -ffp-contract=fast|on|off: Floating-point expression contraction.
  • -ffast-math: Assume floating-point arithmetic is associative and that there’s no NaN, inf or denormalized numbers.
  • -freciprocal-math: Optimize division by a literal.
  • -Ofast: -O3 + -ffast-math = -Ofast.

The following example illustrates how the compiler can turn a (slow) division into a (faster) multiply:

$ clang -xc - -o- -S -emit-llvm -O2 -freciprocal-math << EOF
double rm(double x) {
  return x / 10.;
}
EOF
define double @rm(double) {
  %2 = fmul arcp double %0, 1.000000e-01
  ret double %2
}

This example shows that clang successfully vectorizes the sum of a vector of double, taking advantage of -Ofast to change the instruction order and vectorize them, as the <2 x double> LLVM vector type points out.

$ clang -xc++ - -o- -S -emit-llvm -Ofast << EOF
#include <numeric>
#include <vector>
using namespace std;
double acc(vector<double> const& some)
{
  return accumulate(
           some.begin(),
           some.end(),
           0.);
}
EOF
...
%95 = fadd fast <2 x double> %94, %93
...

Portability vs. performance

A binary may either be generic for an architecture, say x86_64, or take advantage of some instruction set (e.g., AVX). Trading one for the other can provide a great performance boost, at the cost of constraining the binary to a specific processor family.

  • -march=native: Use all instructions available on the host architecture.
  • -mavx: Generate code that can use the AVX instruction set (even if it’s not available on the host).

The following code combines an architecture-specific feature, here the availability of fused multiply add, with relaxation of floating point accuracy:

$ clang++ -O2 -S -o- -march=native -ffp-contract=fast << EOF
double fma(double x, double y, double z) {
  return x + y * z;
}
EOF
...
vfmadd213sd %xmm0, %xmm2, %xmm1

Performance vs. security

The clang compiler provides several sanitizers that perform runtime-checking of various aspects of the program. Combined with a decent test suite, it is a good way to detect problems in one’s program. It’s usually considered a bad idea to ship software compiled with sanitizer flags as they significantly impact performance—the impact is less than running Valgrind on uninstrumented executables, though.

  • -fsanitize=address: Instrument memory accesses, adding out-of-bound checks.
  • -fsanitize=memory: Trace accesses to uninitialized values.
  • -fsanitize=undefined: Trace undefined behavior.
  • -fsanitize=thread: Detect data races in multi-threaded program.

To illustrate the impact of instrumentation, let’s investigate the LLVM bitcode generated by the compilation of the following snippet:

// mem.cpp
#include <memory>
double x(std::unique_ptr<double> y) {
  return *y;
}
$ clang++ -fsanitize=address mem.cpp -S -emit-llvm -o- -O2

Right before the memory access through a getelementptr, a key is computed and looked up to determine the status of the referenced memory location of the pointer. The code then branches on that checks and either reports an error or goes on.

...
%h = getelementptr inbounds %"class.std::unique_ptr", %"class.std::unique_ptr"* %y, i64 0, i32 0, i32 0, i32 0, i32 0, i32 0
%1 = ptrtoint double** %h to i64
%2 = lshr i64 %1, 3
%3 = add i64 %2, 2147450880
%4 = inttoptr i64 %3 to i8*
%5 = load i8, i8* %4
%6 = icmp ne i8 %5, 0
br i1 %6, label %7, label %8

; <label>:7:
call void @__asan_report_load8(i64 %1)
call void asm sideeffect "", ""()
unreachable

; <label>:8:
%9 = load double*, double** %h, align 8
from __future__ import

The Clang compiler supports different version of the C++ standard, so that if you’re working on a given codebase, you can control which language features you’re allowed to use. This capability is especially important if you plan to have a codebase compilable by several toolchains: the language version is firm common ground.

  • -std=c++11/14/17: Choose your standard version.
  • -std=gnu11/...: Pick your poison, and allow usage of a dialect.
  • -fcoroutines-ts: Enable experimental Technical Specifications.

Using the clang CLI auto-completion feature bundled in clang itself, it is possible to list all supported standards:

$ clang --autocomplete=-std=,
...
c++2a
...
cuda
...
gnu1x
...
iso9899:2011

Control security features

It’s also possible to insert various kinds of countermeasures in the code to prevent basic attacks that exploit buffer overflows or ROP.

  • -fstack-protector: Add a stack canary, to detect (some) stack smashing.
  • -fstack-protector-strong: Same as above, but applied to more functions.
  • -fstack-protector-all: Same as above, but applied to all functions. The stack probing is not particularly costly but this does make your code slower and bigger.
  • -fsanitize=safe-stack: Split the stack in a RO stack and a RW stack, to make it harder to smash the stack
  • -fsanitize=cfi: Instrument control flow to detect various situation where an opponent could take control of the control flow. Various protection schemes exist, see Control Flow Integrity documentation.

Let’s have a look at the flight of a (stack) canary:

$ clang -O2 -fstack-protector-all -S \
-o- -xc++ - << EOF
#include <array>
using namespace std;
auto access(array<__int128_t, 10> a,
            unsigned i)
{
  return a[i];
}
EOF
...
cmpq    (%rsp), %rcx
jne .LBB0_2
popq    %rcx
retq
.LBB0_2:
callq   __stack_chk_fail

At the end of the function, -fstack-protector-all has inserted a check between a value and the stack canary, leading to __stack_chk_fail being called if the comparison fails.

Feeding the compiler

Intuitively, the more information the compiler has, the better it can apply its optimizations. You can either gather more information or provide compiler hints.

Profile guided optimization (PGO)

If you have a relevant sample use case for your application, and you’re willing to optimize your application based on that sample, you can use Profile Guided Optimization (PGO).

  1. Compile the whole code base with -fprofile-generate. This generates extra code to record the functions and branches most frequently visited.
  2. Run the generated binaries on the use cases.
  3. Recompile your code with -fprofile-use.

Thanks to the information gathered, the compiler can better group and place functions, group and place basic blocks, and it has better hints for optimizations like loop unrolling or inlining.

Link time optimization (LTO)

Back in the day, separate compilation was a requirement due to memory limitations. Now, loading the whole program in memory during the compilation may be a valid option: that’s Link Time Optimization.

  • -flto=full: at link time, the whole program is optimised once more. The memory requirements are more important but this opens more optimisation opportunities.
  • -flto=thin: for each function, extra summaries are computed, and the compiler can make its decision based on these summaries, lowering the memory requirements, at the expense of potentially missing some optimisations.

As a fun fact, the -flto flag actually produces LLVM bitcode instead of ELF file:

$ echo 'foo() { return 0;}' | clang -flto -O2 -xc - -c -ofoo.o
$ file foo.o
foo.o: LLVM bitcode

Tuning optimization

Some individual passes accept extra parameters to control threshold effects. Most notably:

  • -mllvm -inline-threshold=n : controls inlining.
  • -mllvm -unroll-threshold=n : controls unrolling.

The greater the threshold, the more functions are inlined and more loops are unrolled.

Unfortunately, this applies to the whole compilation unit. For finer grained control, one can rely on compiler directives, a.k.a pragmas.

The following pragmas are valid on a loop and control various aspects of loop optimisations. Their effect is relatively straightforward: control whether the compiler will unroll the given loop or not, choose the unrolling factor

#pragma clang loop unroll(enable|full)
#pragma clang loop unroll_count(8)

It’s also possible to have a targeted version of -ffp-contract using the following pragma. In that case, the specified contract strategy is only valid for the decorated instruction, and the default contract (or the one specified through the command line) is applied otherwise.

#pragma clang fp contract(fast)

More pragmas are detailed in the Language Extension documentation

Getting Compiler Feedback

It’s a well-known fact that compiling C++ can take, say, some time. Clang can provide detailed feedback on how much time it spent in each compilation step. The relevant flag is -ftime-report.

To follow the optimization process in detail, it’s also possible to ask for a verbose output of the optimization process, on a per-optimization basis, using the remark mechanism:

  • -Rpass=inline
  • -Rpass=unroll
  • -Rpass=loop-vectorize

These flags tend to produce a lot of noise though:

$ { clang -xc++ - -c \
  -O2 -Rpass=inline << EOF
#include <numeric>
#include <vector>
using namespace std;
double acc(vector<double> const& some)
{
  return accumulate(
           some.begin(),
           some.end(),
           0.);
}
EOF
} 2>&1 | c++filt
...
... remark: __gnu_cxx::__normal_iterator<double const*, std::vector<double, std::allocator<double> > >::__normal_iterator(double const* const&) \
... inlined into std::vector<double, std::allocator<double> >::begin() const with cost=-40 (threshold=337) [-Rpass=inline]

Concluding words

These two articles aimed to show that a compiler is a complex piece of software with many more levers of action than just make the code faster ones. Compilation speed, executable size, security, making the development process faster, are just some of the multiple targets a compiler tries hard to cover.

Exploring the various flags is an endless quest, but here’s a worthy one:

$ clang --autocomplete=- | wc -l  # Count the number of compiler options that Clang accepts
3197

And this doesn’t even take into account all the low-level tuning that can be done on the LLVM level!

Join Red Hat Developer and get access to handy cheat sheets, free books, and product downloads.

Share