Customize the compilation process with Clang: Optimization options

Customize the compilation process with Clang: Optimization options

When using C++, developers generally aim to keep a high level of abstraction without sacrificing performance. That’s the famous motto “costless abstractions.” Yet the C++ language actually doesn’t give a lot of guarantees to developers in terms of performance. You can have the guarantee of copy-elision or compile-time evaluation, but key optimizations like inlining, unrolling, constant propagation or, dare I say, tail call elimination are subject to the goodwill of the standard’s best friend: the compiler.

This article focuses on the Clang compiler and the various flags it offers to customize the compilation process. I’ve tried to keep this from being a boring list, and it certainly is not an exhaustive one.

This write-up is an expanded version of the talk “Merci le Compilo” given at CPPP on June 15, 2019.

The clang version used is based on trunk, running on RHEL 7.

Every now and then, I’ll be using the SQLite Amalgamation C source as a large third-party code. Let’s assume that the following line has been sourced:

sq=https://raw.githubusercontent.com/azadkuh/sqlite-amalgamation/master/sqlite3.c

Introduction: Stating goals

The following source code is a relatively dumb version of a program that sums up numbers read from standard input. It’s most likely memory bound, but there’s still some processing going on:

#include <iostream>
int main(int argc, char** argv) {
  long s = 0;
  while (std::cin) {
    long tmp = 0;
    std::cin >> tmp;
    s += tmp;
  }
  std::cout << s << std::endl;
  return 0;
}

This is a relatively similar—but not equivalent—program written in Python. Python uses big integers by default so it behaves differently with respect to overflow, but it’s enough for our purposes.

import sys
print(sum(int(x) for x in sys.stdin.readlines()))

Let’s take a dumb approach and measure the execution time of these two programs on a relatively large input set:

$ seq 1000000 > numbers
$ clang++ sum.cpp -o sum
$ time ./sum < numbers
0.61s user 0.01s system 94% cpu 0.659 total

$ time python sum.py < numbers
0.77s user 0.04s system 99% cpu 0.818 total

The native code certainly is faster, but not by much. We can’t draw too many conclusions from a single run, but there’s at least one sure thing: The clang user has not specified their intent, so the compiler just generated a valid binary—this is thankfully a hard constraint—and didn’t try to optimize it for whatever metric its user is interested in.

Had the user wanted to optimize for execution speed, they should have specified that intent, say, through the -O2 flag:

$ clang++ -O2 sum.cpp -o sum
$ time ./count < numbers
0.34s user 0.00s system 99% cpu 0.348 total

Everything you need to grow your career.

With your free Red Hat Developer program membership, unlock our library of cheat sheets and ebooks on next-generation application development.

SIGN UP

Multi-criteria optimization

For a wide range of codebases, there’s something more than just optimize for speed. Sometimes, you want to limit the size of the binary; sometimes, you’re okay with trading speed for extra security. This also depends on where you are in the development life cycle. During code editing, for example, you want a fast analysis of your code, and during bug tracking, you want as much debug information as possible, etc.

 #
 ##                           #
 ##                           ##
 ##            ##             ##
 ##            ##             ##
 ##            ##             ##
 ##    ##      ##             ##
 ##    ##      ##      #      ##
 ##    ##      ##      ##     ##
PERF  DEBUG   EDIT    SECU   SIZE

Performance

I want the generated binary to run fast is a very common query for the compiler, so the following flags are among the most used ones:

  • -O0: No optimization at all.
  • -O1: O1 = (O0 + O2)/(2). I scarcely use this flag.
  • -O2: Optimize as much as possible, without taking the risk of significantly increasing the binary size or degrading performance.
  • -O3: Optimize even more, trading binary size for speed, and sometimes making decisions that may negatively impact performance.
  • -O4: O3 = O4. This is a myth.

Bonus: -O3 -mllvm -polly activates polyhedral optimizations, if Clang was compiled with Polly support.

Debug

I want to debug my code, I don’t care about performance is sadly a common request too :-/

  • -g: Include debug information.
  • -Og: == -O1 -g. That’s already a trade-off between performance and debuggability.

For the curious ones, the following snippet verifies that debug information sections are actually generated when passing the -g flag:

$ curl $sq | clang -xc -c -g - -o sq.o
$ objdump -h sq.o | grep debug
  #  name            size      ...
   9 .debug_str      00012b2d  ...
  10 .debug_abbrev   0000038d  ...
  11 .debug_info     0005056c  ...
  12 .debug_ranges   00000240  ...
  13 .debug_macinfo  00000001  ...
  14 .debug_pubnames 0000c73a  ...
  15 .debug_pubtypes 00001068  ...
  19 .debug_line     00073402  ...

Security

I want to protect my code from others—and myself is growing in importance these days. There aren’t a lot of flags that impact security without impacting performance, but it’s worth mentioning -D_FORTIFY_SOURCE=2. This picks a different declaration for a few functions, for example:

$ clang -xc -c -O2 - -S -emit-llvm -o - -D_FORTIFY_SOURCE=2 << EOF
#include <stdio.h>
void foo(char *s) {
  printf(s, s);
}
EOF
define void @foo(i8*) {
  %2 = tail call i32 (i32, i8*, ...) @__printf_chk(i32 1, i8* %0, i8* %0)
  ret void
}

The macro definition enables a hardened version of printf, namely __printf_chk, that also checks the number of variadic argument.

Size

I want to do some kind of weight control over my binary may be a valid requirement for some embedded system. In that case, you can use:

  • -Os: Same as -O2 with extra code size optimization, including different parameters for transformations like inlining.
  • -Oz: Same as -Os with more size optimizations, at the price of less performance.

Let’s showcase the impact of theses flags on the amalgamation binary:

$ curl $sq|clang -xc - -O2 -c -o-|wc -c
1488400
$ curl $sq|clang -xc - -Os -c -o-|wc -c
850696
$ curl $sq|clang -xc - -Oz -c -o-|wc -c
796976

Editing

The compiler also helps to produce better code through a bunch of warning and code-editing features:

  • -Wall: (Almost) all warnings.
  • -Werror[=...]: If you believe that a warning should be an error, you can selectively enable that feature, per warning.
  • -w: If you don’t know what it does, you probably don’t want to 🙂
  • -Xclang -code-completion-at: An internal flag that can be used by IDE to provide smart code completion.
$ cat hello.cpp
#include <iostream>
int main(int argc, char**argv) {
  std::co
$ clang++ -Xclang -code-completion-at=hello.cpp:3:10 -fsyntax-only hello.cpp
COMPLETION: codecvt : codecvt<<#typename _InternT#>, <#typename _ExternT#>, <#typename _StateT#>>
COMPLETION: codecvt_base : codecvt_base
...
COMPLETION: cout : [#ostream#]cout

In this case, clang outputs all identifiers starting with co available in namespace std.

In the next article, we’ll look at various compromises and tradeoffs involved in optimization, such as debug precision versus binary size, the impact of the optimization level on compilation time, and performance versus security. Stay tuned.

 

Join Red Hat Developer (it’s free) and get access to software, cheat sheets, books, and more.

Share