RHEL

In this article, I will discuss the promising experiments I’ve done in order to implement and evaluate a PGO enabled LLVM toolchain in Fedora. The official documentation for Profile-Guided Optimization (PGO) covers the technique and the differences between sampling and instrumentation. In my experiments, I’ve solely focused on instrumentation.

It’s okay to think of PGO as a black box

Applying PGO to any application can be quite an involved task, and there’s enough to consider even when thinking about PGO as a black box. If you have an application, a performance testing workload, and maybe even a validation workload, then you’re can try PGO. If you don’t have a validation workload, just split up your performance testing workload in half. The idea is that you don’t overfit your application to just one set of inputs and outputs, namely the performance testing workload.

PGO is like a feedback driven recompilation of your application, which means, unlike compiling with -O2 or -O3 in clang, you cannot benefit from PGO unless you help your application progress through the following five phases:

  • Phase 1: Turning on an optimization that makes your application slower.

    Compile your application and enable PGO in it. It is just a flag in clang. Note, that this doesn’t turn on an optimization like -O2. When you execute your application now, it will output PGO files. So, if anything, this will make your application slower.

  • Phase 2: Gather feedback from your workload.

    Then, you can run the application with the performance testing workload and collect the PGO output. Don’t look at the performance of this run because there’s a bit of overhead going on for the instrumentation.

  • Phase3: Merge the feedback into one file.

    After that, you take the PGO output, which could be multiple files, and merge it into one file. Loosely speaking, if your application passes a certain code branch twice in each output, this phase will collect both numbers and put more weight onto that code branch.

  • Phase 4: Recompile your application.

    Next, recompile your application by feeding the merged PGO output into the compile process. This will help clang steer to make better decisions when ordering basic blocks (source).

  • Phase 5: Validate the performance improvement.

    Finally, test your application against the performance testing workload and the validation workload and make sure you get good performance optimizations out of both.

Applying the 5 phases to LLVM in Fedora

The previous section discussed an application without going into further details. Let’s exchange the word “application” with “any PGO optimizable binary from the LLVM toolchain.” Why do I use such complicated phrasing? LLVM is made up of sub-packages with many binaries amongst which clang is one of the most prominent. In Fedora and Red Hat Enterprise Linux we build packages using a so-called “standalone build-mode”. That is a rather antiquated build mode that dates back to when LLVM was organized in SVN as separate repositories. Nowadays, LLVM is a Git mono repository and the default way of building it is as such. In fact, some sub-projects are deprecating the “standalone build-mode” upstream.

Through the course of my experiments, the reasons for which I wanted to keep the build mode that we have were twofold. First, I wanted to make my changes as less disruptive as possible to ease the potential adoption of my changes. Second, I wanted to compare the results against our existing LLVM toolchain.

Most of my work circled around changing the following three repositories from which I built part of the LLVM toolchain:

I decided to build for one operating system and one architecture only, namely Fedora 37 on x86_64.

I branched off from the aforementioned Fedora sources and created a branch in each repository to store my work. Then in phase 1, I would build the llvm, lld, and clang packages inside a Fedora Copr repository: kkleine/llvm-pgo-instrumented. Recall a PGO instrumented clang, when compiling an application will produce the regular application binaries but also PGO data files.

In phase 2, I created another Copr repository to run the application against a workload and gather PGO feedback. This is where things became interesting. For the clang application, a workload can be any other project that is compiled using clang. For example, blender or chromium are just a few examples for the packages that we build and ship in our Fedora Linux distribution, so why not just use them? After all, wouldn’t it be nice to build a package in Copr and get a sub-package with the PGO data inside? As with *-debug packages, they are generated automatically. Can this be automated so we don’t have to modify each package’s spec file individually, but instead keep them as is? The answer is yes to all of this.

Copr can build in different chroots depending on what operating system and architecture you want. So I first need to create a new project called kkleine/profile-data-collection in Copr and smuggle in our LLVM toolchain:

copr create \
  --chroot fedora-37-x86_64 
  --repo copr://kkleine/llvm-pgo-instrumented   
  profile-data-collection

Any package that I build in the kkleine/profile-data-collection project that has a BuildRequires, clang tag in the spec file will now use the PGO instrumented clang. For the sake of clarity, let’s just assume the package I want to build is blender. But how do I get an automatic sub-package with PGO data inside? This is a bit more involved, but I’ll show you how it can be done. You need to modify the buildroot in which to build blender so another package is already installed before you build blender.

copr edit-chroot \
  --packages llvm-pgo-instrumentation-macros \    
  kkleine/profile-data-collection/fedora-37-x86_64

This installs a so-called llvm-pgo-instrumentation-macros package into the build root. What this does is best explained with the chromium package. Chromium takes a long time to build. Seriously, it sometimes takes more than a day to finish. The number of clang or clang++ calls along the way is enormous. And each time I compile a file, many places in the clang code are used. The gathered PGO information in the produced files has lots of overlaps and can greatly be reduced by merging them.

In fact, my first naive approach to compiling chromium was to let it produce as many PGO files as needed and only at the very end have them merged into one. As it turns out, some of our builders ran out of disk space because of PGO files. In total, there was more than a gigabyte of PGO data on disk. Luckily, you can continuously merge PGO files to always fit onto disk. For chromium, the resulting PGO files about clang were reduced from 1,6 GB to 2 MB.

The background merge job and RPM macros

WARNING: I’m not going to explain every RPM macro that I’m using because I want to emphasize the workflow in general and only hint at the overall complexity of this endeavor.

RPM spec files can have sections like %prep, %build, %install, or %changelog to organize the build process, from unpacking sources to compiling and installing files on the target system. RPM itself taps into this process and stops any job at the end of a %build section.

%__spec_build_post   %{___build_post}
___build_post  \
  RPM_EC=$?\
  for pid in $(jobs -p); do kill -9 ${pid} || continue; done\
  exit ${RPM_EC}\
%{nil}

In order to automatically start a background job that collects and merges PGO data, I overwrite the existing __spec_build_pre macro in /usr/lib/rpm/macros by appending to it:

# Adds a backslash to each new line unless it is a %{nil} line
function add_backslash() {
sed '/^%{nil}/!s,$, \\,'
}
rpm --eval "%%__spec_build_pre %{macrobody:__spec_build_pre}" \
| add_backslash
echo "%{?__llvm_pgo_instrumented_spec_build_pre}"

The appended macro then starts the background job, which is an endlessly running job that uses inotifywait to look for close_write events on files matching the .*\.profraw$ regular expression in a given directory. Here’s how the background job is started through RPM macros.

# Where to store all raw PGO profiles
%__pgo_profdir %{_builddir}/raw-pgo-profdata
# Auxiliary PGO profile to which the background
# job merges continuously
%__pgo_background_merge_target %{_builddir}/%{name}.llvm.background.merge
# Place where the background job stores its PID file
%__pgo_pid_file /tmp/background-merge.pid
%__llvm_pgo_instrumented_spec_build_pre \
[ 0%{__llvm_pgo_subpackage} > 0 ] \\\
&& %{__pgo_env} \\\
&& /usr/lib/rpm/redhat/pgo-background-merge.sh \\\
   -d %{__pgo_profdir} \\\
   -f %{__pgo_background_merge_target} \\\
   -p %{__pgo_pid_file} & \

As we’ve seen before, unfortunately RPM is not very nice when it comes to running merge jobs in the background of the %build section for a package. There are many good reasons for this. But in order to stop the background job I need to write to a file for which the background job continuously listens: %{__pgo_shutdown_file}. Once the background job confirms that it is ready it will delete its PID file %{__pgo_pid_file} which is for what we’ll wait in the macro to happen.

# Overriding __spec_build_post macro from /usr/lib/rpm/macros
%__spec_build_post \
  %{?__llvm_pgo_instrumented_spec_build_post} \
  %{___build_post}

%__llvm_pgo_instrumented_spec_build_post    \
  if [ 0%{__llvm_pgo_subpackage} > 0 ]\
  then\
   echo 'please exit' > %{__pgo_shutdown_file};\
   [ -e %{__pgo_pid_file} ] && inotifywait -e delete_self %{__pgo_pid_file} || true;\
  fi\

I use a technique similar to how debug information is automatically created as a sub-package without the spec file actually asking for it:

# Generate profiledata packages for the compiler
%__llvm_pgo_subpackage_template \
%package -n %{name}-llvm-pgo-profdata \
Summary: Indexed PGO profile data from %{name} package \
%description -n %{name}-llvm-pgo-profdata \
This package contains profiledata for llvm that was generated while \
compiling %{name}. This can be used for doing Profile Guided Optimizations \
(PGO) builds of llvm \
%files -n %{name}-llvm-pgo-profdata \
%{_libdir}/llvm-pgo-profdata/%{name}/%{name}.llvm.profdata \
%{nil}

Think of %{name}.llvm.profdata as the file to which we’ve continuously merged our PGO data in the background job.

As you can see the sub-package with PGO data will be called %{name}-llvm-pgo-profdata where %{name} resolves to the Name: tag in a spec file.

This completes phase 3 with the creation and bidirectional termination of the background job to gather and merge PGO data files.

Recompilation of LLVM with PGO

There’s only one thing left to do. What if I want to get and later use PGO data from multiple projects like blender and chromium together? This is possible by collecting all generated subpackages through BuildRequires: tags in another package called llvm-pgo-profdata. During the build of this llvm-pgo-profdata package, all profiles are merged into an indexed profile data file. The final llvm-pgo-profdata RPM then installs the indexed profile data file into a location from which a PGO optimized build of LLVM can read it. This PGO optimized build of the LLVM toolchain is done in a third Copr project called kkleine/llvm-pgo-optimized. The llvm.spec file contains these lines to pull in the profile data:

%if %{with pgo_optimized_build}
BuildRequires: llvm-pgo-profdata
%endif

Then when CMake is invoked, the only change needed is to pass along the profile data:

%if %{with pgo_optimized_build}
    -DLLVM_PROFDATA_FILE=%{_libdir}/llvm-pgo-profdata/llvm-pgo.profdata \
%endif

Evaluation

What I tested here is the LLVM shipped with rawhide at the time against a PGO optimized LLVM 16.0.2 that I built.

I tested this using the LLVM test suite:

“The test-suite contains benchmark and test programs. The programs come with reference outputs so that their correctness can be checked. The suite comes with tools to collect metrics such as benchmark runtime, compilation time and code size.”

In the evaluation, I kept an eye on the execution, compile, and link time:

$ /root/test-suite/utils/compare.py --metric exec_time --metric compile_time --metric link_time --lhs-name 16.0.3 --rhs-name 16.0.2-pgo /root/rawhide/results.json vs /root/pgo/results.json
Warning: 'test-suite :: SingleSource/UnitTests/X86/x86-dyn_stack_alloc_realign.test' has no metrics, skipping!
Warning: 'test-suite :: SingleSource/UnitTests/X86/x86-dyn_stack_alloc_realign2.test' has no metrics, skipping!
Warning: 'test-suite :: SingleSource/UnitTests/X86/x86-dyn_stack_alloc_realign.test' has no metrics, skipping!
Warning: 'test-suite :: SingleSource/UnitTests/X86/x86-dyn_stack_alloc_realign2.test' has no metrics, skipping!
Tests: 3052
Metric: exec_time,compile_time,link_time

Program                                       exec_time                    compile_time                  link_time
                                              16.0.3    16.0.2-pgo diff    16.0.3       16.0.2-pgo diff  16.0.3    16.0.2-pgo diff
920428-1.t                                      0.00      0.00        inf%   0.00         0.00             0.03      0.02     -27.8%
pr17078-1.t                                     0.00      0.00        inf%   0.00         0.00             0.03      0.03      -4.2%
enum-2.t                                        0.00      0.00        inf%   0.00         0.00             0.03      0.04      36.4%
doloop-1.t                                      0.00      0.00        inf%   0.00         0.00             0.03      0.04      30.0%
divconst-3.t                                    0.00      0.00        inf%   0.00         0.00             0.02      0.02     -17.9%
pr81556.t                                       0.00      0.00        inf%   0.00         0.00             0.03      0.03      24.6%
divcmp-4.t                                      0.00      0.00        inf%   0.00         0.00             0.03      0.04      13.9%
20020307-1.t                                    0.00      0.00        inf%   0.00         0.00             0.03      0.02     -26.5%
20020314-1.t                                    0.00      0.00        inf%   0.00         0.00             0.02      0.03      23.7%
divcmp-3.t                                      0.00      0.00        inf%   0.00         0.00             0.03      0.03     -20.3%
20020328-1.t                                    0.00      0.00        inf%   0.00         0.00             0.03      0.03       6.0%
20020406-1.t                                    0.00      0.00        inf%   0.00         0.00             0.03      0.03      27.0%
20020411-1.t                                    0.00      0.00        inf%   0.00         0.00             0.04      0.03     -20.1%
complex-4.t                                     0.00      0.00        inf%   0.00         0.00             0.03      0.03       1.4%
20020508-1.t                                    0.00      0.00        inf%   0.00         0.00             0.04      0.03     -14.0%
                           Geomean difference                      -100.0%                         -9.7%                       -1.2%
           exec_time                             compile_time                             link_time
l/r           16.0.3     16.0.2-pgo         diff       16.0.3   16.0.2-pgo        diff       16.0.3   16.0.2-pgo         diff
count  3034.000000    3034.000000    2401.000000  2505.000000  2505.000000  440.000000  2505.000000  2505.000000  2505.000000
mean   1091.690748    1074.387911    inf          0.259116     0.225875    -0.077137    0.049104     0.048398     0.014828
std    21120.154138   20962.649384  NaN           2.214408     1.988421     0.199779    0.032997     0.032546     0.237169
min    0.000000       0.000000      -1.000000     0.000000     0.000000    -0.494005    0.017100     0.017500    -0.551422
25%    0.000000       0.000000      -0.227273     0.000000     0.000000    -0.195129    0.029100     0.029100    -0.161290
50%    0.001100       0.001100       0.000000     0.000000     0.000000    -0.110612    0.034300     0.033600    -0.010672
75%    0.126725       0.123600       0.212121     0.000000     0.000000     0.011439    0.045700     0.044400     0.161049
max    817849.818925  828252.719527  inf          74.697400    69.996700    0.844595    0.206500     0.227000     0.980296

The most important line to look at is this:

Geomean difference       -100.0%     -9.7%    -1.2%

To interpret the results, one has to understand that all tested programs are too fast to measure their execution time, hence the inf%. The compile time on the other hand shows a performance improvement of 9.7% when going from LLVM 16.0.3 to PGO optimized LLVM 16.0.2. The performance of linking was also improved by 1.2%.

A close to 10% performance improvement for the compiler is quite good, given that I haven’t changed a single line in the compiler itself.

Summary

I previously mentioned our rather antiquated way of building llvm, clang, compiler-rt and openmp. While providing us with great turnaround times for releasing bug fixes, it has also caused us quite some trouble. That is why we internally investigate to build those packages as one. This would give us a facet of opportunities like bootstrap builds, LTO over all packages, and PGO without the need for multiple Copr repositories. Note that Fedora and RHEL are built with Koji, not Copr.

I hope you enjoyed reading this and are as excited as I am about the potential changes to the LLVM toolset.