Many profiling tools on Linux have previously been limited by their reliance on stack unwinding algorithms that require commonly-used frame pointer optimizations to be disabled. This article introduces eu-stacktrace, a prototype tool that uses the elfutils toolkit’s unwinding libraries to support a sampling profiler to unwind frame pointer-less stack sample data.
Background
Developers and customers find benefit in profiling the performance of their applications in both development and production environments. A typical requirement for useful profile data is an accurate stack trace listing the active functions at every sample in the profile. Commonly used profiling tools try to fulfill this requirement with basic unwinder implementations that assume programs to have been compiled with a stack frame format that includes frame pointers.
The reliance on frame pointer-based unwinding has led to a conflict of priorities in recent versions of popular Linux distributions. Historically, the omit-frame-pointer optimization in GCC has been a popular compiler default. This optimization reduces register pressure by reassigning the frame pointer register on architectures such as x86 to store general-purpose data. Recently, various users reliant on profiling have requested that this compiler optimization be disabled to allow existing profilers to function with system programs, while other users have been concerned about the resulting performance loss even on systems where profiling functionality is never used.
It is worth noting that the elfutils toolkit includes a more versatile unwinder implementation that relies on .eh_frame data included in a program’s executable file. The .eh_frame format is a subset of the DWARF debug information format, restricted to call-frame information. Programs in almost all Linux distributions, including Red Hat Enterprise Linux and Fedora, are packaged and shipped with .eh_frame sections included in the executables.
To make the elfutils unwinder implementation available for use by sampling profilers, I have been working on a tool called eu-stacktrace. At this point in time, there is a proof-of-concept version of eu-stacktrace that integrates with a patched version of the Sysprof whole-system sampling profiler.
In the following sections, I present the design of eu-stacktrace, compare the effectiveness of Sysprof with and without eu-stacktrace, and describe further goals for development. It is my hope that eu-stacktrace can help profiling tools on Linux to work reliably regardless of the presence or absence of frame pointers in the compiled applications.
Implementation
The prototype version of eu-stacktrace consists of a command line tool implemented in a branch of the elfutils source repository and a patchset for the Sysprof profiler.
Sysprof is a whole-system profiler that uses the Linux kernel’s perf_events framework to periodically sample the processes and threads running on each CPU, recording a syscap file containing a stream of sample packets. The syscap file can then be visualized in Sysprof’s graphical interface.
In its existing implementation, Sysprof invokes perf_events with the PERF_SAMPLE_CALLCHAIN option, which requests the kernel to analyze frame pointers to identify the sequence of program counters in the stack data of a process. To produce a stack trace from this sequence, Sysprof maps the program counters to function names via a simple post-processing pass that runs after the profile data has been captured. However, this method cannot be used to profile programs which were compiled without frame pointers.
In order to use eu-stacktrace for stack unwinding, the patched version of Sysprof instead configures perf_events with the PERF_SAMPLE_STACK option, which requests the kernel to return a fixed-size portion of the program’s stack data.
The eu-stacktrace command line tool is launched concurrently with Sysprof and used as a helper process to unwind the stack samples.
Sysprof sends stack sample packets to the eu-stacktrace process through a fifo. Then eu-stacktrace retrieves any .eh_frame information available for the profiled programs, unwinds each stack sample to produce a sequence of program counters, then writes the program counters to the syscap file as a sample packet in the exact same format that Sysprof would generate in its default mode of operation. Sysprof’s post-processing pass works exactly as before, reading the syscap file and appending function information.
The following command-line example clarifies how Sysprof and eu-stacktrace exchange data:
mkfifo /tmp/stacktrace.fifo
# eu-stacktrace reads from fifo, writes to test.syscap:
eu-stacktrace </tmp/stacktrace.fifo >test.syscap &
# sysprof writes sample packets to fifo during its profiling pass,
# then appends to test.syscap during its annotation pass
sysprof-cli --sample-stack --use-fifo=/tmp/stacktrace.fifo test.syscap
However, the most convenient way to use Sysprof with eu-stacktrace is through the --use-stacktrace option, which will instruct the patched version of Sysprof to launch an eu-stacktrace process automatically:
sysprof-cli --use-stacktrace test.syscapEstimate of effectiveness
It’s important to check that the overhead of CFI unwinding with eu-stacktrace is not too large compared to Sysprof’s default mode of operation. If this overhead turns out to be in the same range as the performance loss from compiling programs with frame pointers, that would make a strong argument for re-enabling the frame pointer removal optimization once CFI unwinding is generally accessible by profilers.
To give an initial idea of the CPU overhead of eu-stacktrace unwinding compared to Sysprof’s default mode of operation, I used Sysprof with and without eu-stacktrace to profile a system that was running the stress-ng "matrix" benchmark, invoked with stress-ng --matrix 0 -t 30s. On a system that was otherwise lightly loaded, using Sysprof with the default frame pointer profiling resulted in 0.09% of the samples coming from the sysprof-cli profiler process, while profiling with eu-stacktrace resulted in 1.18% of the samples coming from sysprof-cli and eu-stacktrace.
The overhead of the elfutils unwinder scales with the number of distinct processes for which eh_frame data needs to be processed, rather than with the number of samples. After launching several desktop applications and re-running the benchmark, the profiling overhead rose to 1.39% of the total samples.
According to Fedora project discussions around the time frame pointers were being re-enabled in major distributions, slowdown due to frame pointers is reported to fall within the range of 0…2%. More extreme slowdowns have been observed for particular programs such as the Python interpreter, but are not ubiquitous.
It is important to note that, unlike with overhead due to profiling, slowdown due to frame pointers occurs regardless of whether a particular system is being profiled or will ever need to be profiled. Thus, approximately 1% overhead with eu-stacktrace only during profiling is a reasonable tradeoff over 0…2% overhead for frame pointer inclusion on every system, all of the time. The overhead could be further reduced by making eu-stacktrace accessible via a library API rather than a fifo, at the cost of requiring more complex modifications to the profiling tools that use it.
Next steps
As of the time of writing, there are several remaining tasks to make eu-stacktrace work off-the-shelf as a solution for profiling without frame pointers. In particular, additional fixes are needed to make the implementation portable across architectures (the current prototype works on x86_64 systems) and to handle executables within containers; and more detailed benchmarking is desirable to estimate the upper limit for the complexity of a workload that can be handled within a given target profiling overhead (i.e., less than 2%).
After that, it will be feasible to integrate eu-stacktrace with other profiling tools beyond Sysprof. Currently, having eu-stacktrace to interface with the profiling tool through a fifo ensures that the changes to the profiling tool will be as simple as possible.
The eu-stacktrace prototype is available in a branch of the elfutils source repository, and the README describes how to build and test it with the currently-required Sysprof patches.