Recent versions of commonly-used Linux distributions including Fedora and Ubuntu have disabled frame pointer optimizations with the goal of allowing profiling tools to produce stack traces without needing to include a call-frame information interpreter. In this article I will explain some overlooked limitations of unwinding with frame pointers and why enabling frame pointers does not constitute a full solution to enable profiling. I will also list some initiatives that aim to enable system-wide profiling without the need for frame pointers.
Overview
Several recent articles have discussed the interaction of frame pointer optimization defaults and profiling, including Guinevere Larsen’s overview of the issue, Will Cohen’s article on call-frame information and unwinding, and my own article on profiling frame pointer-less code with eu-stacktrace.
In short, modern compilers can produce code either with or without a frame pointer register indicating the beginning of the current stack frame. With frame pointers enabled, the structure of the call stack is trivial to analyze; without frame pointers, an additional register becomes available for general computation. Since around 2011 with GCC version 4.6, the default has been to omit the frame pointer register, which means that debugging and profiling tools must use call-frame information to produce stack traces. In user space, call-frame information in the DWARF-based .eh_frame
format is universally available.
Unfortunately, it has not been feasible to include a full interpreter for DWARF and .eh_frame
bytecode in the Linux kernel. Thus, the kernel’s perf_events
framework can only use frame pointer unwinding for user-space code, which has affected profiling tools based on this framework and made them non-functional on most Linux distributions. This led to widespread calls to recompile distributions with frame pointers enabled, in the hopes of enabling system-wide stack trace profiling based on perf_events
.
Unfortunately, there are several issues with frame pointer unwinding that have been overlooked in the recent discussions.
Uneven distribution of performance gains and losses
First, the users most impacted by the slowdown due to frame pointers are different from the users who benefit from profiling-driven fixes. This creates a win-lose tradeoff that cannot be discussed in a satisfying fashion.
In general, one group of users is concerned with performance losses on systems with large numbers of interacting components. Such systems can exhibit issues due to mistuning, which can be fixed for a large performance impact when a profiler is available.
Another group of users are concerned with raw computational capacity when obtaining the maximum degree of optimization from their compiler. There are no low-hanging fruit for such users to find; all that they get from re-enabling frame pointers is a 1-2% performance loss, which translates to the loss of about 1 or 2 years of compiler improvements.
It is never good for an upstream project to be forced to decide which group of users is more important. This is especially true for projects whose user base is as large and diverse as that of a compiler or Linux distribution. Thus, there is a lot of motivation to develop a profiling solution that does not require frame pointer optimizations to be disabled.
Function prologues and epilogues
Second, the profiles produced by frame pointer unwinding will inevitably exhibit gaps around function prologues and epilogues and in procedure lookup table (PLT) sections. In these portions of an executable, the frame pointer register does not accurately reflect the current stack frame, which causes the frame pointer unwinder to skip the innermost function. In particular, this affects the validity of the profile for evaluating code-locality optimizations.
A lower-bound estimate of the problem can be obtained by a method suggested by Will Cohen. The minimum size of an x86 function prologue is 8 bytes. We can use the following perf
command to check the number of samples that fall into the first 8 bytes of a function and are thus guaranteed to have an inaccurate frame pointer:
# perf report --sort=sample,symoff
| grep -E '0x[01234567]$' # first 8 bytes only
| grep -v "[k]" | grep -v "@plt" # userspace only
When tested on x86, this analysis yielded about 5.2% of samples falling into the first 8 bytes of a function. Therefore, at least that proportion of samples will have an inaccurate stack trace when frame pointer unwinding is used. The actual proportion is likely to be greater, since compiler optimizations may expand the prologue with additional initialization code. Similarly, on aarch64, where the minimal function prologue size is 12 bytes, a minimum of 6.0% of inaccurate samples were found to occur. The frequent occurrence of samples early in a function may be a result of sampling being more likely to happen immediately after code is loaded into cache after a TLB miss.
In any case, this means that even with frame pointers enabled, call-frame information is still required to obtain an accurate profile.
Assembly-code functions in libraries
Third, the existence of hand-written assembly-code functions in commonly-used libraries, particularly the glibc string and memory manipulation functions, causes another source of inaccuracy. Again, the assembly-code sections do not maintain the frame pointer register the same way that an ordinary function call does.
In the best case, the frame pointer unwind will skip the caller of the assembly-code function. That is, if function f
calls function g
which calls strcpy
, the resulting stack trace will claim that function f
called strcpy
directly. In the worst case, if the assembly-code function uses the frame pointer register for general-purpose computation, the unwind will not be able to proceed at all.
On the other hand, a Call Frame Information (CFI) unwinder will be able to unwind the call correctly. Since around 2003, the glibc assembly code has been hand-annotated with CFI directives, and these document how the canonical frame address can be calculated relative to the stack pointer, or relative to a value spilled to memory, or otherwise. Any other library that includes similar annotations can enable accurate CFI unwinding of assembly-code.
To support frame pointer unwinding by modifying the glibc assembly-code functions to stop using the frame pointer register for computation and imitate frame pointer-enabled code emitted by a compiler is not a likely prospect. In addition to the volume of the work required, only a subset of the glibc users would find such a change desirable.
Alternatives to frame pointer unwinding
Fortunately, there are signs that frame pointer enablement in current Linux distributions is only a stopgap measure. Several initiatives are underway, each of which would make it feasible to obtain profiles via perf_events
without relying on a frame pointer unwinder:
- My own
eu-stacktrace
project was described in a prior article. As of the time of writing, an initial version has been merged upstream into elfutils release 0.192 and can be enabled by compiling elfutils with the--enable-stacktrace
option, as described in the README. - The SFrame project is a simplified call-frame information format with stronger efficiency guarantees than
.eh_frame
, albeit slightly less flexibility. As of the time of writing, a patchset to implement SFrame unwinding forperf_events
is being reviewed for inclusion in the Linux kernel. After that, SFrame support will need to be added to elfutils and then major distributions will consider compiling their packages with.sframe
sections by default. - New generations of hardware will include shadow stack support. Shadow stacks are a security feature which uses hardware assistance to track the structure of the call stack and monitor its integrity. This would also allow stack traces to be obtained without relying on call-frame information or on frame pointers.
Overall, the accuracy of a stack trace profile depends on a number of subtle details that are easy to overlook. Fortunately, the current projects seem to be on track to improve the quality of profile information over what has been available in the past.