Featured image for: SCTP over UDP in the Linux kernel.

Recent discussion in the Fedora list about frame pointers led us to think it may be an interesting idea to explain what frame pointers can and can't do for unwinding. To summarize the discussion, there was a request from a Fedora developer that, by default, all packages in Fedora should be built with explicit frame pointers to make it easier to profile and optimize. Those opposing the change cited the performance impact that such a change would have on the system. Even though that change has already been implemented, we thought it would be interesting to give some insight into the costs from the compiler team's perspective.

Unwinding or backtracing?

A common first point of confusion is the difference between unwinding and backtracing, as some discussions seem to use interchangeably or refer to one when they actually mean the other. So, what are their differences?

Backtracing is the simple process of getting a list of which function called which; a call chain. Simple as that. It does not alter the program in any state; it only gives us information about it.

Unwinding, on the other hand, is restoring the program's state to how it was before some function(s) were called and possibly giving a profiler much more information. This is useful for more than just profiling or debugging, as some language constructs, such as exceptions or global continues, require accurate unwinding, meaning it must always be available.

Why are we still doing backtraces again?

Despite what that explanation made it sound like, backtraces can be enough information for some use cases. One is performance tuning, since the program collecting performance data does not need to know exactly what the program looked like when a function was called. Knowing in which functions our programs are spending most of the time is enough.

The benefit of using backtraces is the speed. We want the performance measuring program to be as fast as possible so that it will influence the measured program as little as possible. Since all programs require unwinding information (but don't use it very often), we make a great effort to compress the information as much as possible. Very good for the end user, not as much for profiling.

Where do frame pointers fit into this?

A frame pointer will always be pointing to the topmost frame—and the previous return address is at a known offset from it—after the prologue of the function finishes. Which means that generating a backtrace is very easy and fast. For each function we need to analyze, we have the exact address where all the important information is.

This would also help with another problem: unreliable backtraces. Sometimes, when the compiler gets very clever, it might be challenging to actually tell the function we're in, such as in inlined functions, for instance. Ensuring that the compiler always adds frame pointers to all functions makes it trivial to locate the return location in the variable-sized function activation frames and generate the full backtrace. Alternatively, the stack may be corrupted, but the frame pointers may still be accurate, or vice versa, and using both to generate a backtrace may give additional information to the developer.

Why are they not used everywhere, then?

Frame pointers can only be used to determine the function calls. They have no information about parameters or how to get anything more than the address where the functions were called, and sometimes knowing the parameters can be just as important as knowing the call itself. They also have two drawbacks: We need to remember them in the current frame and save them when entering a new frame. They may sound blindingly obvious, but these issues need to be addressed. Let's look at what:

In practice, "remembering the frame pointer in the current frame" means we need to keep that information in a register at all times. Some architectures may not have a big problem with this, but on architectures with a limited supply of registers, this can mean we have one fewer register to save data, and so more memory operations can be required. As for issue #2, this means that for every function call, we might need to have 3 extra instructions, a push, a copy of the new frame pointer, and a pop (depending on the architecture and a few other factors. This is always true for x86_64, however). Again, more slow memory operations and a bit more memory used for every call.

Along with accessing slow memory, the extra instructions make the program bigger, which is a problem in itself, and it can make cache misses more likely, further slowing down the problem. The sum of all these factors creates a 2% decrease in performance, according to the tests in the Fedora change proposal.

2% doesn't sound too bad

Indeed, it doesn't if we just take the number at face value. However, a separate benchmark shared by Phoronix has gotten a very different result, showing the geometric mean of performance of the 100 tests decreasing by 13.2%. Internal testing showed only a 0.8% geometric mean change in the SPEC2000 benchmark using gcc12 and -O3 optimizations. No single benchmark can easily give us an answer, though, as each application is impacted differently, so each user can end up feeling different effects.

To give frame pointers a realistic chance, let's assume that 2% is the final number. It sounds reasonable if this is the only way to get accurate backtracing information since the possible gains of developers better optimizing their code is much greater than a measly 2%.

As one can imagine from the existence of this blog post, it is not the only way. There are a few options, each with its own pros and cons. The main one we'll focus on here is DWARF Call Frame Information (CFI) since it is always available in userspace (the focus of the proposal), but the change request lists other options.

What is DWARF CFI?

DWARF CFI is how unwinding is usually done for C++ exceptions and POSIX Thread cancellation. It is a list of frame description entries (FDE), which can be thought of as tables. Each line of the FDE table represents the program's state at a given point. To save on space, each row of the FDE only describes changes from the previous row rather than describing the full state, which is one of the reasons for DWARF unwinding to be slow. Will Cohen wrote a very enlightening blog post on the topic if you are interested in going more in-depth on how the CFI actually works.

Pros

In no particular order, some of the pros of using DWARF instead of frame pointers are:

  • Almost complete information: DWARF has all the information required to do full unwinding. This means that if you need information about the parameters of a given function call, you can usually get it. The main exception is if the program generating the backtrace can't determine the calling convention.
  • Outside of the code section: This section is stored in the .debug_frame or .eh_frame sections, so it will not influence execution time when not profiling the application.
  • Can be distributed separately: Since the parts of the CFI that are not essential are part of the .debug_frame section, they can be distributed with the rest of the separate debug information, so, once again, regular users won't be affected.

Cons

As for the cons, they are:

  • Slower: As mentioned before (quite a few times), when using the CFI, we need to deal with compressed information, so fully unwinding to get a backtrace is much slower than frame pointers.
  • Only usable in userspace: The Linux kernel does not use DWARF at all, instead choosing to use ORC (only available for x86_64 processors, unfortunately), a subset of the DWARF CFI, much more optimized to how the kernel must operate. ORC comes with its own issues, but exploring them is beyond the scope of this post. Suffice it to say, both together don't cover all the use cases of frame pointers.

Another issue that can come up—common when trying to get around the slowness of unwinding—is saving a big snapshot of memory. This makes it so perf can unwind after the fact, with no performance concerns, but it also generates large reports. If the profiled program needs multiple runs or long ones, the size of each report may become a concern. For example, the largest report generated during our testing was over 1.5 GB for a relatively short execution time and a single run. If the development machine is seriously hurting for size, this can be a serious trade-off.

Just how slow is using DWARF, anyway?

If we need to make trade-offs, we need to know how much each side costs for different use cases; otherwise we can't make informed decisions. To answer this question, Will Cohen has done some rough testing using perf. This isn't supposed to be a comprehensive and perfectly scientific test; it is instead meant to show if we're looking at a 10% or 10-fold performance degradation.

To assess the impact of DWARF unwinding, Will set up a fedora36 virtual machine and locally compiled systemtap. He then compiled perf from the upstream Linux kernel commit 15205c2829ca2cbb5ece5ceaafe1171a8470e62b, and ran the following commands:

~/bin/perf record -e cpu-clock --output=perf_record.data --call-graph=dwarf -- ~/bin/perf record -e cpu-clock --call-graph=dwarf --output=perf_stap.data ../install/bin/stap --example -p2 sleeptime.stp

~/bin/perf record -e cpu-clock --output=perf_report.data --call-graph=dwarf -- ~/bin/perf report --input=perf_stap.data



~/bin/perf record -e cpu-clock --output=perf_fp_record.data --call-graph=fp -- ~/bin/perf record -e cpu-clock --call-graph=fp --output=perf_fp_stap.data ../install/bin/stap --example -p2 sleeptime.stp

 ~/bin/perf record -e cpu-clock --output=perf_fp_report.data --call-graph=fp -- ~/bin/perf report --input=perf_fp_stap.data

Notice the difference in the parameter --call-graph, one using DWARF, the other frame pointers.

Recording should have minimal overhead from perf, and looking at the data, when using DWARF information, the overhead was less than the 0.3% of the collected samples compared to the profiled program, while using frame pointers had only 0.16% overhead.

For reporting, which is when the DWARF unwinding happens and when we expect to see the biggest difference, and there is some visible impact. Around 17% of the time spent generating the report was in unwinding. Also, to be able to naively record and still unwind correctly after the fact, the dwarf unwinder had to generate files 50-60 times larger.

So the current trade-off is doubling the overhead when collecting the data (from 0.16 to 0.3% of the total runtime) and a file 50-60 times larger, but we get a more accurate backtrace, and the performance of the profiled program is not affected—assuming that the act of writing itself doesn't affect the system.

So, is unwind-based backtracing a bad choice?

Once again, the answer is not so simple. We have a big example on how unwind-based backtracing can be done at great speeds in ORC. According to this thread on the kernel mailing list, it works 20-40 times faster than DWARF while working from the same base principles. Granted, part of this speed improvement is due to choices that can't be replicated in userspace, but it shows that the concept is solid and gives us some possibilities to test for improving speed.

Another speed limiting factor is that perf always does full state unwinding when showing backtraces, which is not strictly necessary. And remember, unwind-based is guaranteed to be accurate; otherwise basic language constructs may fail. If, for instance, the compiler decided to inline a function, there will be no activation frame, and so, no way for perf to know that a function was there at all, so it would be missing in the report. Missing inlined functions is a big deal since a situation may arise where the inlined function needs to be optimized, but the performance report says it is the caller being slow. While perf is working on showing functions even when using frame pointers, it still depends on having debug information, i.e., DWARF, which may not always be available and is part of the reason the frame pointers were suggested in the first place.

However, we don't have to settle

Even though this is the current situation, there is some future work on its way to make an even better case for using DWARF CFI. If perf implemented a minimal unwinder—only getting the frame pointer and instruction pointers—instead of the current full unwinding, I believe we may get even closer to the performance of ORC while keeping all the benefits of using unwinding for accurate backtracing. PolarSginals has recently published a blog post on how such a minimal unwinding may be accomplished using BPF, and have finished a minimum viable product. It's hard to compare how much of a performance boost it would be for perf itself (especially when the blog post explicitly mentions that performance was not the main concern), but we at least know that the concept is solid.

Another optimization just waiting to happen is improving the stack snapshot that is stored by perf. 8 KB every time we collect a sample is way too big in most situations. If it was possible for perf to store less information per sample or have a heuristic that determines how much information to store each time, the trade-off could become much more manageable.

Conclusion

Right now, there is no surefire solution. The best way to profile a program depends on your specific use case. DWARF is widely available for userspace programs, but it requires more storage space and can add a lot of overhead for the program. Frame pointers are always available, but they affect the performance of the profiled program, and it needs to be recompiled, making it impractical for projects that depend on libraries. There are known ways to improve this, but they need a lot of work put into them before we can even be sure what the end result may look like.