Exhaustive profiling toolkit: elfutils and libdwfl_stacktrace

Various (good enough) 80% solutions (i.e., framepointer unwinding and SFrame) have tended to dominate the Linux stack profiling landscape. These proved simple to implement and deploy compared to 20% solutions (e.g., elfutils) that use CFI for more exhaustive profile coverage, including coverage of difficult control-flow sections (e.g., function prologues and epilogues, unusual ABIs). This article discusses the libdwfl_stacktrace initiative to make the elfutils project easier to use for stack profiling and examines ideas for further work, including potential improvements to the kernel’s perf_events infrastructure to benefit both 80% and 20% profiling solutions.

An exhaustive profiling solution

In 2024, the elfutils project released the eu-stacktrace prototype. I developed eu-stacktrace with the goal of enabling system-wide stack sample profiling that does not require compiling programs with frame pointers. The goal was to remove barriers that made the already-existing unwinding library in elfutils difficult to adopt by a system-wide profiler.

The initial version of eu-stacktrace was an executable that communicated with a profiling tool through a FIFO, accepting stack sample packets, unwinding them, and returning call chains. This was a conservative design which assumed minimal modifications to the profiling tool. In early 2025, I reworked the design into a more conventional and efficient library interface, released as libdwfl_stacktrace in elfutils 0.193 and based on feedback from the Sysprof profiler project.

We designed libdwfl_stacktrace for adoption by profiler projects that currently receive callchain data from Linux perf_events. You must modify the profiler to collect stack samples rather than callchains, and then pass these stack samples to libdwfl_stacktrace for unwinding.

The core of libdwfl_stacktrace is the following function, which translates a perf_events stack snapshot into a sequence of frames and passes the frames to a callback.

int dwflst_perf_sample_getframes (dwfl, elf, pid, tid,
  const void *stack, size_t stack_size,
  const Dwarf_Word *regs, size_t n_regs,
  uint64_t perf_regs_mask, uint32_t abi,
  callback, void *arg);

With a couple of minor changes, the interface also adapts to non-perf_events-based profiling infrastructures.

-int dwflst_perf_sample_getframes (dwfl, elf, pid, tid,
+int dwflst_sample_getframes (dwfl, elf, pid, tid,
  const void *stack, size_t stack_size,
  const Dwarf_Word *regs, size_t n_regs,
-  uint64_t perf_regs_mask, uint32_t abi,
+  const int *regs_mapping, size_t n_regs_mapping,
  callback, void *arg);

Previously, the elfutils libdwfl library enabled stack unwinding, but the public interface made a number of limiting assumptions. A libdwfl library session was represented by a Dwfl structure tied to one process, generally assumed to be accessed as a core file or via a ptrace interface. This does not match the model used by profiling tools, which process sample packets for all processes on a system. When creating multiple Dwfl data structures, these did not share information, resulting in the repeated loading of CFI for the same library as that library was dynamically linked into different processes.

The libdwfl_stacktrace interface handles multiple processes by providing profiling tools with a Dwfl_Process_Tracker data structure that maintains a table of Dwfl structures and a cache of associated module data. When obtaining Dwfl structures from the tracker via the new dwflst_tracker_find_pid() interface, these Dwfl structures will cache CFI for modules within the tracker. Thus, a commonly used module such as libc.so.6 will only load once, and Dwfl structures representing different processes will share it.

The aim of libdwfl_stacktrace is to allow stack trace profilers to use elfutils’ mature support for CFI-based unwinding to cover programs across an entire system without missing edge case packets. Further work to make elfutils into an exhaustive profiling solution revolves around obtaining a more consistent data rate from the perf_events stack sampler.

SFrame as a lightweight profiling solution

In contrast to the exhaustive aim of libdwfl_stacktrace, the SFrame profiling effort is based around a simple and lightweight CFI format meant to have a simple interpreter that would be easy to include in the Linux kernel. Initial versions of SFrame improved on framepointer unwinding and did not aim to cover all architectures or all control-flow patterns. However, the project has since expanded its ambitions.

A kernel patchset has been under review for quite some time, and the format was slated for testing in Fedora. But there have been delays. The initial infrastructure for deferred stack tracing merged only this August, but is not yet enabled for any architecture. To be honest, the slow pace of adoption was surprising. My interpretation of events is that SFrame as a promising 80% solution has become bogged down in the process of trying to also cover the 20% solution space. There is an inherent tension between SFrame’s reliance on established, implicit ABI rules and the potential need to cover code sections that do not adhere to these implicit rules.

As SFrame becomes more complex to handle edge cases, the complexity of the standard begins to approach that of .eh_frame CFI, the kernel and distribution review process slows down, and the value of a completely new format becomes less clear. On the other hand, SFrame support could already be available in the Linux kernel and slated for inclusion in Fedora as a simpler alternative to .eh_frame. If only the project were more committed to keeping its design footprint small and supplanting framepointer unwinding as the already-existing "good enough, but not perfect" solution. Widely-used framepointer unwinding exhibits various coverage gaps, such as function prologues and epilogues, many of which SFrame resolves even in its initial form.

An area that the SFrame project is beginning to explore is the generation of CFI data from JIT compilers. This is a slow but very promising area of development. Keeping the data format straightforward allows JIT compilers to implement support with minimal effort. Any work in this area is universally beneficial since we could also extend elfutils to read JIT-generated SFrame sections.

The importance of profile consistency

In doing performance-testing of various profiling solutions, I've come to appreciate that consistency of the sample rate and profiling overhead is more important than the absolute overhead. The kernel often decides to lower the perf_events sampling rate on a loaded system.

[806714.250516] perf: interrupt took too long (3949 > 3940), lowering kernel.perf_event_max_sample_rate to 50000
[806714.881792] perf: interrupt took too long (5081 > 4936), lowering kernel.perf_event_max_sample_rate to 39000
[806715.097290] perf: interrupt took too long (6390 > 6351), lowering kernel.perf_event_max_sample_rate to 31000
[806715.175595] perf: interrupt took too long (8084 > 7987), lowering kernel.perf_event_max_sample_rate to 24000
[806716.304292] perf: interrupt took too long (10176 > 10105), lowering kernel.perf_event_max_sample_rate to 19000

The use of inconsistent sample rates throughout a profile is problematic for studying the resulting data compared to a high but constant profiling overhead. Measurements such as per-function sample counts are inevitably distorted once this kernel safety measure is triggered.

On heavily-loaded systems, even the baseline Sysprof profiler using perf_events with the default framepointer-based unwinding mechanism can trigger the throttling. Hence, this is an ongoing issue lurking in the background of the profile-quality discourse, and we need further testing to identify how often the kernel adjusts the sample rate. It’s known that the throttling becomes more likely as the number of cores on a system increases, which makes profiling on high-CPU-count systems more difficult to achieve reliably.

We may overcome the tendency of the kernel to throttle perf_events sample rates with additional tuning parameters that accommodate high-bandwidth profiling data streams. Depending on the results from planned experimentation, it might make sense to reduce the likelihood of triggering long-term sample rate reductions from one-off data gluts by introducing an optional grace period for the throttling or automatically raising kernel.perf_event_max_sample_rate after a period of heavy load has passed. Currently, such tuning options do not exist, but are justifiable for exhaustive, debug-oriented profiling on non-production systems.

Wrap up

In this article, we discussed two approaches to Linux stack profiling: elfutils with libdwfl_stacktrace for exhaustive profiling and SFrame for lightweight profiling. Future work revolves around obtaining a more consistent data rate from the perf_events stack sampler to make elfutils into a more viable exhaustive profiling solution.

Red Hat Developer Sandbox

Programming languages & frameworks

System design & architecture

Developer experience

Automated data processing

Platform engineering

Secure development & architectures

E-books

Cheat sheets

Documentation

Exhaustive profiling toolkit: elfutils and libdwfl_stacktrace

An exhaustive profiling solution

SFrame as a lightweight profiling solution

The importance of profile consistency

Wrap up

How to deploy and benchmark vLLM with GuideLLM on Kubernetes

Getting started with OpenShift APIs for Data Protection

How in-place pod resizing boosts efficiency in OpenShift

Automate Oracle 19c deployments on OpenShift Virtualization

Monitoring OpenShift Gateway API and Service Mesh with Kiali

Red Hat Enterprise Linux 10 cheat sheet

Platforms

Build

Quicklinks

Communicate

RED HAT DEVELOPER

Red Hat legal and privacy links

Red Hat legal and privacy links

Report a website issue