SystemTap

This blog is the third in a series on stapbpf, SystemTap's BPF (Berkeley Packet Filter) backend. In the first post, Introducing stapbpf – SystemTap’s new BPF backend, I explain what BPF is and what features it brings to SystemTap. In the second post, What are BPF Maps and how are they used in stapbpf, I examine BPF maps, one of BPF's key components, and their role in stapbpf's implementation.

In this post, I introduce stapbpf's recently added support for tracepoint probes. Tracepoints are statically-inserted hooks in the Linux kernel onto which user-defined probes can be attached. Tracepoints can be found in a variety of locations throughout the Linux kernel, including performance-critical subsystems such as the scheduler. Therefore, tracepoint probes must terminate quickly in order to avoid significant performance penalties or unusual behavior in these subsystems. BPF's lack of loops and limit of 4k instructions means that it's sufficient for this task.

Using tracepoint probes with stapbpf

SystemTap makes it easy for users to write BPF programs and attach them to tracepoints. SystemTap's high-level scripting language provides a straightforward way to interface with the kernel's BPF facilities. The following example script attaches a probe to the mm_filemap_add_to_page_cache tracepoint. It tracks how many pages are added by each process and which process has added the most pages. Once the tracepoint probe has been running for 30 seconds, the timer probe (also a BPF program) fires and the process that added the most pages is printed along with the number of pages it added. Probing is then terminated via exit().

$ cat example.stp
global faults[250]
global max = -1

probe kernel.trace("mm_filemap_add_to_page_cache")
{
  faults[pid()]++

  if (max == -1 || faults[pid()] > faults[max])
    max = pid()
}

probe timer.s(30)
{
  if (max != -1)
    printf("Pid %d added %d pages\n", max, faults[max])
  else
    printf("No page cache adds detected\n")

  exit()
}

To run this script using stapbpf, simply use stap --bpf:

# stap --bpf example.stp
Pid 5099 added 5894 pages

Advantages of stapbpf

SystemTap's scripting language conveniently abstracts away a variety of low-level BPF details that may not be pertinent to a user's inquiry and could complicate their investigation or at least worsen the learning curve associated with BPF tooling. Actions such as declaring a hashmap with space for 250 key-value pairs (global fault[250]) and checking whether it contains a specific key (pid() in fault) are very simple to express in SystemTap. If other BPF tools are used, then performing these actions may require increased verbosity or additional knowledge of BPF internals such as the various types of BPF maps and which kernel-provided BPF helper functions should be used to correctly access the map.

Stapbpf is also able to create tracepoint probes for kernel builds that differ from the system on which it's currently running (the host machine). This can be useful for cases where the target machine requires minimal tooling or where probes must be compiled for modules that have not yet been loaded into the kernel. In order to cross-compile the probes, stapbpf derives tracepoint information directly from kernel header files of the target machine.

To do so, stapbpf uses an interesting technique adapted from SystemTap's default (loadable kernel module) runtime. Kernel header files containing tracepoint definitions are compiled along with additional headers created by SystemTap. These SystemTap-specific headers modify tracepoint definition macros so that debug info found in the resulting modules contains the information needed to construct the probes. In particular, stapbpf requires the size of the tracepoint arguments and their location during probe execution so that these arguments can be properly accessed. SystemTap also uses this technique to implement typecasting via the @cast operator. This allows users of stapbpf to cast void pointer context variables to a type that can be dereferenced:

probe kernel.trace("hrtimer_init")
{
  state = @cast($hrtimer, "hrtimer", "kernel<linux/hrtimer.h>")->state
  printf("hrtimer state: %d\n", state)
}

Cross-compiling requires having the target machine's kernel build tree on the host machine and the stapbpf binary on the target machine. Then it's just a matter of using stap --remote to compile the probes locally and run them on the target machine via SSH:

# stap --bpf --remote ssh://user@hostname example.stp
Pid 8346 added 89 pages

Where to get SystemTap

Stapbpf development is ongoing so it's recommended that you build and install SystemTap using the most up-to-date source code. Instructions for doing so can be found here.

Last updated: November 15, 2018