Featured image for: Automating the testing process for SystemTap, Part 1: Test automation with libvirt and Buildbot.

Low-level library functions and kernel system calls on Linux report errors by returning a POSIX error code such as ENOENT or EINVAL. With SystemTap, an open source tracing toolkit maintained by Red Hat, it is possible to look inside the Linux kernel to debug the problem. This article demonstrates how to investigate the cause of an error code using SystemTap with the whythefail.stp script.

Limitations of POSIX error codes

Low-level system libraries on Linux follow the POSIX standard, which defines a list of numerical error codes. Each error code is associated with an identifier that provides a mnemonic hint about the problem. For instance, error code 22 is identified as EINVAL, denoting an invalid argument to a system call.

The Linux man pages document the error codes returned by each function and the likely circumstances that could produce these error codes. Sometimes, this documentation is sufficient to understand and fix a problem. More often, the range of events that could produce an error code is too complex to document.

For example, suppose we are working on a C program that interacts with the Linux kernel's perf_events framework through the perf_event_open system call. When we test the program, the system call fails and returns the EINVAL error code.

Searching for EINVAL in the manual page for this system call, we find the following paragraph that gives some hints about possible causes:

"Returned if the specified event is invalid. There are many possible reasons for this. A non-exhaustive list: sample_freq is higher than the maximum setting; the CPU to monitor does not exist; read_format is out of range; sample_type is out of range; the flag's value is out of range; exclusive or pinned set and the event is not a group leader; the event config values are out of range or set reserved bits; the generic event selected is not supported; or there is not enough room to add the selected event."

Although this paragraph offers a list of possible reasons for the error, it describes this list as "non-exhaustive." Debugging the program will only show that the system call failed. It will not give any information about whether the cause is one of the reasons on this list or some other cause entirely. The information that would solve this error can be obtained only from Linux kernel internals, beyond the reach of a standard user-space debugger such as GDB.

How SystemTap probes a system call's implementation

Fortunately, we can access additional information on an open source operating system in the following ways.

  • We could delve into the source code of the Linux kernel or the library producing the error code.
  • We can also use the SystemTap tracing framework.
  • SystemTap can gather information from numerous probe points in the Linux kernel and user-space programs. 
  • We can specify the probe points and information-gathering procedures by writing a script in SystemTap's built-in scripting language. SystemTap also includes a toolkit of ready-made example scripts that can answer common questions about a Linux system's behavior. One of these scripts is whythefail.stp, a powerful tool designed to help find the cause of an unexpected error code by tracing the control-flow path through a system call's implementation.

Gathering information about available probe points

In our example, we want to trace the path taken through the perf_event_open syscall. SystemTap can probe almost any statement within functions in the kernel. We can verify the breadth of available probe points with the stap -L command, which finds all accessible probe points that match a wildcard pattern. We can specify the wildcard patterns in a format similar to file glob patterns in the shell.

Thus, even if we are not aware of the full name of the kernel function implementing perf_event_open, we can check for kernel.statement probe points that match the pattern *perf_event_open@*:*. SystemTap will give the source code locations of these probe points and the list of variables visible at each source code location as follows:

$ stap -L 'kernel.statement("*perf_event_open@*:*")'

...

kernel.statement("__do_sys_perf_event_open@kernel/events/core.c:12107") $attr_uptr:struct perf_event_attr* $pid:pid_t $cpu:int $group_fd:int $flags:long unsigned int $group_leader:struct perf_event* $output_event:struct perf_event* $attr:struct perf_event_attr $event_file:struct file* $group:struct fd $task:struct task_struct* $event_fd:int $move_group:int $f_flags:int $cgroup_fd:int

kernel.statement("__do_sys_perf_event_open@kernel/events/core.c:12108") $attr_uptr:struct perf_event_attr* $pid:pid_t $cpu:int $group_fd:int $flags:long unsigned int $group_leader:struct perf_event* $output_event:struct perf_event* $attr:struct perf_event_attr $event_file:struct file* $group:struct fd $task:struct task_struct* $move_group:int $cgroup_fd:int

...

Note: If we cannot find the probe points, the system may not have debuginfod installed for retrieving kernel debug information. In that case, we must install the debug information through the legacy method of running the stap-prep script. This script invokes dnf debuginfo-install on Fedora (or apt on Debian or Ubuntu) to install kernel debug information and source code packages (e.g. kernel-debuginfo and kernel-debuginfo-common-* on Fedora). This alternative method has some drawbacks, including requiring the system to run an up-to-date kernel that matches the kernel debug information package currently available in the package repository. These drawbacks prompted the development of debuginfod.

The stap -L invocation displays dozens of probe points, one for each statement of the __do_sys_perf_event_open function. The kernel source code declares this function and assigns it to handle the perf_event_open system call by invoking the kernel's SYSCALL_DEFINE5 internal convenience macro. The implementation of this macro varies between kernel versions, so the function name could also vary.

How the whythefail.stp script works

The whythefail.stp utility is a script in the SystemTap examples collection that diagnoses an error by tracing the control flow of the function returning that error. The implementation of this script starts with a simple idea: to understand control flow through a function invocation. It is enough to probe every statement in the function. By reporting the sequence of statements, a developer can see which branch every if statement took and discover where the function exited.

A simplified version of this procedure can be expressed with the SystemTap scripting language as follows:

probe kernel.statement("__do_sys_perf_event_open@*:*") {

  tokenize(pp(),"@"); fileline=tokenize("","@")

  file=tokenize(fileline,":\""); line=tokenize("",":\"")

  printf("%s[%d] %s:%s $$vars:%s\n", execname(), tid(), file, line, $$vars)

}

This code prints a comprehensive description of each statement, including the process and thread responsible for the syscall invocation, the filename and line number of the statement, and the contents of local variables. The calls to tokenize in the code snippet extract the file name and line number from the pp() probe point string describing the probe point. After wildcard resolution, this string has a form similar to the probe points produced by stap -L such as:

kernel.statement("__do_sys_perf_event_open at kernel/events/core.c:12107")

Of course, if we trace every invocation of a function, we will see a lot of irrelevant information. Our program might invoke perf_event_open multiple times and only one of these invocations returns the error code causing the problem. At the same time, it is difficult to filter invocations because we don't know ahead of time whether any given invocation of perf_event_open is the one that ends up returning the error code.

The whythefail.stp script solves this dilemma with a strategy called speculative tracing. A speculative tracer collects information about every invocation of the problematic call (perf_event_open). At the end of each invocation, the tracer checks the collected data and determines whether it is relevant to the final report, reporting relevant data and discarding irrelevant data. In this case, the relevance criterion is simple: any invocation of perf_event_open that returns EINVAL is relevant.

SystemTap supports speculative tracing via the speculation.stp tapset library, including the following functions:

  • speculate(id, msg) stores a string msg into a speculation buffer identified by the label id until the script can decide whether the information it contains is relevant. For this example, we create a separate speculation buffer for every invocation of perf_event_open.
  • commit(id) outputs the strings accumulated in the speculation buffer labeled id. For this example, we commit the speculation buffer for an invocation if we find that the invocation returns EINVAL.
  • discard(id) discards the strings in the speculation buffer labeled id.

When we use whythefail.stp to trace invocations of perf_event_open, we store a message in the speculation buffer for every statement in __do_sys_perf_event_open. We commit or discard the speculation at the exit point of the function, traced via the following probe point:

kernel.function("__do_sys_perf_event_open").return

Using whythefail.stp to find the error source

The whythefail.stp script is a general-purpose utility adapted to our search by providing the following arguments on the SystemTap command line:

$ stap --example whythefail.stp PROC_OR_KERNEL FUNCTION_NAME CONDITION_EXPRESSION [VARIABLE]

As documented in the script's source code, these arguments take the following meanings:

  • PROC_OR_KERNEL is a SystemTap probe point prefix identifying the codebase to probe. We specify kernel to probe the kernel, but the argument could specify a kernel module or user-space process instead.
  • FUNCTION_NAME is the function invocation we want to trace.
  • CONDITION is the condition evaluated at function return. The whythefail.stp prints trace data only when CONDITION evaluates to true.
  • Optionally, VARIABLE is a SystemTap expression instructing whythefail.stp to print an additional value at each statement. The default for VARIABLE is $$vars, which prints all of the local variables at each statement.

Optionally, we can add the SystemTap command-line option -c CMD or -x PID to trace system call invocations from one particular process. These options are useful for system calls invoked by many other programs simultaneously with the program we debug.

In our example scenario, we invoke whythefail.stp with a FUNCTION_NAME of __do_sys_perf_event_open and a CONDITION of $return == -22. Recall that 22 is the numerical error code for EINVAL. We can obtain it by following the chain of includes from /usr/include/errno.h or by running the errno command-line utility. With these arguments, whythefail.stp traces only invocations of perf_event_open that finish by returning EINVAL as follows:

$ stap --example whythefail.stp kernel __do_sys_perf_event_open "\$return == -22"

We use a backslash to escape the $ dollar sign in $return to suppress the shell's variable substitution.

The script's output shows the path that the __do_sys_perf_event_open function took before returning the error code. There will be an entry for each statement, displaying its filename and line number with the contents of local variables, as shown in the following example:

myprogram[20919] kernel function(__do_sys_perf_event_open) $return: -22

myprogram[20919] kernel/events/core.c:12049 $$vars: attr_uptr=0x7ffcd1ef0bc0 pid=0xffffffffffffffff cpu=0x1 group_fd=0xffffffffffffffff flags=0x0 group_leader=0x0 output_event=0x0 event=? sibling=? attr={...} ctx=? gctx=? event_file=0x0 group={...} task=0x0 pmu=? event_fd=? move_group=0x0 err=? f_flags=0x2 cgroup_fd=0xffffffffffffffff

myprogram[20919] kernel/events/core.c:12053 $$vars: attr_uptr=0x7ffcd1ef0bc0 pid=0xffffffffffffffff cpu=0x1 group_fd=0xffffffffffffffff flags=0x0 group_leader=0x0 output_event=0x0 event=? sibling=? attr={...} ctx=? gctx=? event_file=0x0 group={...} task=0x0 pmu=? event_fd=? move_group=0x0 err=? f_flags=0x2 cgroup_fd=0xffffffffffffffff

myprogram[20919] kernel/events/core.c:12054 $$vars: attr_uptr=0x7ffcd1ef0bc0 pid=0xffffffffffffffff cpu=0x1 group_fd=0xffffffffffffffff flags=0x0 group_leader=0x0 output_event=0x0 event=? sibling=? attr={...} ctx=? gctx=? event_file=0x0 group={...} task=0x0 pmu=? event_fd=? move_group=0x0 err=0x0 f_flags=0x2 cgroup_fd=0xffffffffffffffff

myprogram[20919] kernel/events/core.c:12489 $$vars: attr_uptr=? pid=? cpu=? group_fd=? flags=? group_leader=? output_event=? event=? sibling=? attr={...} ctx=? gctx=? event_file=? group={...} task=? pmu=? event_fd=? move_group=? err=? f_flags=? cgroup_fd=?

Analyzing the Linux source

We can obtain the source code of kernel/events/core.c with the debuginfod-find source utility. When invoking it, we need to know the absolute path of the kernel source as configured in the distribution's source code packages (e.g., kernel-debuginfo-common-* on Fedora or linux-source-* on Debian). For example, the absolute path has the prefix /usr/src/debug/kernel-5.19.11/linux-5.19.11-200.fc36.x86_64 on a Fedora system running the kernel version 5.19.11-200.fc36.x86_64.

$ debuginfod-find source /boot/vmlinuz-5.19.11-200.fc36.x86_64 /usr/src/debug/kernel-5.19.11/linux-5.19.11-200.fc36.x86_64/kernel/events/core.c

Downloading from https://debuginfod.fedoraproject.org/ 83040/331982

~/.cache/debuginfod_client/9b687addfc9de54d0ff36eeb801cadde85fde7b3/source##usr##src##debug##kernel-5.19.11##linux-5.19.11-200.fc36.x86_64##kernel##events##core.c

Then, we can look for the line numbers specified in the output of whythefail.stp as follows:

$ less -N ~/.cache/debuginfod_client/9b687*/source*kernel##events##core.c

...

  12052         /* Do we allow access to perf_event_open(2) ? */

  12053         err = security_perf_event_open(&attr, PERF_SECURITY_OPEN);

  12054         if (err)

  12055                 return err;

  12056

  12057         err = perf_copy_attr(attr_uptr, &attr);

  12058         if (err)

  12059                 return err;

...

  12489 }

...

Because of inlining and optimization, the kernel debug information is not guaranteed to provide an accessible probe point for every line. We can see that the statement coverage is incomplete if whythefail.stp outputs warning messages such as the following:

WARNING: probe kernel.statement("__do_sys_perf_event_open at kernel/events/core.c:12057") (address 0xffffffffa829e017) registration error [man warning::pass5] (rc -84)

In this situation, it is still possible to diagnose the problem. We need to be careful to consider multiple possibilities when interpreting the whythefail.stp output.

At first glance, it looks as though the error code is coming from the call to security_perf_event_open(). We test this first hypothesis by running whythefail.stp again for invocations of security_perf_event_open as follows:

$ stap --example whythefail.stp kernel security_perf_event_open "\$return == -22"

...

^C

statistics:

entry count: 11

exit-miss count: 11

statement count: 22

The invocation does not yield any output, which suggests that security_perf_event_open() returned as normal. The statistics report printed when we terminate whythefail.stp, showing that whythefail.stp detected 11 function entries and exits that did not satisfy the condition $return == -22 which supports this conclusion.

The original whythefail.stp output for __do_sys_perf_event_open stated that err=0x0 (i.e., the value of the local variable err is still 0) when execution reaches line 12054. This suggests that execution continued to line 12057, one of the lines SystemTap warned us it could not probe. In that case, the error code might be coming from the call to perf_copy_attr().

We test this second hypothesis by running whythefail.stp again for invocations of perf_copy_attr:

$ stap --example whythefail.stp kernel perf_copy_attr "\$return == -22"

...

myprogram[41981] kernel function(perf_copy_attr) $return: -22

myprogram[41981] kernel/events/core.c:11704 $$vars: uattr=0x7ffff30f9140 attr=0xffffa70d8af2be48 size=? ret=?

myprogram[41981] kernel/events/core.c:11705 $$vars: uattr=0x7ffff30f9140 attr=0xffffa70d8af2be48 size=? ret=?

myprogram[41981] kernel/events/core.c:11711 $$vars: __ret_gu=? __val_gu=0x0 uattr=0x7ffff30f9140 attr=0xffffa70d8af2be48 size=? ret=?

myprogram[41981] kernel/events/core.c:11712 $$vars: uattr=0x7ffff30f9140 attr=0xffffa70d8af2be48 size=0x80 ret=0x0

myprogram[41981] kernel/events/core.c:11716 $$vars: uattr=0x7ffff30f9140 attr=0xffffa70d8af2be48 size=0x80 ret=0x0

myprogram[41981] kernel/events/core.c:11718 $$vars: uattr=0x7ffff30f9140 attr=0xffffa70d8af2be48 size=0x80 ret=0x0

myprogram[41981] kernel/events/core.c:11721 $$vars: uattr=0x7ffff30f9140 attr=0xffffa70d8af2be48 size=? ret=0x0

myprogram[41981] kernel/events/core.c:11722 $$vars: uattr=0x7ffff30f9140 attr=0xffffa70d8af2be48 size=? ret=0x0

myprogram[41981] kernel/events/core.c:11733 $$vars: uattr=0x7ffff30f9140 attr=0xffffa70d8af2be48 size=? ret=0x0

myprogram[41981] kernel/events/core.c:11736 $$vars: uattr=0x7ffff30f9140 attr=0xffffa70d8af2be48 size=? ret=0x0

myprogram[41981] kernel/events/core.c:11739 $$vars: uattr=0x7ffff30f9140 attr=0xffffa70d8af2be48 size=? ret=0x0

myprogram[41981] kernel/events/core.c:11740 $$vars: mask=? uattr=0x7ffff30f9140 attr=0xffffa70d8af2be48 size=? ret=0x0

myprogram[41981] kernel/events/core.c:11743 $$vars: mask=0x40001 uattr=0x7ffff30f9140 attr=0xffffa70d8af2be48 size=? ret=0x0

myprogram[41981] kernel/events/core.c:11826 $$vars: uattr=0x7ffff30f9140 attr=? size=? ret=?

Our program received the error code from this function. These indicated lines of kernel/events/core.c reveal the cause:

$ less -N ~/.cache/debuginfod_client/9b687*/source*kernel##events##core.c

...

  11739         if (attr->sample_type & PERF_SAMPLE_BRANCH_STACK) {

  11740                 u64 mask = attr->branch_sample_type;

  11741

  11742                 /* only using defined bits */

  11743                 if (mask & ~(PERF_SAMPLE_BRANCH_MAX-1))

  11744                         return -EINVAL;

...

  11826 }

Evidently, our program passed an invalid parameter value in attr->branch_sample_type, where attr is the struct perf_event_attr parameter data structure provided when invoking the system call.

SystemTap and whythefail.stp define system call errors

This article demonstrated how source code inspection and SystemTap tracing could narrow down the causes of a confusing error code. With the help of whythefail.stp, we utilized the general information provided by an EINVAL error code and traced the Linux kernel's implementation of the system call to understand which argument was invalid and why. We also illustrated how debugging tools could double as program understanding tools for programmers whose code interacts with an unfamiliar project. When we used whythefail.stp, our goal was not to diagnose a problem within the kernel itself but to understand the kernel's internals and behavior in more detail than the documentation conveys.

Please comment below if you have any questions. We welcome your feedback!