Low-level library functions and kernel system calls on Linux report errors by returning a POSIX error code such as ENOENT or EINVAL. With SystemTap, an open source tracing toolkit maintained by Red Hat, it is possible to look inside the Linux kernel to debug the problem. This article demonstrates how to investigate the cause of an error code using SystemTap with the whythefail.stp
script.
Limitations of POSIX error codes
Low-level system libraries on Linux follow the POSIX standard, which defines a list of numerical error codes. Each error code is associated with an identifier that provides a mnemonic hint about the problem. For instance, error code 22 is identified as EINVAL, denoting an invalid argument to a system call.
The Linux man pages document the error codes returned by each function and the likely circumstances that could produce these error codes. Sometimes, this documentation is sufficient to understand and fix a problem. More often, the range of events that could produce an error code is too complex to document.
For example, suppose we are working on a C program that interacts with the Linux kernel's perf_events
framework through the perf_event_open
system call. When we test the program, the system call fails and returns the EINVAL error code.
Searching for EINVAL in the manual page for this system call, we find the following paragraph that gives some hints about possible causes:
"Returned if the specified event is invalid. There are many possible reasons for this. A non-exhaustive list: sample_freq is higher than the maximum setting; the CPU to monitor does not exist; read_format is out of range; sample_type is out of range; the flag's value is out of range; exclusive or pinned set and the event is not a group leader; the event config values are out of range or set reserved bits; the generic event selected is not supported; or there is not enough room to add the selected event."
Although this paragraph offers a list of possible reasons for the error, it describes this list as "non-exhaustive." Debugging the program will only show that the system call failed. It will not give any information about whether the cause is one of the reasons on this list or some other cause entirely. The information that would solve this error can be obtained only from Linux kernel internals, beyond the reach of a standard user-space debugger such as GDB.
How SystemTap probes a system call's implementation
Fortunately, we can access additional information on an open source operating system in the following ways.
- We could delve into the source code of the Linux kernel or the library producing the error code.
- We can also use the SystemTap tracing framework.
- SystemTap can gather information from numerous probe points in the Linux kernel and user-space programs.
- We can specify the probe points and information-gathering procedures by writing a script in SystemTap's built-in scripting language. SystemTap also includes a toolkit of ready-made example scripts that can answer common questions about a Linux system's behavior. One of these scripts is whythefail.stp, a powerful tool designed to help find the cause of an unexpected error code by tracing the control-flow path through a system call's implementation.
Gathering information about available probe points
In our example, we want to trace the path taken through the perf_event_open
syscall. SystemTap can probe almost any statement within functions in the kernel. We can verify the breadth of available probe points with the stap -L
command, which finds all accessible probe points that match a wildcard pattern. We can specify the wildcard patterns in a format similar to file glob patterns in the shell.
Thus, even if we are not aware of the full name of the kernel function implementing perf_event_open
, we can check for kernel.statement
probe points that match the pattern *perf_event_open@*:*
. SystemTap will give the source code locations of these probe points and the list of variables visible at each source code location as follows:
$ stap -L 'kernel.statement("*perf_event_open@*:*")'
...
kernel.statement("__do_sys_perf_event_open@kernel/events/core.c:12107") $attr_uptr:struct perf_event_attr* $pid:pid_t $cpu:int $group_fd:int $flags:long unsigned int $group_leader:struct perf_event* $output_event:struct perf_event* $attr:struct perf_event_attr $event_file:struct file* $group:struct fd $task:struct task_struct* $event_fd:int $move_group:int $f_flags:int $cgroup_fd:int
kernel.statement("__do_sys_perf_event_open@kernel/events/core.c:12108") $attr_uptr:struct perf_event_attr* $pid:pid_t $cpu:int $group_fd:int $flags:long unsigned int $group_leader:struct perf_event* $output_event:struct perf_event* $attr:struct perf_event_attr $event_file:struct file* $group:struct fd $task:struct task_struct* $move_group:int $cgroup_fd:int
...
Note: If we cannot find the probe points, the system may not have debuginfod installed for retrieving kernel debug information. In that case, we must install the debug information through the legacy method of running the stap-prep script. This script invokes dnf debuginfo-install
on Fedora (or apt
on Debian or Ubuntu) to install kernel debug information and source code packages (e.g. kernel-debuginfo
and kernel-debuginfo-common-*
on Fedora). This alternative method has some drawbacks, including requiring the system to run an up-to-date kernel that matches the kernel debug information package currently available in the package repository. These drawbacks prompted the development of debuginfod.
The stap -L
invocation displays dozens of probe points, one for each statement of the __do_sys_perf_event_open
function. The kernel source code declares this function and assigns it to handle the perf_event_open
system call by invoking the kernel's SYSCALL_DEFINE5
internal convenience macro. The implementation of this macro varies between kernel versions, so the function name could also vary.
How the whythefail.stp script works
The whythefail.stp
utility is a script in the SystemTap examples collection that diagnoses an error by tracing the control flow of the function returning that error. The implementation of this script starts with a simple idea: to understand control flow through a function invocation. It is enough to probe every statement in the function. By reporting the sequence of statements, a developer can see which branch every if
statement took and discover where the function exited.
A simplified version of this procedure can be expressed with the SystemTap scripting language as follows:
probe kernel.statement("__do_sys_perf_event_open@*:*") {
tokenize(pp(),"@"); fileline=tokenize("","@")
file=tokenize(fileline,":\""); line=tokenize("",":\"")
printf("%s[%d] %s:%s $$vars:%s\n", execname(), tid(), file, line, $$vars)
}
This code prints a comprehensive description of each statement, including the process and thread responsible for the syscall invocation, the filename and line number of the statement, and the contents of local variables. The calls to tokenize
in the code snippet extract the file name and line number from the pp()
probe point string describing the probe point. After wildcard resolution, this string has a form similar to the probe points produced by stap -L
such as:
kernel.statement("__do_sys_perf_event_open at kernel/events/core.c:12107")
Of course, if we trace every invocation of a function, we will see a lot of irrelevant information. Our program might invoke perf_event_open
multiple times and only one of these invocations returns the error code causing the problem. At the same time, it is difficult to filter invocations because we don't know ahead of time whether any given invocation of perf_event_open
is the one that ends up returning the error code.
The whythefail.stp
script solves this dilemma with a strategy called speculative tracing. A speculative tracer collects information about every invocation of the problematic call (perf_event_open
). At the end of each invocation, the tracer checks the collected data and determines whether it is relevant to the final report, reporting relevant data and discarding irrelevant data. In this case, the relevance criterion is simple: any invocation of perf_event_open
that returns EINVAL
is relevant.
SystemTap supports speculative tracing via the speculation.stp tapset library, including the following functions:
speculate(id, msg)
stores a stringmsg
into a speculation buffer identified by the labelid
until the script can decide whether the information it contains is relevant. For this example, we create a separate speculation buffer for every invocation ofperf_event_open
.commit(id)
outputs the strings accumulated in the speculation buffer labeledid
. For this example, we commit the speculation buffer for an invocation if we find that the invocation returns EINVAL.discard(id)
discards the strings in the speculation buffer labeledid
.
When we use whythefail.stp
to trace invocations of perf_event_open
, we store a message in the speculation buffer for every statement in __do_sys_perf_event_open
. We commit or discard the speculation at the exit point of the function, traced via the following probe point:
kernel.function("__do_sys_perf_event_open").return
Using whythefail.stp to find the error source
The whythefail.stp
script is a general-purpose utility adapted to our search by providing the following arguments on the SystemTap command line:
$ stap --example whythefail.stp PROC_OR_KERNEL FUNCTION_NAME CONDITION_EXPRESSION [VARIABLE]
As documented in the script's source code, these arguments take the following meanings:
PROC_OR_KERNEL
is a SystemTap probe point prefix identifying the codebase to probe. We specifykernel
to probe the kernel, but the argument could specify a kernel module or user-space process instead.FUNCTION_NAME
is the function invocation we want to trace.CONDITION
is the condition evaluated at function return. Thewhythefail.stp
prints trace data only whenCONDITION
evaluates to true.- Optionally,
VARIABLE
is a SystemTap expression instructingwhythefail.stp
to print an additional value at each statement. The default forVARIABLE
is$$vars
, which prints all of the local variables at each statement.
Optionally, we can add the SystemTap command-line option -c CMD
or -x PID
to trace system call invocations from one particular process. These options are useful for system calls invoked by many other programs simultaneously with the program we debug.
In our example scenario, we invoke whythefail.stp
with a FUNCTION_NAME
of __do_sys_perf_event_open
and a CONDITION
of $return == -22
. Recall that 22 is the numerical error code for EINVAL. We can obtain it by following the chain of includes from /usr/include/errno.h
or by running the errno command-line utility. With these arguments, whythefail.stp
traces only invocations of perf_event_open
that finish by returning EINVAL
as follows:
$ stap --example whythefail.stp kernel __do_sys_perf_event_open "\$return == -22"
We use a backslash to escape the $
dollar sign in $return
to suppress the shell's variable substitution.
The script's output shows the path that the __do_sys_perf_event_open
function took before returning the error code. There will be an entry for each statement, displaying its filename and line number with the contents of local variables, as shown in the following example:
myprogram[20919] kernel function(__do_sys_perf_event_open) $return: -22
myprogram[20919] kernel/events/core.c:12049 $$vars: attr_uptr=0x7ffcd1ef0bc0 pid=0xffffffffffffffff cpu=0x1 group_fd=0xffffffffffffffff flags=0x0 group_leader=0x0 output_event=0x0 event=? sibling=? attr={...} ctx=? gctx=? event_file=0x0 group={...} task=0x0 pmu=? event_fd=? move_group=0x0 err=? f_flags=0x2 cgroup_fd=0xffffffffffffffff
myprogram[20919] kernel/events/core.c:12053 $$vars: attr_uptr=0x7ffcd1ef0bc0 pid=0xffffffffffffffff cpu=0x1 group_fd=0xffffffffffffffff flags=0x0 group_leader=0x0 output_event=0x0 event=? sibling=? attr={...} ctx=? gctx=? event_file=0x0 group={...} task=0x0 pmu=? event_fd=? move_group=0x0 err=? f_flags=0x2 cgroup_fd=0xffffffffffffffff
myprogram[20919] kernel/events/core.c:12054 $$vars: attr_uptr=0x7ffcd1ef0bc0 pid=0xffffffffffffffff cpu=0x1 group_fd=0xffffffffffffffff flags=0x0 group_leader=0x0 output_event=0x0 event=? sibling=? attr={...} ctx=? gctx=? event_file=0x0 group={...} task=0x0 pmu=? event_fd=? move_group=0x0 err=0x0 f_flags=0x2 cgroup_fd=0xffffffffffffffff
myprogram[20919] kernel/events/core.c:12489 $$vars: attr_uptr=? pid=? cpu=? group_fd=? flags=? group_leader=? output_event=? event=? sibling=? attr={...} ctx=? gctx=? event_file=? group={...} task=? pmu=? event_fd=? move_group=? err=? f_flags=? cgroup_fd=?
Analyzing the Linux source
We can obtain the source code of kernel/events/core.c
with the debuginfod-find source utility. When invoking it, we need to know the absolute path of the kernel source as configured in the distribution's source code packages (e.g., kernel-debuginfo-common-*
on Fedora or linux-source-*
on Debian). For example, the absolute path has the prefix /usr/src/debug/kernel-5.19.11/linux-5.19.11-200.fc36.x86_64
on a Fedora system running the kernel version 5.19.11-200.fc36.x86_64
.
$ debuginfod-find source /boot/vmlinuz-5.19.11-200.fc36.x86_64 /usr/src/debug/kernel-5.19.11/linux-5.19.11-200.fc36.x86_64/kernel/events/core.c
Downloading from https://debuginfod.fedoraproject.org/ 83040/331982
~/.cache/debuginfod_client/9b687addfc9de54d0ff36eeb801cadde85fde7b3/source##usr##src##debug##kernel-5.19.11##linux-5.19.11-200.fc36.x86_64##kernel##events##core.c
Then, we can look for the line numbers specified in the output of whythefail.stp
as follows:
$ less -N ~/.cache/debuginfod_client/9b687*/source*kernel##events##core.c
...
12052 /* Do we allow access to perf_event_open(2) ? */
12053 err = security_perf_event_open(&attr, PERF_SECURITY_OPEN);
12054 if (err)
12055 return err;
12056
12057 err = perf_copy_attr(attr_uptr, &attr);
12058 if (err)
12059 return err;
...
12489 }
...
Because of inlining and optimization, the kernel debug information is not guaranteed to provide an accessible probe point for every line. We can see that the statement coverage is incomplete if whythefail.stp
outputs warning messages such as the following:
WARNING: probe kernel.statement("__do_sys_perf_event_open at kernel/events/core.c:12057") (address 0xffffffffa829e017) registration error [man warning::pass5] (rc -84)
In this situation, it is still possible to diagnose the problem. We need to be careful to consider multiple possibilities when interpreting the whythefail.stp
output.
At first glance, it looks as though the error code is coming from the call to security_perf_event_open()
. We test this first hypothesis by running whythefail.stp
again for invocations of security_perf_event_open
as follows:
$ stap --example whythefail.stp kernel security_perf_event_open "\$return == -22"
...
^C
statistics:
entry count: 11
exit-miss count: 11
statement count: 22
The invocation does not yield any output, which suggests that security_perf_event_open()
returned as normal. The statistics report printed when we terminate whythefail.stp
, showing that whythefail.stp
detected 11 function entries and exits that did not satisfy the condition $return == -22
which supports this conclusion.
The original whythefail.stp
output for __do_sys_perf_event_open
stated that err=0x0
(i.e., the value of the local variable err
is still 0) when execution reaches line 12054. This suggests that execution continued to line 12057, one of the lines SystemTap warned us it could not probe. In that case, the error code might be coming from the call to perf_copy_attr()
.
We test this second hypothesis by running whythefail.stp
again for invocations of perf_copy_attr
:
$ stap --example whythefail.stp kernel perf_copy_attr "\$return == -22"
...
myprogram[41981] kernel function(perf_copy_attr) $return: -22
myprogram[41981] kernel/events/core.c:11704 $$vars: uattr=0x7ffff30f9140 attr=0xffffa70d8af2be48 size=? ret=?
myprogram[41981] kernel/events/core.c:11705 $$vars: uattr=0x7ffff30f9140 attr=0xffffa70d8af2be48 size=? ret=?
myprogram[41981] kernel/events/core.c:11711 $$vars: __ret_gu=? __val_gu=0x0 uattr=0x7ffff30f9140 attr=0xffffa70d8af2be48 size=? ret=?
myprogram[41981] kernel/events/core.c:11712 $$vars: uattr=0x7ffff30f9140 attr=0xffffa70d8af2be48 size=0x80 ret=0x0
myprogram[41981] kernel/events/core.c:11716 $$vars: uattr=0x7ffff30f9140 attr=0xffffa70d8af2be48 size=0x80 ret=0x0
myprogram[41981] kernel/events/core.c:11718 $$vars: uattr=0x7ffff30f9140 attr=0xffffa70d8af2be48 size=0x80 ret=0x0
myprogram[41981] kernel/events/core.c:11721 $$vars: uattr=0x7ffff30f9140 attr=0xffffa70d8af2be48 size=? ret=0x0
myprogram[41981] kernel/events/core.c:11722 $$vars: uattr=0x7ffff30f9140 attr=0xffffa70d8af2be48 size=? ret=0x0
myprogram[41981] kernel/events/core.c:11733 $$vars: uattr=0x7ffff30f9140 attr=0xffffa70d8af2be48 size=? ret=0x0
myprogram[41981] kernel/events/core.c:11736 $$vars: uattr=0x7ffff30f9140 attr=0xffffa70d8af2be48 size=? ret=0x0
myprogram[41981] kernel/events/core.c:11739 $$vars: uattr=0x7ffff30f9140 attr=0xffffa70d8af2be48 size=? ret=0x0
myprogram[41981] kernel/events/core.c:11740 $$vars: mask=? uattr=0x7ffff30f9140 attr=0xffffa70d8af2be48 size=? ret=0x0
myprogram[41981] kernel/events/core.c:11743 $$vars: mask=0x40001 uattr=0x7ffff30f9140 attr=0xffffa70d8af2be48 size=? ret=0x0
myprogram[41981] kernel/events/core.c:11826 $$vars: uattr=0x7ffff30f9140 attr=? size=? ret=?
Our program received the error code from this function. These indicated lines of kernel/events/core.c
reveal the cause:
$ less -N ~/.cache/debuginfod_client/9b687*/source*kernel##events##core.c
...
11739 if (attr->sample_type & PERF_SAMPLE_BRANCH_STACK) {
11740 u64 mask = attr->branch_sample_type;
11741
11742 /* only using defined bits */
11743 if (mask & ~(PERF_SAMPLE_BRANCH_MAX-1))
11744 return -EINVAL;
...
11826 }
Evidently, our program passed an invalid parameter value in attr->branch_sample_type
, where attr
is the struct perf_event_attr
parameter data structure provided when invoking the system call.
SystemTap and whythefail.stp define system call errors
This article demonstrated how source code inspection and SystemTap tracing could narrow down the causes of a confusing error code. With the help of whythefail.stp
, we utilized the general information provided by an EINVAL error code and traced the Linux kernel's implementation of the system call to understand which argument was invalid and why. We also illustrated how debugging tools could double as program understanding tools for programmers whose code interacts with an unfamiliar project. When we used whythefail.stp
, our goal was not to diagnose a problem within the kernel itself but to understand the kernel's internals and behavior in more detail than the documentation conveys.
Please comment below if you have any questions. We welcome your feedback!
Last updated: June 4, 2024