SystemTap

A number of the SystemTap script examples in the newly released SystemTap 4.0 available in Fedora 28 and 29 have reduced the amount of time required to convert the scripts into running instrumentation by using the syscall_any tapset.

This article discusses the particular changes made in the scripts and how you might also use this new tapset to make the instrumentation that monitors system calls smaller and more efficient. (This article is a follow-on to my previous article: Analyzing and reducing SystemTap’s startup cost for scripts.)

The key observation that triggered the creation of the syscall_any tapset was a number of scripts that did not use the syscall arguments. The scripts often used syscall.* and syscall.*.return, but they were only concerned with the particular syscall name and the return value. This type of information for all the system calls is available from the sys_entry and sys_exit kernel tracepoints. Thus, rather than creating hundreds of kprobes for each of the individual functions implementing the various system calls, there are just a couple of tracepoints being used in their place.

Let's take a concrete look at the impact of using syscall_any.*.return in place of syscall.*.return with the errno.stp example on a Fedora 28 machine. This errno.stp script provides system-wide information about the system calls that return error codes. It prints a separate line for each combination of PID, system call, and error. At the end of each line is a count of the times that particular combination was seen. Below is a version of the script using syscall.*.return.

#!/usr/bin/stap
#
# Copyright (C) 2010, 2018 Red Hat, Inc.
# By Dominic Duval, Red Hat Inc.
# dduval@redhat.com
#
# Monitors errors returned by system calls.
#
# USAGE: stap errno.stp
#

global execname, errors

probe syscall.*.return {
  errno = retval
  if ( errno < 0 ) {
    p = pid()
    execname[p]=execname();
    errors[p, errno, name] <<< 1
  }
}

probe end {
  printf("\n")
  printf("%8s %-32s %-16s %-12s %8s\n",
    "PID", "Syscall", "Process", "Error", "Count")
  foreach ([pid, error, thissyscall] in errors- limit 20) {
    printf("%8d %-32s %-16s %-12s %8d\n",
      pid,
      thissyscall, 
      execname[pid],
      error ? errno_str(error) : "",
      @count(errors[pid, error, thissyscall])
    )
  }
}

global prom_arr

probe prometheus {
  foreach ([p, errno, name] in errors- limit 20) {
    prom_arr[p, errno, name] = @count(errors[p, errno, name])
  }

  @prometheus_dump_array3(prom_arr, "count", "pid", "errno", "name")
  delete prom_arr
}

We can use the stap_time.stp script discussed in my previous SystemTap article to record the time required by the various SystemTap phases. The stap_time.stp script will be running in another window.

Below is a run of the old_errorno.stp script that instruments the system calls with the syscall.*.return probe points. The command line has -m old_error to keep the resulting kernel module old_errno.ko around, the keep temporary files (-k) option so we can look at the size of the files used to create the instrumentation, and -T 0.25 to run the instrumentation for a quarter second.

$ stap -m old_errno -k old_errno.stp -T 0.25

     PID Syscall                          Process          Error           Count
    8489 read                             pulseaudio       EAGAIN             50
    9016 recvmsg                          Timer            EAGAIN             16
    9223 recvmsg                          firefox          EAGAIN             11
    9016 futex                            Timer            ETIMEDOUT           7
    8386 recvmsg                          gnome-shell      EAGAIN              7
   25430 recvmsg                          Web Content      EAGAIN              6
    2196 futex                            pmdaperfevent    ETIMEDOUT           2
   16278 read                             stapio           EAGAIN              1
   25430 futex                            Web Content      ETIMEDOUT           1
    9792 recvmsg                          hexchat          EAGAIN              1
Keeping temporary directory "/tmp/stapUfJ9KC"

Now we'll replace the syscall.*.return in the script with syscall_any.return. The new, improved version has similar output:

$ stap -m new_errno -k new_errno.stp -T 0.25

     PID Syscall                          Process          Error           Count
    8489 read                             pulseaudio       EAGAIN             50
    9016 recvmsg                          Web Content      EAGAIN             40
    9016 futex                            Web Content      ETIMEDOUT          10
    8386 recvmsg                          gnome-shell      EAGAIN              9
   11884 recvmsg                          Web Content      EAGAIN              8
    9792 recvmsg                          hexchat          EAGAIN              4
    2196 futex                            pmdaperfevent    ETIMEDOUT           3
   10336 read                             gnome-terminal-  EAGAIN              2
   16840 read                             stapio           EAGAIN              1
    5262 read                             qemu-system-x86  EAGAIN              1
    1863 recvmsg                          gsd-color        EAGAIN              1
     811 openat                           systemd-journal  ENOENT              1
   11884 futex                            Web Content      ETIMEDOUT           1
Keeping temporary directory "/tmp/staprI6U1R"

The stap_time script provided the following output for the various passes of generating the instrumentation:

$ stap stap_time.stp 
old_errno.stp 80 2592 13033 807 20384
new_errno.stp 96 2594 1593 22 4125

With some minor work, Gnuplot can generate a graph of the data, as seen below. Pass 2 (elaboration) and pass 4 (compilation) are the ones that take a majority of the time in old_errno.stp using the syscall.*.return probe points. We see those are greatly reduced in new_errno.stp using the syscall_any.return probe point: from 13,033 ms to 1,593 ms for pass 2 and from 20,384 ms to 4,125 ms for pass 4. We also see pass 3 code generation is virtually nonexistent on the graph (reduced from 807 ms to 22 ms).

Graph showing reduced times for generating the instrumentation

The new_errno.ko script is less than 1/3 the size of the old_errno.ko script:

$ ls -ls *.ko
 472 -rw-rw-r--. 1 wcohen wcohen  481016 Nov  1 11:07 new_errno.ko
1528 -rw-rw-r--. 1 wcohen wcohen 1564360 Nov  1 11:06 old_errno.ko

Looking at the *src.c files in the temporary directories /tmp/stapUfJ9KC/ and /tmp/staprI6U1R/, we can see that old_errno_src.c is about four times larger than new_errno_src.c. This explains the additional overhead for the pass 3 code generation and pass 4 compilation and the larger old_errno.ko.

$ ls -l /tmp/stapUfJ9KC/*src.c /tmp/staprI6U1R/*src.c
-rw-rw-r--. 1 wcohen wcohen  967746 Nov  1 11:07 /tmp/staprI6U1R/new_errno_src.c
-rw-rw-r--. 1 wcohen wcohen 4119779 Nov  1 11:06 /tmp/stapUfJ9KC/old_errno_src.c

Conclusion

If your SystemTap script just requires information about the system call name and return value for all the system calls, you should consider using the syscall_any tapset in place of the syscall.* to make the instrumentation compile faster and result in small instrumentation.

Additional resources

See my other  articles on Red Hat Developers:

Red Hat Developer has many other articles on SystemTap and performance.