Featured image for: SCTP over UDP in the Linux kernel.

This is the second article in a two-part series about SystemTap, a tool for adding instrumentation to Linux systems to better understand the behavior of the kernel and of userspace applications or libraries. The first article used an existing example script called linetimes.stp to uncover a possible performance problem.

Now we'll show how SystemTap determines where to place the instrumentation, a few common SystemTap coding techniques, how the linetimes.stp script was created, and some issues encountered when implementing the script.

How does linetimes.stp work?

The debugging information produced by the compiler contains several different pieces of information, including the regions occupied by functions and output mapping machine instructions back to the files and lines of source code.

Multiple machine-language instructions can be associated with a particular line of code. The compiler marks the appropriate instruction that represents the start of a statement in the line information. The script probes each statement start using the following statement:

probe $1.statement(@2 "@*:*")

The $1 will be filled in with the first argument passed to the script, indicating what binary (the kernel, a kernel module, or a userspace application) is being instrumented. The @2 will be filled in by the second argument passed to the script: the name of the function being instrumented. The at symbol (@) indicates that the second argument should be converted to a string. The $1 argument requires no conversion and remains as is.

The string @*:* specifies a wildcard probe, requesting all filenames through the first asterisk and all line numbers through the second asterisk.

Now that we have the probes, the next step is to create a probe handler that computes the times between the successive probes, accumulates those times, and tracks the control flow through the program. The script needs some state information stored in global associative arrays. The elapsed times (in an array called times) and last probe point seen (in an array called last_pp) need to be tracked on a per-thread basis. The probe handler uses the thread ID, tid(), to access those associative arrays and keep information about each thread distinct.

By default, if there is no entry in the associative array for a key, the value 0 is returned when the program tries to read the entry. To avoid computing a negative elapsed time by subtracting the current time from 0, you need to check that times[pid()] has a non-zero value. A non-zero time value indicates that an earlier probe on the thread was triggered, so there is information about the time in times and the previous probe point in last_pp.

The body of the if statement in the following probe handler computes the elapsed time, e. The times are accumulated in the region statistical associative array with the <<< operator. The probe point stored in last_pp is used as a key to group together times with the same starting line.

The last statement in the body of the if statement records the path from the previous probe location (last_pp[tid()]) to the current probe location (pp()) in the cfg associative array. Following the if statement, the probe handler ends by updating the times and last_pp entries for the thread:

global times, last_pp, region, cfg
probe $1.statement(@2 "@*:*") {
  t = gettimeofday_us()
  s = times[tid()]
  if (s) {
    e = t - s
    region[last_pp[tid()]] <<< e
    cfg[last_pp[tid()], pp()] <<< 1
  }
  times[tid()] = t
  last_pp[tid()] = pp()
}

Properly handling function returns

One situation the script needs to address is leaving the function via a return and then having the function called again. With just the probe shown previously, the script would include the time between the last statement before the function return and the first statement executed on the next call to the function. To avoid this error, the script needs to have a probe similar to the previous one, but on the function return. The main difference between the two is that the return probe removes the times and last_pp entries from the associative arrays to avoid computing the unwanted interval.

This design results in the following probe. As before, $1 indicates whether the instrumented binary is the kernel, a kernel module, or a userspace binary, and @2 is the function name:

probe $1.function(@2).return {
  t = gettimeofday_us()
  s = times[tid()]
  if (s) {
    e = t - s
    region[last_pp[tid()]] <<< e
    cfg[last_pp[tid()], pp()] <<< 1
  }
  delete times[tid()]
  delete last_pp[tid()]
}

Counting the function entries

To gauge the relative frequency of paths through the code, one would like to know the number of times the function is called. This information is recorded in a global variable (calls) and the following one-line probe that counts each time the function is entered. This probe uses the statistics operator (<<<) for efficiency, because multiple probes can operate concurrently without locking:

global calls
probe $1.function(@2).call { calls <<< 1 }

Generating the results

One last probe is needed to generate the output when the script exits. This task is implemented with an end probe. The output is generated using printf statements. The SystemTap printf allows formatting with field width and justification. The %-58s in the third printf left-justifies the output of a string printed in a 58-character wide field, and the following two %10d entries format two right-justified decimal numbers in 10-character wide fields.

There are two sections of code. The first section prints out the headers for the region information, then uses a foreach loop to print a line with the statistical average and maximum time recorded for each region. The loop variable is p and the + sign indicates that the loop should sort the values of p in ascending order. In this case, output is in the same order as the lines in the source code.

The second section prints the control flow graph information, which is implemented with nested foreach loops. The outer loop prints each line executed, and the inner loop prints out all the lines that were executed immediately following it, with a count of the executions:

probe end {
  printf("\n%s %s call count: %d\n", @1, @2, @count(calls));
  printf("\n%-58s %10s %10s\n", "region", "avg(us)", "max(us)");
  foreach (p+ in region) {
    printf("%-58s %10d %10d\n", p, @avg(region[p]), @max(region[p]));
  }
  printf("\n\ncontrol flow graph information\n")
  printf("from\n\tto\n=======================\n")
  foreach ([src+] in region) {
    printf("%-s\n", src)
    foreach ([s,dest+] in cfg[src,*]) { # slice for all dest's
      printf("\t%-s %d\n", dest, @count(cfg[src,dest]));
    }
  }
}

Putting the script together

All the previous pieces of code have been assembled into a single file in /usr/share/systemtap/examples/process/linetimes.stp:

#! /usr/bin/env stap
#
# Copyright (C) 2010-2015 Red Hat, Inc.
# Written by William Cohen <wcohen@redhat.com>
#
# The linetimes.stp script takes two arguments: where to find the function
# and the function name. linetimes.stp will instrument each line in the
# function. It will print out the number of times that the function is
# called, a table with the average and maximum time each line takes,
# and control flow information when the script exits.
#
# For example all the lines of the do_unlinkat function:
#
# stap linetimes.stp kernel do_unlinkat

global calls, times, last_pp, region, cfg

probe $1.function(@2).call { calls <<< 1 }
probe $1.function(@2).return {
  t = gettimeofday_us()
  s = times[tid()]
  if (s) {
    e = t - s
    region[last_pp[tid()]] <<< e
    cfg[last_pp[tid()], pp()] <<< 1
  }
  delete times[tid()]
  delete last_pp[tid()]
}

probe $1.statement(@2 "@*:*") {
  t = gettimeofday_us()
  s = times[tid()]
  if (s) {
    e = t - s
    region[last_pp[tid()]] <<< e
    cfg[last_pp[tid()], pp()] <<< 1
  }
  times[tid()] = t
  last_pp[tid()] = pp()
}

probe end {
  printf("\n%s %s call count: %d\n", @1, @2, @count(calls));
  printf("\n%-58s %10s %10s\n", "region", "avg(us)", "max(us)");
  foreach (p+ in region) {
    printf("%-58s %10d %10d\n", p, @avg(region[p]), @max(region[p]));
  }

  printf("\n\ncontrol flow graph information\n")
  printf("from\n\tto\n=======================\n")
  foreach ([src+] in region) {
     printf("%-s\n", src)
     foreach ([s,dest+] in cfg[src,*]) { # slice for all dest's
        printf("\t%-s %d\n", dest, @count(cfg[src,dest]));
     }
  }
}

Limitations

The linetimes.stp script is useful, but suffers from several limitations:

  • The measurements include overhead from the probe handlers. The line of code being measured might only be a few instructions, whereas the probe hander's overhead is significantly more. The script is therefore most useful when a line of code calls a function and you are trying to determine which calls are slower and faster in the function.
  • The script can't monitor inline functions, because the probes for function entry (probe $1.function(@2).call) and return (probe $1.function(@2).return) are not available for inline functions. This limitation extends farther than it might look, because compiler optimization can implicitly make functions inline even if they aren't explicitly specified as inline by the developer.
  • Compiler reordering of statements can lead to some unexpected sequences in the control flow graph.
  • Recursive functions can't be probed. If the script monitors a function using recursion, one would see the line with the call followed by the line that is the entry point of the function. However, each time the function returns, the probe clears data and doesn't properly track the times for statements following the return.

Conclusion

When you encounter a system problem in the future, take a look at the SystemTap examples to see whether there's an existing script that might help you diagnose the problem or provide a good starting point to create your own bespoke instrumentation. You can also learn more about SystemTap on the upstream project's home page.