Red Hat Developer

A common question that SystemTap can be used to answer involves how often particular events occur on the system. Some events, such as system calls, can happen frequently and the goal is to make the SystemTap script as efficient as possible.

Using the statistical aggregate in the place of regular integers is one way to improve the performance of SystemTap scripts. The statistical aggregates record data on a per-CPU basis to reduce the amount of coordination required between processors, allowing information to be recorded with less overhead. In this article, I’ll show an example of how to reduce overhead in SystemTap scripts.

Below is a version of the syscalls_by_proc.stp example script that tallies the number of times each executable makes a system call. When the script exits, it prints out the number of times each executable on the system make a system call sorted from most to fewest system calls.

Line 1 defines the global array used to store the information. Line 3 is the probe that actually tallies each system call using the ++. The probe end starting at line 5 prints out the data stored in the global syscalls associative array.

global syscalls

probe kernel.trace("sys_enter") { syscalls[execname()] ++ }

probe end {
  printf ("%-10s %-s\n", "#SysCalls", "Process Name")
  foreach (proc in syscalls-)
    printf("%-10d %-s\n", syscalls[proc], proc)
}

The following code example shows a run of the previous script. The -v option causes SystemTap to provide timing information for each of the five passes passes. The -t provides timing information for each of the probes in the script and will provide a way to compare the efficiency of the different implementations. The -c "make -j8" starts the make command once the SystemTap instrumentation is ready and stops the instrumentation once the command finishes. At the end of the output is the "probe hit report," which provides information about the overhead of the various probes.

The middle of line for the kernel.trace("raw_syscalls:sys_enter") shows it was triggered more than 24 million times. The middle of that same line states it took an average of 4224 clock cycles to do the operation. Also over 3 million lock contentions occur on the global associative array (__global_syscall) in the "refresh report" section.

$ stap -v -t ~/present/2019blog/fast/syscalls_by_proc.stp -c "make -j8"
Pass 1: parsed user script and 504 library scripts using 290680virt/85620res/3552shr/82780data kb, in 590usr/30sys/621real ms.
Pass 2: analyzed script: 2 probes, 1 function, 0 embeds, 1 global using 296352virt/92244res/4484shr/88452data kb, in 130usr/190sys/331real ms.
Pass 3: using cached /home/wcohen/.systemtap/cache/b8/stap_b8412b9e49934149b789a69c3e2a2b4e_1469.c
Pass 4: using cached /home/wcohen/.systemtap/cache/b8/stap_b8412b9e49934149b789a69c3e2a2b4e_1469.ko
Pass 5: starting run.
  DESCEND  objtool
  CALL    scripts/atomic/check-atomics.sh
  CALL    scripts/checksyscalls.sh
  CHK     include/generated/compile.h
  TEST    posttest
  Building modules, stage 2.
  MODPOST 3484 modules
arch/x86/tools/insn_decoder_test: success: Decoded and checked 5057175 instructions
  TEST    posttest
arch/x86/tools/insn_sanity: Success: decoded and checked 1000000 random instructions with 0 errors (seed:0x18e55059)
Kernel: arch/x86/boot/bzImage is ready  (#9)
#SysCalls  Process Name
22614118   make
1675055    sh
138647     awk
112797     modpost
93456      cat
69563      objdump
68771      insn_decoder_te
49887      grep
14656      cc1
13544      gcc
11714      rm
10174      as
...
----- probe hit report: 
kernel.trace("raw_syscalls:sys_enter"), (/home/wcohen/present/2019blog/fast/syscalls_by_proc.stp:17:1), hits: 24894191, cycles: 310min/4224avg/477851max, variance: 5282267, from: kernel.trace("sys_enter"), index: 0
end, (/home/wcohen/present/2019blog/fast/syscalls_by_proc.stp:21:1), hits: 1, cycles: 109027min/109027avg/109027max, variance: 0, from: end, index: 1
----- refresh report:
'__global_syscalls' lock contention occurred 3230491 times
Pass 5: run completed in 174200usr/100470sys/69766real ms.

The overhead of the script can be significantly reduce by using statistical aggregates. The ++ is replaced by a <<< 1 to tally each system call and the tallies are now printed with a @sum(syscalls[proc]) in the code below.

global syscalls

probe kernel.trace("sys_enter") { syscalls[execname()] &lt;&lt;&lt; 1 }

probe end {
  printf ("%-10s %-s\n", "#SysCalls", "Process Name")
  foreach (proc in syscalls-)
    printf("%-10d %-s\n", @sum(syscalls[proc]), proc)
}

Below is an equivalent run of the script using the statistical aggregates in place of the slower ++. Looking at the "probe hit report" toward the end of the listing, you will see there the raw_syscalls:sys_enter trace point was run a similar number of times as the other script, about 25 million times. However, notice that the average time required by the handler is 409 clock cycles, much lower that the 4224 cycles observed on the version with ++. There is no lock contention listed for the syscalls associative array listed in the "refresh report" section for the run using statistical aggregates either.

$ stap -v -t ~/present/2019blog/faster/syscalls_by_proc.stp -c "make -j8"
Pass 1: parsed user script and 504 library scripts using 290680virt/85624res/3552shr/82780data kb, in 580usr/30sys/616real ms.
Pass 2: analyzed script: 2 probes, 1 function, 0 embeds, 1 global using 296348virt/92256res/4492shr/88448data kb, in 140usr/190sys/332real ms.
Pass 3: using cached /home/wcohen/.systemtap/cache/de/stap_de5d5dc935de72aac43603205a17cad4_1482.c
Pass 4: using cached /home/wcohen/.systemtap/cache/de/stap_de5d5dc935de72aac43603205a17cad4_1482.ko
Pass 5: starting run.
  DESCEND  objtool
  CALL    scripts/atomic/check-atomics.sh
  CALL    scripts/checksyscalls.sh
  CHK     include/generated/compile.h
  TEST    posttest
  Building modules, stage 2.
  MODPOST 3484 modules
arch/x86/tools/insn_decoder_test: success: Decoded and checked 5057175 instructions
  TEST    posttest
arch/x86/tools/insn_sanity: Success: decoded and checked 1000000 random instructions with 0 errors (seed:0xbefec711)
Kernel: arch/x86/boot/bzImage is ready  (#9)
#SysCalls  Process Name
22616433   make
1675055    sh
138647     awk
112797     modpost
93456      cat
69563      objdump
68771      insn_decoder_te
49891      grep
14706      cc1
13544      gcc
11714      rm
10174      as
...
----- probe hit report:
kernel.trace("raw_syscalls:sys_enter"), (/home/wcohen/present/2019blog/faster/syscalls_by_proc.stp:17:1), hits: 24895995, cycles: 216min/409avg/395696max, variance: 3425671, from: kernel.trace("sys_enter"), index: 0
end, (/home/wcohen/present/2019blog/faster/syscalls_by_proc.stp:19:1), hits: 1, cycles: 847759min/847759avg/847759max, variance: 0, from: end, index: 1
----- refresh report:
Pass 5: run completed in 172780usr/66010sys/64215real ms.

The tradeoff of using the statistical aggregates is that when using the @sum() is that the data needs to be fetched from all the processors in the machine. This is more expensive than just fetching a single integer value stored in an associative array. However, for this example, reducing the overhead of the system call probes more than makes up for the overhead of @sum() used to print the results.

When writing SystemTap scripts, you can use the -t option to better understand the overhead of your scripts and consider using the statistical aggregates when feasible. As this example shows, the overhead in instrumentation scripts can be significantly reduced in SystemTap scripts.