One common idiom in performance monitoring is how long did it take for a program to do something. For example you may want to know the time taken for database queries in PostgreSQL or just-in-time translations in a Java Virtual Machine. SystemTap and user-space markers in Linux packages make it much easier to determine the duration of those operations.

The user-space markers compiled into Linux packages mark key points in the code where particular actions occur. The user-space markers also provide arguments that provide additional information about the action. For example, the markers and the available arguments in PostgreSQL can be listed using using the SystemTap command:

$ stap -L 'process("postgres").mark("*")'

The two user-space markers related to the start and completion of a query are:

process("postgres").mark("query__start") $arg1:long
process("postgres").mark("query__done") $arg1:long

The $arg1 is a pointer to the string describing the query. There could be multiple queries being processed at the same time and $arg1 is useful in matching up the query__start and query__done events.

The postgresql-query.stp below is a short SystemTap script that produces statistics on the minimum, average, and maximum time for database queries for each Thread ID. The query__start probe records the time in the query_time global array using the Thread ID and pointer $arg1 as keys. When the query__done probe executes it will search the query_time global array for query start time. If a time is found, the elapsed time is computed and that sample is added to the query_summary statistics for the Thread ID. When the script exits the end probe executes and prints out the number of queries, the minimum time, average time, and maximum time for each thread ID.

global query_time, query_summary

probe process("postgres").mark("query__start") {
  query_time[tid(), $arg1] = gettimeofday_us();
}

probe process("postgres").mark("query__done") {
  p = tid()
  t = query_time[p, $arg1]; delete query_time[p, $arg1]
  if (t) {
    query_summary[p] <<< (gettimeofday_us() - t);
  }
}

probe end {
  printf("ntid count min(us) avg(us) max(us)n");
  foreach (p in query_summary) {
    printf("%d %d %d %d %dn", p, @count(query_summary[p]),
     @min(query_summary[p]), @avg(query_summary[p]), @max(query_summary[p]));
  }
}

Assuming that PostgreSQL has already been setup to run pgbench, the script is started in one terminal window with:

$ stap postgresql-query.stp

In another terminal window pgbench is run:

# su postgres -c "pgbench -T 10"

When the pgbench completes the SystemTap script in the previous terminal widow is stopped with a ctl-C yielding the following output:

tid count min(us) avg(us) max(us)
 4205 4 699 7049 15276
 4206 128072 8 55 240476

It shows there were two threads with those events, thread 4205 and 4206. The bulk of the queries were done in thread 4206 with an average time of 55 microseconds and a maximum of 240,476 microseconds.

Additional information about the PostgreSQL user-space markers is available at http://www.postgresql.org/docs/8.4/static/dynamic-trace.html .

The IcedTea JVM also has similar user-space markers in the hotspot just-in-time translator and SystemTap tapsets which provide additional variables. With the java-1.6.0-openjdk-devel-1.6.0.0 RPM installed to provide the tapsets the available probes can be listed out with:

$ stap -L "hotspot.*,hotspot.*.*"

The postgresql-query.stp could be adapted to provide information on how many just-in-time translations are performed and how long they take on average. The two probes of interest from the "stap -L" above are:

 hotspot.method_compile_begin name:string compiler:string class:string method:string 
     sig:string probestr:string $arg1:long $arg2:long $arg3:long $arg4:long 
     $arg5:long $arg6:long $arg7:long $arg8:long
 hotspot.method_compile_end name:string compiler:string class:string method:string 
     sig:string probestr:string $arg1:long $arg2:long $arg3:long $arg4:long $arg5:long 
     $arg6:long $arg7:long $arg8:long $arg9:long

We end up with a similar script, icedtea-jit.stp, below:

global jit_time, jit_summary

probe hotspot.method_compile_begin {
  jit_time[tid(), class, method] = gettimeofday_us();
}

probe hotspot.method_compile_end {
  p = tid()
  t = jit_time[p, class, method]; delete jit_time[p, class, method]
  if (t) {
    jit_summary[p] <<< (gettimeofday_us() - t);
  }
}

probe end {
  printf("ntid count min(us) avg(us) max(us)n");
  foreach (p in jit_summary) {
    printf("%d %d %d %d %dn", p, @count(jit_summary[p]),
     @min(jit_summary[p]), @avg(jit_summary[p]), @max(jit_summary[p]));
  }
}

In one terminal window start the icedtea-jit.stp with:

$ stap icedtea-jit.stp

In another window start up a Java application such as Eclipse:

$ eclipse

After the Eclipse Java application is exited with icedtea-jit.stp with a ctrl-C. This results in something like the following output:

tid count min(us) avg(us) max(us)
4736 501 123 18601 901362
4735 481 131 21573 2097815

This shows that there were two threads doing just-in-time translation when eclipse was run and each translated approximately 500 methods. The minimum times were fairly short around 120 to 130 microseconds, but the average times were much larger on the order of 20 milliseconds. It would be nice to see what the distributions of times were. The @hist_log() operator can be used in the end probe to print histograms of the times spent doing just-in-time translation. The probe handlers for the start and end of just-in-time translation remain the same and the end probe is changed to the following:

probe end {
  foreach (p in jit_summary) {
    printf("tid = %dn", p)
    print(@hist_log(jit_summary[p]))
  }
}

Doing the same experiment the new script now better shows the distributions of time:

tid = 5975
  value |-------------------------------------------------- count
     32 |                                                    0
     64 |                                                    0
    128 |@@@@@@@@@@@@@@@                                    31
    256 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@                     60
    512 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@                     60
   1024 |@@@@@@@@@@@@@@@@@@@@@@@@                           48
   2048 |@@@@@@@@@@@@@@@@                                   33
   4096 |@@@@@@@@@@@@@@@@@@@@@@@@                           48
   8192 |@@@@@@@@@@@@@@@@@@                                 36
  16384 |@@@@@@@@@@@@@@@@@@@                                39
  32768 |@@@@@@@@@                                          18
  65536 |@@@@@                                              11
 131072 |@@                                                  5
 262144 |@@                                                  5
 524288 |                                                    1
1048576 |                                                    0
2097152 |                                                    1
4194304 |                                                    0
8388608 |                                                    0

tid = 5976
  value |-------------------------------------------------- count
     16 |                                                    0
     32 |                                                    0
     64 |                                                    1
    128 |@@@@@@@@@@@@@@@@@                                  34
    256 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@                 69
    512 |@@@@@@@@@@@@@@@@@@@@@@@@@                          51
   1024 |@@@@@@@@@@@@@@@@@@@@@                              43
   2048 |@@@@@@@@@@@@@@@@@@@@@@@@@                          51
   4096 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@              74
   8192 |@@@@@@@@@@@@@@@@@@@@@@@@                           48
  16384 |@@@@@@@@@@@@@@@@@@@@@@@@@                          50
  32768 |@@@@@@@@@@@@@@@                                    31
  65536 |@@@@                                                8
 131072 |@@@                                                 6
 262144 |@                                                   2
 524288 |                                                    1
1048576 |                                                    0
2097152 |                                                    0

The plots show most of the translations take less than 65536 micro seconds, but there are a few slow translations that are approaching a second of wall clock time. One could further modify the probe handlers to show precisely which classes and methods those slow translations were in.

Last updated: January 9, 2023