William Cohen

William Cohen has been a developer of performance tools at Red Hat for over a decade and has worked on a number of the performance tools in Red Hat Enterprise Linux and Fedora such as OProfile, PAPI, SystemTap, and Dyninst.

Recent Posts

Algorithms != Programs and Programs are not “One size fits all”

Algorithms != Programs and Programs are not “One size fits all”

You’ve probably been taught that picking an algorithm that has the best Big-O asymptotic cost will yield the best performance. You might be surprised to find that on current hardware, this isn’t always the case. Much of algorithmic analysis assumes very simple costs where the order of operations doesn’t matter. Memory access times are assumed to be the same. However, the difference between a cache hit (a few processor clock cycles) and a cache miss that requires access to main memory (a couple hundred cycles) is immense.
This article series is the result of the authors (William Cohen and Ben Woodard) discussion that there is a disconnect on the typical ideas of algorithm efficiency taught in computer science and computer engineering versus what is currently encountered in actual computer systems.

Continue reading “Algorithms != Programs and Programs are not “One size fits all””

Reducing the startup overhead of SystemTap monitoring scripts with syscall_any tapset

Reducing the startup overhead of SystemTap monitoring scripts with syscall_any tapset

A number of the SystemTap script examples in the newly released SystemTap 4.0 available in Fedora 28 and 29 have reduced the amount of time required to convert the scripts into running instrumentation by using the syscall_any tapset.

This article discusses the particular changes made in the scripts and how you might also use this new tapset to make the instrumentation that monitors system calls smaller and more efficient. (This article is a follow-on to my previous article: Analyzing and reducing SystemTap’s startup cost for scripts.)

The key observation that triggered the creation of the syscall_any tapset was a number of scripts that did not use the syscall arguments. The scripts often used syscall.* and syscall.*.return, but they were only concerned with the particular syscall name and the return value. This type of information for all the system calls is available from the sys_entry and sys_exit kernel tracepoints. Thus, rather than creating hundreds of kprobes for each of the individual functions implementing the various system calls, there are just a couple of tracepoints being used in their place.

Continue reading “Reducing the startup overhead of SystemTap monitoring scripts with syscall_any tapset”

Analyzing and reducing SystemTap’s startup cost for scripts

Analyzing and reducing SystemTap’s startup cost for scripts

SystemTap is a powerful tool for investigating system issues, but for some SystemTap instrumentation scripts, the startup times are too long. This article is Part 1 of a series and describes how to analyze and reduce SystemTap’s startup costs for scripts.

We can use SystemTap to investigate this problem and provide some hard data on the time required for each of the passes that SystemTap uses to convert a SystemTap script into instrumentation. SystemTap has a set of probe points marking the start and end of passes from 0 to 5:

  • pass0: Parsing command-line arguments
  • pass1: Parsing scripts
  • pass2: Elaboration
  • pass3: Translation to C
  • pass4: Compilation of C code into kernel module
  • pass5: Running the instrumentation

Articles in this series:

Continue reading “Analyzing and reducing SystemTap’s startup cost for scripts”

Making the Operation of Code More Transparent and Obvious with SystemTap

Making the Operation of Code More Transparent and Obvious with SystemTap

You can study source code and manually instrument functions as described in the “Use the dynamic tracing tools, Luke” blog article, but why not make it easier to find key points in the software by adding user-space markers to the application code? User-space markers have been available in Linux for quite some time (since 2009). The inactive user-space markers do not significantly slow down the code. Having them available allows you to get a more accurate picture of what the software is doing internally when unexpected issues occur. The diagnostic instrumentation can be more portable with the user-space markers, because the instrumentation does not need to rely on instrumenting particular function names or lines numbers in source code. The naming of the instrumentation points can also make clearer what event is associated with a particular instrumentation point.

For example, Ruby MRI on Red Hat Enterprise Linux 7 has a number of different instrumentation points made available as a SystemTap tapset. If SystemTap is installed on the system, as described by What is SystemTap and how to use it?, the installed Ruby MRI instrumentation points can be listed with the stap -L” command shown below. These events show the start and end of various operations in the Ruby runtime, such as the start and end of garbage collection (GC) marking and sweeping.

Continue reading “Making the Operation of Code More Transparent and Obvious with SystemTap”

“Use the dynamic tracing tools, Luke”

“Use the dynamic tracing tools, Luke”

A common refrain for tracking down issues on computer systems running open source software is “Use the source, Luke.” Reviewing the source code can be helpful in understanding how the code works, but the static view may not give you a complete picture of how things work (or are broken) in the code. The paths taken through code are heavily data dependent. Without knowledge about specific values at key locations in code, you can easily miss what is happening. Dynamic instrumentation tools, such as SystemTap, that trace and instrument the software can help provide a more complete understanding of what the code is actually doing

I have wanted to better understand how the Ruby interpreter works. This is an opportunity to use SystemTap to investigate Ruby MRI internals on Red Hat Enterprise Linux 7. The article What is SystemTap and how to use it? has more information about installing SystemTap. The x86_64 RHEL 7 machine has ruby-2.0.0648-33.el7_4.x86_64.rpm installed, so the matching debuginfo RPM is installed to provide SystemTap with information about function parameters and to provide me with human-readable source code. The debuginfo RPM is installed by running the following command as root:

Continue reading ““Use the dynamic tracing tools, Luke””


Find what capabilities an application requires to successful run in a container

Many developers would like to run their existing applications in a container with restricted capabilities to improve security. However, it may not be clear which capabilities the application uses because the code uses libraries or other code developed elsewhere. The developer could run the application in an unrestricted container that allows all syscalls and capabilities to be used to avoid possible hard to diagnose failures caused by the application’s use of forbidden capabilities or syscalls. Of course, this eliminates the enhanced security of restricted containers. At Red Hat, we have developed a SystemTap script (container_check.stp) to provide information about the capabilities that an application uses. Read the SystemTap Beginners Guide for information on how to setup SystemTap.

Continue reading “Find what capabilities an application requires to successful run in a container”


How to avoid wasting megabytes of memory a few bytes at a time

Maybe you have so much memory in your computer that you never have to worry about it — then again, maybe you find that some C or C++ application is using more memory than expected. This could be preventing you from running as many containers on a single system as you expected, it could be causing performance bottlenecks, and it could even be forcing you to pay for more memory in your servers.

You do some quick “back of the envelope” calculations to estimate the amount of memory your application should be using, based on the size of each element in some key data structures, and the number of those data structures in each array. You think to yourself, “That doesn’t add up! Why is the application using so much more memory?” The reason it doesn’t add up is that you didn’t take into account the memory space being wasted in the layout of the data structures.

Continue reading “How to avoid wasting megabytes of memory a few bytes at a time”


Instruction-level Multithreading to improve processor utilization

No one wants the hardware in their computer sitting idle – we all want to get as much useful work out of our hardware as possible. Mechanisms such as cache and branch prediction have been incorporated into processors to minimize the amount of processor idle time caused by memory accesses and changes in program flow; however, these mechanism are not perfect.

There are still times that the processor could be idle waiting for data or computational results to become available – these delays are relatively short, generally less than a few hundred clock cycles, typically around ten.   The operating system software context switch to another runnable task takes on the order of hundreds of cycles.  Thus, the overheads are too large for the operating system to switch to another runnable tasks to hide these short times of idleness.

One approach to get better utilization is to have the physical processor support multiple logical processors. If one logical processor has to wait for some result, the physical processor can switch to processing instructions from other logical processors to keep the hardware busy doing useful work and get better utilization of the  hardware.

Continue reading “Instruction-level Multithreading to improve processor utilization”


“Don’t cross the streams”: Thread safety and memory accesses at the speed of light

The classic 1984 movie Ghostbusters offered an important safety tip for all of us:

Don’t cross the streams.” – “Why not?” – “It would be bad.” – “I’m fuzzy on the whole good/bad thing. What do you mean, ‘bad’?” – “Try to imagine all life as you know it stopping instantaneously and every molecule in your body exploding at the speed of light.” – “Right. That’s bad. Okay. All right. Important safety tip. Thanks…”

Similarly, in computing, there are also cases where data crossing through memory between different instruction streams would cause a similar effect to a software application – “all execution as we know it stopping instantaneously”.

This is due to the performance optimizations that both hardware and software implement to reorder and eliminate memory accesses. Ignoring these “memory access reordering” issues can result in extremely problematic debugging scenarios.

The bug from hell was a scenario where Java’s OpenJDK runtime parallel garbage collector very occasionally crashed because one thread’s write would signal that the data structure had been updated. This signal occurred before the actual update writes (to the same data structure), and the result was that other threads would end up reading invalid values. We’re going to take a deeper look into this scenario to understand exactly what went on in this notorious issue.

Continue reading ““Don’t cross the streams”: Thread safety and memory accesses at the speed of light”


Superscalar Execution

In the traditional processor pipeline model under ideal circumstances one new instruction enters the processor’s and one instruction completes execution each cycle. Thus, for the best case the processor can have an average execution rate of one clock per instruction. A superscalar processor allows multiple unrelated instructions to start on the same clock cycle on separate hardware units or pipelines. Under ideal conditions a superscalar processors could have an average clocks per instruction (CPI) be less one, meaning your 2GHz processor might be able to execute “billions and billions of instructions per second” as the famed astronomer Carl Sagan might like to say.

Different processors have different constraints on which instructions can be issued together for superscalar execution. A processor such as the Intel Atom can issue two instructions per cycle. However, the compiler needs to take care in the instructions it selects and the order of those instructions because the Atom processor has restrictions on which instructions can start at the same time. For example the Intel Atom processor can only start one older X87 floating point instruction per cycle, so the compiler should avoid back-to-back X87 floating point instructions. The Intel Haswell-based processors have four units for basic arithmetic/logic operations and three units for load/store operations and the Intel Skylake-based processor can issue up to 6 uops per cycle. Given these detailed differences selecting and ordering the instructions is probably best left to the compiler.

Continue reading “Superscalar Execution”