William Cohen

William Cohen has been a developer of performance tools at Red Hat for over a decade and has worked on a number of the performance tools in Red Hat Enterprise Linux and Fedora such as OProfile, PAPI, SystemTap, and Dyninst.

Recent Posts

“Use the dynamic tracing tools, Luke”

“Use the dynamic tracing tools, Luke”

A common refrain for tracking down issues on computer systems running open source software is “Use the source, Luke.” Reviewing the source code can be helpful in understanding how the code works, but the static view may not give you a complete picture of how things work (or are broken) in the code. The paths taken through code are heavily data dependent. Without knowledge about specific values at key locations in code, you can easily miss what is happening. Dynamic instrumentation tools, such as SystemTap, that trace and instrument the software can help provide a more complete understanding of what the code is actually doing

I have wanted to better understand how the Ruby interpreter works. This is an opportunity to use SystemTap to investigate Ruby MRI internals on Red Hat Enterprise Linux 7. The article What is SystemTap and how to use it? has more information about installing SystemTap. The x86_64 RHEL 7 machine has ruby-2.0.0648-33.el7_4.x86_64.rpm installed, so the matching debuginfo RPM is installed to provide SystemTap with information about function parameters and to provide me with human-readable source code. The debuginfo RPM is installed by running the following command as root:

Continue reading ““Use the dynamic tracing tools, Luke””


Find what capabilities an application requires to successful run in a container

Many developers would like to run their existing applications in a container with restricted capabilities to improve security. However, it may not be clear which capabilities the application uses because the code uses libraries or other code developed elsewhere. The developer could run the application in an unrestricted container that allows all syscalls and capabilities to be used to avoid possible hard to diagnose failures caused by the application’s use of forbidden capabilities or syscalls. Of course, this eliminates the enhanced security of restricted containers. At Red Hat, we have developed a SystemTap script (container_check.stp) to provide information about the capabilities that an application uses. Read the SystemTap Beginners Guide for information on how to setup SystemTap.

Continue reading “Find what capabilities an application requires to successful run in a container”


How to avoid wasting megabytes of memory a few bytes at a time

Maybe you have so much memory in your computer that you never have to worry about it — then again, maybe you find that some C or C++ application is using more memory than expected. This could be preventing you from running as many containers on a single system as you expected, it could be causing performance bottlenecks, and it could even be forcing you to pay for more memory in your servers.

You do some quick “back of the envelope” calculations to estimate the amount of memory your application should be using, based on the size of each element in some key data structures, and the number of those data structures in each array. You think to yourself, “That doesn’t add up! Why is the application using so much more memory?” The reason it doesn’t add up is that you didn’t take into account the memory space being wasted in the layout of the data structures.

Continue reading “How to avoid wasting megabytes of memory a few bytes at a time”


Instruction-level Multithreading to improve processor utilization

No one wants the hardware in their computer sitting idle – we all want to get as much useful work out of our hardware as possible. Mechanisms such as cache and branch prediction have been incorporated into processors to minimize the amount of processor idle time caused by memory accesses and changes in program flow; however, these mechanism are not perfect.

There are still times that the processor could be idle waiting for data or computational results to become available – these delays are relatively short, generally less than a few hundred clock cycles, typically around ten.   The operating system software context switch to another runnable task takes on the order of hundreds of cycles.  Thus, the overheads are too large for the operating system to switch to another runnable tasks to hide these short times of idleness.

One approach to get better utilization is to have the physical processor support multiple logical processors. If one logical processor has to wait for some result, the physical processor can switch to processing instructions from other logical processors to keep the hardware busy doing useful work and get better utilization of the  hardware.

Continue reading “Instruction-level Multithreading to improve processor utilization”


“Don’t cross the streams”: Thread safety and memory accesses at the speed of light

The classic 1984 movie Ghostbusters offered an important safety tip for all of us:

Don’t cross the streams.” – “Why not?” – “It would be bad.” – “I’m fuzzy on the whole good/bad thing. What do you mean, ‘bad’?” – “Try to imagine all life as you know it stopping instantaneously and every molecule in your body exploding at the speed of light.” – “Right. That’s bad. Okay. All right. Important safety tip. Thanks…”

Similarly, in computing, there are also cases where data crossing through memory between different instruction streams would cause a similar effect to a software application – “all execution as we know it stopping instantaneously”.

This is due to the performance optimizations that both hardware and software implement to reorder and eliminate memory accesses. Ignoring these “memory access reordering” issues can result in extremely problematic debugging scenarios.

The bug from hell was a scenario where Java’s OpenJDK runtime parallel garbage collector very occasionally crashed because one thread’s write would signal that the data structure had been updated. This signal occurred before the actual update writes (to the same data structure), and the result was that other threads would end up reading invalid values. We’re going to take a deeper look into this scenario to understand exactly what went on in this notorious issue.

Continue reading ““Don’t cross the streams”: Thread safety and memory accesses at the speed of light”


Superscalar Execution

In the traditional processor pipeline model under ideal circumstances one new instruction enters the processor’s and one instruction completes execution each cycle. Thus, for the best case the processor can have an average execution rate of one clock per instruction. A superscalar processor allows multiple unrelated instructions to start on the same clock cycle on separate hardware units or pipelines. Under ideal conditions a superscalar processors could have an average clocks per instruction (CPI) be less one, meaning your 2GHz processor might be able to execute “billions and billions of instructions per second” as the famed astronomer Carl Sagan might like to say.

Different processors have different constraints on which instructions can be issued together for superscalar execution. A processor such as the Intel Atom can issue two instructions per cycle. However, the compiler needs to take care in the instructions it selects and the order of those instructions because the Atom processor has restrictions on which instructions can start at the same time. For example the Intel Atom processor can only start one older X87 floating point instruction per cycle, so the compiler should avoid back-to-back X87 floating point instructions. The Intel Haswell-based processors have four units for basic arithmetic/logic operations and three units for load/store operations and the Intel Skylake-based processor can issue up to 6 uops per cycle. Given these detailed differences selecting and ordering the instructions is probably best left to the compiler.

Continue reading “Superscalar Execution”


Quickly determine which instruction comes next with Branch Prediction

A pipelined processor requires a steady stream of instructions to be fed into the pipeline. Any delay in feeding instructions into the pipeline will hurt performance. For a sequence of instructions without branches it is relatively easy to determine the next instruction to feed into the pipeline, particularly for processors with fixed sized instructions. Variable-sized instructions might complicate finding the start of each instruction, but it is still a contiguous, linear stream of bytes from memory.

Keeping the processor pipeline filled when there are changes in the control flow of a program due to branches is more difficult. Function calls, function returns, conditional branches, and indirect jumps all affect which instruction is the next in the sequence. In the worst case the pipeline might have to sit idle until all the prior instructions complete and this can significantly hurt performance. Given the frequency that instructions changing the control flow are executed in code these waits can greatly reduce performance.

To improve pipeline performance the processor attempts to predict what instructions to execute next. These mechanisms help the processor speculatively execute code past the branch instructions. If the prediction is wrong, the speculative work is discarded to preserve the programmer’s simple execution model semantics.

Continue reading “Quickly determine which instruction comes next with Branch Prediction”


Assembly Line for Computations

The simple programmer’s model of a processor executing machine language instructions is a loop of the following steps with each step finished before moving on the the next step:

  1. Fetch instruction
  2. Decode instruction and fetch register operands
  3. Execute arithmetic computation
  4. Possible memory access (read or write)
  5. Writeback results to register

As mentioned in the introduction blog article even if the processor can get each step down to a single cycle that would would be 2.5ns (5*0.5ns) for a 2GHz (2×10^9 cycles per second) processor, only 400 million instructions per second. If the instructions can be processed in assembly-line fashion and the steps of instructions be overlapped, the performance could be significantly improved. Rather than completing one instruction every 2.5ns, the processor could potentially complete an instruction every clock cycle, a five-fold improvement in speed. This technique is known as pipelining.

Continue reading “Assembly Line for Computations”


Programmer’s Model of a Processor Executing Instructions Versus Reality

Everything on a computer system eventually ends up being run as a sequence of machine instructions. People want to keep things simple and understandable even if that is not really the way that things work. The simple programmer’s model of a Reduced Instruction Set Computer (RISC) processor executing those machine language instruction is a loop of the following steps each step finished before moving on the the next step:

Continue reading Programmer’s Model of a Processor Executing Instructions Versus Reality