profiling

Which tasks are periodically taking processor time?

When running a latency-sensitive application one might notice that on a regular basis (for example every 5 minutes) there is a delay. The SystemTap periodic.stp script can provide some possible causes of that regular delay. The SystemTap periodic.stp script generates a list of the number of times that various scheduled functions run and the time between each scheduled execution. In the case of delay every five minutes one would run the periodic script for tens of minutes and then look through the output list for a function that has a period of approximately 300,000,00 microseconds (5 minutes*60 seconds*1,000,000 microseconds/second).

Continue reading “Which tasks are periodically taking processor time?”

Share

Examining Huge Pages or Transparent Huge Pages performance

All modern processors use page-based mechanisms to translate the user-space processes virtual addresses into physical addresses for RAM. The pages are commonly 4KB in size and the processor can hold a limited number of virtual-to-physical address mappings in the Translation Lookaside Buffers (TLB). The number TLB entries ranges from tens to hundreds of mappings. This limits a processor to a few
megabytes of memory it can address without changing the TLB entries. When a virtual-to-physical address mapping is not in the TLB the processor must do an expensive computation to generate a new virtual-to-physical address mapping.

Continue reading “Examining Huge Pages or Transparent Huge Pages performance”

Share

Determining whether an application has poor cache performance

Modern computer systems include cache memory to hide the higher latency and lower bandwidth of RAM memory from the processor. The cache has access latencies ranging from a few processor cycles to ten or twenty cycles rather than the hundreds of cycles needed to access RAM. If the processor must frequently obtain data from the RAM rather than the cache, performance will suffer. With Red Hat Enterprise Linux 6 and newer distributions, the system use of cache can be measured with the perf utility available from the perf RPM.

Continue reading “Determining whether an application has poor cache performance”

Share

Profiling Ruby Programs

The Ruby Interpreter includes a profiling tool which is invoked with the -rprofile option on the command line. Below is an example running the Ruby Fibonacci program (fib.rb) included in Ruby documentation samples. The list of functions is sorted from most to least time spent exclusively in the function (self seconds). The first column provides the percentage of self seconds for each function. The cumulative seconds indicates the amount of time spent in that function and the functions it calls directly and indirectly. The calls, self ms/call, and total ms/call provide some indication whether the function is called frequently and the average cost of each call to a function.

Continue reading “Profiling Ruby Programs”

Share

Profiling Python Programs

For RHEL6 and newer distributions tools are available to profile Python code and to generate dynamic call graphs of a program’s execution. Flat profiles can be obtained with the cProfile module and dynamic callgraphs can be obtained with pycallgraph.

The cProfile Python module records information about each of the python methods run. For older versions of Python that do not include the cProfile module you can use the higher overhead profile module. Profiling is fairly simple with the cProfile module.

Continue reading “Profiling Python Programs”

Share

Dive deeper in NUMA systems

A common performance related issue we are seeing is how certain instructions
are causing bottlenecks.  Sometimes it just doesn’t make sense.  Especially
when it involves lots of threads or shared memory on NUMA systems.

For quite awhile a bunch of us have been writing tools to help exploit features
of the CPU to provide us insight to not only the instruction of the bottleneck
but the data address too.

See, the instruction is only half the picture.  Having the data address allows
you to see two distinct functions operating on what looks like distinct data,
but yet are intertwined on a cache-line.  Thus these functions are tugging
memory back and forth causing huge latency spikes.

Sometimes the answer is to separate the data onto different cache-lines, other
times (in the case of locks) perhaps change the granularity to reduce
contention.

Intel CPUs have support for providing data addresses for load and stores (along
with latency times for loads) through its performance counters.  Userspace
exploits this feature with a tool called ‘perf’.

Latest perf can be run with:

Continue reading “Dive deeper in NUMA systems”

Share

Starting with SystemTap

As I stare at this blank screen to start writing my first blog entry I have that same feeling that so many developers have when starting with an unfamiliar programming language or application.  The developers in our group realize that it is not easy starting from nothing and we strive to make it easier to productively use SystemTap to investigate performance problems.

Continue reading Starting with SystemTap

Share