I had an interesting question from one of our developers here at Red Hat:
“When I was investigating a performance issue in our project after switched to oracle’s jdk7u40, I found a performance regression in class
sun.net.www.protocl.http.HttpURLConnection.getOutputStream(). This method takes more cpu time than with jdk7u25.”
And it does, much more time. In fact, when
fixedLengthStreamingMode is enabled
HttpURLConnection.getOutputStream() takes ten times as long: about 1.2 milliseconds versus 47 microseconds.
Continue reading “An ultra-lightweight high-precision logger for OpenJDK”
Now that Red Hat Enterprise Linux 7 is publicly available, we thought RHEL application developers would be interested in seeing how the new C/C++ toolchain compares to the equivalent in Red Hat Enterprise Linux 6 in terms of raw performance. The numbers are pretty surprising so stay tuned. But first a little introduction to set the scene.
Continue reading “Red Hat Enterprise Linux 7 toolchain a major performance boost for C/C++ developers”
When running a latency-sensitive application one might notice that on a regular basis (for example every 5 minutes) there is a delay. The SystemTap
periodic.stp script can provide some possible causes of that regular delay. The SystemTap
periodic.stp script generates a list of the number of times that various scheduled functions run and the time between each scheduled execution. In the case of delay every five minutes one would run the periodic script for tens of minutes and then look through the output list for a function that has a period of approximately 300,000,00 microseconds (5 minutes*60 seconds*1,000,000 microseconds/second).
Continue reading “Which tasks are periodically taking processor time?”
If an important task is processor limited, one would like to make sure that the task is getting as much processor time as possible and other tasks are not delaying the execution of the important task. The SystemTap example script,
cycle_thief.stp, lists what interrupts and other tasks run on the same processor as the important task. The
cycle_thief.stp script provides the following pieces of information:
All modern processors use page-based mechanisms to translate the user-space processes virtual addresses into physical addresses for RAM. The pages are commonly 4KB in size and the processor can hold a limited number of virtual-to-physical address mappings in the Translation Lookaside Buffers (TLB). The number TLB entries ranges from tens to hundreds of mappings. This limits a processor to a few
megabytes of memory it can address without changing the TLB entries. When a virtual-to-physical address mapping is not in the TLB the processor must do an expensive computation to generate a new virtual-to-physical address mapping.
Continue reading “Examining Huge Pages or Transparent Huge Pages performance”
Modern computer systems include cache memory to hide the higher latency and lower bandwidth of RAM memory from the processor. The cache has access latencies ranging from a few processor cycles to ten or twenty cycles rather than the hundreds of cycles needed to access RAM. If the processor must frequently obtain data from the RAM rather than the cache, performance will suffer. With Red Hat Enterprise Linux 6 and newer distributions, the system use of cache can be measured with the
perf utility available from the
Continue reading “Determining whether an application has poor cache performance”
In an earlier post we looked into using the Performance Co-Pilot toolkit to explore performance characteristics of complex systems. While surprisingly rewarding, and often unexpectedly insightful, this kind of analysis can be rightly criticized for being “hit and miss”. When a system has many thousands of metric values it is not feasible to manually explore the entire metric search space in a short amount of time. Or the problem may be less obvious than the example shown – perhaps we are looking at a slow degradation over time.
There are other tools that we can use to help us quickly reduce the search space and find interesting nuggets. To illustrate, here’s a second example from our favorite ACME Co. production system.
Continue reading “Performance Regression Analysis with Performance Co-Pilot “
Investigating performance in a complex system is a fascinating undertaking. When that system spans multiple, closely-cooperating machines and has open-ended input sources (shared storage, or faces the Internet, etc) then the degree of difficulty of such investigations ratchets up quickly. There are often many confounding factors, with many things going on all at the same time.
The observable behaviour of the system as a whole can be frequently changing even while at a micro level things may appear the same. Or vice-versa – the system may appear healthy, average and 95th percentile response times are in excellent shape, yet a small subset of tasks are taking an unusually large amount of time to complete, just today perhaps. Fascinating stuff!
Let’s first consider endearing characteristics of the performance tools we’d want to have at our disposal for exploring performance in this environment.
Continue reading “Exploratory Performance Analysis with Performance Co-Pilot “
As I mentioned here, Joe Mario and I delivered this session at Red Hat’s Developer Exchange session in Boston. There were a lot of great questions and we hope you’ll find this video-recorded session useful.
Now that you followed all the steps to make your application NUMA-aware, how do you know if you got it right, or if you shifted your performance problem elsewhere?
In this session, Don and Joe will:
- discuss initial high level steps to verify correct memory and cpu-process placement, including:
- showing how performance can easily suffer with incorrect placement.
- describing available options to correct placement.
- discuss the open source tools, both available now and in development, which use the hardware’s performance counters to more accurately pinpoint:
- where your program is making costly remote NUMA memory accesses,
- identifying if and where other programs are inflicting NUMA-related performance penalties on your program,
- how much those remote accesses are hurting your performance.
- discuss various approaches for resolving these low-level issues.
Continue reading “NUMA – Verifying it’s not hurting your application performance “
A common performance related issue we are seeing is how certain instructions
are causing bottlenecks. Sometimes it just doesn’t make sense. Especially
when it involves lots of threads or shared memory on NUMA systems.
For quite awhile a bunch of us have been writing tools to help exploit features
of the CPU to provide us insight to not only the instruction of the bottleneck
but the data address too.
See, the instruction is only half the picture. Having the data address allows
you to see two distinct functions operating on what looks like distinct data,
but yet are intertwined on a cache-line. Thus these functions are tugging
memory back and forth causing huge latency spikes.
Sometimes the answer is to separate the data onto different cache-lines, other
times (in the case of locks) perhaps change the granularity to reduce
Intel CPUs have support for providing data addresses for load and stores (along
with latency times for loads) through its performance counters. Userspace
exploits this feature with a tool called ‘perf’.
Latest perf can be run with:
Continue reading “Dive deeper in NUMA systems”