In an earlier post we looked into using the Performance Co-Pilot toolkit to explore performance characteristics of complex systems. While surprisingly rewarding, and often unexpectedly insightful, this kind of analysis can be rightly criticized for being "hit and miss". When a system has many thousands of metric values it is not feasible to manually explore the entire metric search space in a short amount of time. Or the problem may be less obvious than the example shown - perhaps we are looking at a slow degradation over time.
There are other tools that we can use to help us quickly reduce the search space and find interesting nuggets. To illustrate, here's a second example from our favorite ACME Co. production system.
For this scenario we'll be looking at data from a MS Windows SQL Server machine that is backing the acme.com web application. Demonstrating the cross-platform tools (and cos we're warped and twisted people), we have recorded the performance data from a (remote) RHEL6 machine and we'll be analyzing it on a Mac OS X desktop.
http://youtu.be/Z9fSymDfuvQ
Win - we've found areas of interest and with very little assistance; we've diagnosed the problem, starting from almost no knowledge of the system. And we had fun exploring system activity too - good times!
Notice that this kind of analysis is made possible by a model where performance data from all components of complex systems are extracted integrated and analysed together (in this case - hardware, kernel, and database components). All of the metrics can be described with sufficient metadata (units, data type, and so on) that generic tools like pmie(1) and pmlogsummary(1) can sensibly interpret them.
Performance regression detection can be applied in several ways. Developers at ACME Co. use it to compare platform and application activity between load test runs. New versions of their application are automatically scanned for unexpectedly large database query results (see the accompanying clip) or other resource utilization patterns that don't match a known-good baseline.
Another use is in evaluating the effects of an operating system upgrade. In such cases, there will be a large number of unrelated changes at once (system libraries, kernel, toolchain, JVM runtime, etc). Having a disciplined approach to evaluating improvements and regressions at a system level (where "system level" might involve many cooperating machines) is invaluable.
You can find more information, books and other Performance Co-Pilot resources on the project website. In Fedora and EPEL, look for the "pcp", "pcp-gui" and "pcp-doc" RPMs.
http://oss.sgi.com/projects/pcp
Last updated: April 5, 2018