Investigating performance in a complex system is a fascinating undertaking.  When that system spans multiple, closely-cooperating machines and has open-ended input sources (shared storage, or faces the Internet, etc) then the degree of difficulty of such investigations ratchets up quickly.  There are often many confounding factors, with many things going on all at the same time.

The observable behaviour of the system as a whole can be frequently changing even while at a micro level things may appear the same.  Or vice-versa - the system may appear healthy, average and 95th percentile response times are in excellent shape, yet a small subset of tasks are taking an unusually large amount of time to complete, just today perhaps.  Fascinating stuff!

Let's first consider endearing characteristics of the performance tools we'd want to have at our disposal for exploring performance in this environment. 

We are talking production systems of course, so they must be light-weight in all dimensions (processor, memory, storage, networking) and being always-on is a requirement.  They'll be operating under duress, 24x7.  When other components are failing around these tools, we want them to keep doing their thing (complex systems often exhibit cascading failures); the tools should be operational even when three of four engines fail, there's no hydraulic pressure, and a plume of thick black smoke is trailing behind the system!

We want them to be able to give us data immediately, and be able to save data for later playback and offline analysis.  We do not want less tangible issues like financial cost, licensing, lack of security, lack of openness or similar constraints to give us pause when deploying the tools to all the cooperating hosts in our complex system.

Perhaps most importantly of all - they must provide access to all of the many metrics from the many different domains (components) of these complex systems - because combinations of factors from different components and different hosts might be (will be!) contributing to today's crisis.

With this basic tooling in place, we then move up a level and want these tools to be able to play nicely with higher level tools (monitoring and reporting systems, data warehouses, analytics, capacity planning or modelling tools).  That means ideally APIs for extracting data, and tools for producing easily-consumed data for others to build on and analyse further.

This is the set of requirements underpinning the Performance Co-Pilot (PCP) toolkit.

So, with this in mind, let us turn our focus back to the topic at hand - how can we perform simple exploratory performance analysis in complex production systems using this not-so-hypothetical-after-all set of tools.  What do we even mean by "exploratory" in this context?

When confronted with a need to analyse performance we're often starting with an initial observation about the system - poor response time of an interactive web application, perhaps.  We can explore available data and come up with a series of hypotheses, informed by our observations.

To illustrate the approach, and introduce some of the concepts and tools that are part of the Performance Co-Pilot, the screencast below (under 12 minutes) shows the exploration of some data from two machines in a loosely-coupled cluster - a storage server and an application server.  This is production system data, with some names changed to protect the innocent.  For simplicity, many of the machines that complete this (complex) system are not presented.

http://youtu.be/zrAjevr8_Ds

You can find more information, books and other Performance Co-Pilot resources on the project website.  In Fedora and EPEL, look for the "pcp", "pcp-gui" and "pcp-doc" RPMs.

http://oss.sgi.com/projects/pcp