Is your application's performance having problems scaling properly? If so, do you know if it’s due to false cacheline sharing – causing the ping-ponging of cachelines between NUMA nodes?
False sharing occurs when one or more processes or threads repeatedly modifies data co-located in the same cacheline. This forces the other processes and threads to invalidate their cached copies and reload, often from main memory, with the updated values. This can slow programs down considerably.
Red Hat has developed an extension to the perf tool to detect cache-to-cache thrashing. It will be invoked as "perf c2c".
The tool will enable you to see:
- The hottest cachelines in your application.
- All the processes and threads who are accessing those cachelines (including pid, tid, and object names)
- The offsets into the cachelines for the accesses.
- Details for which accesses are modifying or just reading the cachelines.
- The NUMA nodes and cpus where those accesses are coming from.
- The instruction address in the object where those memory accesses are occurring.
The above information makes it really clear to understand if your application is suffering from false sharing, and provides you with all the details needed to address it.
As we developed this tool, we uncovered some nice performance gains in various open source code we used for testing. To learn more, come to our Thursday morning DevNation talk. Pinpoint memory cachline tearing for NUMA performance gains - Don Zickus & Joe Mario