Understanding malloc behavior using Systemtap userspace probes
malloc family of functions are critical for almost every serious application program. Its performance characteristics often have a big impact on the performance of applications. Given that the default malloc implementation needs to have consistent performance for all general cases, it makes available a number of tunables that can help developers tweak its behavior to suit their programs.
About two years ago I had written an article on the Red Hat Customer Portal that described the high level design of the GNU C Library memory allocator and also introduced the reader to various magic environment variables that malloc understands to change its behavior. The behavior documented in that article and the tricks to tweak malloc behavior hold just as true for RHEL-7, which is based on upstream glibc 2.17 as they did for RHEL-6, which is based on upstream glibc 2.12.
However, it can be pretty cumbersome for an application developer to try and find out exactly what aspect of the malloc implementation to tweak. The best way would have been to try tweaking all of the magic variables and stick with the combination that works best. This is very ad hoc, which is why we came up with the idea of adding Systemtap static probe points to malloc to make this process much more systematic. These malloc probes were included upstream in glibc 2.19 and have been backported to RHEL-7.
The Userspace Probing chapter in the Systemtap Beginners’ Guide on the Customer Portal is a good starting point if you’ve never used Systemtap for userspace probes before. The glibc manual (invoked using the
info libc command) has details for each of the probe points. This post only intends to serve as a guideline to explain what hitting those probe points may mean for your application in terms of performance.
Changes to the process heap
There are two major events that have a performance penalty during changes made to the process heap, viz. growing and shrinking of the heap. These are indicated by the following probe points:
memory_sbrk_more (void *$arg1, size_t $arg2)
memory_sbrk_less (void *$arg1, size_t $arg2)
As the names suggest,
memory_sbrk_more is hit when the heap is grown using the
sbrk system call and
memory_sbrk_less is hit when the heap is shrunk using the same system call.
$arg1 is the pointer that markes the new end of the process heap and
$arg2 is the size by which the process heap was grown or shrunk. The penalty here is the invocation of a system call, which results in a context switch into the kernel.
The other important factor in allocator behavior is the choice between allocating a malloc request on the heap as opposed to using
mmap to service the request. Larger requests are better off being mmapped since they have a tendency to create larger unused gaps in the heap. Allocating on the heap however is advantageous from a performance point of view since it does not involve a syscall. To strike a balance, the allocator maintains a dynamic threshold that is adjusted to service frequent requests to the heap to improve performance at the cost of higher potential space wastage. Requests smaller than the threshold are allocated on the heap while larger ones are serviced using mmap. Changes in this dynamic threshold are important because they can change the balance between heap wastage and speed of allocation.
memory_mallopt_free_dyn_thresholds (int $arg1, int $arg2)
This probe is triggered when the
free function adjusts the dynamic threshold. Argument
$arg2 are the adjusted mmap and trim thresholds, respectively.
Changes to arenas
Multiple threads in a program cannot possibly scale in performance if they have to synchronize access to the process heap for memory allocation. For this reason, the allocator maintains multiple arenas so that threads attach themselves to their own arenas and hence don’t have to contend with each other during malloc. Maintaining these arenas has some bottlenecks. The first major cost is the creation of an arena.
memory_arena_new (void *$arg1, size_t $arg2)
A new arena is typically created when a newly created thread calls
malloc for the first time. This is an expensive event because it involves getting an address space mapping from the kernel. It is also expensive because it means additional address space utilization by the process.
$arg1 is the address space returned for the arena and
$arg2 is the size of the arena.
Of course, every new thread doesn’t result in creation of a new arena. When a thread exits, it may leave behind an arena that is available for reuse in a free list. Hitting the
memory_arena_reuse_free_list probe point is an indicator that this may have happened. This is a good sign since arena reuse for exclusive usage means that resources are being optimally used.
Alternatively, a thread may fail to get an existing free arena and also may not create an arena because of the limit on the number of arenas that can be created in a process. This is determined either by the
M_ARENA_MAX mallopt parameter or by the number of available cpu cores. Once this limit is reached, threads have to share arenas with other threads, which is when the following probe point is hit:
memory_arena_reuse (void *$arg1, void *$arg2)
$arg1 is the arena that’s about to be reused and
$arg2 is the arena that the thread failed to allocate space on. If
NULL, it means that the calling thread has invoked malloc for the first time, that is we have reached the limit for maximum arenas. This is an indicator that your application may experience lock contention when allocating memory.
On the other hand if
$arg2 is not
NULL, it means that the calling thread may have failed to allocate space on the arena at
memory_arena_reuse probe is hit without hitting
memory_arena_reuse_wait, then it means that the thread did not encounter any contention at that moment. This obviously does not mean that the thread will never see contention on this arena.
memory_arena_reuse_wait (void *$arg1, void *$arg2, void *$arg3)
If this probe is hit just before
memory_arena_reuse, it means that the calling thread is about to enter a wait state on the lock for the arena. The lock address is in
$arg2 is the arena it is trying to acquire and
$arg3 is the arena the thread failed to allocate memory on previously. Like in case of
NULL, then the thread is calling malloc for the first time and was unable to secure an arena exclusively for itself.
The process heap is easy to extend using the
sbrk system call. Arenas however are allocated using
mmap and it is not always possible to extend them contiguously. To work around this, the allocator implements the concept of an arena heap, which is an mmapped location chained on to the arena to extend the arena.
memory_heap_new (void *$arg1, size_t $arg2)
This probe is hit when a new heap is allocated for an arena. This is an expensive operation since it involves a system call to the kernel to get address space for the heap.
$arg1 is the returned heap address and
$arg2 is the size of the heap.
On allocation, much of the heap has
PROT_NONE permissions. In fact, this is true for the originally allocated arenas as well and is done so to reduce the actual commit charge for the process. Portions of the arena heaps are given permissions using the
mprotect system call to give the effect of growing. Similarly, portions are given back to the system by using either the
madvise system call or by using
mprotect to give
PROT_NONE permissions again. There are probe points to capture these events since they are again costly events.
memory_heap_more (void *$arg1, size_t $arg2)
As the name suggests, this probe point tracks the growth of the arena heap, which is done using the
mprotect syscall to give read+write permissions to appropriate blocks within the heap.
$arg1 is the address of the heap and
$arg2 is the new size of the heap.
memory_heap_less (void *$arg1, size_t $arg2)
This is exactly the opposite of
memory_heap_more. The trailing portion of the heap is returned to the system using either the
mprotect system call.
$arg1 is the address of the heap and
$arg2 is the new size of the heap.
memory_heap_free (void *$arg1, size_t $arg2)
When an arena heap is completely unused, it may be freed. This probe point tracks this operation since it involves calling the
munmap syscall to return the arena heap to the system.
$arg1 is the address of the arena heap and
$arg2 is the size.
Memory pressure in arenas
memory_malloc_retry (size_t $arg1)
memory_realloc_retry (size_t $arg1, void *$arg2)
memory_memalign_retry (size_t $arg1, size_t $arg2)
memory_calloc_retry (size_t $arg1)
memory_arena_retry (size_t $arg1, void *$arg2)
These probes are triggered when the corresponding functions fail to obtain the requested amount of memory from the arena in use. The
memory_arena_retry probe is a catch-all for all of the individual probes, which is useful when one only wants to see cases where a thread had to change arenas due to resource limitations.
These probes are an indication that another arena may be tried or allocation may fail. Usually this would also result in sharing of arenas among threads, which in turn increases contention between threads, thus affecting performance.
$arg1 is the user requested size for all probes above. For probes that have
$arg2 as a pointer, that is the old memory address. In
$arg2 is the requested alignment.
Looking out for mallopt
Finally, you may want to see when your mallopt tweaks kicked in. Alternatively, libraries that the application depends on may be doing its own tweaking of malloc behavior which will have consequences on application performance. There are probe points to indicate when such tweaking is done.
memory_mallopt (int $arg1, int $arg2)
This is a catch-all probe that is hit whenever an application calls the mallopt function.
$arg2 are arguments passed to
memory_mallopt_mxfast (int $arg1, int $arg2)
memory_mallopt_trim_threshold (int $arg1, int $arg2, int $arg3)
memory_mallopt_top_pad (int $arg1, int $arg2, int $arg3)
memory_mallopt_mmap_threshold (int $arg1, int $arg2, int $arg3)
memory_mallopt_mmap_max (int $arg1, int $arg2, int $arg3)
memory_mallopt_check_action (int $arg1, int $arg2)
memory_mallopt_perturb (int $arg1, int $arg2)
memory_mallopt_arena_test (int $arg1, int $arg2)
memory_mallopt_arena_max (int $arg1, int $arg2)
These are separate probes for each of the allowed mallopt parameters.
$arg1 is the requested value and
$arg2 is the previous value of this parameter.
The third argument (
$arg3) in some of the probe points above is nonzero if dynamic threshold adjustment was already disabled.
Hopefully this article has helped you gain a deeper understanding of how to tweak malloc behaviour for your applications or even understand how the allocator works. There are a number of such static probe points throughout glibc in RHEL-7 — in the dynamic linker, pthreads implementation and even the math library. Watch out for information on static probe points in future posts.