Adding buffer overflow detection to string functions

This article describes the steps required to add buffer overflow protection to string functions. As a real-world example, we use the strlcpy function, which is implemented in the libbsd library on some GNU/Linux systems.

This kind of buffer overflow protection uses a GNU Compiler Collection (GCC) feature for array size tracking (“source fortification”), accessed through the __builtin_object_size GCC built-in function. In general, these checks are added in a size-checking wrapper function around the original (wrapped) function, which is strlcpy in our example.

Continue reading “Adding buffer overflow detection to string functions”

Share

Practical micro-benchmarking with ‘ltrace’ and ‘sched’

Recently I was asked to look into an issue related to QEMU’s I/O subsystem performance – specifically, I was looking for differences in performance between glibc’s malloc and other malloc implementations’. After a good deal of benchmarking I was unable to see a clear difference between our malloc (glibc) and others, not because the implementations were similar, but because there was too much noise in the system; the benchmarks were being polluted by other things, not just in QEMU, but elsewhere in the operating system. I really just wanted to see how much time malloc was using, but it was a small signal in a large noisy system.

To get to the results I needed, I had to isolate malloc’s activity so I could measure it more accurately. In this blog, I’ll document how I extracted the malloc workload, and how to isolate that workload from other system activity so it can be measured more effectively.

Continue reading “Practical micro-benchmarking with ‘ltrace’ and ‘sched’”

Share

Upgrading the GNU C Library within Red Hat Enterprise Linux

Occasionally, there’s a need for a new GNU C Library for a given application to run.  For example, some versions of the Google Chrome browser started to warn users on Red Hat Enterprise Linux 7 that future versions of Chrome would not support their operating system. The Chromium source code contained a version check, flagging all versions of the GNU C Library (glibc) older than 2.19 as obsolete. This check has since been relaxed to 2.17 (the version in Red Hat Enterprise Linux 7), but it is still worth discussing what we can do to support application binaries in Red Hat Enterprise Linux which require a newer glibc version to run.

Distribution-specific binaries

Before discussing the feasibility of glibc upgrades, it is worth noting that there is a disconnect between how GNU/Linux distributions build the applications they ship as part of the distribution, and how independent software vendors (ISVs) build their application binaries.

Continue reading “Upgrading the GNU C Library within Red Hat Enterprise Linux”

Share

Dirty Tricks: Launching a helper process under memory and latency constraints (pthread_create and vfork)

You need to launch a helper process, and while Linux’s fork is copy-on-write (COW), the page tables still need to be duplicated, and for a large virtual address space that could result in running out of memory and performance degradation. There are a wide array of solutions available to use, but one of them, namely vfork is mostly avoided due to a few difficult issues. First is that vfork pauses the parent thread while the child executes and eventually calls an exec family function, this is a huge latency problem for applications. Secondly is that there are a great many number of considerations to take into account when using vfork in a threaded application, and missing any one of those considerations can lead to serious problems.

It should be possible for posix_spawn to safely do all of this work via POSIX_SPAWN_USEVFORK, but often there is quite a lot of “work” that needs to be done just before the helper calls an exec family function, and that has lead to ever increasingly complex versions of posix_spawn like posix_spawn_file_actions_addclose, posix_spawn_file_actions_adddup2, posix_spawn_file_actions_destroy, posix_spawnattr_destroy, posix_spawnattr_getsigdefault, posix_spawnattr_getflags, posix_spawnattr_getpgroup, posix_spawnattr_getschedparam, posix_spawnattr_getschedpolicy, and posix_spawnattr_getsigmask. It might be simpler if the GNU C Library documented a small subset of functions you can safely call, which is in fact what the preceding functions are modelling. If you happen to select a set of operations that can’t be supported by posix_spawn with vfork then the implementation falls back to fork and you don’t know why. Therefore it is hard to use posix_spawn robustly.

Continue reading “Dirty Tricks: Launching a helper process under memory and latency constraints (pthread_create and vfork)”

Share

Recent improvements to concurrent code in glibc

gnu logoIn this post, I will give examples of recent improvements to concurrent code in glibc, the GNU C library, in the upstream community project. In other words, this is code that can be executed by multiple threads at the same time and has to coordinate accesses to shared data using synchronization. While some of these improvements are user-visible, many of them are not but can serve as examples of how concurrent code in other code bases can be improved.

One of the user-visible improvements is a new implementation of Pthreads semaphores that I contributed. It puts less requirements on when a semaphore can be destructed by a program. Previously, programs had to wait for all calls to sem_wait or sem_post to return before they were allowed to call sem_destroy; now, under certain conditions, a thread that returned from sem_wait can call sem_destroy immediately even though the matching sem_post call has woken this thread but not returned yet. This works if, for example, the semaphore is effectively a reference counter for itself; specifically, the program must still ensure that there are no other concurrent, in-flight sem_wait calls or sem_post calls that are yet to increment the semaphore. The new semaphore implementation is portable code due to being based on C11 atomic operations (see below) and replaces several architecture-specific implementations.

Continue reading “Recent improvements to concurrent code in glibc”

Share

Malloc systemtap probes: an example

gnu logoOne feedback I got from my blog post on Understanding malloc behavior using Systemtap userspace probes was that I should have included an example script to explain how this works. Well, better late than never, so here’s an example script. This script prints some diagnostic information during a program run and also logs some information to print out a summary at the end. I’ll go through the script a few related probes at a time.

global sbrk, waits, arenalist, mmap_threshold = 131072, heaplist

Continue reading “Malloc systemtap probes: an example”

Share

Improving math performance in glibc

Update: Vincent Lefèvre helpfully pointed out that I had linked to the incorrect Worst Cases paper. That link is now fixed.

Update 2: Dan Courcy pointed out that my equation in the “Multiplying zeroes” section had an error, which I have now fixed.

Mathematical function implementations usually have to trade off between speed of computation and the accuracy of the result. This is especially true for transcendentals (i.e. the exponential and trigonometric functions), where results often have to be computed to a fairly large precision to get last bit accuracy in a result that is to be stored in an IEEE-754 double variable.

gnu logoTranscendental functions in glibc are implemented as multiple phases. The first phases of computation use a lookup table of pre-computed values and a polynomial approximation, a combination of which gives an accurate result for a majority of inputs. If it is found that the lookup table may not give an accurate result the next, slower phase is employed. This phase uses a multiple precision representation to compute results to precisions of up to 768 bits before rounding the result to double. As expected, this kills performance; the slowest path for pow for example is a few thousand times slower than the table lookup phase.

I looked into this problem not very long ago and tried to improve performance of the multiple precision phase so that it is not as bad as it currently is. The result of that was an improvement of about 8 times in the performance of the slowest path of the pow function. Other transcendentals got similar improvements since the fixes were mostly in the generic multiple precision code. These improvements went into glibc-2.18 upstream. We’ll take a look at what these changes were but first, a quick look at the multiple precision number for context on the changes.

Continue reading “Improving math performance in glibc”

Share