Recent improvements to concurrent code in glibc
In this post, I will give examples of recent improvements to concurrent code in glibc, the GNU C library, in the upstream community project. In other words, this is code that can be executed by multiple threads at the same time and has to coordinate accesses to shared data using synchronization. While some of these improvements are user-visible, many of them are not but can serve as examples of how concurrent code in other code bases can be improved.
One of the user-visible improvements is a new implementation of Pthreads semaphores that I contributed. It puts less requirements on when a semaphore can be destructed by a program. Previously, programs had to wait for all calls to sem_wait or sem_post to return before they were allowed to call sem_destroy; now, under certain conditions, a thread that returned from sem_wait can call sem_destroy immediately even though the matching sem_post call has woken this thread but not returned yet. This works if, for example, the semaphore is effectively a reference counter for itself; specifically, the program must still ensure that there are no other concurrent, in-flight sem_wait calls or sem_post calls that are yet to increment the semaphore. The new semaphore implementation is portable code due to being based on C11 atomic operations (see below) and replaces several architecture-specific implementations.
Another improvement contributed by others in the glibc community recently is adding support for transactional lock elision of Pthreads mutexes on PowerPC; this can improve the performance of critical sections by using Hardware Transactional Memory to execute the code speculatively in parallel, only falling back to using locks when necessary. This complements the existing lock elision support on s390 and Intel systems. Lock elision needs to be explicitly enabled when building glibc.
I have also worked on improving glibc’s internal interfaces around futexes. Futexes are an abstraction offered by the Linux kernel that allows a program to block until waken up by another thread or timing out, with the help of the operating system (i.e., unlike when spin-waiting in a busy loop, the OS is aware of the blocking relationship and can, for example, execute another thread on this CPU while the original thread is blocked). This is ongoing work, and also involves collaboration with the kernel community, which is currently improving the documentation of the futex operations offered by the kernel. Doing so is important because futexes are very useful – yet have not been specified in full detail previously. The efforts in the kernel and the glibc communities should help in making futexes be more easy to use by other programs.
I want to conclude this overview with a very low-level improvement, which I consider very important: We have started to transition glibc to using the C11 memory model. In a nutshell, a memory model defines the behavior of a multi-threaded program, in particular how the sequential instruction stream in each thread communicates with other threads through reads and writes to shared memory. A previous post explains the C11/C++11 memory model in some more detail (note that while the interfaces that C11 and C++11 provide for synchronization differ, the model itself is intentionally the same).
Thus, the reason I mention this is that it really is the foundation on top of which other concurrency abstractions such as mutexes or semaphores can be implemented in glibc. The C11 memory model is a programming language’s memory model, and can be implemented on top of the memory models of the various hardware architecture supported by glibc.
Previously – and still in the vast majority of glibc concurrent code that has not been changed to use the C11 model – we were using an insufficiently documented memory model and a mixture of normal memory accesses and architecture-specific assembly implementations of atomic operations to synchronize in shared memory. This allowed experts to write correct concurrent code, but using the C11 memory model and atomic operations has advantages:
- Over time, more programmers will become familiar with the C11 memory model; using it decreases the learning curve for new glibc developers. Also, we get access to the existing and future tool support for this model. For example, the cppmem tool is a great tool to explore all possible executions of small snippets of C11-like concurrent code; it runs in your browser, and can be really helpful to understand the model by example – and interactively!
- It puts glibc’s interactions with, and expectations on, the compiler on a well-specified foundation, namely the C11 memory model. Prior to C11, C didn’t define behavior of multi-threaded programs, so one often essentially relied on knowledge about compilers and their specific implementation to write working concurrent code.
- Requiring code to be data-race free is necessary to tell the compiler which memory accesses are part of synchronization and must not be optimized like sequential code. A useful side effect of this is that it becomes easy to spot which accesses to memory are actually part of synchronization, because either the atomic operation or the data type are visible in the code; in other words, it makes potentially complex concurrent code stand out more.
- In glibc’s case, we don’t loose anything in terms of performance, at least when a decent compiler is used to build glibc. The previous atomic operations are a subset of what is offered by C11, and in a few cases we were able to select the required hardware barriers more carefully.
Of course, transforming all concurrent code in a code base as big as glibc is not a simple change but rather a multi-phase process with incremental steps. We made a few trade-offs to ease transitioning to the C11 model, which are explained in more detail on a glibc wiki page. Here are a few choices we made that might be useful approaches for your code bases too:
- We introduce the new, C11-like atomic operations alongside the old atomic operations, allowing us to transition one cluster of atomic code in glibc at a time instead of having to switch everything at once. The C11-like atomics do have the same semantics as their counterparts in C11 but different names, so that we do not conflict with any actual C11 code.
- We require the memory order of an atomic operation to always be explicitly specified. We want efficient code so programmers should make a conscious choice.
- We do not use explicitly atomic types for data accessed by atomic operations (e.g., equivalents of C11’s atomic types). Note that this is not ideal, and I would not recommend it for new code. Nonetheless, we know that the existing data type declarations in glibc work correctly, so things like requirements on alignment to make atomic operations work are already taken care of. Perhaps more importantly, we require all accesses to a certain variable to use atomic operations if just one atomic access to it exist; thus, this tells the compilers that we support that a variable is in fact used for synchronization.
- The documentation guidelines request that concurrent code should be documented using the terms and semantics specified in the C11 memory model (e.g., relations such as happens-before or reads-from). There should not be a disconnect between the model the code is based on and the terminology used to document the code.
I hope that this look at what has been happening recently in the upstream glibc project was interesting for you. Feel free to leave comments if you have further questions.