Multi-Thread, Async-Signal and Async-Cancel Safety Docs in GNU libc

POSIX specifies which of the interfaces it defines are safe to call in multi-threaded (MT) programs, in asynchronous signal (AS) handlers, and when asynchronous thread cancellation (AC) is enabled. But to what extent does GNU libc comply with POSIX, or even strive to? In this article we will dig into this topic and look at a recent project Red Hat has completed together with the upstream community.

gnu logo

Although GNU libc strives to comply with POSIX, there have long been differences between POSIX and GNU libc when it comes to these safety properties. It is not just that GNU libc has extensions that affect them: there could also be implementation errors that led to unintended and unknown differences.

In order to alleviate this problem, last year, we started a GNU libc code review project to assess and document these safety properties. The project did not uncover any major surprises, but it produced information that was long needed in the GNU libc manual. As of GNU libc 2.19, all functions in the manual carry the results of this safety assessment, and this documentation is in the process of moving downstream into Red Hat’s products.

Work in progress

The included information is not yet in its final form, because it states what the properties of the implementation are as of specific releases, rather than an ABI-like commitment, by the GNU libc community, to retain these properties in future releases. This distinction is relevant in the following situations:

  • when our implementation is safe, while current POSIX does not require safety, because future revisions of POSIX might require us to make changes that cannot be implemented while retaining the safety we currently provide; and
  • when POSIX demands safety, while our current implementation requires developers to take additional steps to achieve safety, because future implementation changes, intended to attain POSIX’s safety requirements for all users, may cause significant performance degradation of the affected functions. This would be in addition to the overhead incurred by callers targeting the current implementations for using our recommended workarounds that enable the functions to be called safely.

For the reasons above, until the GNU libc community commits to safety features in the same way it commits to ABIs, any differences between documented safety properties and POSIX requirements should be kept in the software maintenance radar.

Although we have envisioned marking all such differences, and introduced infrastructure to do so, we have not got to that point yet. If you notice such an undocumented difference, and you have always wanted to contribute to GNU libc, here is your chance! 🙂

Why are functions not safe?

Ideally, every function would be safe to call in any imaginable circumstance, but efficiency and history often make this impossible.

Take asynchronous thread cancellation, for one, and resource acquisition functions, e.g., memory allocation, file descriptor opening, mutex locking. We would like the resource to be released if and only if it was acquired by the time the thread is canceled. A caller might try this:

ptr = NULL;
try {
  ptr = malloc (sizeof (*ptr));
}
catch (...) {
  free (ptr);
}

but if cancellation happens between the instruction that calls malloc and the one that stores the returned pointer in ptr, the cleanup will attempt to release the NULL pointer. Leaking memory or file descriptors might be tolerable in some cases (which is why we have not regarded functions that may leak such resources as AC-Unsafe), but locks can be more of a problem:

int islocked = 0;
try {
  flockfile (stderr);
  islocked = 1;
  // multiple writes to stderr
  funlockfile (stderr);
  islocked = 0;
}
catch (...) {
  if (islocked)
    funlockfile (stderr);
}

If the thread is canceled after flockfile takes the lock but before locked is set, the canceled thread will leave stderr locked forever; if cancellation hits after funlockfile releases the lock but before locked is cleared, the thread may release the lock one too many times, invoking undefined behavior. Moving the assignments cannot avoid the problem: either both acquisition and assignment must be executed, or neither must be.

Library interfaces could be reworked so as to save callers from the problem:

flockfile_setflag (stderr, &islocked);

or:

malloc_into (sizeof (*ptr), &ptr);

But then, the libraries would have to overcome the problem of atomically acquiring the resource and storing a value in the passed-in pointer, i.e. – any of the atomic instructions often used to take a lock would have to also set the flag. Finding CPUs that offer such complex instructions would be a challenge. Given these implementation challenges, it should be no surprise that POSIX requires so few interfaces to be AC-Safe.

Locks also make for asynchronous signal safety problems. Say a data structure is guarded by a lock and a signal interrupts a thread while it holds the lock. If the signal handler attempts to take the lock again, it will deadlock. A recursive lock would most often just trade one set of problems for another: if the interrupted thread was part-way through an update, the signal handler will find an inconsistent data structure, and changes it makes can be partially undone by the interrupted thread when it regains control. Just think of an insertion in a doubly-linked list interrupted at any point by another insertion.

Alas, avoiding mutual exclusion constructs is seldom an option, and even when it is, it may have its caveats. Locale objects, for example, are created and managed by GNU libc in such a way that, once they are instantiated, they are never modified or released. In theory, this enables readers to avoid locking, but in practice, changing the active locale may cause functions to behave in ways that are not consistent with either locale. To avoid deeming all locale users MT-Unsafe, we deemed the locale-changing ones MT-Unsafe, so that the locale cannot change after multiple threads are started. Once it was decided that it had to be constant in multi-threaded situations, reading from the locale object without synchronization was no longer a reason to deem a function MT-Unsafe.

This sort of reasoning guided one of our first decisions: to go beyond just marking functions as MT-, AS- or AC-Safe or Unsafe depending on whether they behave according to their specifications or may deviate from them in undefined ways, if called in multi-threaded programs, in asynchronous signal handlers, or when asynchronous thread cancellation is enabled. When they were found unsafe, we wanted to indicate why, and provide advice and workarounds when possible, sometimes coordinating with callers of otherwise safe functions.

Safety Annotations

Over the year devoted to the project, we have evolved a set of keywords that group functions that share similar safety features.

For example, locale readers are annotated with the “locale” keyword, whereas its setters are deemed MT-Unsafe and marked with “const:locale”, a reminder that the locale must be constant for safety. Functions that, when called from signal handlers, may find or cause corrupt data structures are marked with “corrupt” under AS-Unsafe, whereas those that may deadlock are marked with “lock” there. Those that, upon cancellation, may leak memory, file descriptors or locks are marked, under AC-Safe or AC-Unsafe, with “mem”, “fd” or “lock”.

If a function uses a static buffer for its return value, or an internal variable in ways that may destructively interfere with other threads using the same variable, we qualify its MT-Unsafe (or MT-Safe, see below) status with the “race” keyword, because calling it in multi-threaded programs is likely to exercise race conditions and cause the function to deviate from its specified behavior.

Can racy functions be MT-Safe?

Surprisingly, yes, or at least this appears to be compatible with the informal requirements set forth by POSIX. Consider, for example, a function such as memset. If it is run concurrently by two or more threads, writing to the same buffer, it will not take any responsibility for avoiding the race condition. Nevertheless, POSIX states memset is MT-Safe.

Which is not to say that the race does not invoke undefined behavior. It does, but our understanding is that it is the user’s responsibility to avoid races involving user-chosen buffers and variables, by introducing appropriate synchronization primitives. A library function accessing or modifying caller-chosen buffers is doing so on caller’s behalf.

Our interpretation is that the responsibility of the implementation, when an MT-Safe implementation is required, is to avoid introducing races of its own, e.g., when accessing internal data structures that users might not even know to be there. Static buffers used for return values, although exposed to users, are chosen internally by the implementation, so functions whose interfaces require the use of such buffers cannot possibly be MT-Safe, as noted in POSIX.

With this interpretation we have adopted, it does not matter whether FILE streams are implicitly synchronized by default as a mere convenience to users, or as a consequence of the fact that some functions implicitly operate on streams that are not chosen by callers (e.g., printf always writes to stdout, which would require printf to be deemed MT-Unsafe if any functions operating on streams did not synchronize). The interpretation is consistent with both possibilities.

Other MT-Safe functions that may trigger races are those that deallocate objects that other threads might be using, be they synchronization objects (locks, condition variables, etc), be they streams being closed. The release functions (free, fclose) are MT-Safe, in spite of these potential races, because (we reasoned) accessing an object after it is released (or concurrently with its release) invokes undefined behavior, and since these are caller-chosen objects, the caller is responsible for avoiding it. When there is internal synchronization in the implementation, it defines an order in which the release clearly happens before the access; otherwise, in the absence of synchronization, there are concurrent accesses from different threads, which amounts to a user-initiated race condition.

In spite of our regarding synchronization on caller-chosen objects as caller responsibility, in some cases of opaque objects passed as arguments, users might reasonably assume library-implemented synchronization, by analogy with FILE streams, whereas the library does not have to and does not implement such synchronization. In cases in which such assumptions seemed likely, we explicitly annotated functions with “race” on the named arguments, but without deeming the functions MT-Unsafe for this reason.

Read more

For additional details are about the annotations, their meanings and rationales, refer to the GNU libc manual, starting at the info node “POSIX Safety Concepts” of upstream version 2.19 or newer. The current safety assessment of each function is next to the function definition, in the same manual.

Furthermore, we have started collaboration with the Linux Documentation Project to bring equivalent documentation, under the same conventions, to man pages in future releases of that project.

If you have thoughts or comments on this topic, we would love to hear from you in the comments.


Join Red Hat Developers, a developer program for you to learn, share, and code faster – and get access to Red Hat software for your development.  The developer program and software are both free!