RHEL

Restartable sequences (rseq) are a Linux feature that can maintain per-CPU data structures in userspace without relying on atomic instructions. A restartable sequence is written under the assumption that it runs from beginning to end without the kernel interrupting it and running some other code on that CPU. It can therefore access per-CPU data without further synchronization.

Restartable refers to the fallback mechanism that kicks in if the kernel has to reschedule execution. In this case, control is transferred to a fallback path, which can retry the execution or use a different algorithm to implement the required functionality. It turns out that this facility is sufficient to implement a variety of algorithms using per-CPU data, especially if combined with an explicit memory barrier system call.

Why add rseq to Red Hat Enterprise Linux?

There is not much software yet that uses restartable sequences for production purposes. Google's tcmalloc is one example. But the rseq kernel facility provides one more benefit. The kernel writes the number of the currently running CPU to a data structure shared with userspace. This is sufficient for a user-space-only (and therefore fast) implementation of the POSIX sched_getcpu function.

Why does sched_getcpu performance matter? Some database software needs a fast sched_getcpu function for optimization purposes. Traditionally, Linux provides the getcpu system call to obtain the number of the current CPU, but the system call overhead, unfortunately, defeats its use for performance optimizations. Some architectures already provide a fast getcpu function (the Linux variant of sched_getcpu) via the vDSO acceleration mechanism, but it turns out that this vDSO approach is impossible to implement on AArch64. Other architectures use some hidden and otherwise unused CPU state to stash the CPU number, but there is simply no available CPU state on AArch64 that could be used for this purpose. This means that for performance parity, AArch64 really needs support for restartable sequences.

In Red Hat Enterprise Linux (RHEL), the sched_getcpu function is provided by the  GNU C Library (glibc). We did not want to build a custom glibc variant just for AArch64, which is why we chose to enable rseq on all architectures. Furthermore, the rseq-based sched_getcpu function turned out to be slightly faster than the previous vDSO-based implementation.

The glibc ABI impact

Restartable sequences involve a memory area shared between userspace and kernel, which is updated by the kernel. The Linux kernel only supports one such area per thread, which means that all rseq-using code in the process needs to use that single area. Once glibc starts using rseq for its sched_getcpu function, direct activation of the rseq area via the rseq system call fails in applications because it has already been registered by glibc. Therefore, applications need to reuse the glibc-managed rseq area. In the glibc 2.35 upstream version, we solved this coordination problem by exposing three new symbols: __rseq_size, __rseq_flags, __rseq_offset as part of the glibc ABI (Application Binary Interface). Applications that want to use the rseq facility can use these ABI symbols to locate the rseq area and use it according to the restartable sequences protocol.

RHEL 9 is based on glibc 2.34, not 2.35, so it does not immediately benefit from this upstream work. Initially, we backported the rseq-enhanced sched_getcpu into Red Hat Enterprise Linux 9.0 in a disabled-by-default configuration without the new symbols. Users could activate the glibc.pthread.rseq glibc tunable to enable it. This means that applications can register rseq directly with the kernel, but sched_getcpu on AArch64 remains very slow until the tunable is set manually on a per-process basis. In Red Hat Enterprise Linux 9.1, we will enable glibc's use of restartable sequences by default. But this means we also need to provide a coordination mechanism for applications to use. We could have come up with a RHEL-specific approach or a custom version of librseq that applications are expected to use on RHEL. But in the end, we decided to backport the rseq-specific parts of the glibc 2.35 ABI into the glibc version that will be released with Red Hat Enterprise Linux 9.1. We have traditionally not augmented the glibc ABI after a release, but we felt that the use case was important enough to make an exception here.

Our RPM-based dependency management does not work on individual symbols (unlike Debian's symbol files). This means that, in general, we have to backport all symbols in the version set rather than the few we are actually interested in. In this case, we knew about the possible backport requirement when making the change upstream, so we made sure that the relevant symbol set included just the three symbols: __rseq_size, __rseq_flags, and __rseq_offset.

To maintain a consistent glibc ABI across the entire RHEL 9 release, a RHEL 9.0 glibc update includes these new symbols as well. However, unlike RHEL 9.1, glibc's own use of rseq remains disabled by default. This additional backport ensures that rseq-enabled applications built against RHEL 9.1 or later can still launch on RHEL 9.0, albeit by default, without the glibc-integrated rseq facility.

The glibc.pthread.rseq tunable remains available in future RHEL 9 glibc versions and can be used to disable glibc's use of rseq in case there are compatibility issues. For example, we worked with the CRIU authors to make sure that checkpointing and restoring applications continue despite rseq usage by the process, but for compatibility with checkpointing tools that lack rseq awareness, glibc's use of the feature can be disabled. Likewise, disabling glibc's use makes rseq registration available once more to early adopters of rseq that have not been ported yet to the glibc coordination mechanism (such as tcmalloc).

Restartable sequences availability

Starting with Red Hat Enterprise Linux 9.1, restartable sequences are available as part of the system C library, and they are used to accelerate the sched_getcpu function that is critical to some database workloads. Maintaining compatibility with future rseq-using applications required adding new ABI symbols. A glibc update has been released for RHEL 9.0 to maintain a consistent glibc across all RHEL 9 minor releases.