Guidelines for instruction encoding in the NOP space
More and more CPUs implement new features with instructions that are executed as NOPs (no-operation instructions) on previous CPU generations. This results in some challenges for operating system builders, particularly in the area of legacy software support.
An instruction that can be ignored completely by some CPUs does not seem useful, but it turns out that there are quite a few applications for it, including:
- Performance hints, such as marking atomic loads and stores which are part of mutex lock and unlock operations.
- Optional array bounds checking, as once implemented by Intel MPX.
- Security hardening, such as verifying at load time that memory is read-only (and not writable), for making it harder to inject C++ vtables or bytecode.
- Markers in the instruction stream, for control-flow integrity validation (e.g., Intel CET), or as hints to dynamic instrumentation tools, such as Valgrind.
Using an instruction encoding in the NOP space means that operating system and application developers can deliver a single set of binaries. These binaries work both on old CPUs and new CPUs. On old CPUs, due to their no-operation nature, the instructions and the information they provide is completely ignored. New CPUs that recognize the instructions can run the same binaries, which then benefit from the additional CPU features. There are a few caveats related so that this works smoothly in practice.
NOP support between implementations can vary
Most instruction sets have multiple independent implementations. Even if there is just a single CPU vendor for real silicon, distributions like Fedora and Red Hat Enterprise Linux deal with at least three implementation: the actual hardware, emulation support in QEMU, and the Valgrind instruction decoder and compiler. If a new instruction should be in the NOP space but is not, according to one of the existing implementations, the results can be unpredictable. Often, these not-yet-implemented NOPs result in illegal instruction traps, which completely invalidates the reason for using instructions in the NOP space in the first place.
For example, the Intel x86 architecture initially supported only a limited length for NOP instructions. Support for longer NOPs first appeared in the Intel Pentium Pro CPU, but long NOP support was not present in all CPUs with an otherwise comparable feature set, resulting in interoperability issues. Binaries using long NOPs crashed with an illegal instruction trap on CPUs which did not support them.
Changed NOP behavior needs to be optional
As an additional safety measure, instructions in the NOP space should not become active immediately after merely upgrading the CPU. There are two reasons for this: a toolchain (compiler/assembler/linker combination) may have produced these NOPs by accident (e.g., a non-mainstream toolchain the CPU vendor is not tracking effectively). Or, there could be existing binaries with deliberate use of the NOP instructions. And because the NOPs have been ignored so far, not ignoring them could expose new bugs (incorrect results or crashes). These bugs could either be genuine toolchain bugs or simple misuse of the new feature by application developers who did not have an opportunity to test their code.
Both cases can be very challenging for users who want to upgrade their hardware. Software usually outlives hardware. It is not always possible to rebuild affected binaries from source code (or re-link them against updated libraries). Under such circumstances, new instructions in the NOP space can delay hardware updates indefinitely.
In the past, virtualization was a sufficient solution to this dilemma: The hypervisor could provide a CPU model that disables these NOP-space instructions, even though the physical CPU would support them. Of course, this still needs some level of support in the CPU for a split configuration: new instruction support for the hypervisor itself and some guests, but no support (true NOP operation) for other guests. (Typically, system administrators do not want to disable security hardening features on the hypervisor for the sake of a single guest.)
In a container-based world, such coarse control does not appear to be sufficient. In environments with high security requirements (or multiple tenants), the container host and certain privileged containers will be expected to run with full hardening enhancements provided by NOP-space instructions. However, it is still desirable to provide full compatibility with container images that end users supply and that are affected by either or both bug categories described above. A boot-time configuration setting will often be too coarse-grained to be useful. (We recently encountered a similar trade-off between security hardening and compatibility with old container images in the context of
vsyscall page support on Linux.)
How to make new behavior optional
The low-level software interfaces to enable such selective support can take various forms.
- Use ELF markup to enable the new CPU feature. The ELF loader in the kernel (or in the glibc dynamic loader) can inspect existing binaries and verify that everything has the required support level. If the dynamic loader in userspace is responsible for turning on support, there is an overlap with the next option.
- Compatible binaries could explicitly opt in to the new behavior, using a system call (or a new sub-command for the
arch_prctlsystem calls), or by setting a special CPU register (this is how
libmpxswitched on Intel MPX). This setting would only apply to the current process or thread. (What makes sense here depends on the feature in question.) When the process image is replaced by a new one using
execve, support for the new instructions would revert to the default (disabled).
- For the container use case, it could be interesting to have a system call and make the support state inheritable across
execve. This means that the entire process tree in a container would use the new CPU feature. It would be up to the container engine to make sure that the image is compatible with that, probably by inspecting metadata associated with the container image and conveying that information to the kernel via a system call.
A concrete implementation involving kernel, hypervisor, CPU and perhaps firmware could take many forms. The key point is that a single running operating system kernel must be able to support different feature activation states across its separate processes. Once the CPU and firmware support this capability, the kernel and hypervisor can collaborate to implement the per-process model. If it is just a boot-time option, either per guest or (worse) per physical host, enabling the feature in a container-based world is much more difficult.
Coordination with dynamic instrumentation tools, such as Valgrind
Valgrind does not only emulate NOPs, it is also possible to embed special NOP-like instruction sequences using the
valgrind.h header file. These instructions are (mostly) ignored when running on a real CPU, but they provide useful information to Valgrind when running under emulation. For example, they can explicitly mark memory as undefined when Valgrind assumes it is defined (based on previous program actions).
New NOP space instructions should not conflict with the marker instructions Valgrind uses.
Instructions in the NOP space are an attractive way to provide new performance and security features. However, some care is necessary to avoid conflicts with existing uses of NOP instructions. And, if such conflicts arise, end users will need a way to work around them, which means that the operating system they use will need to be able to enable and disable support for the new NOP-space instructions at the individual process level.