Various distributions have been experimenting with generating x86-64-v3 instructions for all compiled code. Although Valgrind already supported emulation of those instructions, there was still work to do to make the Memcheck tool produce correct diagnostics for memory issues.
What is Valgrind?
Valgrind is an instrumentation framework for building dynamic analysis tools that check C and C++ programs for errors. Memcheck is the default tool Valgrind uses when you don't ask for another tool (using --tool=
). Memcheck keeps track of the validity and addressability of all memory a program uses. This means Memcheck can warn for usage of unaddressable memory or when program execution depends on values which were never defined. It also reports on memory leaks and allocated memory which is never freed.
How does Valgrind work?
Valgrind works by using dynamic binary instrumentation. Valgrind translates all instructions in a program into a intermediate representation, called VEX. A tool like Memcheck then instruments this intermediate representation, to track all memory operations. This transformed intermediate representation is then translated back into native instructions, which is what gets executed. Valgrind can be seen as a virtual machine with a just in time compiler which uses that native instruction stream as byte code.
What is x86-64-v3?
x86-64-v3 is an set of common x86-64 CPU features (instruction sets). It supports AVX and AVX2 instructions which add many new 256 bits vector operations. The fused multiply-add (FMA) instructions also work on 256 bit vectors, which combine multiplication and addition into a single operation that computes the intermediate result with infinite precision (we will see below how Valgrind got this "wrong"). And the BMI1 and BMI2 instruction sets that providing various bit manipulation instructions.
For the next Red Hat Enterprise Linux (RHEL) 10 release (currently still in development) the GCC compiler will default to -march=x86-64-v3
which means programs will use all these instructions by default. See this article for more information. The current RHEL 9 uses x86-64-v2
as default.
Although Valgrind already had support for all these x86-64-v3 instructions, we found several issues once the whole distribution was build with -march=x86-64-v3
as default. These have been fixed in the Valgrind 3.23.0 release.
More accurate instruction emulation
The fused multiply–add (FMA) instruction does the calculation of r = (a * b) + c
in one instruction. This has two advantages. You get two operations for the price of one, instead of having to do a multiplication and an addition in separate instructions. And it enhances the accuracy of the whole operation by using only a single rounding of the end result, instead of having to round the result of the multiplication step and then the addition step separately.
When FMA instructions were originally introduced, AMD and Intel both introduced a slightly different variant. FMA4 allowed the result register to be separate from the three input registers and while in FMA3, the result register had to be one of the input registers. See this page for more information.
In the end, the FMA3 variant became the one supported by both AMD and Intel and is part of x86-64-v3
. Valgrind did support both FMA3 and FMA4, but didn't keep track which variant the CPU it ran on supported. When translating either instruction it used a common generic implementation that did a multiplication and then an addition. This caused subtle rounding issues with floating point arithmetic which FMA was supposed to prevent.
To make these FMA calculations precise, Valgrind now keeps track of the FMA3 or FMA4 flag in the CPUID. On FMA-capable hosts it now emits an VFMADD
instruction. This makes the floating point operations as accurate as they are when the program is not run under Valgrind.
Reverse engineering GCC optimizations on large vectors
Sometimes the compiler is really clever and Valgrind Memcheck has to work extra hard to make sure the generated code is correct. This is especially true when GCC uses large vector operations as introduced by AVX and AVX2 to optimize string operations. Using vector registers and operations is attractive since it allows comparing 16 (for 128 bit vectors) or 32 (for 256 bit vectors) characters at once.
One such tricky optimization is when GCC sees the strcmp
function with one argument being a static constant string which is as large (or larger) than the vector size. GCC will generate code to compare the (start of) the strings by loading the strings into two vector registers and XOR
ing them, using the VPXOR
instruction, and then using the VPTEST
instruction on the resulting vector, which will do a bitwise AND and sets the ZF
flag if the result is all zeros (which indicates the strings were equal). The generated code looks like this:
VMOVDQU (%str1), %ymm1 # load str1 from memory into register vec1
VMOVDQU (%str2), %ymm2 # load str2 from memory into register vec2
VPXOR %ymm1, %ymm2, %ymm3 # vec3 = vector1 xor vector2
VPTEST %ymm3, %ymm3 # set ZF if binary and of vec3 is all zeros
JE equal1f # jump to equal1f is ZF is set
Assuming str1
is at least as big as the vector register wide, this a very efficient way of comparing with str2
. And even if str2
is shorter than str1
it is a quick way to check the strings aren't equal. As long as the compiler can prove that the memory after the end of str2
can be loaded directly into the vector register (for example, when the string is placed on the stack). This works because a shorter string will have a \0
(zero) character at the end. This means at least that character will produce a non-zero XOR
result and so the VPTEST
will see at least one bit set in the result vector register. If there is at least one non-zero character the the ZF
flag not to be set (whatever the chars after this zero character are).
Although the above check is logically correct, it does create some challenges for Valgrind Memcheck. In the case one of the strings is shorter than the other, we first hit the issue that the bytes right after the end of string zero character might not technically be addressable (Memcheck tracks this very precisely). So normally, it wants to produce a warning that (partially) unaddressable memory is loaded into a register.
But the above optimization depends on being able to read a little more bytes than needed. So there is the option --partial-loads-ok=yes
(which is now the default). This option makes it so that such loads do not produce an address error. Instead, loaded bytes originating from illegal addresses are marked as uninitialized, and those corresponding to legal addresses are handled in the normal way. So now for such loads, Memcheck will mark the bytes in the vector register after the string zero terminator as undefined.
This brings us to the second tricky issue to get right. We are now operating on vector registers with partial defined values. Valgrind Memcheck needs to do exact instrumentation to make sure the result is properly tracked as (un)defined. This is fairly simple for the result of the XOR
of the two vector registers. The result vector is defined up to the first undefined byte in one of the input registers.
When setting the ZF
flag for the VPTEST
instruction, you need to check whether any bit is set in the defined part of the result vector register; that is enough to make the flag value defined (and not set). This is because the result depends on all bits being zero, once you see any bit in the defined set being one, it doesn't really matter what the other (undefined) bits are. We also know this only matters for the strings being unequal because one of the strings is shorter. In that case there is at least the end of string zero terminating byte that is defined (and unequal to the byte in the longer string).
Intercepting glibc dynamic linker/loader string optimizations
Valgrind Memcheck intercepts various glibc memory and string functions (e.g., strcpy
, strcmp
, strlen
, memcpy
or memmove
). It does this partially because it is hard to proof some of these functions, which are optimized hand written assembly, correct. And partially because Memcheck would like to check some pre-conditions on the functions, like whether memory arguments overlap.
To do this, Valgrind sets the LD_PRELOAD
environment variable when launching the program to load alternative, simple, instrumented versions of these string and memory functions. This loads the code before any library is loaded, which can then be intercepted when the program or a library uses any of those functions.
This works for any such optimized string or memory function, whether using x86-64-v3
instructions or earlier vector instructions in glibc. Except for functions that the dynamic loader (ld.so
) uses itself when loading the LD_PRELOAD
libraries. ld.so
contains its own implementation of these functions (since it is responsible for loading glibc, it cannot use the glibc functions directly itself).
The ld.so
versions used to not be built with x86-64-v3
optimized instructions though, so Valgrind Memcheck could just interpret the simpler version of these functions directly.
Since we cannot use the LD_PRELOAD
trick to load the alternative code into the process ,we needed to add an hardwire
for this specific ld.so
function. The hardwire
is a simple implementation of that function that will need to be called instead of the original. The disadvantage of having to use an hardwire
is that it is architecture-specific and that Valgrind has to look up the symbol addresses itself. Since these symbols are normally private to ld.so
, that means Valgrind needs the full symbol table available. So ld.so
cannot be stripped (to remove unnecessary/debug symbols). Luckily ld.so
is fairly small, so not stripping the debug symbols doesn't make it much bigger than necessary.
Using it all together
When using a distribution that defaults to building all code for x86-64-v3
, like the upcoming RHEL 10 beta, or when using -march=x86-64-v3
to build your own code, you want to be using the Valgrind 3.23.0 release current Fedora 40 has. Valgrind 3.23.0 will accurately execute the new vector code, even with GCC optimizations taking advantage of the new AVX and AVX2 instructions and it will intercept tricky memory and string operations so Memcheck can track undefined values in your code.