Featured image for Valgrind.

Valgrind is a great tool not only for finding errors related to memory management in a program, but also for memory consumption analysis, performance profiling, issues related to multithreading, and more. In this article, I introduce Valgrind's undocumented --trace-flags option and explain how we improved Valgrind's accuracy in one area related to the AArch64 processor from Arm.

A case of rounding errors

Valgrind works as an abstraction layer between the application and the operating system. It disassembles the application's code and adds instrumentation to it depending on which of the Valgrind tools are used. To execute and analyze memory or register manipulations, Valgrind parses instructions and translates them into an intermediate representation (IR) called VEX. For example, the front end of Valgrind translates an ADD instruction in the application's assembly-language code to the Iop_Add IR. The Valgrind tools (memcheck, helgrind, and so on) instrument the IR, after which the Valgrind back end re-assembles the code.

Although Valgrind's translation process is very complex, it works well. But sometimes it does make mistakes. For example, the following simple C language code run under Valgrind showed imprecision in the rounding of some floating-point operations (documented in this bug report):

int main()

double x = 1004.3;
double y = 2.0;
double r = pow(x, y);

printf("r = %.10f\n", r); return 0;

Compiled properly on the AArch64, this code should print the value of r as 1008618.4899999999. However, when run under Valgrind, it printed 1008618.490000000. The reason was that Valgrind lacked correct support for the AArch64 fused multiply add (FMADD) instruction, which was used in the pow function, causing the rounding issue.

Fused multiply add

The addition of a product between two operands is common enough to earn a special instruction on many processors. FMADD stands for floating-point fused multiply-add. The basic operation is:

D(destination) = A(accumulator) + N * M

There are 32-bit (float) and 64-bit (double) variants of the instruction.

The rounding issues come in because doing A + N * M in one go gives slightly different results from doing (N * M) first and then adding A.

Different processors recognize the prevalence of this operation with a variety of instructions. The PowerPC ppc64 and IBM s390x, like the AArch64, have scalar FMADD. But most other architectures have only vector instructions that are similar to FMADD. For instance, the Intel x86 offers a VFMADD instruction, which is similar to the AArch64's FMADD, but is a vector (single instruction/multiple data, or SIMD) instruction. Vector registers are large registers that store several numbers, allowing simultaneous operations to be performed on them all at once. For example, Intel's AVX-512 processor uses 512-bit registers. Scalar registers are much smaller, usually 32 or 64 bits long, and contain one scalar value.

Because Valgrind needed to support scalar fused multiply-add instructions for the ppc64 and s390x, it already defined an IR for it called Iop_MAddF32. This VEX IR operation represents a 32-bit float fused multiply-add instruction. But the arm64 front end and back end for Valgrind didn't implement it yet. A team I worked on created a patch that adds Arm64 VEX front- and back-end support for Iop_MAdd/SubF32/64. Before this patch was added, FMADD was implemented in VEX as two IRs, Iop_Add and Iop_Mul, to represent one actual instruction. That caused the rounding errors.

The front-end part of the patch replaced the use of Iop_Add and Iop_Mul with one Iop_Madd IR, which allows Valgrind to avoid the rounding error. The back end then turns the IR into the actual instructions again.

To test our patch to Valgrind, we wrote assembly language code to make sure that Valgrind generates the Iop_MAdd IR instruction when it is supposed to. If an architecture supports scalar FMA instructions, the compiler will hopefully turn something like x = a + (b *c) into an efficient FMADD instruction instead of a multiplication and then an addition instruction. But it is easier to use inline assembly directly:

asm("fmadd %s0, %s1, %s2, %s3\n;" : "=w"(dst) : "w"(x), "w"(y), "w"(z));

Here, s is the name of the 32-bit SIMD/FP register, used in this case as a floating-point (FP) register.

Here is a minimal test that could be compiled with the command gcc -g -o tst test.c:

int
main(int argc, char **argv)
{
float x = 55;
float y = 0.69314718055994529;
float z = 38.123094930796988;
float dst;
//32bit variant
asm("fmadd %s0, %s1, %s2, %s3\n;" : "=w"(dst) : "w"(x), "w"(y), "w"(z));
printf("%f = %f + %f * %f\n", dst, z, x, y);

return 0;
}

The --trace-flags option

For figuring out precisely what Valgrind does, its --trace-flags option is very useful. This option helps you spot problematic places in the Valgrind code, and is also useful for expert users who want to know what exactly Valgrind is handling in an application.

The --trace-flags option is not documented in the Valgrind manual page, nor is it displayed with valgrind --help. However, you can see the options it offers in each of its flags by running valgrind --help-debug. Table 1 shows the flags and their effects.

Table 1: Flags in Valgrind's --trace-flags option.
Flag Effect
10000000 Show conversion into IR
01000000 Show after initial opt
00100000 Show after instrumentation
00010000 Show after second opt
0000 1000 Show after tree building
00000100 Show selecting insns
00000010 Show after reg-alloc
00000001 Show final assembly
00000000 (all bits cleared) Show summary profile only

Note: To get full details from --trace-flags, you also need to specify --trace-notbelow or --trace-notabove.

With these values, you can see all the transformations performed by Valgrind and related instrumentation tools. But here we are interested only in the first "disassembly" and the final "assembly" steps. We will explore these next.

How --trace-flags works

Here, I'll describe how to use the --trace-flags option step by step, using as an example a trace of the FMADD activity that concerns us.

Valgrind's first step, which is the conversion into IR, is tool-independent. But for the next step, showing the final assembly, it helps to not have any tool do instrumentation so that the final assembly is clearer. In our case, we do not care too much about the instrumentation and the optimizations it makes. Therefore, I add the --tool=none option, so that no tool (memcheck, by default) adds its own instructions. The resulting command is:

$ ./vg-in-place -q --tool=none --trace-flags=10000000 --trace-notbelow=999999 ./tst 2>&1 | less

The command produces many blocks of code that do not interest us. The block relevant for us is the main function in the ./tst module. To find the relevant block, we re-run the previous command, replacing the arbitrary value in --trace-notbelow=999999 with the SB (superblock) number displayed when main was called from the previous run:

SB 1237 (evchecks 6200) [tid 1] 0x400634 main /root/valgrind/tst+0x400634

The SB number for main is 1237. We use this number to skip all superblocks before main. Therefore, our new command is:

$ ./vg-in-place -q --tool=none --trace-flags=10000000 --trace-notbelow=1237 ./tst 2>&1 | less

We want to look for the fmadd in the output from the block that's relevant for us. The fmadd related block for before the patch situation used to be:

(arm64) 0x400670: fmadd s0, s0, s1, s2

------ IMark(0x400670, 4, 0) ------
t18 = Shr32(GET:I32(888),0x16:I8)
t19 = Or32(And32(Shl32(t18,0x1:I8),0x2:I32),And32(Shr32(t18,0x1:I8),0x1:I32))
t17 = AddF32(t19,GET:F32(352),MulF32(t19,GET:F32(320),GET:F32(336)))
PUT(320) = V128{0x0000}
PUT(320) = t17
PUT(272) = 0x400674:I64

With the FMADD support, the output changed to:

(arm64) 0x400670: fmadd s0, s0, s1, s2
------ IMark(0x400670, 4, 0) ------
t18 = Shr32(GET:I32(888),0x16:I8)
t19 = Or32(And32(Shl32(t18,0x1:I8),0x2:I32),And32(Shr32(t18,0x1:I8),0x1:I32))
t17 = MAddF32(t19,GET:F32(320),GET:F32(336),GET:F32(352))
PUT(320) = V128{0x0000}
PUT(320) = t17
PUT(272) = 0x400674:I64

Compare the actual addition instruction from before and after the application of the patch. Before, the addition was:

t17 = AddF32(t19,GET:F32(352),MulF32(t19,GET:F32(320),GET:F32(336)))

After applying the patch, the corresponding code is:

t17 = MAddF32(t19,GET:F32(320),GET:F32(336),GET:F32(352))

The trace shows us that MAddF32 is used instead of AddF32 and MulF32, as desired.

Assembly code with the --trace-flags option

As described earlier, tracing and profile control could be useful for viewing the final assembly language code. For this task, we'll use the four flags in the value 10000111:

$ ./vg-in-place --tool=none --trace-flags=10000111 --trace-notbelow=1237 -q ./tst 2>&1 | less

Looking through the output for MAddF32, we can see the following assembly code:

-- t79 =
MAddF32(Or32(And32(Shl32(t72,0x1:I8),0x2:I32),And32(Shr32(t72,0x1:I8),0x1:I32)),GET:F32(320),GET:F32(336),GET:F32(352))
ldr %vD128(S-reg), 320(x21)
ldr %vD129(S-reg), 336(x21)
ldr %vD130(S-reg), 352(x21)

...

msr fpcr, %vR139
ffmadd %vD131(S-reg), %vD128(S-reg), %vD129(S-reg), %vD130(S-reg)
mov(d) %vD79, %vD131

R refers to a general-purpose register and D refers to an SIMD/FP register. The next-to-last line of this snippet shows that registers D128 through 131 were loaded and used for the fmadd. We can look at the MAddF32 instruction we saw earlier in the VEX IR:

t17 = MAddF32(t19,GET:F32(320),GET:F32(336),GET:F32(352))

and compare it to the resulting assembly code:

ldr %vD128(S-reg), 320(x21)
ldr %vD129(S-reg), 336(x21)
ldr %vD130(S-reg), 352(x21)
ffmadd %vD131(S-reg), %vD128(S-reg), %vD129(S-reg), %vD130(S-reg)

The comparison tells us, for instance, that the first argument, GET:F32(320), that was loaded to the D128 SIMD register became the second operand in the fmadd. This was very helpful during our debugging because it revealed when the operands or their order was wrong. The example here demonstrates how informative and fine-grained the --trace-flags option is. We can look at the actual instruction emitted without having to care about register allocation, or one can look afterward to see the actual assembly code generated.

Conclusion

I hope this article has helped you to understand better how Valgrind works, how developers are improving it, and how you can use --trace-flags to discover precisely what your program and Valgrind do at a low level.

Comments