Time traveling is an extremely useful tool for developers that found a bug in their code. Beyond wanting to travel to a future where the bug is fixed, developers will usually try to slow down time with a debugger, so they can examine where things go wrong. But what if even with a debugger, it is still too late and the bug has already happened? Wouldn’t it be nice to be able to go back in time and see it happening in real time?
Let’s talk about time travel debugging. In this post, I’ll go over the following commands:
record
record stop
record save
record full restore
reverse-next
(rn
)reverse-step
(rs
)set exec-direction
reverse-continue
(rc
)
As a final note, there are many implementations on many different debuggers, but this post will be focusing on GNU Debugger’s (GDB) record-full implementation. Other options are mentioned at the end, but to follow this tutorial, all you’ll need is GDB in Linux.
Why would I want to go back?
When I tell anyone about this feature, one of the most common questions I get is “why would I do that?”. There are many situations where the code that reports an error is after all the code that could generate that error in the first place. Take the following toy code as an example:
1 │ #include <stdio.h>
2 │ #include <stdlib.h>
3 │ int bubble_sort(int *vec, int size) {
4 │ for (int i = 0; i < size; i++) {
5 │ for (int j = i; j < size-1; j++) {
6 │ if (vec[j] >= vec[j+1]) {
7 │ int tmp = vec[j];
8 │ vec[j] = vec[j+1];
9 │ vec[j+1] = tmp;
10 │ }
11 │ }
12 │ }
13 │ }
14 │
15 │ int sorted(int* vec, int size) {
16 │ for (int i = 0; i < size-1; i++) {
17 │ if (vec[i] > vec[i+1])
18 │ return 0;
19 │ }
20 │ return 1;
21 │ }
22 │
23 │ int main() {
24 │ int a[6] = {0, 3, 1, 5, 4, 2};
25 │ int b[6] = {5, 0, 3, 1, 2, 4};
26 │ int c[6] = {5, 4, 3, 2, 1, 0};
27 │ int d[6] = {3, 0, 4, 1, 2, 5};
28 │ int e[6] = {0, 4, 5, 1, 2, 3};
29 │ int** v = malloc (sizeof(int *) * 6);
30 │ int size = 6;
31 │ int fail = 0;
32 │ v[0] = a;
33 │ v[1] = b;
34 │ v[2] = c;
35 │ v[3] = d;
36 │ v[4] = e;
37 │ v[5] = NULL;
38 │
39 │ while (*v != NULL) {
40 │ bubble_sort (*v, size);
41 │ if (!sorted(*v, size))
42 │ fail++;
43 │ v++;
44 │ }
45 │
46 │ if (fail)
47 │ printf ("Oh no! there were errors!\n");
48 │
49 │ return 0;
50 │ }
This little toy shows a very common pattern in programming in general, where you’ll first do a lot of processing, and later verify if the result is correct. This pattern is useful and sometimes necessary, since data can interact in complex ways, but it can make debugging an issue incredibly difficult. Let’s run this program and hope it doesn’t have any issues:
$ gcc sort.c -o sort -g3
$ ./sort
Oh no! there were errors!
Oh oops, who could have seen this coming! In one of the iterations, the inferior (i.e., the program we’re debugging) thinks the array isn’t sorted. We could go forward and check each iteration, but we don’t know if the problem is sorting the array or checking if it is sorted, and we don’t know which of the 5 iterations is incorrect. The fastest way to find the problem is setting a breakpoint at line 42, meaning we will stop at the failing iteration:
(gdb) break 42
Breakpoint 1 at 0x40141f: file sort.c, line 42.
(gdb) run
Starting program: /home/gwenthekween/sort
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib64/libthread_db.so.1".
Breakpoint 1, main () at sort.c:42
42 fail++;
(gdb) print **v@6
$2 = {4, 2, 0, 1, 3, 5}
Stopping at the breakpoint, we can confirm that the array really is unsorted, so we don’t have to worry about debugging the sorted function, but there isn’t much more we can do. We could try to find some unique way to identify the iteration and go from the start, but that isn’t always possible (like in this case), so without the ability to go back, all we can do is check all 5 invocations of bubble_sort
searching for the one which produces the incorrect result. This is one of the easiest use cases to see how time travel debugging can be used, because—now that we know we are in a faulty iteration—we could time travel to the start of the iteration and only analyze the incorrect one.
Other use cases include unreliable bugs—so that you can stop when the problem happens and study what non-deterministic thing changes between executions—and unexpected side effects—in case you skip a function and realize later that it is the reason a bug is happening.
I’m convinced. How do I time travel?
First we need to restart the inferior. Despite calling it “time travel”, what the record-full system actually does is record the execution of the inferior, which allows you to “rewind the tape” and see what was recorded in the past, rather than true time travel. That’s why it is more often called "reverse debugging".
With this in mind, let’s start again, and record the execution until right before the error is reported:
(gdb) start
Temporary breakpoint 1 at 0x401289: file sort.c, line 24.
Starting program: /home/gwenthekween/sort
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib64/libthread_db.so.1".
Temporary breakpoint 1, main () at sort.c:24
24 int a[6] = {0, 3, 1, 5, 4, 2};
(gdb) b 42
Breakpoint 2 at 0x40141f: file sort.c, line 42.
(gdb) record
(gdb) cont
Continuing.
Breakpoint 2, main () at sort.c:42
42 fail++;
(gdb)
Note, we have to start the inferior before asking to record because GDB resets the recording state between runs.
We’re back to where we were before, but now we can go back to the end of the bubble_sort
. For that, let’s look at the next 2 interesting commands: reverse-next
(rn
) and reverse-step
(rs
). Similar to their forward counterparts, these functions tell GDB to execute the inferior until we reach a different line, with next skipping function calls and step entering those calls. For example:
(gdb) list
37 v[5] = NULL;
38
39 while (*v != NULL) {
40 bubble_sort (*v, size);
41 if (!sorted(*v, size))
42 fail++;
43 v++;
44 }
45
46 if (fail)
(gdb) reverse-next
41 if (!sorted(*v, size))
(gdb) rs
bubble_sort (vec=0x7fffffffdd50, size=6) at sort.c:13
13 }
In this case, we’re next-ing over the call to sorted, and then step-ing into the call to bubble_sort
. Notice how we can’t really see which function we’ll be stepping into, but rather which function we just skipped. This is unfortunately a limitation of the information GDB has to print the current location. I’ll talk about the major limitations closer to the end. If you plan on issuing many commands moving backwards, you can set the execution direction to be reverse
, instead of prefixing commands with reverse-
. To show one more neat feature, let’s continue to the start of bubble_sort
:
(gdb) set exec-direction reverse
(gdb) break bubble_sort
Breakpoint 3 at 0x401141: file sort.c, line 4.
(gdb) cont
Continuing.
Breakpoint 3, bubble_sort (vec=0x7fffffffdd50, size=6) at sort.c:4
4 for (int i = 0; i < size; i++) {
(gdb) set exec-direction forward
Notice that we asked GDB to place a breakpoint in code that we have executed before, and GDB recognized that it was hit anyway. This also works for watchpoints and such. Ideally, a user shouldn’t care if the execution is going forwards or backwards when issuing commands for inferior execution. At this point, you’re free to continue debugging as normal, and finding the problem on line 5 (j
should be initialized to 0).
One final interesting set of instructions for particularly complex bugs, or ones that refuse to present themselves in the development environment, are the commands record save
and record full restore
. Since we have the full execution history in memory, GDB is able to write everything that happened and load later. This means you can grab the execution log in one machine and debug in another, or send the execution log to a coworker. However, this feature has been very broken between GDB 10 and GDB 15.1 (the latest release),but a fix is already merged and will be in the next version of GDB (likely 15.2).
Also be aware that GDB expects the recording to be from a CPU with a similar architecture, and the same registers available. I’m not sure what happens if you load a recording that attempts to modify registers your machine doesn’t have, but I can only imagine that the best case scenario would be GDB crashing.
Which leads us nicely into the next section...
You mentioned limitations?
Yes, I did. Reverse debugging is all well and good in theory, but each implementation comes with its drawbacks, and record full
has a few very important ones. First, there are memory considerations, since all the changes are saved in memory. By default, GDB keeps the latest 200 thousand instructions, each of which consuming at least 96 bytes, usually more due to implementation details (explained in this talk if you’re curious). This means that to fill the default instruction size, you need at least 18 Gb of ram in the machine that is running GDB. To avoid stopping your system entirely, you may want to reduce the maximum number of recorded instructions.
The second big limitation is spotty support. Most notably, we can’t record multi-threaded programs. GDB won’t complain if you try, and will do its best, but this will only make a mess of the recorded history. Beyond this, there are still a few architectures that record full
doesn’t support—most notably RISC-V—and GDB won’t let you record at all if you try. Also, unfortunately, x86_64 is only partially supported. If you try to record the execution of an AVX, AVX2, or AVX512 instruction (so using optimized libc functions or similar) GDB will say it can’t understand them and refuse to move forward. I have been working on supporting them, and hope to have some land on GDB 16, but that will depend on many factors and might not happen.
Thirdly, there is a very noticeable slowdown when recording, due to it being completely implemented in software. It is hard to calculate the slowdown, as memory usage is very large, swap and caching issues may happen and so on, but as a benchmark, I tried recording a simple program that calculates the 35th fibonacci number, and it was slowed down by 130 thousand times, while the internet seems to agree on 50 thousand times. My best suggestion is to get as close to the error location as possible, to minimize the effect. We GDB developers are keenly aware of this, but it is a layered problem to solve, and unfortunately not a priority yet.
Finally, the user experience is lacking. I have already highlighted that it’s hard to know which function you’ll be stepping into or what you’ll be undoing, since GDB doesn’t have a way to ask “what was the previously executed line” to the recording. We also have issues with missing commands, like function-call-history
, which is implemented for record btrace
but not record full
. There has been some recent movement in this area, though, with improvements to error messages and some commands finally being implemented.
If most of these limitations are too important for you, there are other options that manage almost everything. The only limitation that can't be worked around is the user experience, since every option I have experience with still uses GDB as a front end, so it still rules over the user experience.
What are the other options?
As I mentioned at the very beginning of this post, there are many implementations for reverse debugging. I know of 2 other major implementations: GDB’s record btrace
and RR
.
btrace
has the advantage of also being built into GDB (so no extra programs or configuration is needed), the slowdown is manageable, and it can handle multi-threaded programs; however, it is dependent on an Intel hardware feature, so it supports very few CPUs. btrace
is also only able to record the program counter, so while you will be able to see the path your software took, you won’t know what the memory looked like in the past. To use btrace
recording, just switch the record command to record btrace
. Everything else should work the same, with the exception of record save
and record full restore
, which is not implemented at all for btrace
.
RR
(record and replay) is a different program that is able to quickly record the execution of almost any program, but only in a select few microarchitectures. Will Cohen has a blog post explaining how to operate it, so I won’t go into details. Will’s post misses one important limitation, which is that if some part of the program depends on non-deterministic instructions, like RdRand
, RR may see an inconsistent state and crash.
Conclusion
In this post, I explained why and how to use GDB’s reverse debugging facility. I hope this post was informative and gave you a new way to solve those pesky problems that always crop up in our perfect software. If you are in need of more debugging tips, check out how to make full use of breakpoints, or how to debug lambdas in GDB.