Featured image for: Value range propagation in GCC with Project Ranger.

Almost every computing device you own today has multiple processors, but pretty much every software written today uses only one at a time. Why? Partly, it's the programming languages, which were designed way back when one processor was the default. Mostly, though, it's because programmers aren't good at writing software where things can happen in any order or at any time.

In this article, we'll examine two examples of failed concurrency assumptions in the form of two programs from glibc's test suite that weren't operating reliably.

Ready! Go! Set!

Consider this first example, which has been simplified:

void *thread_fcn(void *arg)
{
  printf("Here!\n");
  char *p = (char *) malloc(sizeof(char) * 2);
  free(p);
  return NULL;
}

int main(int argc, char *argv[])
{
  for (i = 0; i < 2; ++i) {
    pthread_create(&thread[i].thread_id, NULL, &thread_fcn, &thread[i]);
  }
  for (i = 0; i < 2; ++i) {
    pthread_join(thread[i].thread_id, NULL);
  }
  exit(EXIT_SUCCESS);
}

This test aims to ensure threads initialize themselves properly the first time they call malloc, even if other threads are currently using the malloc data. The first time malloc is called, each thread has to be assigned an arena, or pool of memory, to work with.

The test would use gdb, the GNU Project debugger, to break just before the malloc() in thread_fcn, then run the first thread until it was inside malloc() (and had acquired malloc's internal lock), then run the second thread and verify that it properly handled a locked arena. The test usually runs successfully, but sometimes it doesn't.

The failing case relied on two programming errors, and the combination of the two caused the bug. The first bug is understandable—the program assumes that the malloc calls it makes are the first calls for each thread, but printf() itself calls malloc! This wasn't the case in older versions of glibc; the buffer it needed was memory mapped instead, but modern glibc will allocate it via malloc when needed.

This wouldn't itself cause the bug because the second thread is the one that needs to do the "to be tested" initialization. However, the program assumes that the first thread runs first and the second thread runs second! If the second thread runs first, it will call printf, which calls malloc, and initializes the thread before gdb can stop everything. Then the first thread runs, and gdb pauses it inside malloc. When the second thread runs again, it has already initialized malloc and doesn't trigger gdb's breakpoints.

It turns out that unless you explicitly put in code to force one thread to wait for the other, there is no way to know—much less assume—which will be scheduled first.

Solution:  If thread execution order matters, then you have to force that order with synchronization primitives. Anything that ensures that the first call to malloc in the second thread is really the first; either moving gdb's breakpoint earlier in the function or removing the printf.

Counting is easy, until it isn't

This example is a bit more complex. In this case, there are two threads that are running in lockstep. One thread changes the parameters of the other and verifies that it handles that change correctly. Here's the example code, which, as before, has been simplified:

static void
sigsegv_handler (int sig, siginfo_t *info, void *ctx_void)
{
  siglongjmp (jmpbuf, 1);
}
static void *
threadproc (void *ctx)
{
  while (1)
    {
      futex ((int *) &ftx, FUTEX_WAIT, 1, NULL, NULL, 0);
      while (atomic_load (&ftx) != 2)
        {
          if (atomic_load (&ftx) >= 3)
            return NULL;
        }
     xmodify_ldt (1, &desc1, sizeof (desc1));
     /* If ftx == 2, set it to zero,  If ftx == 100, quit.  */
      if (atomic_fetch_add (&ftx, -2) != 2)
        return NULL;
    }
}
  . . .
  for (int i = 0; i < 5; i++)
    {
      if (sigsetjmp (jmpbuf, 1) != 0)
        continue;
      /* Make sure the thread is ready after the last test. */
      while (atomic_load (&ftx) != 0)
        ;
      xmodify_ldt (0x11, &desc2, sizeof (desc2));

      /* Arm the thread.  */
      ftx = 1;
      futex ((int*) &ftx, FUTEX_WAKE, 0, NULL, NULL, 0);

      asm volatile ("mov %0, %%ss" : : "r" (0x7));

      /* Fire up thread modify_ldt call.  */
      atomic_store (&ftx, 2);

      while (atomic_load (&ftx) != 0)
        ;

      /* On success, modify_ldt will segfault us synchronously and we will
         escape via siglongjmp.  */
      support_record_failure ();
    }

The general idea of this test is that the modify_ldt syscall might also switch stacks if some of its fields are left uninitialized, avoiding a segfault. The threadproc and main proc are there to switch the LDT back and forth. The core test is thus:

  1. Main thread loads valid SS segment.
  2. Main thread stores a 2 in ftx to start the subthread.
  3. Main thread waits for subthread to set ftx to 0.
  4. Subthread loads invalid SS segment.
  5. When the kernel next schedules the main thread, it segfaults, calls the handler, and starts the loop all over again.

(Note: the specific operations above are interesting, but not relevant to the problem we're about to discover.)

The problem happens when something else in the system happens at just the wrong time. Note the code to synchronize the two threads—the subthread waits for the futex to be 1, and the main thread sets it to 1, using a shared futex. What happens if the FUTEX_WAKE fails? How can it fail?

Consider what happens if the subthread is interrupted between the futex call and the while loop, before the second iteration of the main loop. The main thread has faulted before setting ftx to 2 (due to, for example, a timer tick, peripheral interrupt, or any other outside influence), but has time to recover from the first fault and ftx is still 1 before the subthread is scheduled. The main thread's while loop is waiting for ftx to be nonzero (true), but the subthread hasn't started its loop, so its while loop also sees ftx is 1, and so its loop waits also.

Deadlock, because two threads don't agree on how to count to two. Good thing binary only uses zero and one, yes?

Solution: Consider many possible global execution orderings (interleaving of operations). Check for futex errors and loop until they do what you want them to do, check for the loop starting with the wrong ftx value, and force the main thread to pause for a moment just before loading SS to give the subthread a better chance of being scheduled. Not perfect, because there's still a small chance that the test will not test the core feature, but it does prevent deadlock.

Conclusion

Parallel programming isn't always parallel, and your program isn't always the only thing running. You need to consider the ramifications of external influence on the timing of your various threads, and always assume the worst possible time for things to happen. Don't assume the kernel runs threads in the order you started them. Don't assume your thread won't be interrupted at random times. Finally, the order in happens expect don't assume anything you.