The classic 1984 movie Ghostbusters offered an important safety tip for all of us:

"Don't cross the streams." - "Why not?" - "It would be bad." - "I’m fuzzy on the whole good/bad thing. What do you mean, 'bad'?" - "Try to imagine all life as you know it stopping instantaneously and every molecule in your body exploding at the speed of light." - "Right. That’s bad. Okay. All right. Important safety tip. Thanks..."

Similarly, in computing, there are also cases where data crossing through memory between different instruction streams would cause a similar effect to a software application - "all execution as we know it stopping instantaneously".

This is due to the performance optimizations that both hardware and software implement to reorder and eliminate memory accesses. Ignoring these "memory access reordering" issues can result in extremely problematic debugging scenarios.

The bug from hell was a scenario where Java's OpenJDK runtime parallel garbage collector very occasionally crashed because one thread's write would signal that the data structure had been updated. This signal occurred before the actual update writes (to the same data structure), and the result was that other threads would end up reading invalid values. We're going to take a deeper look into this scenario to understand exactly what went on in this notorious issue.

The "Bug"

Memory accesses for today's computers are slow, expensive operations, and "optimizing compilers" attempt to eliminate and reorder these memory accesses to improve performance. Modern processors also attempt to eliminate and reorder the memory accesses to improve performance, and within a single sequence of instructions (a thread) the compiler and processor hardware have enough information to ensure that the original unoptimized and new optimized instruction sequences yield the same end result; however, other threads running on separate processors (those observing the external memory accesses) may be misled by these reordered memory operations and incorrectly infer that something is in particular state when it is not.

For "the bug from hell", statements similar to the following C++ code were executed by a thread:

  object->pointer = valid_pointer_value;
  opject->flag = marked_as_valid;

Most of us (who are not already immersed in these issues) would expect the statements above to be executed in the order specified by the program; however, the statements were actually reordered in the thread as:

  opject->flag = marked_as_valid;
  object->pointer = valid_pointer_value;

Other threads would then execute something like the following:

  tmp = object->flag;
  if (tmp == marked_as_valid){

Occasionally, these other threads would see the marked_as_valid value in the flag field, and fetch the pointer field before the pointer field had been properly updated with the current value. Thus, the incorrect, stale value for the pointer field would cause the program to crash.

This change in program behavior due to variations in relative timing is a called race condition, and since the window for this particular race condition to occur was relatively small, this problem was very hard to replicate and debug. Debugging race-conditions is typically difficult, due to the relative difficulty in reproducing the conditions necessary to produce trigger the race-condition; this is made even more confounding when the code differs from the actual (reordered) instructions exeecuted.

The Solutions

Totally disabling the software and hardware optimization is not an option - the performance benefits of reordering memory accesses is too great to ignore. Instead, steps need to be taken to safely use memory access reordering, while still enforcing ordering on the memory operations where required. High-level languages such as C++, C, and Java now dictate what is and is not allowed when performing memory reordering, and they each have libraries for concurrent access to data structures intended to be shared between threads. Other software projects prone to encounter these situations (such as the Linux kernel) will typically include low-level constructs to enforce ordering of memory operations.

High-level language (C11, C++, and Java) memory models and Concurrent Libraries

Early standards for C, C++, and Java did not clearly specify what the allowed behavior of multiple threads and memory was. This poorly-defined, or even undefined behavior made it difficult to write portable code that would work on a wide range of runtime environments. The standards, however, have evolved over time to address this issue.

Standards for these languages now explicitly allow the software and hardware to reorder and eliminate memory operations for regular memory accesses, while special keywords, types, and methods (Java synchronized methods and C++ std:atomic) can be used for data shared between threads. Additionally, language extensions such as OpenMP can simplify writing code that exploits multiple processors.

Concurrent programming libraries have been developed for both languages (for Java: java.util.concurrent, and for C++: Boost), and each library provides useful idioms that are commonly used in multithreaded programs. In most cases it is preferable to use these libraries for communication between threads rather than building something custom, aka "rolling your own." There are many pitfalls that can be avoided using these libraries, and care should be taken to use the right features of each, because like the "bug from hell", debugging issues caused by race condition are (often extremely) non-trivial.

Software approach for older software environments

Some software environments (such as older C compilers) do not have a proper memory model for programmers to follow. The Linux kernel started development without these models, and has had to address the possible memory access reordering problem for device drivers and multiprocessor machines for quite some time. The Linux kernel defines special operations that ensure that the compiler does not move, eliminate, or replicate memory accesses: WRITE_ONCE() and READ_ONCE() - E.g. perform the memory operation precisely once and ensure that the compiler doesn't change the memory access order.

Control over compilers can be achieved using compiler hints and settings, but additional steps need to be taken to ensure that hardware does not change the memory access order. Hardware ordering is specified using memory barriers, which are operations that ensure all memory accesses performed by the processor before the barrier instruction are completed before any of the memory accesses after the barrier may begin.

Thus (to solve the problematic example), the"bug from hell" code might be changed to something like this:

  WRITE_ONCE (object->pointer, valid_pointer_value);
  mb(); /* memory barrier */
  WRITE_ONCE (opject->flag, marked_as_valid);

and this:

  tmp = READ_ONCE(object->flag);
  if (tmp == marked_as_valid){
    mb(); /* memory barrier */

For performance optimization, there are additional variations on the memory barrier operations that only affect certain subsets of memory operations. As just a single example of such a variation, the write memory barrier protects writes to memory - it allows read operations to be moved past the barrier, while write operation cannot be moved.

With great power comes great responsibility

At the end of Ghostbusters, the proton pack streams were crossed to provide the necessary capabilities that a single stream could not provide. Similarly, multithreaded programming can provide capabilities beyond what a single threaded program can provide - just be sure to follow the safety tips.

Further Reading

The Art of Multiprocessor Programming by Maurice Herlihy and Nir Shavit, ISBN:0123705916 9780123705914, 2008.

Memory Ordering in Modern Microprocessors ( by Paul E. McKenney

Java Memory model:

Java concurrent classes:

C11 memory model:

Boost C++ concurrent libraries:


Linux Kernel memory-barriers.txt (

Peter Sewell's Relaxed-Memory Concurrency webpage

Last updated: June 13, 2023