When writing parallel or multi-threaded programs, programmers have to deal with parallelism and concurrency. Both are related concepts but are not the same. In this article, we will review the differences between them and outline a few programming abstractions for both (in particular, atomic data types, Transactional Memory, and task-based parallelism). Red Hat Developer Toolset 1.1 ships with GCC-4.7, which provides (experimental) support for these particular features. Finally, a short outlook on future features proposed for inclusion in the C/C++ standards and considered for upstream GCC.
Concurrent execution refers to situations in which several (logical) processes or threads execute at the same time and are not guaranteed to be independent; for example, they could communicate with each other, wait for other threads to make progress, or could have to execute operations mutually exclusive with the execution of other threads' operations. In contrast, parallel execution refers to several processes performing independent operations that, informally, do not have to consider what the other parallel parts are doing.
Nonetheless, parallelism is related to concurrency in that a typical parallel program will also contain concurrent pieces of code (e.g., to merge the results of parallel computations into a single output value). Also, to benefit from parallelism in hardware, concurrent code often tries to execute as much as possible in parallel (see Amdahl's Law).
C11 and C++11 introduced standardized support for multi-threading in C/C++. On the concurrency side, this includes (1) a memory model that defines the behavior of multi-threaded programs, (2) atomic data types that can be safely accessed by concurrent threads, and (3) several synchronization primitives such as locks and condition variables. There are no programming abstractions specifically aimed at just parallelism yet, but there is full support for creating threads and C++11 also offers simple ways to launch asynchronously executing tasks.
Let's discuss some of those programming abstractions as well as two additional abstractions in more detail: Transactional Memory and abstractions for task-based parallelism. The focus is on C++11, but most of the abstractions are also supported in similar form by C11.
Concurrency
When several threads can access shared data concurrently, programmers have to specify how these threads synchronize those accesses. The problem is that the individual accesses of one thread could be interleaved with the execution of steps of another thread. For example, if two threads both want to increment a variable's value, both of them first read it and then both try to store an increased value back, one of the operations is effectively lost because the latter write does not take the prior thread's increment into account; to work correctly, each thread needs to read from and store to the variable in one atomic step (i.e., appearing as one indivisible operation to other threads).
Next, here are two programming abstractions that provide atomicity for concurrent accesses by different threads:
- C++11 atomic data types are a low-level abstraction meant to expose the synchronization primitives that current hardware provides. Each instance of such a type can be accessed atomically, but they do not allow for accessing several instances in one atomic step.
- Transactional Memory, in contrast, allows programs to access any number of memory locations in one atomic step (i.e., in a transaction). It is a higher-level abstraction, and easier to use correctly.
Memory model
Both abstractions are based on the C++11 memory model, which defines the semantics of multi-threaded executions, in particular how normal, non-atomic memory accesses can be ordered by synchronization based on atomic memory accesses. It also serves as a unified abstraction for the memory models on the different hardware architectures: It is a simpler model than several hardware memory models, and allows for writing portable yet efficient concurrent code.
In a nutshell, the model defines allowed orders of the operations in a particular execution of a multi-threaded program. It defines a happens-before relation for an execution of a program, which is essentially derived from (1) single-thread program order (i.e., which operations get executed first by a particular thread), (2) which values threads actually read in this particular execution, and (3) ordering constraints enforced by certain kinds of synchronization in the program.
For a particular program, the compiler ensures that all executions will follow some valid happens-before relation: It will be consistent with the semantics of the program as specified by C++11, and the code generated by the compiler enforces this happens-before relation on top of the memory model of the selected hardware architecture.
One of the most important requirements on a C++11 multi-threaded program is that it must be free of data races. Informally, a program has a data race if it can execute accesses that are not properly synchronized. More specifically, programs must not contain non-atomic accesses by at least two threads that (1) target the same memory location, (2) are not ordered by happens-before, and (3) contain at least one write access. Since version 4.8, upstream GCC contains ThreadSanitizer, which can find data races in executions of programs.
Discussing the memory model in detail is beyond the scope of this article. For a deep dive, I would recommend the formalization by Batty et al.; even if the formal approach might come with a steeper learning curve, it is an efficient way to fully understand a subject as complex as the memory model. The team behind this formalization also provides a browser-based tool that shows all the allowed executions of small pieces of concurrent C++11 code (and whether the code is actually data-race-free); This can be very useful to really understand why a certain synchronization idiom works -- or why not.
Atomic data types
As mentioned previously, atomic data types (atomics for short) are a low-level abstraction for atomic accesses to single memory locations (e.g., a single variable). For example, to instantiate and increment an atomic integer variable twice, we can write:
std::atomic a;
a = a + 1;
a.store(a.load() + 1);
Both increments are equivalent, and both actually contain two separate atomic operations: the load and the store. This means that if another thread also tries to increment the variable concurrently, one of the updates can be lost as explained previously. To avoid such situations, atomics also provide a few atomic read-modify-write operations such as
a fetch-and-add:
a.fetch_add(1);
Using this operation concurrently in two threads will not lead to lost increments of the variable, because now the load and store of each increment will execute as one atomic step.
Note that while atomics are a relatively simple way to implement shared counters like in our current example, using them can be very difficult in more complex use cases. They are powerful enough to allow for implementing every concurrent algorithm, but this quickly becomes a subject matter for experts. Thus, often it is easier to use a higher-level abstraction.
Transactional Memory
Transactional Memory (TM for short) is a programming abstraction that allows programmers to declare that a piece of code containing several operations on several memory locations must execute atomically. Actually enforcing this requirement during the execution of the program is the job of a generic TM implementation, which is not application-specific and resides in the compiler and a TM runtime library provided by the compiler.
GCC supports TM language constructs for C/C++ since version 4.7. Since version 4.8, it can also make use of the hardware support for TM that the Haswell CPUs by Intel provide; support for the hardware-TM facilities of S/390 and POWER ISA 2.07 has been included in upstream GCC and been suggested for being backported to GCC-4.8. The language constructs supported by GCC follow a draft specification maintained by Study Group 5 of the ISO C++ committee. The core piece of those constructs are transaction
statements:
__transaction_atomic { if (x < 10) y++; }
In this example, the load from x and the potential load and store on y are guaranteed to execute as an atomic transaction, so as one atomic indivisible step from the perspective of all other transactions. No special annotations or data types are required for the data accessed in transaction; although there are some restrictions on the operations that can be executed as a transaction (e.g., no I/O currently), code such as loops, conditional execution, function calls, etc. is allowed in transactions.
This yields a very powerful programming abstraction that gives somewhat stronger atomicity guarantees than locks. It is also easier to use than locks because programmers do not have to follow a locking convention (i.e., which lock protects which data) -- the generic TM implementation takes care of this. Furthermore, transactions can be easily nested within each other, without requiring programmers to have to think about deadlock risks as when nesting critical sections protected by locks -- the TM implementation will avoid deadlocks when executing transactions. Early user studies conducted by academia suggest that using transactions can indeed lead to simpler programs with fewer errors compared to when using locking.
Note that even though transactions are guaranteed to execute virtually atomically, this does not mean that they need to be executed sequentially by the TM implementation. TM implementations like GCC's will strive to execute transactions in parallel, and typically succeed at that for transactions that access disjoint memory locations. Transactional code often executes somewhat slower than normal sequential code, but often scales well to larger numbers of synchronizing threads. Hardware support for TM can decrease these overheads significantly.