That the "free lunch is over" may have become something of a cliche in the IT industry, but the fact is that lately, the increase in cycles per second has been mostly realized by adding more processing units rather than by other techniques. While multiprocessor mainframes and supercomputers existed essentially since the dawn of computing, this may be the first time ever that the only machines without multicore processors may be those in USB fridges and electric toothbrushes.

parallel curves evolvent
Involute of a circle - parallel curves.

This changes the way that even ordinary programs need to be written. The parallelism inherent in the computation can no longer be extracted automatically from the instruction stream, but the programmer herself needs to make it explicit.

To that end, there were historically several approaches.

One possibility is to simply spawn a thread per task, be it a raw POSIX thread or something a bit more high-level, such as a boost::thread or std::thread (if what you are using is, respectively, C++ or C++11). Other possibilities include using MPI, OpenMP, or TBB (Threading Building Blocks).

Which of the above is best will naturally depend on a particular use case. In distributed-memory system, one would inevitably use MPI, in a shared-memory system any of the solutions might make sense.

TBB in particular is a relatively young and relatively high-level C++ template library. Using TBB, a user expresses their problem in terms of separate tasks, and then lets TBB schedule tasks to threads running on available processors.

Parallel tasks in TBB

TBB presents both high-level, easy-to-use constructs, and low-level ones, where a user can tune and configure the parallelization approach to their liking.

On the high end of abstraction, TBB provides tbb::parallel_for and tbb::parallel_reduce, which split a workload in a configurable way across several tasks automatically. For example, here with inline C++11 lambda that makes this construct very comfortable to use:

Items items[...];
size_t nitems = sizeof (items) / sizeof (*items);
some_init (items, nitems);

tbb::parallel_for (tbb::blocked_range<size_t> (0, nitems),
[&] (tbb::blocked_range <size_t> const &r) {
for (size_t i = r.begin (); i != r.end (); ++i)
do_stuff (items[i]);
});

For less straightforward loops, TBB has two tools tbb::parallel_do, and tbb::parallel_pipeline.

The former is useful for loops, where the iteration space is not a simple linear block like above. One can pass arbitrary input iterators to tbb::parallel_do, and additionally, using a feeder argument that's optionally passed to the iterator, schedule more items to be processed. E.g.:

tbb::parallel_do (std::begin (items), std::end (items),
[] (Item &item, tbb::parallel_do_feeder <Item> &feeder) {
do_stuff (item);
for (auto const &v: item.more ())
feeder.add (v);
});

Parallel pipeline constructs allow expressing pipelines of tasks, starting with pure producers, through filters, to pure consumers:

tbb::parallel_pipeline
(1024,
tbb::make_filter <void, Item>
(tbb::filter::serial_in_order,
[] (tbb::flow_control &fc) -> Item { /* ... */; return item; })

& tbb::make_filter <Item, Item>
(tbb::filter::parallel,
[] (Item &item) -> Item { /* ... */ })

& tbb::make_filter <Item, Other>
(tbb::filter::parallel,
[] (Item &item) -> Other { /* ... */ })

& tbb::make_filter <Other, void>
(tbb::filter::serial_out_of_order,
[] (Other &other) { /* ... */ }));

There's also tbb::parallel_sort for sorting ranges delimited by random-access iterators.

At the highest level, there is tbb::graph and related classes for creating arbitrary message-passing graphs, whose brief description would probably be enough for a blog article of its own.

At the lower level, TBB provides a possibility to work directly with tasks, task groups and the scheduler. The above-mentioned high-level tools all use this low-level layer to schedule the work to be done. Any user can do the same if they wish to hand-tune their tasks. Working directly in this layer allows one to schedule follow-up tasks directly (thus avoiding the scheduler overhead), reusing tasks for follow-up work, or set task priority and affinity.

At the lowest level, it is even possible to work with individual threads by hand. TBB contains native thread wrappers, whose interface is based on C++11, and as such are actually called std::thread instead of tbb::thread. Working at the level of threads instead of tasks might be useful e.g. if you need to create non-computation threads. An example might be a dedicated user interface thread, to keep your application responsive while TBB does the number crunching.

Other tools

The majority of this article is devoted to tools for task-based parallelism, but there's quite a bit more in TBB.

There are a number of concurrency primitives. First and foremost—mutexes of many types (briefly: tbb::mutex; tbb::recursive_mutex if you need to acquire the same mutex multiple times from the same thread; tbb::spin_mutex for brief periods of active waiting; tbb::queuing_mutex for fair scheduling; and their R/W variants). Mutexes present a scoped lock mechanism for exception-safe locking:

struct Item {
typedef tbb::mutex mutex_t;
mutex_t mutex;
};

void foo (Item &item) {
Item::mutex_t::scoped_lock lock (item.mutex);
/* Work with item. */
} /* At this point, the lock is released automatically. */

There's also tbb::atomic, which one would use for implementing lock-free algorithms based on compare-and-swap, fetch-and-add, and other atomic operation primitives. Those are wrapped in convenient C++ interfaces:

tbb::atomic <unsigned> c = 7;
c += 19;
c.fetch_and_add (19); /* The same thing. */

TBB also contains template classes implementing concurrent containers.

tbb::concurrent_unordered_map and tbb::concurrent_unordered_set are both modeled on similarly-named C++11 template classes. Both allow concurrent iteration and insertion, without the user having to handle the necessary locking themselves.

Another useful container is tbb::concurrent_vector, which allows safe concurrent growth and pushing, but doesn't allow erasing or insertion at all. It also doesn't keep the elements contiguous, making it necessary to access all elements strictly through iterator arithmetic or direct interfaces. Thus one shouldn't rely on pointer arithmetic, like was de facto allowed for std::vector since C++98, with C++11 making this reliable de iure as well.

Finally, TBB contains tbb::concurrent_queue and tbb::concurrent_bounded_queue for concurrent FIFO access. Both would be used if tbb::parallel_do, pipelines or even flow graphs can't be used for whatever reason, and it's still necessary to have a feeder-consumer relationship between parts of program. The two queue templates differ from std::queue in subtle, yet critical ways.

tbb::concurrent_queue is non-blocking, but due to this it doesn't have a size() interface, but only unsafe_size(), and instead of pop() it has a try_pop() which returns a bool that indicates whether anything was popped. On the other hand, tbb::concurrent_bounded_queue is blocking—push() and pop() members can block if an attempt is made to push into a queue the limit of which has been reached, or to pop from an empty queue. It also has a signed size() interface that returns a negative number if there are readers pending and nothing is in the queue.

Further reading

TBB is a fairly extensive library, and we don't have space to cover all in detail.

Luckily there's a wealth of information in Intel® Developer Zone. The above links lead directly to relevant sections of user guide and reference manual, which I recommend for further perusal. Red Hat also ships Doxygen-generated manuals in a package named tbb-doc.

There was also a two-part article here on Developer Blog by Torvald Riegel, which presented overview of parallelism and concurrency tools in C and C++.

Last updated: January 12, 2023