What’s new in OpenMP 5.0

What’s new in OpenMP 5.0

A new version of the OpenMP standard, 5.0, was released in November 2018 and brings several new constructs to the users. OpenMP is an API consisting of compiler directives and library routines for high-level parallelism in C, C++, and Fortran programs. The upcoming version of GCC adds support for some parts of this newest version of the standard.

This article highlights some of the latest features, changes, and “gotchas” to look for in the OpenMP standard.

Task reductions

The new version of the standard allows the reduction clause on the taskloop construct and adds new clauses: task_reduction for the taskgroup construct and in_reduction for the task and taskloop constructs. Previously, variables could be reduced only across threads in “parallel” regions or teams in “teams” regions or across SIMD lanes. Now, on the constructs, the reduction variables are privatized and many copies of them can be created.

For task reductions,  the implementation can choose from various approaches. For example, it can create an array of the privatized copies of variables, with one element per thread, and initialize the privatized copies at the start of the taskgroup, or it can create privatized copies lazily when they are first referenced in in_reduction by a task. It can even create privatized copies more lazily when they are first accessed by a task, or it can choose something else. The variables need to be reduced by the time the taskgroup construct finishes.

The reduction clause on the taskloop construct is allowed only if the nogroup clause is not specified; thus, if there is an implicit taskgroup around the tasks, the clause acts as both a task_reduction clause on that implicit taskgroup and an in_reduction for the individual explicit tasks the construct creates. When the reduction clause is specified on a parallel or workshare construct can have a task modifier, and such reduction is then usable in the in_reduction clause on task and taskloop constructs encountered during execution of the parallel or workshare construct.

Here’s an example:

int foo () {
  int r = 0;
  #pragma omp taskloop reduction (+:r)
  for (int i = 0; i < 128; i++)
    r += work (i);
  return r;
}

Inside of the tasks created for the taskloop construct (each of them can handle one or more iterations), references to the r variable might refer to a privatized copy of that variable initialized somewhere before the first access to the variable (inside of the task to 0, in this case), generally by using the user-defined reduction initializer. By the time the taskloop construct finishes, the original r should be reduced from any privatized copy using the reduction combiner (in this case, omp_out += omp_in). The implementation can choose if it will do locking, use atomic operations, or reduce serially when all the tasks are finished.

The task reductions can work even with global variables that are privatized when needed, as shown in the following example. The taskgroup construct establishes task reduction for the r variable, and any tasks with a corresponding in_reduction clause will participate in that reduction. (Note that the arguments of the in_reduction and corresponding task_reduction or reduction clause must be the same.)  The taskloop construct in the foo function shows that the reduction clause there acts similarly, but r referenced directly in the tasks created by the taskloop participates in the reduction as well.

int r;
void bar (int i) {
  #pragma omp task in_reduction (+:r)
  r += work (i, 0);
  #pragma omp task in_reduction (+:r)
  r += work (i, 1);
}
int foo () {
  #pragma omp taskgroup task_reduction (+:r)
  bar (0);
  #pragma omp taskloop reduction (+:r)
  for (int i = 1; i < 4; ++i)
    { bar (i); r += i; }
}

Everything you need to grow your career.

With your free Red Hat Developer program membership, unlock our library of cheat sheets and ebooks on next-generation application development.

SIGN UP

!= conditions and C++ range loops

Older versions of the standard required the loop condition to use <, >, <=, or >= comparisons only. The new version of OpenMP allows != to be used as a loop condition as well, as shown in the example below, although in such a case, the increment expression must be incrementing or decrementing the iterator variable by 1. Then, at compile time, the compiler can determine if it is incremented or decremented. Especially with C++, users of random access iterators are used to writing != instead of < or >; when the compiler can figure out at compile time if the loop is incrementing or decrementing, it can transform the loop to < or > comparisons instead, so this change is mainly syntactic sugar.

#pragma omp for
for (auto iv = something.begin (); iv != something.end (); ++iv)
/* ... */;

The OpenMP 5.0 standard adds support for many of the C++11, C++14, and C++17 features, and the C++ range for is one of them. So you can use this:

#pragma omp parallel for
for (auto x : vec)
/* ... */;

and the compiler will split the work among threads of parallel.

Host teams construct

In OpenMP 4.5, the teams construct used to be allowed only immediately lexically nested inside of  the target construct for offloading. The OpenMP 5.0 standard allows teams to be used also for host parallelization, especially for NUMA systems where communication between different NUMA nodes can be expensive. The different teams then don’t participate in normal synchronization, which is done inside of “parallel” regions. You can distribute work between the different NUMA nodes using the distribute’ construct and within each NUMA node, parallelize using parallel’ constructs and possibly simd constructs.

Iterators in the depend clause

In the depend clause, artificial iterators can be used to create at runtime a variable number of depend clauses. In the following code:

#pragma omp task depend(iterator (i=0:64:16, long j=v1:v2), in: arr[i][j])
;

if the v1 variable has a value of 2 and the v2 variable has the value 4, then the above code is handled at runtime like this:

#pragma omp task depend(in: arr[0][2], arr[0][3], arr[16][2], arr[16][3]) \
                 depend(in: arr[32][2], arr[32][3], arr[48][2], arr[48][3])

where the i artificial variable has int type and j has long type and is only in the scope of the depend clause; i iterates from 0 (inclusive) to 64 (exclusive) with step 16, and j iterates from v1 (inclusive) to v2 exclusive) with step 1.

Atomic construct changes

In OpenMP 4.5, atomic constructs were either relaxed (when no extra clause has been specified) or sequentially consistent (with  the seq_cst clause). OpenMP 5.0 allows you to specify various other memory ordering behaviors explicitly (the relaxed, acquire, release, acq_rel clauses) and allows you to change the default when it is not specified explicitly through a new requires directive. Additionally, you can specify a hint clause (although currently GCC just parses that and ignores it later on). On the flush construct, you can also specify a memory order clause.

#pragma omp requires atomic_default_mem_order(seq_cst)
// All atomic constructs will be in this translation unit
// sequentially consistent unless specified otherwise.

#pragma omp atomic update // This will be now seq_cst
i += 1;
#pragma omp atomic capture relaxed // This will be relaxed
v = j *= 2;

Various new combined constructs

As syntactic sugar, several new combined constructs are now supported, especially to save typing when using OpenMP tasking. These include:

#pragma omp parallel master
#pragma omp parallel master taskloop
#pragma omp parallel master taskloop simd
#pragma omp master taskloop
#pragma omp master taskloop simd

omp_pause_resource

OpenMP 5.0 has two new API calls, omp_pause_resource and omp_pause_resource_all, through which users can ask the library to release resources (threads, offloading device data structures, etc.). This can allow, for example, the use of fork without an immediate exec when OpenMP directives have been used before and will be used in the child as well.

Data sharing changes

In GCC 9, there is a change that actually isn’t something new in OpenMP 5.0. OpenMP 4.0 introduced, likely by a mistake, a change where const qualified variables were no longer predetermined as shared. This has an intended effect that you can specify those variables in a shared clause (likely motivation for that change), but it also has an undesirable effect that when the default(none) clause is used, the const qualified variables must be specified explicitly in some data sharing clause if they are used inside of the construct, whereas previously they didn’t have to be. In older GCC releases, I was hoping to reverse this change, but it has been agreed that this is not going to change. Users encountering this have various options; see OpenMP data sharing for more details.

Trying it out

The new version of the OpenMP 5.0 standard is available and includes many more new features than those described above. Also, see the OpenMP 5.0 Reference Guide. The upcoming version of GCC (GCC 9) is going to support the features described above and various others (for C and C++ only for now), but many other OpenMP 5.0 new features will be implemented only in later GCC versions. See OpenMP 5.0 support for GCC 9 for details on what exactly is implemented in GCC 9 and what will be implemented in later GCC releases.

More articles for C/C++ developers

Share