New features in OpenMP

OpenMP is an API consisting of compiler directives and library routines for high-level parallelism in C and C++, as well as Fortran. Version 5.1 of OpenMP was released in November 2020 and version 5.0 was released in November 2018. This article discusses the new features from OpenMP 5.0 which are implemented in GCC 11, and some new OpenMP 5.1 features.

OpenMP 5.0 features

Let's start with features that were added in the OpenMP 5.0 standard version.

Support for non-rectangular collapsed loops

Before OpenMP 5.0, all OpenMP looping constructs (worksharing loops, simd, distribute, taskloop, and combined or composite constructs based on those) were required to be rectangular. This means that all of the lower bound, upper bound, and increment expressions of all the associated loops in the loop nest were required to be invariant against the outermost loop. OpenMP 5.0 still requires all the increment expressions to be loop-invariant, but allows some cases where the lower and upper bound expressions of the inner loops can be based on a single outer-loop iterator.

There are restrictions to this new feature, however: An inner-loop iterator must use at most one outer-loop iterator, and the expressions need to resolve to a * outer + b, where a and b are loop-invariant expressions. If the inner and referenced outer loops have different increments, there are further restrictions to support easy computation of the number of iterations of the collapsed loop nest before the loop. In addition, non-rectangular loops might not have schedule or dist_schedule clauses specified. This allows the implementation to choose any iteration distribution it prefers.

The following triangular loop is an example:

#pragma omp for collapse(2)
for (int i = 0; i < 100; i++)
  for (int j = 0; j < i; j++)
    arr[i][j] = compute (i, j);

But a non-rectangular loop can also be much more complex:

#pragma omp distribute parallel for simd collapse(4)
for (int i = 0; i < 20; i++)
  for (int j = a; j >= g + i * h; j -= n)
    for (int k = 0; k < i; k++)
      for (int l = o * j; l < p; l += q)
        arr[i][j][k][l] = compute (i, j, k, l);

The easiest implementation is by computing a rectangular hull of the loop nest and doing nothing inside of the combined loop body for iterations that wouldn't be run by the original loop. For example, for the first loop in this section, the implementation would be:

#pragma omp for collapse(2)
for (int i = 0; i < 100; i++)
  for (int j = 0; j < 100; j++)
    if (j < i)
      arr[i][j] = compute (i, j);

Unfortunately, such an implementation can cause a significant work imbalance where some threads do no real work at all. Therefore, except for non-combined non-rectangular simd constructs, GCC 11 computes an accurate number of iterations before the loop. In the case of loop nests with just one loop dependent on outer-loop iterators, it uses Faulhaber's formula, with adjustments for the fact that some values of the outer iterator might result in no iterations of the inner loop. This way, as long as the loop body performs roughly the same amount of work for each iteration, the work is distributed evenly.

Conditional lastprivate

In OpenMP, the lastprivate clause can be used to retrieve the value of the privatized variable that was assigned in the last iteration of the loop. The lastprivate clause with a conditional modifier works as a fancy reduction, which chooses the value from the thread (or team, SIMD lane, or task) that executed the maximum logical iteration number. For example:

#pragma omp parallel for lastprivate(conditional:v)
for (int i = 0; i < 1024; i++)
  if (cond (i))
    v = compute (i);
result (v);

For this construct to work, the privatized variable must be modified only by storing directly to it, and shouldn't be modified through pointers or modified inside of other functions. This allows the implementation to find those stores easily and adjust a store to remember the logical iteration that stored it. This feature is implemented in GCC 10 already.

Inclusive and exclusive scan support

OpenMP 5.0 added support for implementing parallel prefix sums (otherwise known as cumulative sums or inclusive and exclusive scans). This support allows C++17 std::inclusive_scan and std::exclusive_scan to be parallelized using OpenMP. The syntax is built upon the reduction clause with a special modifier and a new directive that divides the loop body into two halves. For example:

#pragma omp parallel for reduction (inscan, +:r)
for (int i = 0; i < 1024; i++)
  {
    r += a[i];
    #pragma omp scan inclusive(r)
    b[i] = r;
  }

The implementation can then split the loop into the two halves, creating not just one privatized variable per thread, but a full array for the entire construct. After evaluating one of the halves of user code for all iterations—which differs between inclusive and exclusive scans—efficient parallel computation of the prefix sum can be performed on the privatized array, and finally, the other half of the user code can be evaluated by all threads. The syntax allows the code to work properly even when the OpenMP pragmas are ignored. This feature is implemented in GCC 10.

Declare variant support and meta-directives

In OpenMP 5.0, some direct calls can be redirected to specialized alternative implementations based on the OpenMP context from which they are called. The specialization can be done based on which OpenMP constructs the call site is lexically nested in. The OpenMP implementation can then select the correct alternative based upon the implementation vendor, the CPU architecture and ISA flags for which the code is compiled, and so on. Here is an example:

void foo_parallel_for (void);
void foo_avx512 (void);
void foo_ptx (void);
#pragma omp declare variant (foo_parallel_for) \
match (construct={parallel,for},device={kind("any")})
#pragma omp declare variant (foo_avx512) \
match (device={isa(avx512bw,avx512vl,"avx512f")})
#pragma omp declare variant (foo_ptx) match (device={arch("nvptx")})
void foo (void);

If foo is called directly from within the lexical body of a worksharing loop that is lexically nested in a parallel construct (including the combined parallel for), the call will be replaced by a call to foo_parallel_for. If foo is called from code compiled for the previously mentioned AVX512 ISAs, foo_avx512 will be called instead. And finally, if foo is called from code running on NVidia PTX, the compiler will call foo_ptx instead.

A complex scoring system, including user scores, decides which variant will be used in case multiple variants match. This construct is partially supported in GCC 10 and fully supported in GCC 11. The OpenMP 5.0 specification also allows meta-directives using similar syntax, where one of several different OpenMP directives can be used depending on the OpenMP context in which it is used.

The loop construct

In OpenMP 4.5, the various looping constructs prescribed to the implementation how it should divide the work. A programmer specified whether the work should be divided between teams in the league of teams, or between threads in the parallel region, or across SIMD lanes in a simd construct, and so on. OpenMP 5.0 offers a new loop construct that is less prescriptive and leaves more freedom to the implementation about how to actually implement the work division. Here's an example:

#pragma omp loop bind(thread) collapse(2)
for (int i = 0; i < 1024; i++)
  for (int j = 0; j < 1024; j++)
    a[i][j] = work (i, j);

The bind clause is required on orphaned constructs and specifies which kind of threads that encounter it will participate in the construct. If the pragma is lexically nested in an OpenMP construct that makes the binding obvious, the bind clause can be omitted. The implementation is allowed to use extra threads to execute the iterations. The loop construct is implemented in GCC 10.

There are restrictions on which OpenMP directives can appear in the body of the loop, and no OpenMP API calls can be used there. These restrictions were imposed so that the user program can't observe and rely on how the directive is actually implemented. Restrictions on work scheduling have been added in OpenMP 5.1, which is discussed next.

OpenMP 5.1 features

In OpenMP 5.1, C++ programs can specify OpenMP directives using C++11 attributes, in addition to the older use of pragmas. Two examples using attributes follow:

[[omp::directive (parallel for, schedule(static))]]
for (int i = 0; i < 1024; i++)
  a[i] = work (I);

[[omp::sequence (directive (parallel, num_threads(16)), \
                 directive (for, schedule(static, 32)))]]
for (int i = 0; i < 1024; i++)
  a[i] = work (i);

OpenMP 5.1 added a scope directive, where all threads encountering it will execute the body of the construct. Private and reduction clauses can be applied to it. For example:

#pragma omp scope private (i) reduction(+:r)
{
  i = foo ();
  r += i;
}

Unless the nowait clause is present on the directive, there is an implicit barrier at the end of the region.

OpenMP 5.1 has new assume, interop, dispatch, error, and nothing directives. Loop transformation directives were also added. The master was deprecated and replaced by the new masked construct. There are many new API calls, including:

  • omp_target_is_accessible
  • omp_get_mapped_ptr
  • omp_calloc
  • omp_aligned_alloc
  • omp_realloc
  • omp_set_num_teams
  • omp_set_teams_thread_limit
  • omp_get_max_teams
  • omp_get_teams_thread_limit

The OpenMP API features history appendix covers all changes, including deprecated features.

Try it out

The specifications for both OpenMP 5.0 and OpenMP 5.1 are available at openmp.org/specifications/, including both PDF and HTML layouts. The latest version of GCC (GCC 11) supports the features described in this article and various others (this time not just C and C++, but many features also for Fortran). But several other new features of OpenMP will be implemented only in later GCC versions.

Last updated: April 29, 2021