What is new in OpenMP 4.5

A new version of the OpenMP standard, 4.5, has been released in November 2015 and brings several new constructs to the users. OpenMP is an API consisting of compiler directives and library routines for high level parallelism in C, C++ and Fortran programs. The upcoming version of GCC adds support for this newest version of the standard.

This post highlights some of the latest features, changes, and "gotcha's" to look out for.

Taskloop construct

Diving right in, the 'taskloop' construct was newly added in version 4.5 and as its name implies, allows dividing iterations of a loop into tasks. It can optionally wait on completion of those tasks, and each created task is assigned one or more iterations of the loop. For example:

#pragma omp taskloop num_tasks (32)
  for (long l = 0; l < 1024; l++)
    do_something (l);

The above code will create 32 tied tasks in a new taskgroup, and a reasonable implementation will assign 32 iterations to each task. As there is an implicit taskgroup around it, the encountering task will await completion of all these tasks. In OpenMP 4.0 and earlier, in order to achieve similar effect, one would create the tasks manually, like the following example:

#pragma omp taskgroup
  for (int tmp = 0; tmp < 32; tmp++)
    #pragma omp task
      for (long l = tmp * 32; l < tmp * 32 + 32; l++)
        do_something (l);

When there are multiple collapsed loops, or if C++ iterators used in the loop, handling this manually would be harder, so using the taskloop construct greatly simplifies the source.

Instead of specifying the number of tasks that should be created, the grainsize clause can be used instead. This specifies how many iterations each task should have (from the specified grain-size to less than twice that value), and the implementation will compute the number of tasks automatically. Alternatively, if both num_tasks and grainsize clauses are missing, the implementation will choose some reasonable default.

Another tasking change in OpenMP 4.5 is the addition of task priorities, which can be specified for a task in the priority clause. As you might expect, the runtime should prefer scheduling tasks with higher priorities over scheduling tasks with lower priorities.

Doacross parallelism

In some cases it is desirable to parallelize loops that have some inter-iteration dependencies. This is typically alright as long as the work that can be performed in parallel is sufficiently expensive that the threads will not spend most of their runtime waiting on each other.

OpenMP 4.0 and earlier offered the 'ordered' construct. In a loop construct marked with an ordered clause, the body of the ordered construct is executed in lexical iteration order. In OpenMP 4.5, it is possible to use ordered directives (without body) that express dependencies on earlier lexical iterations of the loop, and a place where all the data that other iterations might depend from this iteration on, is computed. For example:

#pragma omp for ordered(2)
  for (int i = 0; i < M; i++)
    for (int j = 0; j < N; j++)
        a[i][j] = foo (i, j);
        #pragma omp ordered depend (sink: i - 1, j) depend (sink: i, j - 1)
        b[i][j] = bar (a[i][j], b[i - 1][j], b[i][j - 1]);
        #pragma omp ordered depend (source)
        baz (a[i][j], b[i][j]);

The example above demonstrates that only the outer loop is distributed among threads in the team. The first ordered directive tells the runtime to wait for the completion of the specified earlier iterations, and the last ordered directive marks the point where the iteration computed all its data other iterations might depend on.

In this case, the compiler will ignore the depend (sink: i, j - 1) dependency, because only the outer loop is participating in the work-sharing, and the inner loop iterations are performed in lexical order; therefore, by the time i, j iteration is performed, the i, j - 1 iteration is guaranteed to be finished.

In addition to allowing depend clause on the ordered directive, OpenMP 4.5 also allows threads and simd clauses on the ordered construct and allows #pragma omp ordered simd to be used in SIMD loops. This marks a code region that is executed in lexical order within the SIMD loop, for small portions of code that should not be vectorized in an otherwise vectorizable loop.

Data sharing changes

Also included the latest update, C++ references are now allowed in privatization clauses, where previously they were allowed only in shared clause.

In C++ methods, it is now possible to privatize accessible non-static data members of the object on which the method is invoked, as long as they are accessed in the corresponding OpenMP region using an id-expression that denotes them, rather than explicit this->member dereference.

The reduction clause in C and C++ allows array sections, so C/C++ arrays can now be reduced without wrapping them into structures or classes and defining user defined reductions. The linear clause is now allowed on the loop construct, where previously it has been allowed only on SIMD constructs.

Offloading changes

Offloading is likely the part of the OpenMP standard that has changed most, consequently introducing some source level incompatibilities - these being mostly related to non-explicit mapping of scalar variables, including pointers, in C/C++ target regions.

In OpenMP 4.0, these variables, unless mentioned explicitly in map clauses on the target construct, were implicitly mapped tofrom - meaning that the host side value of a pointer would be copied to target (usually not really useful there), target value of a pointer copied back to host (also usually undesirable), or lastly no copying occurs if the variable is already mapped.

In OpenMP 4.5, unless the defaultmap(tofrom: scalar) clause is used, scalar variables are implicitly privatized, as if the newly allowed firstprivate clause on the target construct is used for them - meaning their value is always copied to the target region, and never copied back.

In many cases this will work even with code written for OpenMP 4.0, but the exception is mainly if you need to get a scalar value back from the device at the end of the target region. For example, the following open OpenMP 4.0 will no longer work:

void foo () {
  double sum = 0.0;
  #pragma omp target map(array[0:N])
  #pragma omp teams distribute parallel for simd reduction(+:sum)
    for (int i = 0; i < N; i++)
      sum += array[i];
  return sum;

The failure will occur because sum will be firstprivate in the region and the host value will not be modified. To port this code to OpenMP 4.5, one needs to either add default(tofrom: scalar) clause to the target region, or use explicit map(tofrom: sum).

Pointers are implicitly mapped as if they appear in map(ptr[:0]) clauses - thus the host pointer is translated to the corresponding device pointer if the pointed object is already mapped, or NULL otherwise.

Mapping of C++ references has been clarified, and it is now possible to map structure
elements individually. Support for asynchronous offloading has been added through the nowait and depend clauses on the target construct, whereas in OpenMP 4.0, all target regions were synchronous - the host task would be waiting until the offloading is finished, and one had to use host tasking manually to achieve asynchronous offloading.

The target construct is now treated as an implicit task region, and the new target enter data and target exit data constructs were added. This means that the mapping and un-mapping of variables can now be also performed synchronously or asynchronously, and in separate functions or methods - e.g. it is possible to enter data in C++ constructor and exit data in C++ destructor.

The declare target directive has been extended, and it is now possible to mark global variables (e.g. large arrays) for deferred mapping. Device memory routines have been added for explicit allocation, de-allocation and memory transfers between host and offloading device, and clauses for interaction with native device implementations have been added: use_device_ptr clause to target data construct and is_device_ptr clause to target construct.

GCC 6 currently supports offloading to Intel XeonPhi Knight's Landing, some OpenMP 4.5 target constructs can be offloaded to AMD HSA GPGPUs, and OpenMP offloading to NVidia PTX is in the works (OpenACC offloading to NVidia PTX is already supported).

Miscellaneous changes

Some other smaller, miscellaneous changes in OpenMP 4.5 include:

The improvement of Fortran 2003 support.

The hint clause on critical construct and new hinted lock API routines were added to allow applications to inform the runtime about contention and desirability of speculation.

The if clause has been extended so that when using combined and composite constructs it is now possible to specify to which construct(s) the if clause applies, and it is possible to specify different if clauses for different constructs from which the combined or composite construct is composed.

In #pragma omp declare simd directive's linear clauses it is now possible to specify through val, uval and ref modifiers whether references are linear, or whether the referenced variable values are linear.

Query functions for thread affinity have been added.

And in reality, OpenMP 4.5 also contains various other smaller changes, clarifications and bug fixes, too many to list them all, but this is a summary of some of those that are more notable.

Trying it out

The new version of the OpenMP 4.5 standard is available from https://openmp.org/wp-content/uploads/openmp-4.5.pdf, and the new version of the standard is going to be supported in the upcoming GCC 6 (for C and C++ only.)

Fortran support remains at OpenMP 4.0 level in this version, and is scheduled to be released around April this year.

Last updated: March 15, 2023