OpenMP 4.0 support in Developer Toolset 3 Beta — Parallel programming extensions for today’s architectures

In this article, we’ll take a look at the OpenMP parallel programming extensions to C, C++ and Fortran – OpenMP 4.0. These are available out of the box in GCC v4.9.1, available to Red Hat Enterprise Linux developers via Red Hat Developer Toolset v3.0 (currently at beta release).

For a thorough backgrounder in parallelism and concurrency programming concepts, see Torvald Riegel’s earlier articles (part 1 and part 2). In this article, we’ll instead dig into the nuts and bolts of what OpenMP v4 provides to developers, and how it works in practice in GCC.

OpenMP v4.0

The OpenMP 4.0 standard was released in July 2013 and includes various enhancements compared to the OpenMP v3.1 support shipped in RHEL 7’s system compiler and in Developer Toolset v2.1 and earlier. These enhancements include SIMD constructs, device constructs, enhanced CPU affinity support, task dependencies, task groups, user defined reductions, construct cancellation and various other smaller changes. We’ll talk about each of these enhancements in turn below.

As with older versions of OpenMP, to enable OpenMP support in GCC one should use the -fopenmp compiler option during both compilation and linking.

SIMD

SIMD constructs were added to help the compiler vectorize performance critical loops. For example, in the following testcase:

int foo (int *p, int *q) {
  int i, r = 0;
  #pragma omp simd reduction(+:r) aligned(p,q:32)
  for (i = 0; i < 1024; i++) {
    p[i] = q[i] * 2;
    r += p[i];
  }
  return r;
}

the new pragma directive tells the compiler that there are no loop-carried lexical backward data dependencies which would prevent the vectorization, hints that both “p” and “q” pointers are 32-byte aligned and requests the “r” variable to be privatized and used to compute a reduction (each SIMD lane will compute its own sum, and at the end those results are combined).

The SIMD constructs can be combined with various other constructs, so a loop can be e.g. parallelized and vectorized at the same time, and one can declare certain functions with additional pragma to request creation of extra version(s) which will process multiple arguments simultaneously.

#pragma omp declare simd simdlen(8) notinbranch uniform(y)
int bar (int x, int y) { return x * y; }
int foo (int *p, int *q) {
  int i, r = 0;
  #pragma omp parallel for simd reduction(+:r) aligned(p, q:32) schedule(static, 32)
  for (i = 0; i < 1024; i++) {
    p[i] = bar (q[i], 2);
    r += p[i];
  }
  return r;
}

In the above example for i?86/x86_64 architectures, GCC 4.9 creates 3 extra versions of bar, one for SSE2, one for AVX and one for AVX2, which can process 8 “x” values in one call, passed in a vector register(s). “y” is passed in as a scalar, the return value is again a vector. The combined constructs parallelizes the loop with 32 iteration chunks spread across CPU threads, and each chunk is then vectorized.

Device Constructs

Device constructs allow offloading of certain regions of code to specialized acceleration devices. In GCC 4.9, the OpenMP 4.0 device constructs are recognized, but no acceleration devices are supported yet, so the regions are executed using host fallback on the host CPU, but there is work underway in the upstream GCC project for GCC 5 to support offloading e.g. on Intel MIC accelerator cards, NVidia PTX and eventually AMD HSA too.

CPU affinity

GCC 4.8 and earlier supported CPU affinity to some extent, e.g. through the GOMP_CPU_AFFINITY environment variable and boolean OMP_PROC_BIND environment variable, but GCC 4.9 offers the much more precise OMP_PROC_BIND algorithm with an OMP_PLACES environment variable allowing description of the CPU topology.

Task Dependencies and Groups

Tasks in OpenMP 4.0 have been enhanced, so that it is possible to describe dependencies between child tasks of the same parent task, e.g. where a variable is shared by a number of tasks, and one of those tasks needs to wait until all tasks writing
to that variable are complete. Tasks can be also grouped into task groups, where the end of the task group region waits for all the tasks from the task group to themselves complete.

subroutine dep
  integer :: x
  x = 1
  !$omp parallel
    !$omp single
      !$omp taskgroup
        !$omp task shared (x) depend(out: x)
          x = 2
        !$omp end task
        !$omp task shared (x) depend(in: x)
          if (x.ne.2) call abort
        !$omp end task
        !$omp task shared (x) depend(in: x)
          if (x.ne.2) call abort
        !$omp end task
      !$omp end taskgroup
    !$omp end single
  !$omp end parallel
end subroutine dep

Here, the first task is a writer to x, and the other two tasks can’t be scheduled until it is complete, while the other two tasks can be run simultaneously.

User-Defined Reductions

In OpenMP 3.0, only basic arithmetic was possible in C/C++ reductions (and in Fortran with a couple of extra intrinsics). OpenMP 3.1 added new support for min and max intrinsic reductions for C/C++ developers. In OpenMP 4.0, however, users can define their own reductions for both arithmetic types and classes or structures, by specifying a combiner operation as well as, optionally, an initializer operation. For example:

struct S
{
  int s;
  void foo (S &x) { s += x.s; }
  S (const S &x) { s = 0; }
  S () { s = 0; }
  ~S ();
};

#pragma omp declare reduction (foo: S: omp_out.foo (omp_in)) 
initializer (omp_priv (omp_orig))

defines a user defined foo reduction on class S. When this is used as:

int bar ()
{
  S s;
  #pragma omp parallel for reduction (foo: s)
  for (int i = 0; i < 64; i++)
    s.s += i;
  return s.s;
}

each thread will have its own private object, will perform the partial sum on it, and then the foo method will be called for each thread on the original “s” variable with a reference to the private copy of the variable. User defined reductions may, of course, also be used together with SIMD constructs, or device constructs.

Construct Cancellation

Some constructs – parallel, for, taskgroup and sections – can be cancelled in OpenMP 4.0, as long as the cancellation construct is lexically within the construct being cancelled and a few other conditions are met. As C++ exceptions must not be thrown through the OpenMP constructs, this can sometimes be useful to avoid doing unnecessary work once some exception has been raised and caught in the region. As an alternative example, when using tasks to search for something, if a particular task succeeds in finding the one required result, it is possible to cancel the entire taskgroup. Here’s an example using C++ exceptions:

void foo () {
  std::exception *exc = NULL;
  #pragma omp parallel shared(exc)
  {
    #pragma omp for
    for (int i = 0; i < N; i++) {
      #pragma omp cancellation point for
      try { something_that_might_throw (); }
      catch (const std::exception *e) {
        #pragma omp atomic write
        exc = e;
        #pragma omp cancel for
      }
    }
    if (exc) {
      #pragma omp cancel parallel
    }
  }
  if (exc) {
    // throw exc.
  }
}

In this case, exceptions are caught in the loop construct and stored atomically into a shared variable. The current thread then continues to the wait at the end of
the loop construct. Other threads continue executing something_that_might_throw() until that returns. Upon starting the next iteration, however, the cancellation point construct tells
the compiler to also bypass the rest of the iterations of the worksharing construct.

Wrap-Up

That completes this brief walk through the major new OpenMP features Red Hat Enterprise Linux developers can find in Red Hat Developer Toolset 3.0 Beta. We’re always happy to receive your feedback and questions, so feel free to add a comment or drop us an email or tweet!

 

Leave a Reply