MPI plus OpenMP Tasks / OmpSs

This Resource Pack helps developers create efficient and effective, capability-scale applications using MPI and either OpenMP Tasks or OmpSs—either to enhance existing (perhaps, MPI-only) software or for new projects.

This page has the following sections:

Motivation and Strategy

OmpSs (https://pm.bsc.es/ompss) is a parallel programming model based on tasks and developed at Barcelona Supercomputing Center. OmpSs is also employed as a test bench for the OpenMP programming model in order to improve its tasking model. In particular, it extends OpenMP with new directives, clauses and semantics to support asynchronous parallelism.

There are conceptual issues with calling MPI functions from the tasks of OmpSs (or OpenMP). They are related to the asynchronous execution of tasks in the runtime contrasted with the assumed order of operations in the MPI. For example, if one task executes the (blocking) MPI_Send function and another task the (blocking) MPI_Recv function, one can easily arrive at dead-locks if the tasks are executed in a different order. Moreover, these may only happen intermittently during runs, as the execution order can change from one run to another.

The way around these issues offered in MPI is to use non-blocking variants of these functions, MPI_Isend and MPI_Irecv. However, the asynchronity of the task-based execution also comes into play. Since, in this case, the actual communication can happen after the task has already finished, one can arrive at situations that the communication buffer has been deallocated by the task or rewritten by other operations.

A crude way to resolve these issues is introducing synchronization points. But a more elegant way natural for task-based programming models is to add auxiliary data dependencies to the tasks. By this approach, finer point-to-point synchronizations can be enforced. Details of this approach are in the Best Practice Guide on MPI + OmpSs by INTERTWinE.

Strategy

Each of these programming models is focused on exploiting a different level of parallelism. MPI provides explicit communication primitives to exploit inter-process parallelism (inter- or intra-node) while OpenMP Tasks / OmpSs exploits parallelism within a process (always intra-node). Mixing MPI and OpenMP Tasks / OmpSs programming models in the same application is a complex task, as features of both languages might easily interact in an unexpected way, resulting in dead-locks, incorrect results, fatal errors or performance issues.

As in the case of MPI + OpenMP Threads, the correct support for communicating threads should be enabled during MPI initialization. In particular, MPI_Init_thread should be used instead of MPI_Init (see Best Practice Guide for MPI + OpenMP). Currently, MPI and OpenMP Tasks / OmpSs are totally decoupled. For instance, the OmpSs runtime does not know whether an MPI call will be blocked or not, and likewise, the MPI runtime does not know if a thread calling an MPI primitive is in a task created by the OmpSs runtime. 

To avoid such issues and to make the coupling smooth, we designed a Best Practice Guide and provided several real-world test cases implementing this combination.

Despite the fact that these two programming models do not offer (at the moment) any common service, there are some mechanisms and strategies that can be employed to safely combine them, such as to enclose MPI communication primitives inside OpenMP / OmpSs tasks and allowing the OpenMP / OmpSs dependency system to manage the synchronization between them. This methodology facilitates the overlapping of communication and computation phases, generally improving the application performance.

Industrial Relevance

MPI and OpenMP are both mature and widely supported approaches to programming parallel computers. They have an open governance model, with involvement from leading HPC vendors. They are also widely supported in compiler environments, as well as languages such as FORTRAN, C and C++, and are generally available on HPC and (to a lesser extent) cloud computing environments. Availability on Windows platforms is limited.

The use of OpenMP Tasks, included in the standard since version 3.0, has not seen as wide an adoption as the traditional OpenMP model (based on worksharing) was in its previous versions. In fact, most of the OpenMP code that we can find in real applications is based exclusively on the parallelization of loops (data parallelism), rather than using other parallelization techniques (for example, task parallelism).

OmpSs is a programming model — a forerunner of OpenMP tasks based on StarSs @ BSC — aiming at extending the OpenMP tasking model with richer support for asynchronous parallelism (tasks). OmpSs offers a different approach to the use of accelerator devices (such as, GPUs and FPGAs) based on leveraging existing native kernels (i.e. CUDA, OpenCL, etc.).

The introduction of data dependencies (the ‘depend’clause) and the extensions focused on using accelerator devices (the ‘target’ directive) have extended the potential of using OpenMP tasks / OmpSs by increasing the parallelization opportunities and composability between the different phases of a program.

Best Practice Guide

The combination of MPI and OpenMP Tasks / OmpSs can become an essential symbiosis in the decomposition of parallel algorithms allowing the overlap of computation and communication phases in a simple and efficient way (see Figure below). However, as the interactions between these two programming models are still relatively new, we provide a guideline in the INTERTWinE Best Practice Guide as well as examples of various kernels and applications how to facilitate and benefit from this coupling.

Computation and communication overlap using a double buffer technique. In this scenario both phases may be implemented with tasks

Applications and Kernels

The INTERTWinE team has ported several real-world applications and kernels to illustrate good practice for MPI plus OmpSs/ OpenMP Tasks, along with developers’ commentary:

  1. The TAU kernel ‘linsolv’ is an example of a very typical approach to hybridisation, adding OpenMP Tasks to computational kernels running in different MPI processes. Various strategies with single and parallel task producers as well as taskloops using different tasks granularity have been verified. The hybrid MPI + OpenMP parallelization yields weaker performance than the pure MPI parallelization for the conducted tests. Also, no clear advantage of using explicit tasks instead of threads could be observed. It is anticipated that for more complex applications or with much larger test cases, the hybrid parallelization strategy could be of more advantage. [TAU]
  2. The BAR N-Body benchmark approximates the evolution of a system of bodies in which each body continuously interacts with every other body combining OmpSs tasks with MPI [BAR N-Body]
  3. In the Luwdig application, we have attempted to reduce the amount of synchronization required within the halo exchange procedure as well as to overlap communication and propagation in this phase using OpenMP tasks/ OmpSs under the MPI multiple mode. This approach is beneficial on large-scale computations, but with the price of the increased complexity of application handling significantly more communication requests. [Ludwig (OMP Tasks)] [Ludwig (OmpSs)]
  4. The iPIC3D code employs OpenMP tasks in the multithreaded MPI + OpenMP Threads version aiming at reducing the amount of synchronization in halo exchange as well as overlapping communication of particles with computation of their trajectories. This combination brings some performance overhead when few nodes are used; we have investigated this issue and prepared a performance analysis report. But, the MPI + OpenMP tasks combination leads to the reduction of the total execution time in the field-dominant regime on large-scale calculations. This combination also raises a concern regarding the increased complex of the code as in Ludwig. [iPIC3D Guide] [iPIC3D Report]
  5. In GraphBLAS, we study a possibility to accommodate OpenMP / OmpSs tasks in the MPI version of the Preconditioned Conjugate Gradient (PCG) accelerated with a preconditioner based on an Incomplete LU factorization (ILU0). Additionally, we determine the best-performing strategy for a number of leaves per task in the Task Dependency Graph (TDG). Both MPI + OpenMP tasks and MPI + OmpSs variants consistently outperform the pure MPI version. [GraphBLAS (OMP Tasks)] [GraphBLAS (OmpSs)]

Software to support this Resource Pack can be downloaded from GitHub [GitHub]

Resource Pack

The INTERTWinE MPI and OpenMP Tasks / OmpSs Resource Pack contain the following:

  1. INTERTWinE Best Practice Guide for programming with MPI and OmpSs.
  2. INTERTWinE developers' commentary on several real-world software applications, to illustrate good practice for MPI plus OpenMP  Tasks/ OmpSs:
    1. The TAU kernel ‘linsolv’ implements several (iterative) methods to find an approximate solution of the linear system Ax=b, where A is a sparse block matrix [Guide, Source Code]
    2. The BAR N-Body benchmark approximates the evolution of a system of bodies in which each body continuously interacts with every other body combining OmpSs tasks with MPI [Source Code]
    3. Ludwig, a versatile code for the simulation of Lattice-Boltzmann (LB) models in 3D on cubic lattices [Guide (OMP Tasks), Source Code] [Guide (OmpSs), Source Code]
    4. iPIC3D, a Particle-in-Cell (PIC) code for the simulation of space plasmas in space weather applications during the interaction between the solar wind and the Earth’s magnetic field [Guide, Report, Source Code]
    5. GraphBLAS is a suite of iterative solvers for solving sparse linear systems [Guide (OMP Tasks), Source Code] [Guide (OmpSs), Source Code]

For more details, please consult our deliverables:

  1. D5.4 Final report on application/ kernel plans, evaluations and benchmark suite releases
  2. D5.3 Performance evaluation report
  3. D5.2 Interim report on application/kernel plans, evaluations and benchmark suite releases
Last updated: 27 Nov 2018 at 9:19