Communication Libraries plus OpenMP Threads

This Resource Pack helps developers create efficient and effective, capability-scale applications using MPI/ GASPI and OpenMP Threads—either to enhance existing (perhaps, MPI-only) software or for new projects.

This page has the following sections:

Motivation and Strategy

Combining MPI (http://mpi-forum.org/) or GASPI (http://gaspi.de), for inter-node communication, with OpenMP (http://www.openmp.org/), for intra-node shared memory environment, is quite standard practice in hybrid programming on large HPC systems. There are essentially two main motivations for this combination of programming models:

  1. Reduction in memory footprint, both in the application and in the MPI library (e.g. communication buffers).
  2. Improved performance, especially at high core counts where the pure MPI scalability is running out.

Strategy

A combination of MPI and OpenMP Threads is used for two main reasons, which may both be important for a given application code. First to reduce memory requirements (compared to MPI alone), since data structures that are replicated between MPI processes need only to be replicated once (or a small number of times) per node, rather than once per core. Second to improve performance at high core counts, where the scalability of pure MPI deteriorates.

Combining MPI and OpenMP Threads can produce better performance in a number of ways, for example better load balance, reduced communication, or as a relatively straightforward way to exploit additional levels of parallelism.

However, MPI and OpenMP Threads represent distinct parallelisation strategies; the efficient integration of the two can be challenging. Further, a simplistic integration of MPI and OpenMP Threads can introduce additional synchronisation points into an application, which in turn inhibits performance and scalability.

Many applications use the simplest style of MPI and OpenMP Threads programming, where all MPI library calls are made outside of OpenMP parallel regions. As the number of cores per node continues to increase, this approach may limit scalability with increasing numbers of OpenMP threads; so exploring other strategies, where computation and communication can take place simultaneously in different OpenMP threads is likely to benefit these applications on future hardware platforms.

Since traditionally the GASPI programming model targets multi-threaded or task-based applications, we aim here to rather classic way of replacing MPI communication with the GASPI one and study its effects on an example of the TAU linsolv.

Industrial Relevance

MPI and OpenMP are both mature and widely supported approaches to programming parallel computers. They have an open governance model, with involvement from leading HPC vendors. They are also widely supported in compiler environments, as well as languages such as FORTRAN, C and C++, and are generally available on HPC and (to a lesser extent) cloud computing environments. Availability on Windows platforms is limited.

The combination of MPI and OpenMP Threads is already an important feature of many HPC applications used in industry. It is becoming a standard technique for achieving better scalability on modern cluster architectures, which have increasing numbers of CPU cores per node. Industrial application areas such as structural mechanics, computational fluid dynamic and seismic analysis, which use large-scale HPC facilities, are the main domain for which MPI plus OpenMP Threads can be found in current use. However, to exploit MPI and OpenMP efficiently, one needs skilled software engineers with a track record of parallel programming on cluster-scale resources.

Best Practice Guide

The Best Practice Guide on MPI + OpenMP by INTERTWinE discusses motivations for combining MPI and OpenMP in more detail and explains why and how these potential benefits can be realized in application codes. It discusses also the possible downsides of MPI + OpenMP programs, covering software engineering issues and performance pitfalls.

Five different styles of MPI + OpenMP program discussed in the Best Practice Guide are:

  • Master-only: all MPI communication takes place in the sequential part of the OpenMP program (no MPI in parallel regions).
  • Funneled: all MPI communication takes place through the same (master) thread but can be inside parallel regions.
  • Serialized: MPI calls can be made by any thread, but only one thread makes MPI calls at any one time.
  • Multiple: MPI communication simultaneously in more than one thread.
  • Asynchronous Tasks: MPI communications can take place from every thread, and MPI calls take place inside OpenMP tasks.

For the final concept of calling MPI from tasks of OpenMP, see also section on MPI+OmpSs and the Best Practice Guide on MPI+OmpSs by INTERTWinE. Most of the discussion readily applies to OpenMP tasks as well.

For migrating the MPI(-only) codes to GASPI, consult the Best Practice Guide on MPI and GASPI by INTERTWinE.

Tutorials

Online tutorial material for hybrid MPI+OpenMP programming:

Example codes

Simple examples code written using MPI+OpenMP:

Utilities

Applications / kernels

The INTERTWinE team has ported several real-world applications and kernels to illustrate good practice for MPI/ GASPI plus OpenMP Threads as well as verified the concept of MPI finepoints. These results are provided along with developers’ commentary:

  1. The TAU kernel ‘linsolv’ is an example of a very typical approach to hybridisation, adding OpenMP Threads to computational kernels running in different MPI/ GASPI processes. Our example shows that there is some overhead (computing time associated with OpenMP Threads), so the return on investment is only realised at larger-scale where the level of inter-process communication is significant, and can be reduced by exploiting threaded parallelism on multicore nodes. The case study can be downloaded here [MPI, GASPI].
  2. In the applications Ludwig and iPIC3D we illustrate a less common approach to adding OpenMP Threads to an MPI application, in order to reduce the amount of synchronisation that is required—in this case, in the ‘halo swap’ stage of the domain decomposition calculations. The method relies on the MPI_THREAD_MULTIPLE construct to allow threads within each MPI process to collaborate in communications through the MPI library. This technique can lead to significant benefits for large-scale calculations though the implementation can increase algorithmic complexity with associated risks. The two case studies can be downloaded here [iPIC3D] [Ludwig].
  3. We expose both the EPCC benchmarks and Ludwig to MPI Finepoints —an attempt to improving thread concurrency within the MPI library. The main concept behind finepoints is to create multiple actors (threads), which will contribute to large operations in MPI. MPI finepoints are designed to be used with persistent partitioned buffers in which threads are required to manage only their own buffer ownership, for instance that the buffer is valid. This case study is available here [MPI Finepoints]  

Source code to support this resource pack can be downloaded from GitHub (MPI) GitHub (GASPI).

Benchmarks

Links to benchmark codes that can be used to assess the performance of hybrid MPI+OpenMP programs:

Resource pack

The INTERTWinE MPI and OpenMP Threads Resource Pack contains the following:

  1. INTERTWinE Best Practice Guide for programming with MPI and OpenMP Threads.
  2. INTERTWinE developers' commentary on several real-world software applications, to illustrate good practice for MPI plus OpenMP Threads:
    1. The TAU kernel ‘linsolv’ implements several (iterative) methods to find an approximate solution of the linear system A.x=b, where A is a sparse block matrix [Guide (MPI), Source Code (MPI)], [Guide (GASPI), Source Code (GASPI)].
    2. Ludwig, a versatile code for the simulation of Lattice-Boltzmann (LB) models in 3D on cubic lattices [Guide, Source Code].
    3. iPIC3D, a Particle-in-Cell (PIC) code for the simulation of space plasmas in space weather applications during the interaction between the solar wind and the Earth’s magnetic field [Guide, Source Code]
    4. MPI Finepoints study using both the EPCC benchmarks and Ludwig. Code is available upon request. [Guide]

For more details, please consult our deliverables:

  1. D5.4 Final report on application/ kernel plans, evaluations and benchmark suite releases
  2. D5.3 Performance evaluation report
  3. D5.2 Interim report on application/kernel plans, evaluations and benchmark suite releases
Last updated: 08 Nov 2018 at 18:02