Resources for developers and application users
Hybrid parallel programming
Combining two or more parallel programming models within a single application is known as hybrid parallel programming. While combining MPI (http://mpi-forum.org/) and basic OpenMP (http://www.openmp.org/) is quite common practice nowadays, the INTERTWinE project extends the concept of hybrid programming to other programming models. The work ranges from improving the interfaces of individual programming models, providing best practice guides and courses on the current state-of-the-art, up to developing software components to improve interoperability, such as the Directory/Cache service and the Resource Manager.
Below, you can find introductory advice on combining some of the programming models, with links to further reading.
Combining MPI (http://mpi-forum.org/), for inter-node communication, with OpenMP (http://www.openmp.org/), for intra-node shared memory environment, is quite standard practice in hybrid programming on large HPC systems. There are essentially two main motivations for this combination of programming models:
- Reduction in memory footprint, both in the application and in the MPI library (e.g. communication buffers).
- Improved performance, especially at high core counts where the pure MPI scalability is running out.
A program that intends to use multiple threads and MPI should call MPI_Init_thread instead of MPI_Init to initialise the MPI library. In the call to MPI_Init_thread, one of the following modes for threads should be specified:
• MPI_THREAD_SINGLE – Only one thread will execute.
• MPI_THREAD_FUNNELED – The process may be multi-threaded, but only the main thread will make MPI calls (all MPI calls are funneled to the main thread).
• MPI_THREAD_SERIALIZED – The process may be multi-threaded, and multiple threads may make MPI calls, but only one at a time: MPI calls are not made concurrently from two distinct threads (all MPI calls are serialized).
• MPI_THREAD_MULTIPLE – Multiple threads may call MPI, with no restrictions.
Although some MPI library implementations or their older installations do not support the full thread safety implied by MPI_THREAD_MULTIPLE, the two prevalent MPI library code bases from which many implementations are derived, MPICH (https://www.mpich.org/) and OpenMPI (https://www.open-mpi.org/), already provide good support for this feature.
The Best Practice Guide on MPI + OpenMP by INTERTWinE discusses motivations for combining MPI and OpenMP in more detail and explains why and how these potential benefits can be realized in application codes. It discusses also the possible downsides of MPI + OpenMP programs, covering software engineering issues and performance pitfalls.
Five different styles of MPI + OpenMP program discussed in the Best Practice Guide are:
- Master-only: all MPI communication takes place in the sequential part of the OpenMP program (no MPI in parallel regions).
- Funneled: all MPI communication takes place through the same (master) thread but can be inside parallel regions.
- Serialized: MPI calls can be made by any thread, but only one thread makes MPI calls at any one time.
- Multiple: MPI communication simultaneously in more than one thread.
- Asynchronous Tasks: MPI communications can take place from every thread, and MPI calls take place inside OpenMP tasks.
For the final concept of calling MPI from tasks of OpenMP, see also section on MPI+OmpSs and the Best Practice Guide on MPI+OmpSs by INTERTWinE. Most of the discussion readily applies to OpenMP tasks as well.
Links to benchmark codes that can be used to assess the performance of hybrid MPI+OpenMP programs:
- EPCC OpenMP/MPI micro-benchmark suite, https://www.epcc.ed.ac.uk/research/computing/performance-characterisation-and-benchmarking/epcc-openmpmpi-micro-benchmark
- HOMB: simple Laplace solver benchmark, https://sourceforge.net/projects/homb/
- ASC Coral benchmarks contain several hybrid applications, https://asc.llnl.gov/CORAL-benchmarks/
- NAS Parallel Benchmarks – multizone versions use hybrid MPI + OpenMP, https://www.nas.nasa.gov/publications/npb.html
- Some of the NERSC Trinity mini-applications are hybrid, http://www.nersc.gov/users/computational-systems/cori/nersc-8-procurement/trinity-nersc-8-rfp/nersc-8-trinity-benchmarks/
- The PRACE UEABS suite contains several hybrid MPI + OpenMP applications, http://www.prace-ri.eu/ueabs/
Simple examples code written using MPI+OpenMP:
- Jacobi example in Fortran, http://www.nersc.gov/users/computational-systems/retired-systems/hopper/running-jobs/using-openmp-with-mpi/fortran-mpi-openmp-example/
- Utility code to test process and thread affinity on Linux systems, https://github.com/olcf/XC30-Training/blob/master/affinity/Xthi.c
Online tutorial material for hybrid MPI+OpenMP programming:
- SC13 Tutorial on hybrid programming, http://www.openmp.org/press-release/sc13-tutorial-hybrid-mpi-openmp-parallel-programming/
- EPCC/INTERTWinE tutorial on Advanced OpenMP, including material on hybrid programming, performance tuning, and tips, tricks and gotchas, http://www.archer.ac.uk/training/course-material/2016/08/160802_AdvOpenMP_Bristol/index.php
- Hybrid programming tutorial and video from CSCS: http://www.speedup.ch/workshops/w39_2010/slides/SpeedupTutorial2010Stringfellow.pdf https://www.youtube.com/watch?v=TiQRPMBBmDs
OmpSs (https://pm.bsc.es/ompss) is a parallel programming model based on tasks and developed at Barcelona Supercomputing Center. OmpSs also tries to be a test bench for the OpenMP programming model in order to improve its tasking model. In particular, it extends OpenMP with new directives, clauses and semantics to support asynchronous parallelism.
As in the case of MPI + OpenMP, the correct support for communicating threads should be enabled during MPI initialization. In particular, MPI_Init_thread should be used instead of MPI_Init (see section MPI + OpenMP).
There are conceptual issues with calling MPI functions from the tasks of OmpSs (or OpenMP). They are related to the asynchronous execution of tasks in the runtime contrasted with the assumed order of operations in the MPI. For example, if one task executes the (blocking) MPI_Send function and another task the (blocking) MPI_Recv function, one can easily arrive at dead-locks if the tasks are executed in a different order. Moreover, these can only happen at certain runs, as the execution order can change from one run to another.
The way around these issues offered in MPI is to use non-blocking variants of these functions, MPI_Isend and MPI_Irecv. However, the asynchronity of the task-based execution also comes into play. Since in this case, the actual communication can happen after the task has already finished, one can arrive at situations that the communication buffer has been deallocated by the task or rewritten by other operations.
A crude way to resolve these issues is introducing synchronization points. But a more elegant way natural for task-based programming models is to add auxiliary data dependencies to the tasks. By this approach, finer point-to-point synchronizations can be enforced. Details of this approach are in the Best Practice Guide on MPI + OmpSs by INTERTWinE.
MPI (http://mpi-forum.org/) has been considered as the standard for writing parallel programs for distributed memory systems for more than two decades.
Global Address Space Programming Interface (GASPI, http://www.gaspi.de/) is a modern specification of a compact API for the development of parallel applications, which aims at a paradigm shift from bulk-synchronous two-sided communication patterns towards an asynchronous communication and execution model. GPI-2 (http://www.gpi-site.com/gpi2/) represents an open-source implementation of the GASPI standard.
The GASPI standard promotes the use of one-sided communication, where one side, the initiator, has all the relevant information (what, where from, where to, how much, etc.) for performing the data movement. The benefit of this is decoupling the data movement from the synchronization between processes. It enables the processes to put or get data from remote memory, without engaging the corresponding remote process, or having a synchronization point for every communication request. However, some form of synchronization is still needed in order to allow the remote process to be notified upon the completion of an operation.
GASPI provides so-called weak synchronization primitives which update a notification on the remote side. The notification semantics is complemented with routines that wait for the updating of a single or a set of notifications. GASPI allows for a thread-safe handling of notifications, providing an atomic function for resetting a local notification with a given ID (this returns the notification value before reset). The notification procedures are one-sided and only involve the local process.
A program using exclusively GASPI for its communication is launched by the gaspi_run executable, a GASPI analog of the mpirun launcher of MPI.
GASPI can be introduced also gradually into an existing MPI-based code for moving only certain communication duties to the GASPI. In this case, GASPI can inherit the environment from MPI, and the hybrid MPI+GASPI program is launched in the same way as the MPI code.
Details on combining GASPI with MPI as well as a case study can be found in the Best Practice Guide on MPI + GASPI by INTERTWinE.