MPI Sessions

The MPI Sessions proposal introduces a new conceptual framework and new operations that permit independent initialisation and use of MPI by multiple concurrent components. The INTERTWinE project has already been instrumental in defining the semantics for this new concept and the new operations in this proposal. Read more about this concept below.

Modularity, also known as separation of concerns, is an extremely valuable principle in all forms of engineering during both design and construction of any large complex project.

In software, modularity encourages us to break large complex applications into smaller focused components that interact via well-defined interfaces but are otherwise independent. These components are designed independently and built using the most appropriate methods to achieve their design. Cross-cutting concerns, meaning anything that creates additional unwanted dependencies between these components, always cause trouble. The amount of trouble caused escalates rapidly as the number of affected components increases.

In high-performance computing (HPC), the de facto standard communication library that enables applications to scale out and use an entire supercomputer is called Message Passing Interface (MPI). However, MPI is a cross-cutting concern - every component of a complex application that needs to communicate between nodes in a supercomputer depends on MPI, and every usage of MPI depends on shared resources and, ultimately, on a single point of entry to MPI, namely MPI_INIT (or MPI_INIT_THREAD), and on a single object provided by MPI, namely the built-in communicator called MPI_COMM_WORLD.

Breaking the cross-cutting dependencies introduced by MPI involves permitting independent initialisation of MPI by any thread/task at any time. It also requires stronger isolation and compartmentalisation of resource usage inside the MPI library. Together, these changes make it easier to deal with challenges like mixing programming models and fault tolerance.

Many of the candidate Exascale applications are compositions of separate codes, each of which deals with a different aspect of the overall problem. Different expertise and knowledge are needed for each of these components and each one may need to use the hardware resources differently to achieve optimal performance. For example, some components will want to make use of many threads (and few processes) per shared-memory node, whereas others will work better with many processes (and few, if any, threads). Restricting MPI so that it can only be initialised once means there is only one opportunity to choose the thread support level for the entire application; a compromise must be sought between components that work best when single-threaded and other components that require multi-threaded support for correctness or for optimal performance.

It is widely expected that reliability and fault tolerance will be major issues for Exascale systems. Isolating and compartmentalising resource usage can lead to the creation of containment domains for faults, allowing partial failure with well-defined boundaries for knock-on or cascading effects. This is potentially sufficient for continued operation in fallback style fault tolerance approaches. Independent initialisation of MPI allows the application to obtain, initialise, and connect additional replaced resources, if desired, to enable run-through-stabilisation style fault tolerance schemes.

The MPI Sessions Proposal

The MPI Sessions proposal introduces a new conceptual framework and new operations that permit independent initialisation and use of MPI by multiple concurrent components. The INTERTWinE project has already been instrumental in defining the semantics for this new concept and the new operations in this proposal.

The general usage model for an MPI Session is as follows:

  • Obtain a Session handle using MPI_Session_init, which is a local operation that initiates a link to the local resource manager;
  • Use the Session handle to query the runtime using MPI_Sessions_get_names, which returns an array of string names representing available process sets;
  • Use the Session handle and a process set name to create an MPI_Group using MPI_Group_create_from_session;
  • Use the MPI_Group to create an MPI communicator using MPI_Comm_create_from_group, or an MPI Window using MPI_Win_create_from_group, or an MPI File using MPI_File_open_from_group.

Additional work in this area, as part of this project, will focus on creating a proof-of-concept implementation for the new MPI Sessions functions and describing other potential use-cases in more detail.

Our work on Tasks and persistent collectives has also been presented to the MPI Forum.

Last updated: 11 Sep 2018 at 14:33