Tasks and persistent collectives
The INTERTWinE project has been instrumental in creating the MPI Forum proposal for persistent collective operations. Read more about this work below.
In task-based runtimes, computational work is modelled as discrete tasks and data is modelled as dependencies between tasks. In shared-memory, tasks are implicitly supplied with all the data locally available to their execution context and location, in addition to the explicit task dependencies. For example, tasks can access global variables held in the context of the process in which they are executed. This fact simplifies the scheduling of tasks within shared-memory nodes, because no data needs to be moved from the location of one task to the location of another.
Extending task-based runtimes from being limited to one shared-memory node, so that they can take advantage of multiple nodes, can be done in two ways. One way is to build a distributed-memory task scheduler that automatically moves data between nodes, according to the task's dependencies, while taking account of a deep understanding of the costs and benefits of each possible scheduling choice. Moving data takes time and hardware resources; to include a data-movement cost function in every task scheduling decision would be complex and time-consuming, adding to the overhead of the task-based runtime. Another way is to permit shared-memory-only tasks to communicate with remote tasks on other nodes, exactly the way traditional MPI processes communicate with each other.
This second method, allowing tasks to communicate, has profound implications on both programming models. Task-based runtimes implicitly assume that their tasks are, in some sense, “pure functions” - with few or no side-effects and free from behaviours like blocking local execution waiting for a remote action or state-change. However, MPI communication can, and often will, block local execution in many cases - in fact, all “non-local” operations depend on a remote state or a remote action. On the other hand, MPI implicitly assumes a fixed number of communication endpoints with fixed locations in the network fabric. However, tasks can be created and destroyed throughout the lifetime of the program.
The first of these conflicts, between task-based and MPI-based programming, requires that all MPI communication from within tasks must be non-blocking, including initialising, starting and completing each operation. Without modifying MPI, this prevents the use of MPI_WAIT (and its variants), as well as all single-sided synchronisation options. Fortunately, MPI defines MPI_TEST (and variants) as well as non-blocking and persistent versions of all point-to-point functions and non-blocking versions of all collective functions and most file I/O functions. The MPI Forum is also considering proposals for persistent collective operations, and non-blocking creation functions for communicator and file objects in MPI.
The second of these conflicts requires that the destination “location” for each MPI communication is known at the latest when the MPI function is called inside a task, even if that is before the task-based runtime has scheduled the other task that will participate in the communication. This restricts the usage of MPI communication within tasks to situations where the communication pattern is known in advance. The topology of where the other task will be executed must be given to MPI so that the message can be correctly delivered. This early-binding lends itself to persistent communication in MPI.
Persistent communication in MPI involves initialising an operation once, then starting and completing that operation multiple times, and finally destroying the operation once at the end to tidy up resources. The initialisation step (and the expectation of subsequent repeated re-use) allows MPI to plan ahead and use communication resources more efficiently. When MPI_TEST is used for completion, all of the component steps in persistent communication are non-blocking and so are permissible within tasks.
Supporting persistent communication using MPI within tasks that are scheduled locally by different task-based runtime instances on each node, should be the obvious way to extend task-based programming beyond a single node to encompass multiple nodes, and perhaps entire Exascale machines when they become available. So what are the problems?
The main problem is that MPI only defines persistent versions of point-to-point communication functions, and that definition does not take full advantage of early-binding. Persistent point-to-point operations do not match during initialisation but instead match each time they are started. This makes it much more difficult to plan ahead and improve performance beyond the equivalent non-blocking versions of these functions. The proposal to add persistent collective operations to MPI addresses both of these aspects: it extends the concept of persistence to new types of operation and requires that these operations are matched once during initialisation, not each time they are started.
The INTERTWinE project has been instrumental in creating the MPI Forum proposal for persistent collective operations and carefully designing their semantics to support this, and other similar, use-cases. Work by the INTERTWinE project, in collaboration with the University of Auburn, the University of Alabama at Birmingham, the University of Tennessee at Chattanooga, and Intel Corp, has resulted in a reference implementation of all the new proposed functions in Open MPI. The collaboration is currently working to build proof-of-concept implementations of persistent collective operations that out-perform corresponding nonblocking and blocking collective operations by leveraging work done during the EPiGRAM project by the Technical University of Vienna.
Our work on MPI Sessions has also been presented to the MPI Forum.