MPI Endpoints and MPI Fine-points
How should MPI interact with threads and tasks? The concept of having multiple communication contexts, or “endpoints”, in each MPI process has a compelling justification: higher communication performance can be achieved by using multiple separate “lanes”. On the other hand, bundling pieces of data from separate lanes together can reduce the number of notifications that something has happened. There is a trade-off between waiting for the bundle to be complete before sending it and handling the arrival of multiple smaller pieces if the parts of the bundle are sent when they are ready.
For large amounts of data, combining multiple network hardware resources increases the available bandwidth, which decreases the time to deliver data. For small messages, separate network “lanes”, with isolated hardware and software resources, can carry separate virtual traffic, for example, messages using different communicators or different tags. Separating out virtual traffic decreases the length of each incoming message queue, which decreases the latency for small messages.
However, can all this be achieved simply by improving the implementation of MPI libraries? Does the application interface defined by the MPI Standard need to change?
The current Endpoints proposal to the MPI Forum describes a new function to be added to MPI, which exposes this concept of communication endpoints to the users of MPI. Recent work in the MPICH library suggests that the performance benefits of using multiple hardware and software resources might be achievable using separate communicators. The remaining argument supporting this addition is that it may improve programmability for hybrid “MPI + threads” (and possibly “MPI + tasks”) applications. This contention needs to be fully investigated and demonstrated before further progress can be made in the MPI Forum on this proposal.
The Fine-points idea is a rival proposal for how to expose the concept of multiple “endpoints” to users of MPI. There is a fundamental difference at the conceptual level between these different ways of presenting “endpoints” to users of MPI. In the Endpoints proposal, threads (or tasks) become the entities that can use MPI to communicate with other threads (or tasks). The Fine-points proposal retains the idea that it is the whole MPI process that communicates with other MPI processes, irrespective of the presence or number of threads (or tasks). With Fine-points, the aggregation of all the data being sent from one MPI process to another is described by a single message, which is created and managed cooperatively by multiple threads (or tasks). With Endpoints, the data moving between MPI processes is broken up into many small pieces, each of which becomes its own message. These smaller messages are sent by, and received by, individual Endpoints. The Fine-points API expresses concurrent effort by multiple threads (or tasks) on a single large message, whereas the Endpoints API expresses concurrent effort by multiple threads (or tasks) on many small messages.
The INTERTWinE project has evaluated both Endpoints and Fine-points for use in hybrid (MPI + threads and MPI + tasks) applications.
Integrating Endpoints into an application that uses either threads or tasks within a node and MPI between nodes can be done in several different ways. An archetypal HPC program regularly exchanges messages with a neighbourhood of other processes. Depending on how the computational work is split up in each process, the data needed for each outgoing message may be produced by a single local thread or by many local threads. This affects the choice of how to assign Endpoints to local threads because each message can use only one Endpoint. Creating and assigning as many Endpoints as there are neighbour processes (so that each neighbour process is sent one message using one Endpoint) often results in higher load imbalance because some messages are much bigger than others and some messages consist of contributions from many more local threads than others. Creating and assigning as many Endpoints as there are local threads (so that each thread is responsible for sending the data it produces) often results in higher communication overheads because many more messages are being sent by each process. For example, in a regular domain-decomposition code, the corner-to-corner messages are much smaller than the edge-to-edge or the face-to-face messages. It is likely that a single thread will produce the data for each corner-to-corner message but that many threads will produce the data for each face-to-face message. With Endpoints, each portion of a face-to-face message must be sent independently or no portion of the face-to-face message can be sent until all portions are ready to be sent. Neither of these choices significantly increases the performance of the applications we tested and both choices introduce additional complexity into the code rather than making programming easier.
The prototype implementation of the Fine-points API was created by the Sandia National Laboratory. Measuring the performance improvement of this API, for example using the hybrid programming benchmark suite, shows that it can provide higher communication but only for really huge messages (bigger than 4GiB). None of the applications being considered as part of the INTERTWinE project (e.g. Ludwig or iPIC3D) require messages of this size, even when extrapolating usage characteristics to possible future Exascale machines and unusually large input data-sets.
Both Endpoints and Fine-points show some promise but the questions of how MPI should interact with threads and tasks is still an open research topic.