INTERTWinE Resource Manager

This Resource Pack provides a technology preview of the INTERTWinE Resource Manager, which supports better scalability when using runtime libraries.

This page has the following sections:

Motivation

INTERTWinE is committed to helping application developers write scalable code for the world’s largest supercomputers, by combining existing programming models (hybridisation) to future-proof applications for next-generation HPC systems.

This Resource Pack helps developers create efficient and effective, scalable applications improving the resource utilization of the system but also facilitating the use of multiple parallel libraries within the same application.

Industry/ Academic Relevance

Current near-term and mid-term High Performance Computer (HPC) architecture trends all suggest that the first generation of Exascale computing systems will consist of distributed memory nodes, where each node is powerful, and contains a large number of (possibly heterogeneous) compute cores.  The number of parallel threads of execution will likely be on the order of 108 or 109, split between multiple layers of hardware, and therefore offering multiple levels of parallelism.

With the first Exascale hardware likely to appear in the next 5-7 years, the emergence of a single parallel programming model addressing all these levels of parallelism seems to be unachievable (at least considering this period of time). We must expect, therefore, that the majority of Exascale applications will make use of combinations of existing programming APIs, where each API is well standardized, and is specific to one or two levels of parallelism.  This is already a well-established practice in the HPC community, with many applications making use of combinations such as message passing interfaces, parallel programming models and multi-threaded specialized libraries.

The concurrent accesses to the CPU cores by several uncoordinated threads from different libraries and the main application increase the number of context-switches, favour the pollution of the cache and degrade the overall performance of the application. To avoid this situation, most parallel applications are restricted to use only the sequential version of these libraries. However, this will become a severe limitation to exploit the huge hardware concurrency that will be available on Exascale systems, where we have to leverage all the potential levels of parallelism available to the application.

Common Scenarios

A frequent scenario in HPC programs consists in a sequential application that relies on a specific library such as the BLAS (Basic Linear Algebra Subprograms) routines to execute a number of operations. These subprograms are usually aggregated in a highly-optimized supporting library (e.g. ATLAS or Intel MKL). Also, it is very common that the specialized library implements (among other optimizations) a parallelized version of the subprograms.

A sequential application calling a parallelized implementation of linear algebra subprograms may incur under-subscription problems as the sequential part may also need to compute significant portions of code that are not parallelized. These sequential phases may cause considerable periods of time where most CPUs are unused.

To overcome this under-subscription problem, the application developer can parallelize the main code of the application. However, in this new scenario, a parallelized application may incur the over-subscription problem as the different threads used in the main code and the parallel supporting library may compete/run on the same CPUs.

Other traditional hybrid applications combine MPI and OpenMP programming models to perform parallel computations (using the fork-join approach of OpenMP) while communications (MPI) occur in a sequential part of the code. This common computing pattern can be improved to break this rigid sequence of fork-join computation, sequential communication, fork-join computation and so on by using tasks to also perform the communications. The problem of this approach is that programmers may experience dead-locks when invoking MPI synchronous services inside an OpenMP task, blocking the thread (and the associated CPU) until the corresponding service is completed.

Resource Manager Overview

The main goal of the Resource Manager (RM) is to coordinate the access to CPUs by different runtime systems running inside the same application and node (as described in the previous section). This situation naturally arises when a parallel application uses a parallel library that is parallelized using a different programming model and/or runtime. To coordinate the access to the CPU resources, the INTERTWinE project proposes four different APIs implementing the Resource Manager concept. These APIs are divided into two groups, as shown in the figure above, depicting a specific example of a parallel OmpSs application invoking kernels from three different parallel libraries. On the one hand, the Native Offload (NO) and Resource Enforcement (RE) APIs (including also the augmented OpenCL offloading API) are designed to be directly used by application and library developers (to avoid over-subscription issues). On the other hand, the Task Pause/Resume (TPR) and the Dynamic Resource Sharing (DRS) APIs are designed to be used directly by parallel runtimes and communication libraries (to avoid under-subscription issues).

Best Practice Guides

Details about combining different Task-based Runtime Systems as well as some useful cases of study can be found in the Best Practice Guide on OpenMP/ OmpSs/ StarPU plus Multi-threaded Libraries Interoperable Programs.

The INTERTWinE project has also produced three Best Practice Guides following the scheme MPI + X: the Best Practice Guide for hybridizing pure MPI applications with tasks; the Best Practice Guide for Writing MPI and OmpSs Interoperable Programs; the Best Practice Guide for Hybrid MPI and OpenMP Programming.

Applications and Kernels

In this release of the resource pack, we select applications/ kernels that have potentials for enhancing their performance via employing the INTERTWinE Resource Manager. Thus, we blend here simple applications/ kernels that already benefit from the Resource Manager functionalities with the ones that have such potential. We begin with the BAR benchmarks. The BAR Cholesky performs the Cholesky decomposition by distributing the matrix blocks to OmpSs tasks and then further splitting these blocks into subblocks, which are run on StarPU tasks. The BARMxM computes matrix-matrix multiplication by distributing the matrix blocks to StarPU tasks, which are split further into subblocks that are run on OmpSs tasks. Both kernels make use of the Resource Manager Dynamic Resource Sharing API. The BAR N-Body benchmark is an astrophysical simulation in which each body represents a galaxy or an individual star, and the bodies attract each other through the gravitational force. The BAR Heat simulation uses an iterative Gauss-Seidel method to solve the heat equation, which is a parabolic partial differential equation that describes the distribution of heat (or variation in temperature) in a given region over time. Both benchmarks show the benefit of the Resource Manager Task Pause/Resume API. [BAR Cholesky, BAR MxM, BAR N-Body, BAR Heat]

When we look on the Hierarchical Matrix Factorization (HierMatFact), there are three possible approaches for parallelizing this algorithm, including the OpenMP parallel loop pragmas. However, the most promising is the one using OmpSs tasks that leverages the actual concurrency of the algorithm. In addition to that, following the hierarchical nature of the linear algebra algorithms, the algorithm makes calls to the BLAS subprograms from a multithreaded library such as Intel MKL. Coupling of OmpSs and a multithreaded library, none of which is capable to see the other running on the same compute node, leads to the resources oversubscription and, therefore, poor performance. [HierMatFact]

Software to support this release pack can be downloaded from GitHub,

Resource Pack

The INTERTWinE Preliminary Resource Manager Resource Pack contains the following:

  1. Best Practice Guides:
    1. Best Practice Guide for hybridizing pure MPI applications with tasks
    2. Best Practice Guide on OpenMP/ OmpSs/ StarPU plus Multi-threaded Libraries Interoperable Programs
    3. Best Practice Guide for Writing MPI and OmpSs Interoperable Programs
    4. Best Practice Guide for Hybrid MPI and OpenMP Programming
  2. INTERTWinE developers' commentary on several real-world software applications and kernels:
    1. The BAR Cholesky and MxM benchmarks demonstrate the use of the Resource Manager Dynamic Resource Sharing API. [Source Code (BAR Cholesky), Source Code (BAR MxM)]
    2. The BAR N-Body and Heat benchmarks make the use of the Resource Manager Task Pause/Resume API. [Source Code (BAR N-Body), Source Code (BAR Heat)]
    3. The Hierarchical Matrices Factorization benchmark solves LU Factorizations of Hierarchical Matrices, using a parallel Linear Algebra library. [Guide, Source Code (OmpSs), Source Code (OpenMP)]

For more details, please consult our deliverables:

  1. D5.4 Final report on application/ kernel plans, evaluations and benchmark suite releases
  2. D5.3 Performance evaluation report
  3. D5.2 Interim report on application/kernel plans, evaluations and benchmark suite releases
Last updated: 27 Nov 2018 at 9:11