Best Practice Guide for OpenMP/OmpSs/StarPU + Multi-threaded Libraries

05 Apr 2017
A Best Practice Guide to exploiting task-parallelism from within a task-based runtime system concurrently with the use of a multi-threaded numerical library

The BLAS (Basic Linear Algebra Subprograms) and LAPACK (Linear Algebra PACKage) define standard interfaces for dense linear algebra (DLA) operations that, over the past decades, have improved the efficiency and portability of complex scientific and engineering codes. The BLAS and LAPACK application programming interfaces (APIs) specify the parameters and array storage layout for a collection of basic and advanced routines for DLA, but leave the implementation details open for the expert on numerical methods and/or high performance computing.

While legacy versions of BLAS and LAPACK exist, attaining high performance generally requires the use of machine-specific instances such as, for example, Intel MKL, IBM ESSL, AMD ACML or NVIDIA CUBLAS. Some of the BLAS routines within these libraries are multi-threaded, targeting fine-grain loop-parallelism to unleash a parallel execution on a multicore/multisocket server. In addition, for many years, the default approach to exploit thread-level parallelism from LAPACK was to cast most of the computations in terms of the BLAS, and to rely on a multi-threaded implementation of the latter.

Extracting parallelism from a task-parallel runtime has been recently reported as a competitive approach to parallelize DLA operations on multicore/multisocket servers. Compared with the conventional parallelization strategy, the exploitation of task-parallelism reduces the overhead due to thread synchronization (barriers), and exposes a larger amount of (coarser-grain) parallelism, yielding a more scalable solution.

As we approach the first generation of Exascale computing systems, we can expect that each node will contain a large number of (possibly heterogeneous) compute cores, equipped with anything from a few dozens to even thousands of threads per node. Taking advantage of this vast amount of thread concurrency will therefore require the extraction of a combination of coarse-grain task-parallelism and fine-grain loop-parallelism, asking for an efficient interoperability of task-parallel runtimes with multi-threaded libraries in scientific and engineering applications.

The INTERTWinE Best Practice Guide for Writing OpenMP/OmpSs/StarPU + Multi-threaded Libraries Interoperable Programs is aimed at application developers who plan to exploit task-parallelism from within a task-based runtime system concurrently with the use of a multi-threaded numerical library to execute the runtime tasks. In particular, the document pays special attention to the interoperability issues that arise due to oversubscription when exploiting thread-level parallelism simultaneously from within the task-based runtimes OpenMP/OmpSs/StarPU and the MKL numerical library from Intel.

Download the guide here.

Last updated: 05 Apr 2017 at 10:35