Exascale Applications Workshop Agenda

Here is the provisional agenda for the European Exascale Applications Workshop (as of 29th March).

Thursday 19th April 2018
13:15-13:45 Registration and light lunch
13:45-14:00 Welcome and overview of Co-Design within the INTERTWinE project (Dr Valeria Bartsch, Fraunhofer ITWM)
14:00-16:00 Session 1: Asynchronous Execution (chair: Dr George Beckett, EPCC)
16:00-16:15 Coffee
16:15-17:45 Session 2: Interoperability (chair: Dr Mark Bull, EPCC)
Friday 20th April 2018
08:30-09:00 Coffee and pastries
09:00-10:30 Session 3: Usage of Libraries (chair: Dr Mirko Rahn, Fraunhofer ITWM)
10:30-11:00 Coffee
11:00-12:30 Session 4: Distributed Tasks (chair: Dr Jakub Šístek, University of Manchester)
12:30-13:00 Final discussion

Session 1: Asynchronous execution (chair: Dr George Beckett, EPCC)

Most current parallel APIs focus on one or maybe two layers of parallelism in hardware--such as distributed memory, processor cores, or vectors. However, as the scale of supercomputers grow, applications increasingly need to exploit all the levels of parallelism to realise performance improvement, and they do this by incorporating multiple parallel APIs (so called hybridisation) into the software.

However, at the transition between APIs, performance can easily be lost, due to artificial or unnecessary synchronisation. For example, in a multi-threaded region described as a parallel loop, it is implicitly the case that all threads need to complete their calculations before the next step can begin--even though, for many algorithms, useful work could be done as soon as data is finalised for any of the threads.

Application developers are increasingly realising that a clean progression from one phase of an application to another is impacting on performance and are beginning to explore options for relaxing or eliminating unnecessary synchronisation, focusing only on what is needed to support data flow and avoiding a classical breakdown of an application into blocks of computation with hard synchronisation between them.

In this session, we will learn about some of the most promising avenues - replacing threaded blocks with loosely coupled tasks; overlapping computations and communications phases; and initiating data analysis or I/O while a simulation is in progress.

"Improving the interoperability between MPI and OmpSs-2" (Vicenç Beltran, BSC)

Abstract: In the context of the INTERTWinE project, BSC has been working on improving the interoperability between MPI and task-based programming models, to facilitate the development of high-performance hybrid applications. To that end, we have extended MPI with a new threading model called MPI_TASK_MULTIPLE that improves the usability of MPI operations inside tasks. In this talk we explain how this new MPI threading model leverages the OmpSs-2 task pause/resume API to avoid deadlocks and improve application's performance.

"Decoupling data computation from processing to support high performance data analytics" (Nick Brown, EPCC)

Abstract: There are numerous HPC codes that not only perform computation, but also analyse their values to generate higher level information from this raw data. Atmospheric science is one of these domains and traditionally computation has been paused whilst higher level diagnostic values are begot from the raw prognostic fields. In our approach we share the cores of a processor between simulation and analytics/IO, typically one core of the processor performing the analytics and servicing the remainder simulation cores which asynchronously fire and forget their raw data over to be processed. By decoupling these two aspects the theory is that they can both progress fairly independently and this can hide the heavy communication and IO costs of analytics from the simulation side of things. In our work we have realised that a task-based paradigm, but one that is explicitly aware of different ranks, is advantageous in structuring the code and handling the non-deterministic nature of data arrival and processing.

"CaffeGPI—single-sided communication for scalable Deep Learning" (Janis Keuper, Fraunhofer ITWM)

Abstract: Deep Neural Networks (DNN) are currently of great interest in research and application. The training of these networks is a compute intensive and time consuming task. To reduce training times to a bearable amount at reasonable cost we extend the popular Caffe toolbox for DNN with an efficient distributed memory communication pattern.

To achieve good scalability we emphasize the overlap of computation and communication and prefer fine-granular synchronization patterns over global barriers. To implement these communication patterns we rely on the the ”Global address space Programming Interface” version 2 (GPI-2) communication library. This interface provides a light-weight set of asynchronous one-sided communication primitives supplemented by non-blocking fine-granular data synchronization mechanisms. Therefore, CaffeGPI is the name of our parallel version of Caffe. First benchmarks demonstrate better scaling behavior compared with other extensions, e.g., the Intel TM Caffe. Even within a single symmetric multiprocessing machine with four graphics processing units, the CaffeGPI scales bet-ter than the standard Caffe toolbox. These first results demonstrate that the use of standard High Performance Computing (HPC) hardware is a valid cost saving approach to train large DDNs. I/O is another bottleneck to work with DDNs in a standard parallel HPC setting, which we will consider in more detail in a forthcoming paper.

"Asynchronous Execution in DLR's CFD Solvers" (Thomas Gerhold, DLR)

Abstract: In the recent years, DLR have developed a prototype of TAU with shared memory parallelization by dynamic allocation of tasks to the threads, which enabled for asynchronous execution of tasks and for overlap of communication and computation by using GASPI. With the experiences gained the development of a  TAU successor has been started, which is based (so far) on a static task model. Both approaches are presented in the talk.

Session 2: Interoperability (chair: Dr Mark Bull, EPCC)

Current near-term and mid-term architecture trends all suggest that the first generation of Exascale computing systems will consist of distributed memory nodes, where each node is powerful, and contains a large number of (possibly heterogeneous) compute cores. The number of parallel threads of execution will likely be on the order of 10^8 or 10^9, split between multiple layers of hardware parallelism (e.g. nodes, cores, hardware threads and SIMD lanes).

The “silver bullet” of parallel programming models is a single API that can address all these parallel threads across all these hardware layers with maximum performance, yet can also be highly productive and intuitive for the programmer. Given the timescales involved in inventing and standardising a programming API, producing robust implementations, and porting large scale applications, and with the first Exascale hardware likely to appear in the next 5-7 years, we are forced to conclude that this “silver bullet” is in practice unachievable in the short to medium term. 

For a small number of pioneer Exascale applications, it may be possible to invest the effort in porting them to highly specialised, low-level, system-specific APIs. For the larger second wave of applications that follows, this effort, and the resulting lack of portability and sustainability, will be unacceptable. We must expect, therefore, that the majority of Exascale applications will make use of combinations of existing, though possibly enhanced, programming APIs, where each API is well standardised, and is specific to one or two layers of hardware parallelism.

Although there remains room for improvement in individual programming models and their implementations, the main challenges lie in interoperability between APIs, both at the specification level and at the implementation level.

"Shared Notifications in GASPI" (Christian Simmendinger, T-Systems SfR)

GASPI is a PGAS communication library which is based on the concept of one-sided, notified communication. The synchronization context here is bundled together with a one-sided message such that a communication target becomes able to test for completion of the received one-sided communication.

Traditionally the GASPI programming model has been aimed at multithreaded or task-based applications. In order to support a migration of legacy applications (with a flat MPI communication model) towards GASPI, we have extended the concept of shared MPI windows towards a notified communication model in which the processes sharing a common window become able to see all one-sided and notified communication targeted at this window.

Besides the possibility to entirely avoid node-internal communication and to make use of a much improved overlap of communication and computation the model of notified communication in GASPI shared windows will allow legacy SPMD applications a transition towards an asynchronous dataflow model.

"GASPI Shared Notifications in iPIC3D" (Dana Akhmetova, KTH)

In our previous works, we have already experimented with GASPI and have ported a number of real-world applications to this model. The new implementations showed positive results and performed faster than their initial (MPI+OpenMP) versions at least on a large number of cores. In this study, we experiment with a new GASPI feature called shared notifications in iPIC3D, a large plasma physics code for space weather applications written in C++ and MPI+OpenMP. To our knowledge, the GASPI shared notifications have never been used before. We believe that they provide a promising new set of features to further pave the way to the Exascale era for scientific production codes.

"Chameleon: A dense linear algebra software for heterogeneous architectures" (Olivier Aumage, Inria)

Chameleon is a C library providing parallell algorithms to perform BLAS/LAPACK operations exploiting fully modern architectures. Chameleon dense linear algebra software relies on sequential task-based algorithms where sub-tasks of the overall algorithms are submitted to a Runtime system. Such a system is a layer between the application and the hardware which handles the scheduling and the effective execution of tasks on the processing units. A runtime system such as StarPU is able to manage automatically data transfers between non-shared memory areas (CPUs-GPUs, distributed nodes). This kind of implementation paradigm allows to design high performing linear algebra algorithms on very different types of architecture: laptops, many-core nodes, CPUs-GPUs, multiple nodes.

Session 3: Usage of libraries  (chair: Dr Mirko Rahn, Fraunhofer ITWM)

Numerical libraries underlie many scientific and engineering applications, accelerating application development via reduced coding effort, providing standard interfaces that enhance code readibility, and improving performance portability across a range of architectures. Some well-known examples include Intel's MKL and DAAL, IBM's ESSL and parallel ESSL, and NVIDIA's CUDA ecosystem. These libraries are often multi-threaded allowing a seamless exploitation of the parallel resources of a multicore processor. 

Unfortunately, the control these libraries offer over the processor resources (mainly, cores and memory) is very limited. As a result, when the kernels from these libraries are invoked from a runtime system running inside the same application, the result is often resource oversubscription (or undersubscription), and performance degradation. 

This session aims to highlight interoperability issues between parallel scientific applications, runtime systems and multithreaded libraries. In this line, we will learn about a few recent efforts:

  • The design, implementation and usage of the Resource Manager (RM) is composed of four different APIs to dynamically share CPU resources between different runtime systems running inside the same application. These four APIs are designed mainly as a low-level mechanism, used by runtime systems and parallel library developers, to improve the resource utilization of the system as well as facilitate the coordinated use of multiple parallel libraries from the same application, improving application developer productivity.
  • The design of malleable-thread level kernels for dense linear algebra which, contrary to the conventional behaviour of this type of libraries, can add threads on-the-fly to the execution of a numerical kernel. In general, this results in higher performance for standard dense matrix factorizations when accelerated via the application of look-ahead techniques but also in better workload balancing for hierarchical matrix factorizations that are decomposed into a collection of tasks of very different costs.

"Cooperative Resource Allocation between Multiple Runtime Systems" (Olivier Aumage, Inria)

This talk introduces INTERTWinE's Resource Manager (RM). The RM is composed of four different APIs aimed to dynamically share CPU and accelerator resources between different runtime systems running inside the same application, avoiding both over-subscription and under-subscription issues.  These four APIs are designed mainly as a low-level mechanism used by runtime system and parallel library developers. Application developers are only expected to define high-level requirements and policies that will be enforced by the runtime systems. The four APIs presented not only improve the resource utilization of the system but can also facilitate the coordinated use of multiple parallel libraries on the same application, improving application developer productivity. Moreover, some of the RM APIs can also be leveraged to improve the productivity and efficiency of popular hybrid programming techniques that combine message passing and task-based programming models.

"Development of Weather and Climate Applications for Future HPC Systems" (Olivier Marsden, ECMWF)

Weather and climate prediction applications have historically been among the driving forces in the development of the high-performance computing sector. Increased computational performance is a necessary step towards future improvements of forecast reliability and accuracy.

The use of future high-performance computing environments appears to be increasingly dependent on high levels of parallelism and of code flexibility. Weather forecasting applications will likely need to rethink aspects of their design and methodology in order to achieve flexibility and increased parallelism.  This talk will give a view of some of the work being undertaken at the European Centre for Medium-Range Weather Forecasting (ECMWF) to tackle these points.  In particular, we will present the ESCAPE project's dwarf development, along with the design of a weather-prediction focused library upon which many of the dwarves are built. We will also discuss work on achieving performance portability of atmospheric physics code, with examples of IFS Fortran code transformed at a high level to leverage GPUs thanks to OpenACC.

"Parallelizing Dense Linear Algebra Operations with Malleable Thread-Level Libraries" (Enrique S. Quintana Ortí, Universitat Jaume I)

We will review several novel techniques for overcoming load-imbalance when implementing so-called static look-ahead and dynamic runtime-based parallelization in relevant matrix factorizations for the solution of linear systems and the computation of the singular value decomposition (SVD).  One techniques promote worker sharing (WS) between tasks, allowing the threads of the task that completes first to be reallocated for use by the costlier task.  An additional technique allows a fast task to alert the slower task of completion, enforcing the early termination (ET) of the latter, and producing a smooth transition of the factorization procedure into the next stage.

These mechanisms are instantiated via a new malleable thread-level implementation of the Basic Linear Algebra Subprograms (BLAS), and their benefits will be illustrated via an implementation of the several key matrix factorizations for dense and hierarchical matrices.

Session 4: Distributed tasks (chair: Dr Jakub Šístek, University of Manchester)

Task-based models enable programmers to decouple system-, resource- and task management from user code. By abstracting issues such as scheduling, memory management and fault tolerance from the application code itself, the programmers can concentrate on the algorithm rather than on the low level, complex details of parallelization. These programming models allow for a significant reduction of synchronization in applications and, thus, can potentially deliver superior performance when compared with traditional models. For these reasons, the task-based models are considered to be a suitable base for building programming models for upcoming Exascale machines. This session highlights distributed tasks in task-based runtime systems. We will hear about recent developments of the ParSEC runtime and their applications, as well as the directory-cache API and an application.

"Directory Cache" (Mirko Rahn, Fraunhofer ITWM)

The Directory Cache defines a generic API that abstracts distributed memory as a single shared address space. Such an API allows the runtimes to be completely independent from the physical representation of data, as well as from the type of storage used. The API relies on a generic, abstract multilevel data representation, promoting asynchronous data transfers and automatic caching.

More information about the directory cache: https://www.intertwine-project.eu/about-intertwine/directory-cache

"Application Example Running on Top of GPI-Space Integrating D/C" (Tiberiu Rotaru, Fraunhofer ITWM)

GPI-Space is a parallel programming development platform developed at the Fraunhofer Institute for Industrial Mathematics. The main concept behind this is the separation of domain and HPC knowledge. Currently, GPI-Space is used for running many workflow-driven applications from different domains (http://www.gpi-space.de/).

In this talk we report on the integration of GPI-Space with the current Directory/Cache prototpye implementation and experiments with running a real application on top of the resulted setup. As a sample application we used Splotch, a publicly available rendering software for exploration and visual discovery in particle-based datasets coming from astronomical observations or numerical simulations.

"PLASMA/DPLASMA" (George Bosilca, University of Tennessee)
PaRSEC is a generic framework for architecture-aware scheduling and management of micro-tasks on distributed many-core heterogeneous architectures. PaRSEC is the underlying runtime and dynamic scheduling engine of the DPLASMA (numerical linear algebra library for dense matrices.

More information about PaRSEC: icl.utk.edu/projectsdev/parsec 

Last updated: 03 May 2018 at 13:20