Mini-project: Fault-tolerant EDAT
EDAT and all that
We have previously posted about Event Driven Asynchronous Tasks. It is a task-based model in which the programmer is explicitly aware of the distributed nature of their code. Tasks are submitted which depend on events, and events are fired either locally or from another process. Once all a task’s dependencies have arrived it runs – a task may submit any number of new tasks, and fire any number of events.
The EDAT model enables programmers to effectively harness the power of exascale machines, but making use of 109 cores is not the only challenge facing developers of exascale applications. With so many individual hardware components, the probability of a hardware failure becomes too high to ignore.
Preparing to fail
Fault tolerance is certainly not a new concept in high-performance computing. Technologies such as fault-tolerant MPI already exist, but they are focused at a low-level. They require the application developer to write their code specifically to be fault tolerant, and this adds significant complexity. If there’s one thing high-performance software doesn’t need, it’s more complexity!
A consequence of the low level at which these technologies operate is that they cannot themselves capture the state of a computation. As a result a common approach is to checkpoint the computation at a particular point (checkpoint meaning save everything to disk). In the event of a failure, the checkpoint is reloaded and the computation carries on from that point. Unfortunately, writing such large files to disk all at once costs a lot of time/energy/money, so usually it’s done somewhat sparingly, if at all. That means that if the worst does happen, and a checkpoint restart is required, significant progress is often lost. We wanted to take a more fine-grained approach, and to remove the burden of making code fault tolerant from the developer, leading to more efficient, resilient, and more easily maintainable software. Wouldn’t that be nice?
The ACID test
Databases sit at the back end of many of the world’s most high-profile IT systems, they’re ubiquitous. It should come as no surprise then, that fault tolerance is a big deal in the database world, and HPC could learn a thing or two. One of the methods used by databases is a transactional model of interactions, which is ACID compliant. It’s what now?
- Atomicity The state of a database changes if and only if a transaction successfully completes. If a transaction is unsuccessful, that state of the database does not change.
- Consistency Any transaction can only transition the database from one valid state to another, the state of the database is always valid.
- Isolation Transactions that take place concurrently must leave the database in the same valid state as if they had occurred sequentially.
- Durability The state of the database will be recoverable, even in the event of a total power failure.
Okay, sounds great, but what’s that got do with EDAT? Well, we can apply the same approach. Whatever computation the application is doing, it has some state, and every EDAT task is a transaction.
In this fault tolerant EDAT model, events are the medium by which tasks interact with the global state of the computation. To ensure atomicity, we prevent events from being fired from one process to another until the task has successfully completed. This also ensures consistency as events are never fired from an unsuccessful task – if a task fails its events are discarded, and it has no impact on the global state of the computation. It can be safely rerun, without fear of dependencies being sent twice. Isolation is trivially achieved, as the EDAT model is already asynchronous and tasks only begin running once all their dependencies are met, but the order in which dependencies arrive is not important. Durability is perhaps the simplest requirement conceptually, but actually presents the biggest challenge for implementation. It requires every state-change to be written to disk. In EDAT that means everything that could change the state of the computation must be written to disk. Every time an event is fired from a process, an event arrives at a process, or a task is scheduled a write occurs. If the node fails, this record of state can be read, and the computation restarted from almost exactly where it was. The only lost progress is in tasks which happened to be running at the time!
There is an important caveat in all of this. The runtime handles all the above requirements, with no intervention from the application developer. They need only tell EDAT that they want to run with resilience turned on. However, the developer does need to ensure that their application is fully encapsulated in events and tasks. For performance reasons it’s tempting to pass pointers to shared memory rather than whole arrays, but any changes to data stored at that address would constitute an atomicity-breaking side effect!
Of course, there’s no such thing as a free lunch. There is an overhead involved with turning resilience on. There are two modes of operation available, the first which covers the ACI of ACID compliance, and would allow rerunning of a failed task – the overhead incurred at this level is modest. The second mode of operation is fully ACID compliant, meaning it writes a ledger of all events and tasks to disk. Unfortunately, this is where we hit performance difficulties. In principle it should be quicker and easier to make small updates to the ledger file, rather than writing out a full checkpoint at more distant intervals. In practice however, this is the mode of operation which current HPC filesystems expect, and for which they are optimised – everyone knows that writing to disk is expensive and should be done as infrequently as possible after all!
This serialization problem poses a challenge certainly, but not an insurmountable one.