International Journal of Networking and Computing
Online ISSN : 2185-2847
Print ISSN : 2185-2839
ISSN-L : 2185-2839
Special Issue on Workshop on Advances in Parallel and Distributed Computational Models 2021
Task-Level Resilience: Checkpointing vs. Supervision
Jonas PosnerLukas ReitzClaudia Fohry
Author information
JOURNAL OPEN ACCESS

2022 Volume 12 Issue 1 Pages 47-72

Details
Abstract

With the advent of exascale computing, issues such as application irregularity and permanent hardware failure are growing in importance. Irregularity is often addressed by task-based parallel programming implemented with work stealing. At the task level, resilience can be provided by two principal approaches, namely checkpointing and supervision. For both, particular algorithms have been worked out recently. They perform local recovery and continue the program execution on a reduced set of resources. The checkpointing algorithms regularly save task descriptors explicitly, while the supervision algorithms exploit their natural duplication during work stealing and may be coupled with steal tracking to minimize the number of task re-executions. Thus far, the two groups of algorithms have been targeted at different task models: checkpointing algorithms at dynamic independent tasks, and supervision algorithms at nested fork-join programs. This paper transfers the most advanced supervision algorithm to the dynamic independent tasks model, thus enabling a comparison between checkpointing and supervision. Our comparison includes experiments, running time predictions, and simulations of job set executions. Results consistently show typical resilience overheads below 1% for both approaches. The overheads are lower for supervision in practically relevant cases, but checkpointing takes over for order millions of processes.

Content from these authors
© 2022 International Journal of Networking and Computing
Previous article Next article
feedback
Top