As the capability and component count of systems increase, the MTBF decreases. Typically, applications tolerate failures with checkpoint/restart to a parallel file system (PFS). While simple, this approach can suffer from contention for PFS resources. Multilevel checkpointing is a promising solution. However, while multilevel checkpointing is successful on today’s machines, it is not expected to be sufficient for extreme scale systems, which are predicted to have orders of magnitude larger memory sizes and failure rates.
To solve the problem, we combined the benefits of asynchronous and multilevel checkpointing, and modeled the system. Our experiments showed that our system can improve efficiency by 1.1 to 2.0× on future machines. Additionally, applications using our checkpointing system can achieve high efficiency even when using a PFS with lower bandwidth.
Kento Sato, Adam Moody, Kathryn Mohror, Todd Gamblin, Bronis R. de Supinski, Naoya Maruyama and Satoshi Matsuoka, "Design and Modeling of a Non-blocking Checkpointing System", International Conference for High-Performance Computing, Network, Storage, and Analysis 2012 (SC12), LLNL-CONF-554431.