As high performance computing systems increase in size, checkpointing to the parallel file system becomes prohibitively expensive. Multilevel checkpointing may solve this challenge through lightweight checkpoints that handle the most common failures and relying on parallel file system checkpoints only for less common, but more severe failures. To evaluate this approach in a large-scale, production system context, we developed the Scalable Checkpoint/Restart library, which checkpoints to storage on the compute nodes in addition to the parallel file system. Through experiments and modeling, we show that multilevel checkpointing benefits existing systems, and we find that the benefits increase on larger systems. In particular, we developed low-cost checkpoint schemes that are 100x-1000x faster than the parallel file system and effective against 85% of our system failures. Our approach improves machine efficiency up to 35%, while reducing the load on the parallel file system by a factor of two.
You can obtain the latest stable version from our Software page.
Kathryn Mohror, Adam Moody, Greg Bronevetsky, and Bronis R. de Supinski, "Detailed Modeling and Evaluation of a Scalable Multilevel Checkpointing System," in Transactions on Parallel and Distributed Systems, LLNL-JRNL-564721, 25(9):2255-2263, Sept. 2014.
Kathryn Mohror, Adam Moody, and Bronis R. de Supinski, “Asynchronous Checkpoint Migration with MRNet in the Scalable Checkpoint / Restart Library" LLNL-PROC-540391, FTXS’12, Boston MA, June 25, 2012.
Dries Kimpe, Kathryn Mohror, Adam Moody, Brian Van Essen, Maya Gokhale, Kamil Iskra, Rob Ross, Bronis R. de Supinski, "Integrated In-System Storage Architecture for High Performance Computing," LLNL-CONF-557032, ROSS'12, Venice, Italy, June 29, 2012.
Adam Moody, Greg Bronevetsky, Kathryn Mohror, Bronis R. de Supinski, "Design, Modeling, and Evaluation of a Scalable Multi-level Checkpointing System", LLNL-CONF-427742, Supercomputing 2010, New Orleans, LA, November 2010
Adam Moody, Greg Bronevetsky, Kathryn Mohror, Bronis R. de Supinski, "Detailed Modeling, Design, and Evaluation of a Scalable Multi-level Checkpointing System", Lawrence Livermore National Laboratory Technical Report, LLNL-TR-440491, July 2010.