As this three-part news series explains, LLNL is striving to create a computing ecosystem that operates at exascale speeds (more than 1018 calculations per second) to carry out its national security and science missions an order of magnitude faster than today’s high performance computing (HPC) systems. The Livermore Computing (LC) Division is developing software—including Flux, OpenZFS, and SCR—to support these systems.
Checkpoint/restart is a method by which an application periodically saves checkpoints— or snapshots of its data—so that should a failure occur, the application can restart from that point in time rather than from the beginning of the job. Livermore scientists have developed the Scalable Checkpoint/ Restart (SCR) library to dramatically improve this process and have found that with this framework, jobs run more efficiently, recover more data upon failure, and reduce the load on resources.
SCR enables an application to write its checkpoints to fast, local storage tiers that can handle data faster than a parallel file system; leveraging these fast tiers can enable applications to run 100 to 1,000 times faster than if they write checkpoints directly to the parallel system. SCR then moves the checkpoint data to the parallel system in the background while the application continues running the computation.
Because SCR allows for checkpoints to be written faster, an application can checkpoint more frequently so that it has saved more work if a failure occurs. Should a failure occur, the application is able to restart from these fast-tier checkpoints, which also saves time. As an example, for a laser–plasma interaction code the Laboratory was running, the application’s checkpoints initially took 20–40 minutes and were made every few hours, but with SCR, the checkpoints took 10 seconds and were made every 15 minutes.
Kathryn Mohror, Data Analysis group leader in LLNL's Center for Applied Scientific Computing, notes that the SCR application programming interface (API) is platform agnostic and is designed to be easy to integrate with applications. SCR is a simple wrapper around an application’s existing checkpoint/restart code. It provides performance portability for applications while abstracting away the details of the underlying storage.
“That means once SCR is integrated into an application, it can be used across a variety of HPC systems without further changes to the application code,” Mohror says.
Mohror explains how Livermore will apply SCR in its plans for exascale computing. “SCR is part of the Laboratory’s strategy for supporting applications on exascale platforms, where we expect hierarchical storage to play a large role in enabling our applications to ingest and generate data with high performance,” Mohror says. “SCR will manage data movement for applications through the storage hierarchies and enable applications to use upcoming platforms without code modifications.”
The Livermore SCR team is collaborating with Argonne National Laboratory on a checkpoint/restart library called VeloC, which is an Exascale Computing Project– funded effort. They are combining the approaches and code from SCR and Argonne’s Fault Tolerance Interface project, which is similar to SCR.
SCR research has been funded primarily by the Advanced Simulation and Computing (ASC) Program and Livermore’s Laboratory Directed Research and Development program.
- In the image above, an application makes SCR API calls around its existing checkpoint/restart code (shown in the blue boxes). SCR manages storing checkpoints, protecting the checkpoints from loss, and moving them as needed.
- SCR: Scalable Checkpoint/Restart for MPI
- SCR on GitHub
- VIDEO: SCR framework: Accelerating resilience and I/O for supercomputing applications