Distinguished Member of Technical Staff Kathryn Mohror advances the state of the art in I/O and data management and serves as a leader within the greater HPC community.
Topic: Fault Tolerance and Resilience
The Association for Computing Machinery's (ACM) Special Interest Group on High Performance Computing (SIGHPC) has awarded Kathryn Mohror with its prestigious Emerging Woman Leader in Technical Computing (EWLTC) Award.
Bugs, broken codes, or system failures require added time for troubleshooting and increase the risk of data loss. LLNL has addressed failure recovery by developing the Scalable Checkpoint/Restart (SCR) framework.
With SCR, jobs run more efficiently, recover more work upon failure, and reduce load on critical shared resources.
Our researchers will be well represented at the virtual SIAM Conference on Computational Science and Engineering (CSE21) on March 1–5. SIAM is the Society for Industrial and Applied Mathematics with an international community of more than 14,500 individual members.
Highlights include recent LDRD projects, Livermore Tomography Tools, our work with the open-source software community, fault recovery, and CEED.
“If applications don’t read and write files in an efficient manner,” system software developer Elsa Gonsiorowski warns, “entire systems can crash.”
Application-level resilience is emerging as an alternative to traditional fault tolerance approaches because it provides fault tolerance at a lower cost than traditional approaches.
Working on world-class supercomputers at a U.S. national laboratory was not what Edgar Leon, a native of Mexico, envisioned when he began preparing for university.
These techniques emulate the behavior of anticipated future architectures on current machines to improve performance modeling and evaluation.