Topic: Fault Tolerance and Resilience

Our researchers will be well represented at the virtual SIAM Conference on Computational Science and Engineering (CSE21) on March 1–5. SIAM is the Society for Industrial and Applied Mathematics with an international community of more than 14,500 individual members.

News Item

In this issue featuring LLNL's R&D 100 Award winners from 2019, the versatile Scalable Checkpoint/Restart framework offers more reliable simulation performance.

News Item

Of the Lab's four R&D 100 Award winners for 2019, two are open-source software projects: SCR and Spack.

News Item

In the third of three news features, LLNL-developed SCR software will contribute to the efficiency and effectiveness of the HPC leadership-class systems of tomorrow.

News Item

This open-source file system framework supports hierarchical HPC storage systems by utilizing node-local burst buffers.

Project

“If applications don’t read and write files in an efficient manner,” system software developer Elsa Gonsiorowski warns, “entire systems can crash.”

People Highlight

Application-level resilience is emerging as an alternative to traditional fault tolerance approaches because it provides fault tolerance at a lower cost than traditional approaches.

Project

Working on world-class supercomputers at a U.S. national laboratory was not what Edgar Leon, a native of Mexico, envisioned when he began preparing for university.

People Highlight

To overcome the shortcomings of the analytical and architectural approaches to performance modeling and evaluation, we are developing techniques that emulate the behavior of anticipated future architectures on current machines.

Project

With SCR, jobs run more efficiently, recover more work upon failure, and reduce load on critical shared resources.

Project

Kathryn Mohror develops tools that give researchers the information they need to tune their programs and maximize results. After all, says Kathryn, “It’s all about getting the answers more quickly.”

People Highlight