Debugging is like peeling an onion. We have to eliminate the unknowns one by one.
Dong Ahn
diagram of Archer

“Pruning” Sources of Nondeterminism in Large-Scale Applications

Thursday, May 21, 2015

Large-scale applications designed to run on massively parallel supercomputers break a problem into tiny parts that are solved simultaneously for better performance, but the concurrent nature of these activities can generate some challenges for the programmers writing and researchers running these programs. For instance, what happens when two operations attempt to write results to the same memory destination at the same time? If the code does not specify the order in which these actions should happen, the end result of the calculation may depend on which operation completes the task first. Such a situation, appropriately termed a data race, is one of the most notorious sources of nondeterminism in parallel programming.

Computer scientist Dong Ahn explains, “Nondeterminism in and of itself is not bad. It is an essential attribute of parallel computing, but using it in a safe and intentional manner is difficult.” Unintended use of nondeterminism can be time consuming to diagnose, and it often masks other problems in a code, making troubleshooting in general more difficult. Reproducibility, the ability to repeat program executions with the same numerical result or code behavior, is highly desirable when debugging large applications. “Debugging is like peeling an onion,” says Ahn. “We have to eliminate the unknowns one by one.” Ahn is a member of Computation’s Development Environment Group (DEG), which devises effective code-development tools for parallel computing, including—through the PRUNER project—methods of detecting, controlling, and eliminating sources of nondeterminism in large simulations.

Many of Livermore’s massive physics codes are transitioning to hybrid Message Passing Interface (MPI)–OpenMP models to take advantage of more diverse parallelism options on today’s and tomorrow’s high-performance computing systems. Unfortunately, porting an application with many scientific packages and hundreds of thousand or even a million lines of code provides many opportunities for data races to be introduced. Enter ARCHER, the most mature of the PRUNER tools. A precise and portable tool for identifying race conditions in codes that use the OpenMP programming model, ARCHER was developed by DEG and Center for Applied Scientific Computing computer scientists in collaboration with researchers at the University of Utah and RWTH Aachen University in Germany.

Data race detection tools either use static or dynamic methods. Static detection is efficient but often generates false alarms, which then have to be checked by a programmer. Dynamic detection avoids false alarms but uses more computing resources and can miss data races. ARCHER combines the best of static and dynamic analysis techniques to minimize computing and programmer resource use while maximizing accuracy. During several tests of ARCHER on the Cab cluster using small and medium-sized codes, the researchers found that the tool demonstrated much higher accuracy and five times better computational performance than the best available race detection tool. ARCHER even found three previously unreported data races within the AMG2013 code, which is based on the Hypre linear solver library.

The find not only underscored the tool’s usefulness, but it also prompted some detective work. The team made a connection between the newly discovered data races and a troubling unresolved bug in the million-plus-line HYDRA code, which uses Hypre in one of its packages. Because the problem only happened intermittently and at large scales, it had been arduous to debug, and HYDRA’s programmers had eventually rewritten code rather than continue troubleshooting. ARCHER’s developers suspected that the races were at least partially responsible for HYDRA’s crashes, and subsequent testing confirmed this.

Catching other culprits of nondeterminism requires other tools. “MPI introduces some nondeterminism, too,” notes Ahn. “MPI applications can take different computational paths and have different computational results depending on the order in which messages are received.” Ahn and his DEG colleagues are developing a method called ReMPI for recording and replaying program execution to aid in nondeterminism debugging of parallel applications. The replay mechanism uses the recorded traces to ensure that every MPI process observes the same state as the recorded run. ReMPI has a modest storage footprint, unlike other record and replay approaches, because it records only MPI messages that arrive in a different order than predicted—a small proportion of the total number of messages.

In the coming years, the DEG group anticipates a growing need for reproducibility tools, especially as Livermore researchers begin preparing applications to run on machines such as Sierra that rely more heavily on graphics processing units and that, says Ahn, will produce “an explosion of parallelism.” Thanks to projects such as PRUNER, researchers can be confident that the right tools will be on hand. “Every minute we can save for the scientists means more time for scientific discovery,” observes Ahn. “That’s why reproducibility debugging tools are important.”