AutomaDeD is a tool that automatically diagnoses performance and correctness faults in MPI applications. It creates time and control-flow models of processes in an MPI program, and, when a failure occurs, these models are analyzed probabilistically to find the origin of the failure. The output of the tool is MPI processes ranked by their abnormality degree and regions of code where faults are first manifested. The granularity of code region is an MPI call or a computation region between two MPI calls.
AutomaDeD has two major functionalities: (1) identifying abnormal MPI tasks and code regions, and (2) finding the least-progressed task. The first is accomplished by performing outlier detection (via scalable clustering) using the runtime models as input. Outlier detection uses a training set, models of previous runs, or, within the faulty run, models of normally-behaving processes. The second is performed by ordering MPI processes according to their relative progress. The tool infers relative progress using a combination of probabilistic analysis (on the control-flow models) and loop progress analysis.
Publications
Ignacio Laguna, Dong H. Ahn, Bronis R. de Supinski, Saurabh Bagchi, Todd Gamblin, "Probabilistic Diagnosis of Performance Faults in Large-Scale Parallel Applications," International Conference on Parallel Architectures and Compilation Techniques (PACT 2012), Minneapolis, MN, Sep, 2012. LLNL-PROC-548642
Ignacio Laguna, Todd Gamblin, Bronis R. de Supinski, Saurabh Bagchi, Greg Bronevetsky, Dong H. Ahn, Martin Schulz, Barry Rountree, "Large Scale Debugging of Parallel Tasks with AutomaDeD," ACM/IEEE Conference for High Performance Computing, Networking, Storage and Analysis (SC 2011), Seattle, WA, Nov 2011. LLNL-CONF-486911.
Greg Bronevetsky, Ignacio Laguna, Surabh Bagchi, Bronis R. de Supinski, Dong H. Ahn, Martin Schulz, “AutomaDeD: Automata-Based Debugging for Dissimilar Parallel Tasks,” IEEE/IFIP International Conference on Dependable Systems and Networks (DSN 2010), Chicago Illinois, Jun-Jul, 2010. LLNL-CONF-426270.