Today’s largest supercomputers contain millions of cores and cost hundreds of millions of dollars. Single faults or bugs in codes that disable only one process can halt an entire parallel application’s execution. Figuring out what went wrong and fixing it used to be time consuming and expensive. Previous debugging tools were never designed to be used on extreme-scale systems and can take minutes to perform a single debugging operation; typically, each operation is performed tens to hundreds of times during just one debug session. The Stack Trace Analysis Tool (STAT) reduces the number of processes requiring more in-depth analysis by organizing processes within a parallel application based on behavioral patterns.
STAT is an analytic tool designed and developed by computer scientists Greg Lee, Dong Ahn, Bronis de Supinski, Matthew LeGendre, and Martin Schulz, with collaborators at the University of Wisconsin at Madison and the University of New Mexico. STAT works by grouping similar processes at suspicious points in an application’s execution. It quickly and automatically identifies anomalies and outliers—processes that cannot be grouped or whose behavior is substantially different—because they often indicate flawed execution. STAT achieves this grouping by dynamically examining the state of each process and extracting the sequence of function calls (known as call stacks) that led to the current point of execution. In this way, STAT can relate the state of the processes to each other.
STAT offers varying levels of detail in the call stacks, from coarser function granularity to specific source-code line numbers. Because it gathers stack traces across the entire application, it provides a global view of what every process is doing. These stack traces are merged to reduce the problem search space, so users can identify a small yet representative subset of tasks on which to apply heavyweight analysis.
STAT can not only distinguish a process that is stuck in a single location in the code but also pinpoint the exact task causing the hang. STAT also derives the relative execution progress of each application task, which is useful for determining problematic application processes. The tool has run on a wide range of supercomputer platforms, including Linux clusters and the IBM BlueGene family of machines, and it shipped with the Cray XT, XE, and XK high performance computers.
On the Sequoia system at LLNL, STAT proved critical in expediently identifying application errors, system software bugs, and hardware faults that occurred at extreme scales, in some cases on over one million cores, during the installation and commissioning of the machine. It is also expected to work on the even larger machines expected to roll out over the next several years.