Privacy & Legal Notice
The scale of today's fastest supercomputers surpasses the capabilities of even the most advanced debuggers. For instance, Lawrence Livermore National Laboratory's Sequoia boasts 1.6 million cores—far beyond the reach of the most advanced, full-featured parallel debuggers. With future architectures this gap will only grow wider. To help fill this gap, we developed the Stack Trace Analysis tool (STAT) to help identify groups of processes in a parallel application that exhibit similar behavior. A single representative of these groups can then be examined with a full-featured debugger like TotalView or DDT for more in-depth analysis.
A video describing STAT and demonstrating its use is available on YouTube.
The Stack Trace Analysis Tool gathers and merges stack traces from a parallel application's processes. The tool produces call graphs: 2D spatial and 3D spatial-temporal; the graphs encode calling behavior of the application processes in the form of a prefix tree. The 2D spatial call prefix tree represents a single snapshot of the entire application (Figure 1). The 3D spatial-temporal call prefix tree represents a series of snapshots from the application taken over time (Figure 2). In these graphs, the nodes are labeled by function names. The directed edges, showing the calling sequence from caller to callee, are labeled by the set of tasks that follow that call path. Nodes that are visited by the same set of tasks are assigned the same color, giving a visual reference to the various equivalence classes.
Figure 1. A 2D spatial call prefix tree.
Figure 2. A 3D spatial-temporal call prefix tree.
STAT is also capable of gathering stack traces with more fine-grained information, such as the program counter or the source file and line number of each frame. The latter information can be fed into a static code analysis engine to derive the temporal order of the MPI tasks (Figure 3). This analysis traverses from the root of the tree towards the leaves, at each step determining the partial ordering of sibling nodes in the tree by analyzing the control flow of the source code. For straight-line code, this means that one task has made more progress if it has executed past the point of another task, i.e., if it has a greater line number. This ordering is partial since two tasks in different branches of an if-else are incomparable. In cases where the program points are within a loop, STAT can extract the loop ordering variable from the application processes and further delineate tasks by execution progress. This analysis can be useful for identifying the culprit in a deadlocked or livelocked application, where the problematic task has often either made the least or most progress through the code, leaving the remaining tasks stuck in a barrier or blocked pending a message. Note: This feature is still a prototype, please contact Greg Lee for an experimental version.
Figure 3: STAT's temporal ordering analysis engine indicates that task 1 has made the least progress. In this example, task 1 is stuck in a
compute cycle, while the other tasks are blocked in MPI communication, waiting for task 1.
We have built a graphical user interface (GUI) to run STAT and to visualize STAT's outputted call prefix trees (Figure 4). This GUI provides a variety of operations to help focus on particular call paths and tasks of interest. It can also be used to identify the various equivalence classes and includes an interface to attach a heavyweight debugger to the representative subset of tasks.
The STAT GUI is available on all CHAOS x86_64 machines and BlueGene systems in /usr/local/bin/stat-gui. Man pages are also available (man stat-gui). Full documentation can be found in /usr/local/tools/stat/doc/ and in the resources section.
Figure 4. A screenshot of the STAT GUI.
The toolbar on the left allows access to STAT's core operations on the application:
Figure 5. The STAT GUI attach dialog.
The attach dialog allows you to select the application to attach
to. Note: You will want to attach to the job launcher
process (srun on CHAOS and BlueGene/Q
systems or mpirun on BlueGene/P systems).
By default, the
attach dialog searches the localhost for the job
launcher process, but you may specify an alternative hostname in the
"Search Remote Host" text entry field. Thus, you may attach STAT
to a batch job from a login node. On CHAOS systems, the
appropriate host is usually the lowest numbered node in your
allocation. On BlueGene/P systems, there is usually a single
dedicated node (i.e., rzdawndev4 or dawn13) and on BlueGene/Q, it is
one of the dedicated nodes (i.e., seqlac5, seqlac6, vulcanlac5, or vulcanlac6). In general,
you may be able to run `squeue -j <your_slurm_job_id> -tr -o
"%.7i %B"` to find the appropriate node.
When you left click on a node in the graph, you will get a pop-up window that lists the function name and the full set of tasks that took that call path (Figure 6). Right clicking on a node provides a pop-up menu with the same options.
Figure 6. The Node pop-up window.
The pop-up window has several buttons that allow you to manipulate the graph, allowing you to focus on areas of interest. Each button is defined as follows:
Figure 7. The Source View window. The colored arrows correspond to the nodes in Figure 4.
The main window also has several tree manipulation options (note all of these operate on the current, visible state of the tree):
STAT is available on all CHAOS x86_64 machines
and BlueGene systems in /usr/local/bin/stat-cl. Man pages are also
available (man stat-cl), and a usage
message will be printed by executing stat-cl -h.
Full documentation can be found in
/usr/local/tools/stat/doc/ and in the resources
section. The "-c" option is recommended to gather a
comprehensive set of stack traces with varying levels of detail.
An example using STAT on a job submitted through mxterm is provided below:
rzmerl16@lee218:srun -n 64 mpi_ringtopo &
rzmerl13, MPI task 1 of 64 stalling
PID TTY TIME CMD
25695 pts/1 00:00:00 bash
25742 pts/1 00:00:00 srun
25795 pts/1 00:00:00 ps
You will need to invoke STAT on the PID of the srun (or mpirun on BG/P) process:
rzmerl16@lee218:stat-cl -c 25742
Attaching to job launcher and launching tool daemons...
Tool daemons launched and connected!
Attaching to application...
Application already paused... ignoring request to pause
Detaching from application...
Results written to /g/g0/lee218/src/STAT/stat_results/mpi_ringtopo.0222
STAT can also be invoked on batch jobs if you know the host name and the PID of the srun process. On LLNL Linux clusters, this will typically be on the lowest numbered node that you can see in the "squeue" output. Further, you can invoke STAT from a login node by providing the <hostname>:<pid>:
rzmerl156@lee218:squeue | grep lee218
24119 pdebug test.msu lee218 R 0:01 2 rzmerl[11-12]
rzmerl156@lee218:squeue -j 24119 -tr -o "%.7i %B"
rzmerl156@lee218:rsh rzmerl11 ps x | grep srun
2882 ? Sl 0:00 srun -n 32 mpi_ringtopo
rzmerl156@lee218:stat-cl -c rzmerl11:2882
Once STAT has completed execution, it will print a message "Results
written to <directory>". This results directory contains several
including merged stack traces in AT&T dot format (with .dot
extensions). These files
can be fed into the stat-view GUI for analysis:
STAT can be used in conjunction with the IO Watchdog utility, which monitors application output to detect hangs. To enable STAT with the IO Watchdog, add the following to the file $HOME/.io-watchdogrc:
timeout = 20m
actions = STAT, kill
You will then need to run your application with the srun --io-watchdog option:
% srun --io-watchdog mpi_application
When STAT is invoked, it will create a stat_results directory in the current working directory, as it would in a typical STAT run. The outputted .dot files can then be viewed with stat-view. For more details about using IO Watchdog, refer to the IO Watchdog README file in /usr/local/tools/io-watchdog/README.
Dong H. Ahn, Michael J. Brim, Bronis R. de Supinski, Todd Gamblin, Gregory L. Lee, Barton P. Miller, Adam, Moody, and Martin Schulz, "Efficient and Scalable Retrieval Techniques for Global File Properties", International Parallel & Distributed Processing Symposium, Boston, MA, May 2013.
Dong H. Ahn, Bronis R. de Supinski, Ignacio Laguna, Gregory L. Lee, Ben Liblit, Barton P. Miller, and Martin Schulz, "Scalable Temporal Order Analysis for Large Scale Debugging," Supercomputing 2009, Portland, Oregon, November 2009.
Gregory L. Lee, Dorian C. Arnold, Dong H. Ahn, Bronis R. de Supinski, Matthew Legendre, Barton P. Miller, Martin Schulz, and Ben Liblit, "Lessons Learned at 208K: Towards Debugging Millions of Cores," SuperComputing 2008, Austin, Texas, November 2008.
Gregory L. Lee, Dorian C. Arnold, Dong H. Ahn, Bronis R. de Supinski, Barton P. Miller, and Martin Schulz, "Benchmarking the Stack Trace Analysis Tool for BlueGene/L," International Conference on Parallel Computing (ParCo), Aachen and Jülich, Germany, September 2007.
Dorian C. Arnold, Dong H. Ahn, Bronis R. de Supinski, Gregory L. Lee, Barton P. Miller, and Martin Schulz, "Stack Trace Analysis for Large Scale Applications", International Parallel & Distributed Processing Symposium, Long Beach, California, March 2007.
For questions about usage on LC machines, please contact Greg Lee (OUN: lee218).