TotalView Part 3:
Debugging Parallel Programs

Author: Blaise Barney, Lawrence Livermore National Laboratory UCRL-MI-133316

Part 3 Contents

  1. Process/Thread Groups
  2. Debugging Threaded Codes
    1. Overview
    2. Finding Thread Information
    3. Selecting a Thread
    4. Execution Control for Threaded Programs
    5. Viewing and Modifying Thread Data
  3. Debugging OpenMP Codes
    1. Overview
    2. Debugging OpenMP Programs
  4. Debugging MPI Codes
    1. Overview
    2. Starting an MPI Debug Session
    3. Selecting an MPI Process
    4. Controlling MPI Process Execution
    5. Viewing and Modifying Multi-process Data
    6. Displaying Message Queue State
  5. Debugging Hybrid Codes
    1. Overview
    2. Debugging Hybrid Programs
  6. Batch System Debugging
  7. Topics Not Covered
  8. References and More Information
  9. Exercise 3



Preface



Process/Thread Groups


TotalView P/T Groups:

Types of P/T Groups:

Selecting P/T Groups:

Important:



Debugging Threaded Codes

Overview

General Threads Model:

Supported Platforms:

Important Differences:


Finding Thread Information

Root Window:

Process Window:


Selecting a Thread

By Diving:

By Thread Navigation Buttons:

Differentiating Threads:


Execution Control for Threaded Programs

Three Scopes of Influence:

Synchronous vs. Asynchronous:

Warning For asynchronous thread control, unexpected program behavior (like hanging) can occur if some threads step or run while others are stopped - particularly in library routines. CTRL-C may be able to be used to cancel the command that caused the hang.

Thread-specific Breakpoints:


Viewing and Modifying Thread Data

Laminated Variables:

In the Kernel:



Debugging OpenMP Codes

Overview

OpenMP Threads Model:

Supported Platforms:

Supported Features:


Debugging OpenMP Programs

Just Like Threads (sorta):

Setting the Number of Threads:

Code Transformation:

Master Thread vs. Worker Threads:

Example OpenMP Session:

  1. Master thread Stack Trace Pane showing original routine (highlighted) and the outlined routine above it
  2. Process/thread status bars differentiating threads
  3. Master thread Stack Frame Pane showing shared variables
  4. Worker thread Stack Trace Pane showing outlined routine.
  5. Worker thread Stack Frame Pane, in this case showing both private and shared variables
  6. Root Window showing all threads
  7. Threads Pane showing all threads plus selected thread

Execution Control:

Warning Asynchronous execution: single stepping or running one OpenMP thread while others are stopped can lead to unexpected program behavior (like hanging). CTRL-C may be able to be used to cancel the command that caused the hang.

Viewing and Modifying Data:

Manager Threads:



Debugging MPI Codes

Overview

Multi-Process:

Supported Platforms:


Starting an MPI Debug Session

Just a Little Bit Different:

Example:

  1. Start TotalView with the parallel task manager process. Note that the order of arguments and executables is important, and differs between platforms.

    Examples:

    MVAPICH
    Linux
    under SLURM
    totalview srun -a -n 16 -p pdebug myprog
    IBM AIX totalview poe -a myprog -procs 4 -rmpool 0
    SGI totalview mpirun -a myprog -np 16
    Sun totalview mprun -a myprog -np 16
    MPICH mpirun -np 16 -tv myprog

  2. The Root Window and Process Window will appear as usual, however it will be the manager process that will be loaded, not your program. Start the manager process by typing g in the Process Window or by:

    Process Window >  Process Menu  >  Go 

  3. A dialog window will then appear notifying you that it is a parallel job and asking whether or not you wish to stop the job now. Click on Yes (see below). Note: if you click on No the job will begin to immediately execute before you have a chance to set breakpoints, etc.

  4. TotalView will then acquire the MPI tasks which are running under the manager process. When this is done, the Process Window will default to displaying the state information and source for MPI task 0. You are now ready to begin debugging your program.


Selecting an MPI Process

By Diving:

By Process Navigation Buttons:


Example:


Controlling MPI Process Execution

Starting and Stopping Processes:


Warning If you use accelerator keys to control execution, be sure to type the right key! It is a fairly common accident to use a process level command instead of group level command (and vice-versa). For example, typing g instead of G.

Holding and Releasing Processes:

Breakpoints and Barrier Points:

Warning About Single Process Commands:


Viewing and Modifying Multi-process Data

Laminated Variables:


Displaying Message Queue State

Types of Messages Displayed:

Actions:

Message Queue Graph:

Notes:




Debugging Hybrid Codes

Overview

What are "Hybrid" Codes?

Nothing New (Just More of It):

Supported Platforms:


Debugging Hybrid Programs

Starting a Hybrid Code Debug Session:

Tying it All Together:

Example:



Batch System Debugging


Why Debug in Batch?

Using LC's mxterm Utility:

Attaching to a Running Batch Job:

    If you have a batch job that is already running, you can start TotalView on one of the cluster's login nodes and then attach to it.

  1. Login to the cluster where your job is running

  2. Set up your X11 display environment

  3. Determine where your job is running by using a command such as mjstat or squeue. For example:

    cab669% mjstat | grep joeuser
    331894   joeuser        2 pbatch    R            10:15  cab430
    
    cab669% squeue | grep user2
    329921    pbatch    pmin0   user2   R    9:39:59      4 cab[756,816-817,863]
    

    Note that for multi-node, parallel MPI jobs:

    • mjstat only shows the node where the MPI manager task (srun) is running
    • squeue will show all nodes, but the first node in the list is where the MPI manager process is running.

  4. Start TotalView alone: totalview

  5. When the Session Manager dialog box appears (below), select A running program (attach):

  6. An Attach to running program(s) dialog box will then appear (below):
    1. Click on the H+ button to add a host
    2. An Add Host dialog box will appear. Enter the name of the node obtained from the mjstat or squeue command above. Then click OK.

  7. The contents of the Attach to running program(s) dialog box will change after a connection is made to the specified node (below):
    1. Click on the name of your executable in the process list. If it is an MPI job, click on the srun process.
    2. Click on the Start Session button.

  8. A Process Window will then appear with the selected executable now attached to TotalView. If you are running an MPI job, it will be the manager task. You can now debug as usual.



Topics Not Covered


TotalView includes a number of other features and functions not covered in this tutorial. A partial list of these appears below. Please consult the TotalView Documentation for more information.






This concludes TotalView Part 3

Evaluation Form       Please complete the online evaluation form - unless you are doing the exercise, in which case please complete it at the end of the exercise.

Where would you like to go now?



References and More Information


The most useful documentation and reference material is from TotalView's vendor site. You can download this from the TotalView section of their website at Rogue Wave Software, Inc.

If you already have TotalView installed, the same documentation comes with the installation and is available from the install directory and by using TotalView's "Help" menu.