High Performance Computing: ICC\'s High Performance Computing: Code Development


  1. Overview
  2. Platforms and Locations
  3. Quick Start
  4. Output
  5. Compiling and Linking
  6. Run-time Options
  7. Troubleshooting
  8. Documentation and References


Open|SpeedShop is a comprehensive, open source performance analysis tool framework that integrates the most common performance analysis steps all in one tool. Open|SpeedShop supports:
  • Program Counter Sampling
  • Callstack Analysis
  • Hardware Performance Counters
  • MPI Profiling and Tracing
  • I/O Profiling and Tracing
  • Floating Point Exception Analysis
  • Memory Function Tracing
  • Pthreads Function Tracing
  • NVIDIA CUDA Event Tracing

All analysis is performed on unmodified binaries, and can be used on serial, MPI and threaded codes. Open|SpeedShop performance data collection is built around "experiments". Each experiment collects a specific type of performance data.

Three user interface options are provided: Graphical user interface, command line, and a Python scripting API.

The Component Based Tool Framework (CBTF) is a new, experimental implementation of Open|SpeedShop that improves tool scalability and adds new features. It is a younger product still under development, but should be stable on LLNL machines.

Platforms and Locations

    Platform Location Notes
    x86_64 Linux /usr/global/tools/openspeedshop/ Multiple versions are available. Load with Dotkit.
    BG/Q /usr/global/tools/openspeedshop/ Multiple versions are available. Load with Dotkit.

Quick Start

  1. Determine which Dotkit version of Open|SpeedShop you want to load, and then load that Dotkit package. Note that at LC, the CBTF version has been made available. For example:

    % use -l openss
    performance/profile ----------
       openss-mvapich2 - Open|Speedshop (Version 2.1 CBTF for x8664)
              openss21 - Open|Speedshop (Version 2.1)
                openss - Open|Speedshop (Version 2.1 CBTF for x8664)
    % use -l cbtf
    performance/profile ----------
         cbtf-mvapich2 - Open|Speedshop (Version 2.1 CBTF for x8664)
                  cbtf - Open|Speedshop (Version 2.1 CBTF for x8664)
       openss-mvapich2 - Open|Speedshop (Version 2.1 CBTF for x8664)
                openss - Open|Speedshop (Version 2.1 CBTF for x8664)
    % use openss
    Prepending: openss (ok)

  2. Determine which experiment you want to run, based upon the type(s) of performance data you are interested in collecting. The available experiments are shown in the table below.

    Experiment Description
    pcsamp Periodic sampling the program counters gives a low-overhead view of where the timeis being spent in the user application.
    usertime Periodic sampling the call path allows the user to view inclusive and exclusive time spent in application routines. It also allows the user to see which routines called which routines. Several views are available, including the "hot" path.
    hwc Hardware events (including clock cycles, graduated instructions, instruction and data cache and TLB misses, floating-point operations) are counted at the machine instruction, source line and function levels.
    hwcsamp Similar to hwc, except that sampling is based on time, not PAPI event overflows. Up to six events may be sampled during the same experiment.
    hwctime Similar to hwc, except that call path sampling is also included.
    io Accumulated wall-clock durations of input/output (I/O) system calls: read, readv, write, writev, open, close, dup, pipe, creat and others. Show call paths for each unique I/O call path.
    iop* Lightweight I/O profiling: Accumulated wall-clock durations of I/O system calls: read, readv, write, writev, open, close, dup, pipe, creat and others, but individual call information is not recorded.
    iot Similar to io, except that more information is gathered, such as bytes moved, file names, etc.
    mpi Captures the time spent in and the number of times each MPI function is called. Show call paths for each MPI unique call path.
    mpip* Lightweight MPI profiling: Captures the time spent in and the number of times each MPI function is called. Show call paths for each MPI unique call path, but individual call information is not recorded.
    mpit Records each MPI function call event with specific data for display using a GUI or a command line interface (CLI). Trace format option displays the data for each call, showing its start and end times.
    fpe Find where each floating-point exception occurred. A trace collects each with its exception type and the call stack contents. These measurements are exact, not statistical
    mpiotf Write MPI calls trace to Open Trace Format (OTF) files to allow viewing with Vampir or converting to formats of other tools.
    mem* Captures the time spent in and the number of times each memory function is called. Show call paths for each memory function's unique call path.
    pthreads* Captures the time spent in and the number of times each POSIX thread function is called. Show call paths for each POSIX thread function's unique call path.
    cuda* Captures the NVIDIA CUDA events that occur during the application execution and report times spent for each event, along with the arguments for each event, in an event-by-event trace.

    * Only available in Open|SpeedShop using CBTF collection mechanism (currently under development)

    Usage Notes:

    • Each experiment maps to an Open|SpeedShop command: the name of the experiment prepended with "oss". For example, to run the "fpe" experiment, you would use the command "ossfpe".
    • Most experiments have options. To get additional information on a command, simply enter the command name with no arguments. You can also enter the command followed by "help" or "--help".

  3. Run the experiment of choice on your application, providing any required arguments and any Open|SpeedShop options. The general syntax for MPI codes at LC, and examples are shown below.
    osscmd "srun [srun options] usercmd [usercmd options]" [osscmd options]
    Example: collect PAPI hardware counter events for matmult executable running with 4 MPI tasks. Uses the default PAPI event PAPI_TOT_CYC.
    % osshwc "srun -n4 matmult"

    Example: collect multiple PAPI hardware counter events plus program counter sampling for matmult executable running with 4 MPI tasks. Specifies L1, L2 and L3 cache misses in addition to CPU time provided by PC sampling.
    % osshwcsamp "srun -n4 matmult" PAPI_L1_TCM,PAPI_L2_TCM,PAPI_L3_TCM

  4. Open|SpeedShop output consists of text written to stdout and a database file. These are discussed in the Output section below.


As your program executes, Open|SpeedShop will write status/diagnostic information to stdout. Upon program completion, a report will be produced on stdout also. The format and content of the report depends upon the Open|SpeedShop experiment that was run.

Open|SpeedShop will also create a database file in the working directory. The file will be named as:

The database file is used for analysis with the Open|SpeedShop GUI.

Examples of both types of output are shown below.

% osshwcsamp "srun -n4 matmult" PAPI_L1_TCM,PAPI_L2_TCM,PAPI_L3_TCM

[openss]: hwcsamp experiment using input papi event: "PAPI_L1_TCM,PAPI_L2_TCM,PAPI_L3_TCM".
[openss]: hwcsamp experiment using the hwc experiment default sampling_rate: "100".
[openss]: hwcsamp experiment calling openss.
[openss]: Setting up offline raw data directory in /p/lscratche/blaise/offline-oss
[openss]: Running offline hwcsamp experiment using the command:
"srun -n4 /collab/usr/global/tools/openspeedshop/oss-dev/x8664/oss_offline_v2.1u4/bin/ossrun -c hwcsamp ./matmult"

Output from matmult executable removed for clarity

[openss]: Converting raw data from /p/lscratche/blaise/offline-oss into temp file X.0.openss

Processing raw data for matmult ...
Processing processes and threads ...
Processing performance data ...
Processing symbols ...
Resolving symbols for /g/g0/blaise/matmult/matmult
Resolving symbols for /usr/local/tools/mvapich-gnu-1.2/lib/shared/libmpich.so.1.0
Resolving symbols for /lib64/libc-2.12.so
Resolving symbols for /lib64/libpthread-2.12.so
Resolving symbols for /usr/lib64/libpsm_infinipath.so.1.14
Resolving symbols for /usr/lib64/libinfinipath.so.4.0
Updating database with symbols ... 
Finished ... 

[openss]: Restoring and displaying default view for:
[openss]: The restored experiment identifier is:  -x 1

Exclusive    % of CPU  papi_l1_tcm  papi_l2_tcm  papi_l3_tcm  Function (defining location)
 CPU time        Time                                      
29.130000   75.978091  6737152983  1648401202   554678152  multiply_matrices_ (matmult: matmult.f90,25)
 5.770000   15.049557    58094130    25937153     4460313  ips_ptl_poll (libpsm_infinipath.so.1.14)
 1.260000    3.286385     6211521     4328113      572851  __psmi_poll_internal (libpsm_infinipath.so.1.14)
 1.180000    3.077726     4660014     3480615      262262  psm_mq_wait (libpsm_infinipath.so.1.14)
 0.440000    1.147626     4010322     1844752      282742  __GI___sched_yield (libc-2.12.so: syscall-template.S,82)
 0.270000    0.704225     1280815      958226       79788  MAIN__ (matmult: matmult.f90,39)
 0.160000    0.417319    10697294     5052727     1891272  ipath_dwordcpy (libinfinipath.so.4.0)
 0.050000    0.130412       12954        7680        1581  initialize_ (matmult: matmult.f90,4)
 0.010000    0.026082     2297350      559612      186424  MPID_PSM_RecvComplete (libmpich.so.1.0: psmrecv.c,73)
 0.010000    0.026082     2317920      568133      192646  MPID_PSM_Send (libmpich.so.1.0: psmsend.c,36)
 0.010000    0.026082     1117412      515398      270530  pthread_spin_lock (libpthread-2.12.so: pthread_spin_lock.c,35)
 0.010000    0.026082      893876      490998      136812  psmi_amsh_long_reply (libpsm_infinipath.so.1.14)
 0.010000    0.026082       41437       31068        2099  ips_spio_transfer_frame (libpsm_infinipath.so.1.14)
 0.010000    0.026082     2315250      569334      193507  ips_proto_flow_enqueue (libpsm_infinipath.so.1.14)
 0.010000    0.026082     2297885      559894      186956  ips_proto_process_packet_inner (libpsm_infinipath.so.1.14)
 0.010000    0.026082     1114033      512588      270345  ipath_dwordcpy_safe (libinfinipath.so.4.0)
38.340000  100.000000  6834515196  1693817493   563668280  Report Summary

To view the database output, find the relevant *.openss file in your working directory, and then call the Open|SpeedShop GUI with that file. For example:

% openss matmult-hwcsamp.openss

The GUI will then appear, displaying the results of the experiment. Several examples for different experiments are shown below (click for a larger image). Users should consult the Open|SpeedShop documentation for details.

Manager and Profile Summary Windows

Thread Bar Chart

Function Bar Chart

3D Visualization

Profile Summary for PAPI_L2_DCM Event

Function Bar Chart for PAPI_L2_DCM Event

Compiling and Linking

Open|SpeedShop experiments operate on executable binaries, so there are no special compilation or linking requirements. Users just need to ensure they load an appropriate Open|SpeedShop dotkit package - see step 1 under the Quick Start instructions.

Run-time Options

All Open|SpeedShop experiment commands have options. To view the available options, simply enter the name of the experiment command by itself, or call it with the --help flag.

In addition to these, there are a number of environment variables that can be used to direct run-time behavior. Some of these are optional, and some may be required. For details, see Open|SpeedShop documentation. For convenience, key environment variables are listed in the table below, reproduced from the Open|SpeedShop User's Guide.

Environment Variable Description
OPENSS_RAWDATA_DIR Used on cluster systems where a /tmp file system is unique on each node. It specifies the location of a shared file system path which is required for O|SS to save the raw data files on distributed systems.
OPENSS_RAWDATA_DIR=shared file system path
Example: export OPENSS_RAWDATA_DIR=/lustre4/fsys/userid
OPENSS_ENABLE_MPI_PCONTROL Activates the MPI_Pcontrol function recognition, otherwise MPI_Pcontrol function calls will be ignored by O|SS.
OPENSS_DATABASE_ONLY When running the Open|SpeedShop convenience scripts only create the database file and do NOT put out the default report. Used to reduce the size of the batch file output files if user is not interested in looking at the default report.
OPENSS_RAWDATA_ONLY When running the Open|SpeedShop convenience scripts only gather the performance information into the OPENSS_RAWDATA_DIR directory, but do NOT create the database file and do NOT put out the default report.
OPENSS_DB_DIR Specifies the path to where O|SS will build the database file. On a file system without file locking enabled, the SQLite component cannot create the database file. This variable is used to specify a path to a file system with locking enabled for the database file creation. This usually occurs on lustre file systems that don't have locking enabled.
OPENSS_DB_DIR=file system path
Example: export OPENSS_DB_DIR=/opt/filesys/userid
OPENSS_MPI_IMPLEMENTATION Specifies the MPI implementation in use by the application; only needed for the mpi, mpit, and mpiotf experiments. These are the currently supported MPI implementations: openmpi, lampi, mpich, mpich2, mpt, lam, mvapich, mvapich2. For Cray, IBM, Intel MPI implementations, use mpich2.
Example: export OPENSS_MPI_IMPLEMENTATION=openmpi
In most cases, O|SS can auto-detect the MPI in use.


  • Open|SpeedShop is a complex toolkit, and as such, troubleshooting problems may be difficult for the average user.
  • The most common problem is forgetting to load the Open|SpeedShop environment using the use openss-packagename command.
  • Most problems, if not easily resolved, should be reported to the LC Hotline.

Documentation and References

The most important Open|SpeedShop links are listed below. Searching the web will find additional Open|SpeedShop documentation and presentations hosted by third parties.