All analysis is performed on unmodified binaries, and can be used on serial, MPI and threaded codes. Open|SpeedShop performance data collection is built around "experiments". Each experiment collects a specific type of performance data.
Three user interface options are provided: Graphical user interface, command line, and a Python scripting API.
The Component Based Tool Framework (CBTF) is a new, experimental implementation of Open|SpeedShop that improves tool scalability and adds new features. It is a younger product still under development, but should be stable on LLNL machines.
|x86_64 Linux||/usr/global/tools/openspeedshop/||Multiple versions are available. Load with Dotkit.|
|BG/Q||/usr/global/tools/openspeedshop/||Multiple versions are available. Load with Dotkit.|
% use -l openss performance/profile ---------- openss-mvapich2 - Open|Speedshop (Version 2.1 CBTF for x8664) openss21 - Open|Speedshop (Version 2.1) openss - Open|Speedshop (Version 2.1 CBTF for x8664) % use -l cbtf performance/profile ---------- cbtf-mvapich2 - Open|Speedshop (Version 2.1 CBTF for x8664) cbtf - Open|Speedshop (Version 2.1 CBTF for x8664) openss-mvapich2 - Open|Speedshop (Version 2.1 CBTF for x8664) openss - Open|Speedshop (Version 2.1 CBTF for x8664) % use openss Prepending: openss (ok)
|pcsamp||Periodic sampling the program counters gives a low-overhead view of where the timeis being spent in the user application.|
|usertime||Periodic sampling the call path allows the user to view inclusive and exclusive time spent in application routines. It also allows the user to see which routines called which routines. Several views are available, including the "hot" path.|
|hwc||Hardware events (including clock cycles, graduated instructions, instruction and data cache and TLB misses, floating-point operations) are counted at the machine instruction, source line and function levels.|
|hwcsamp||Similar to hwc, except that sampling is based on time, not PAPI event overflows. Up to six events may be sampled during the same experiment.|
|hwctime||Similar to hwc, except that call path sampling is also included.|
|io||Accumulated wall-clock durations of input/output (I/O) system calls: read, readv, write, writev, open, close, dup, pipe, creat and others. Show call paths for each unique I/O call path.|
|iop*||Lightweight I/O profiling: Accumulated wall-clock durations of I/O system calls: read, readv, write, writev, open, close, dup, pipe, creat and others, but individual call information is not recorded.|
|iot||Similar to io, except that more information is gathered, such as bytes moved, file names, etc.|
|mpi||Captures the time spent in and the number of times each MPI function is called. Show call paths for each MPI unique call path.|
|mpip*||Lightweight MPI profiling: Captures the time spent in and the number of times each MPI function is called. Show call paths for each MPI unique call path, but individual call information is not recorded.|
|mpit||Records each MPI function call event with specific data for display using a GUI or a command line interface (CLI). Trace format option displays the data for each call, showing its start and end times.|
|fpe||Find where each floating-point exception occurred. A trace collects each with its exception type and the call stack contents. These measurements are exact, not statistical|
|mpiotf||Write MPI calls trace to Open Trace Format (OTF) files to allow viewing with Vampir or converting to formats of other tools.|
|mem*||Captures the time spent in and the number of times each memory function is called. Show call paths for each memory function's unique call path.|
|pthreads*||Captures the time spent in and the number of times each POSIX thread function is called. Show call paths for each POSIX thread function's unique call path.|
|cuda*||Captures the NVIDIA CUDA events that occur during the application execution and report times spent for each event, along with the arguments for each event, in an event-by-event trace.|
osscmd "srun [srun options] usercmd [usercmd options]" [osscmd options]Example: collect PAPI hardware counter events for matmult executable running with 4 MPI tasks. Uses the default PAPI event PAPI_TOT_CYC.
% osshwc "srun -n4 matmult"
Example: collect multiple PAPI hardware counter events plus program counter sampling for matmult executable running with 4 MPI tasks. Specifies L1, L2 and L3 cache misses in addition to CPU time provided by PC sampling.
% osshwcsamp "srun -n4 matmult" PAPI_L1_TCM,PAPI_L2_TCM,PAPI_L3_TCM
Open|SpeedShop will also create a database file in the working directory. The file will be named as:
[executable]-[experiment].openssThe database file is used for analysis with the Open|SpeedShop GUI.
Examples of both types of output are shown below.
% osshwcsamp "srun -n4 matmult" PAPI_L1_TCM,PAPI_L2_TCM,PAPI_L3_TCM [openss]: hwcsamp experiment using input papi event: "PAPI_L1_TCM,PAPI_L2_TCM,PAPI_L3_TCM". [openss]: hwcsamp experiment using the hwc experiment default sampling_rate: "100". [openss]: hwcsamp experiment calling openss. [openss]: Setting up offline raw data directory in /p/lscratche/blaise/offline-oss [openss]: Running offline hwcsamp experiment using the command: "srun -n4 /collab/usr/global/tools/openspeedshop/oss-dev/x8664/oss_offline_v2.1u4/bin/ossrun -c hwcsamp ./matmult" Output from matmult executable removed for clarity [openss]: Converting raw data from /p/lscratche/blaise/offline-oss into temp file X.0.openss Processing raw data for matmult ... Processing processes and threads ... Processing performance data ... Processing symbols ... Resolving symbols for /g/g0/blaise/matmult/matmult Resolving symbols for /usr/local/tools/mvapich-gnu-1.2/lib/shared/libmpich.so.1.0 Resolving symbols for /lib64/libc-2.12.so Resolving symbols for /lib64/libpthread-2.12.so Resolving symbols for /usr/lib64/libpsm_infinipath.so.1.14 Resolving symbols for /usr/lib64/libinfinipath.so.4.0 Updating database with symbols ... Finished ... [openss]: Restoring and displaying default view for: /g/g0/blaise/tau/workshop/matmult/matmult-hwcsamp.openss [openss]: The restored experiment identifier is: -x 1 Exclusive % of CPU papi_l1_tcm papi_l2_tcm papi_l3_tcm Function (defining location) CPU time Time in seconds. 29.130000 75.978091 6737152983 1648401202 554678152 multiply_matrices_ (matmult: matmult.f90,25) 5.770000 15.049557 58094130 25937153 4460313 ips_ptl_poll (libpsm_infinipath.so.1.14) 1.260000 3.286385 6211521 4328113 572851 __psmi_poll_internal (libpsm_infinipath.so.1.14) 1.180000 3.077726 4660014 3480615 262262 psm_mq_wait (libpsm_infinipath.so.1.14) 0.440000 1.147626 4010322 1844752 282742 __GI___sched_yield (libc-2.12.so: syscall-template.S,82) 0.270000 0.704225 1280815 958226 79788 MAIN__ (matmult: matmult.f90,39) 0.160000 0.417319 10697294 5052727 1891272 ipath_dwordcpy (libinfinipath.so.4.0) 0.050000 0.130412 12954 7680 1581 initialize_ (matmult: matmult.f90,4) 0.010000 0.026082 2297350 559612 186424 MPID_PSM_RecvComplete (libmpich.so.1.0: psmrecv.c,73) 0.010000 0.026082 2317920 568133 192646 MPID_PSM_Send (libmpich.so.1.0: psmsend.c,36) 0.010000 0.026082 1117412 515398 270530 pthread_spin_lock (libpthread-2.12.so: pthread_spin_lock.c,35) 0.010000 0.026082 893876 490998 136812 psmi_amsh_long_reply (libpsm_infinipath.so.1.14) 0.010000 0.026082 41437 31068 2099 ips_spio_transfer_frame (libpsm_infinipath.so.1.14) 0.010000 0.026082 2315250 569334 193507 ips_proto_flow_enqueue (libpsm_infinipath.so.1.14) 0.010000 0.026082 2297885 559894 186956 ips_proto_process_packet_inner (libpsm_infinipath.so.1.14) 0.010000 0.026082 1114033 512588 270345 ipath_dwordcpy_safe (libinfinipath.so.4.0) 38.340000 100.000000 6834515196 1693817493 563668280 Report Summary
To view the database output, find the relevant *.openss file in your working directory, and then call the Open|SpeedShop GUI with that file. For example:
% openss matmult-hwcsamp.openss
The GUI will then appear, displaying the results of the experiment. Several examples for different experiments are shown below (click for a larger image). Users should consult the Open|SpeedShop documentation for details.
Manager and Profile Summary Windows
Thread Bar Chart
Function Bar Chart
Profile Summary for PAPI_L2_DCM Event
Function Bar Chart for PAPI_L2_DCM Event
In addition to these, there are a number of environment variables that can be used to direct run-time behavior. Some of these are optional, and some may be required. For details, see Open|SpeedShop documentation. For convenience, key environment variables are listed in the table below, reproduced from the Open|SpeedShop User's Guide.
|OPENSS_RAWDATA_DIR||Used on cluster systems where a /tmp file system is unique on each node. It specifies the location of a shared file system path which is required for O|SS to
save the raw data files on distributed systems.
OPENSS_RAWDATA_DIR=shared file system path
Example: export OPENSS_RAWDATA_DIR=/lustre4/fsys/userid
|OPENSS_ENABLE_MPI_PCONTROL||Activates the MPI_Pcontrol function recognition, otherwise MPI_Pcontrol function calls will be ignored by O|SS.|
|OPENSS_DATABASE_ONLY||When running the Open|SpeedShop convenience scripts only create the database file and do NOT put out the default report. Used to reduce the size of the batch file output files if user is not interested in looking at the default report.|
|OPENSS_RAWDATA_ONLY||When running the Open|SpeedShop convenience scripts only gather the performance information into the OPENSS_RAWDATA_DIR directory, but do NOT create the database file and do NOT put out the default report.|
|OPENSS_DB_DIR||Specifies the path to where O|SS will build the database file. On a file system without file locking enabled, the SQLite component cannot create the database file. This variable is used to specify a path to a file system with locking enabled for the database file creation. This usually occurs on lustre file systems that don't have locking enabled.
OPENSS_DB_DIR=file system path
Example: export OPENSS_DB_DIR=/opt/filesys/userid
|OPENSS_MPI_IMPLEMENTATION||Specifies the MPI implementation in use by the application; only needed for the mpi, mpit, and mpiotf experiments. These are the currently supported MPI implementations: openmpi, lampi, mpich, mpich2, mpt, lam, mvapich, mvapich2. For Cray, IBM, Intel MPI implementations, use mpich2.
OPENSS_MPI_IMPLEMENTATION=MPI impl. name
Example: export OPENSS_MPI_IMPLEMENTATION=openmpi
In most cases, O|SS can auto-detect the MPI in use.