Using the Sequoia and Vulcan BG/Q Systems

Blaise Barney, Lawrence Livermore National Laboratory LLNL-WEB-608955

Table of Contents

  1. Abstract
  2. Evolution of IBM's Blue Gene Architectures
  3. What Is Sequoia?
  4. Hardware
    1. Sequoia System General Configuration
    2. BG/Q versus BG/P
    3. BG/Q Scaling Architecture
    4. BG/Q Compute Chip, A2 Processor, QPX Unit
    5. BG/Q Compute Card and I/O Card
    6. BG/Q Node Card
    7. BG/Q Midplanes and Racks
    8. BG/Q Networks
    9. Login Nodes
    10. Visualization Clusters
  5. Accessing LC's BG/Q Machines
  6. Accounts, Allocations and Banks
  7. Software and Development Environment - Summary and links to further information about:
  8. Compilers
    1. Overview
    2. IBM Compilers
    3. GNU Compilers and binutils
    4. Optimization
    5. Cross-Compilation Caveats
    6. Static vs. Dynamically Linked Libraries
    7. Miscellaneous
  9. MPI
    1. Overview
    2. MPI Compiler Commands
    3. Alternative MPI Library Builds
    4. Level of Thread Support
    5. Message Routing and Protocols
    6. MPI Environment Variables
    7. MPMD Programs
  10. OpenMP and Pthreads
  11. Running Jobs
    1. System Configuration Details
    2. Number of Nodes, Tasks and Threads
    3. The srun Command
    4. Batch Jobs
    5. Interactive Jobs
    6. Uncertainty Quantification (UQ) Jobs
    7. CNK Environment Variables
  12. Memory Considerations
  13. Transactional Memory and Speculative Execution
  14. Using the QPX Floating-point Unit
  15. Partitions, Mapping and Personality
  16. Math Libraries
  17. Parallel I/O
  18. HPSS Archival Storage
  19. Debugging
    1. Core Files
    2. addr2line
    3. CoreProcessor Tool
    4. STAT for Corefiles
    5. TotalView
    6. STAT
  20. Performance Analysis Tools
    1. What's Available?
    2. gprof
    3. HPCToolkit (Rice)
    4. HPC Toolkit (IBM)
    5. mpitrace
    6. memP
    7. mpiP
    8. Open|SpeedShop
    9. PAPI
    10. TAU
    11. VampirTrace / Vampir
    12. Valgrind
  21. Documentation, Help and References
  22. Exercise


Abstract


This tutorial is intended for users of Livermore Computing's Sequoia BlueGene/Q systems. It begins with a brief history leading up to the BG/Q architecture. Configuration information for the LC's BG/Q systems is presented, followed by detailed information on the BG/Q hardware architecture, including the PowerPC A2 processor, quad FPU, compute, I/O, login and service nodes, midplanes, racks and the 5D Torus network. Topics relating to the software development environment are covered, followed by detailed usage information for BG/Q compilers, MPI, OpenMP and Pthreads. Math libraries, environment variables, transactional memory, speculative execution, system configuration information, and specifics on running both batch and interactive jobs are presented. The tutorial concludes with a discussion on BG/Q debugging and performance analysis tools.

Level/Prerequisites: Intended for those who are new to developing parallel programs in the IBM BG/Q environment. A basic understanding of parallel programming in C or Fortran is required. Familiarity with MPI and OpenMP is desirable. The material covered by EC3501 - Introduction to Livermore Computing Resources would also be useful.



Evolution of IBM's Blue Gene Architectures

This section provides a brief history of the IBM Blue Gene architecture.

1982-1998: QCDSP Predecessor:
  • The QCDSP (Quantum Chromodynamics on Digital Signal Processors) Project led by Columbia University and 4 other collaborating institutions, designed and built systems that incorporated features seen in the Blue Gene architecture:
    • Built with thousands of commodity, off-the-shelf processors
    • Low cost, low power consumption, small footprint
    • Torus network

  • Multiple systems built, ranging up to 600 Gflops

  • Gordon Bell Prize for Most Cost Effective Supercomputer in '98. Award presented at SC98 conference.

  • More information: http://phys.columbia.edu/~cqft/

December 1999: QCDOC and Blue Gene Projects Begin

  • QCDOC Project (QCD On a Chip):
    • Follow-on to QCDSP project
    • Built several multi-Tflop systems between 1999-2005
    • Custom application-specific integrated circuit (ASIC) built by IBM
    • IBM PowerPC 440 core (same as Blue Gene/L)
    • 6-dimensional torus network
    • GNU and IBM XL C/C++ compilers
    • More information: http://phys.columbia.edu/~cqft/

  • IBM Research announced a 5 year, $100M US, effort to build a petaflop scale supercomputer to attack problems such as protein folding. It was called the "Blue Gene" project.

  • The Blue Gene project had two primary goals:
    • Advance the understanding protein folding via large-scale biomolecular simulations that would be orders of magnitude larger than current technologies
    • Explore novel ideas in massively parallel machine architecture and software

  • Also projected to be world's fastest supercomputer.

  • The press release is available HERE.

November 2001: IBM and LLNL Blue Gene Research Partnership

  • IBM announced a research partnership with NNSA's Lawrence Livermore National Laboratory (LLNL) to expand IBM's Blue Gene research project.

  • Jointly design a new supercomputer based on the Blue Gene architecture.

  • Called Blue Gene/L, the machine was to be 15 times faster, consume 15 times less power per computation and be 50 to 100 times smaller than current supercomputers.

  • Intended to address an important subset of computational problems that can be easily divided to run on many tens of thousands of processors.

  • The press release is available HERE.

November 2002: DOE Awards Blue Gene/L Contract to IBM

  • At the SC2002 conference, Energy Secretary Spencer Abraham announced that The Department of Energy (DOE) had awarded IBM a $290 million contract to build the two fastest supercomputers in the world with a combined peak speed of 460 trillion calculations per second (teraflops).

  • The first system would be ASC Purple, with a peak speed of 100 teraflops and over 12,000 processors.

  • The second system would be Blue Gene/L, with a peak speed of 360 teraflops and 130,000 processors.

  • The press release is available HERE.

2003 - June 2004: Blue Gene/L Development

  • Blue Gene development progresses within IBM Rochester and IBM T.J. Watson Research Center

  • June 2003: First-pass chips (DD1) completed. (Limited to 500 MHz).

  • November 2003: 512-node DD1 system achieves 1.4 Tflop Linpack for #73 on the Top500 List. A 32-node prototype folds proteins live on the demo floor at SC2003 conference.

  • February 2, 2004: Second pass (DD2) BG/L chips achieves 700 MHz design.

  • June 2004: 4-rack, 4096 node DD1 prototype achieves 11.7 Tflop Linpack for #4 position on the Top500 List. A 2-rack, 2048 node DD2 system achieves 8.7 Tflop Linpack for #8 position.

November 2004 - November 2007: Blue Gene/L Ranks #1

  • First production Blue Gene/L system ranks #1 on the Top500 List seven consecutive times.

  • November 2004: IBM Rochester beta Blue Gene system achieves 71 Tflop Linpack for #1 position on the Top500 List. The 16-rack, 16384 node system later moved to Lawrence Livermore National Laboratory for installation.

  • June 2005: 32-rack, 32,768 node LLNL system achieves 137 Tflop Linpack for #1 position on the Top500 List again. The press release is available HERE

  • 2005: Complete 64-rack 65,536 node LLNL system installed. Installation of other BGL systems occur at Astron, AIST, Argonne, SDSC, Edinburg, ...

  • November 2005: 65K node LLNL system achieves 281 Tflop Linpack for #1 position again, and retains this slot thru November 2007. IBM T.J. Watson system is ranked #2 at 91 Tflop Linpack.

  • November 2007: Expanded LLNL BG/L system (106,496 nodes) achieves 478 Tflops Linpack to remain #1 until the next Top500 List in June 2008.

Blue Gene/P:

  • The Blue Gene/P architecture is a follow-on to Blue Gene/L.

  • Similar to Blue Gene/L, but has 4 cores per node instead of 2.

  • November 2007: The first Blue Gene/P system appeared on the Top500 List at position #2 with 167 Tflops. Installed at Forschungszentrum Juelich in Germany.

  • LLNL's Dawn system (2009-2013) was a 500 Tflop Blue Gene/P machine with 36,864 compute nodes and 147,456 cores.

Blue Gene/Q and Sequoia:

  • February 2009: The Department of Energy awards IBM a contract to build a 20 petaflop, next generation Blue Gene system at LLNL. The new architecture is called Blue Gene/Q, a follow on to the BG/P architecture.

  • Although similar to BG/P, there are significant differences and improvements.

  • News and press releases:

  • June 2012: Sequoia, the first production BG/Q system, debuted on the Top500 List at #1 with a 16.3 petaflop Linpack and 20.1 petaflop peak. Three other BG/Q systems also debuted in the top 10 positions on this list.

  • News and press releases:
8192 Node QCDSP Machine
8192 node, 400 Gflop QCDSP at Columbia University
Source: http://phys.columbia.edu/~cqft/


Blue Gene Project Logo





Blue Gene/L Banner

Blue Gene/L Artist Conception

BGL Installation with Cables
Installation of LLNL's BG/L Machine


LLNL BG/L System
LLNL's BG/L Machine - larger image HERE


LLNL BG/P System
LLNL's Dawn BG/P Machine - larger image HERE


LLNL BG/Q System
LLNL's Sequoia Machine - larger image HERE


What Is Sequoia?


Overview:
  • Sequoia is a 20 Petaflop IBM BG/Q system sited at the Lawrence Livermore National Laboratory in Livermore, CA.

  • 98,304 nodes with 16 cores/node; 1,572,864 total cores

  • 4 hardware threads per core; 64 per node; 6,291,456 total threads

  • 64-bit, IBM PowerPC A2 processor

  • 1.6 petabytes of memory; 16 GB/node

  • 5-dimensional Torus network

  • 96 refrigerator sized racks:
    • ~3400 square feet
    • ~4500 lbs. per rack = 216 tons

  • 7.9 MWatts total power

  • Extremely power efficient - Green500 List comparisons:
    BG/L:   205 MFLOPS/W
    Dawn BG/P:   367 MFLOPS/W
    Sequoia BG/Q:   2,177 MFLOPS/W

  • Water cooled

  • ASC Tri-lab, classified, capability resource

  • Unclassified Sequoia BG/Q systems:
    • vulcan: Unclassified, 24,576 node 5 Petaflop BG/Q production system
    • rzuseq: Unclassified, 512 node BG/Q development system




Sequoia

Photos:



Hardware

Sequoia System General Configuration

Nodes:

Networks:

Lustre File System:

Vulcan Configuration:

BG/Q Systems Summary:



BG/Q versus BG/P

BG/Q - Similar But Significantly Different Than BG/P:

Hardware

BG/Q Scaling Architecture

Chip to Full System:

  • 16 cores, 16 quad-FPUs, memory controllers, networks, etc on single chip
  • 1 chip plus memory = a single compute (or I/O) card
  • 32 compute cards comprise a "drawer-like" node card
    8 I/O cards comprise an I/O drawer
  • 16 node cards form a midplane
  • 2 midplanes + I/O drawers comprise a rack
  • Multiple racks connect to complete the system. In Sequoia's case, 96 racks.
    BG/Q Scaling Architecture


Hardware

BG/Q Compute Chip

BG/Q Compute Chip:
  • Physical characteristics:
    • 360 mm2 in size (~19mm x 19mm)
    • Cu-45 SOI technology
    • ~1.47 B transistors
    • 11 metal layers

  • Total of 18 symmetric cores:
    • All cores 4-way multi-threaded, 64-bit, 1.6 GHz, dedicated L1 cache, and quad FPU.
    • 16 user cores
    • 1 system core reserved for operating system functions
    • 1 redundant core to increase manufacturing yield if there is a defective core

  • L1 caches:
    • 4 KB L1 prefetch cache; 32 x 128-byte lines
    • 16 KB L1 instruction cache; 64-byte lines; 4-way associative
    • 16 KB L1 data cache; 64-byte lines; 8-way associative
    • Dedicated to each core

  • L2 cache:
    • 32 MB
    • Shared between all cores
    • Split into 16 slices of 2 MB each, and connected with a crossbar switch to provide sufficient bandwidth to cores.
    • Multiversioned cache - supports transactional memory, speculative execution, rollback.
    • Supports atomic operations
    • 16-way set associative

  • Crossbar switch:
    • Connects L2 slices with cores
    • Peak aggregate read bandwidth of 409.6 GB/s and write bandwidth of 204.8 GB/s

  • Memory controllers:
    • Two DDR3 memory controllers
    • Each connects to external DDR3 memory
    • Peak aggregate bandwidth of 42.7 GB/s

  • Networking:
    • 5D Torus topology integrating MPI point-to-point, collectives, barriers, remote put/get
    • DMA
    • 10 bidirectional links to neighboring chips
    • Each link has a peak of 4 GB/s (2 GB/s send + 2 GB/s receive)
    • Additional 11th link for I/O

  • External (file) I/O
    • Compute nodes connect to I/O nodes over an 11th serial I/O link @ 2 GB/s
    • I/O nodes communicate to external network over PCIe Gen2 x8 interface (4 GB/s transmit + 4 GB/s receive)
    • At LLNL, Sequoia's I/O nodes communicate over the PCIe interface to QDR Infiniband cards.

PowerPC A2 Core:

  • Built upon IBM's Power Edge of Network(PowerEN) processor design:
    • Merges network and server attributes to create a wire-speed processor: "not a network endpoint that consumes data, but an inline processor that filters or modifies data and sends it on".
    • Strong emphasis on low power consumption design
    • Hybrid design employing "massive" multithreading capabilities, integrated I/O and unique special-purpose accelerators for compression, cryptography, pattern matching, XML and Ethernet.
    • IBM PowerEN presentation slides
    • IBM PowerEN White Paper

  • Implements 64-bit PowerISA instruction set

  • Optimized for aggregate throughput:
    • 4-way simultaneously multi-threaded (SMT)
    • 2-way concurrent issue 1 XU + 1 FPU
    • In-order dispatch, execution, completion

  • L1 Instruction / Data cache = 16 KB / 16 KB. 64 byte lines.

  • 32x4x64-bit GPR

  • Dynamic branch prediction

  • Connected to its own private prefetch unit

  • 1.6 GHz

  • Detailed information available in the A2 Processor User's Manual.

Quad Floating-point Unit:

  • Each core has an associated quad floating-point unit - increased from the double FPU in BG/P.

  • Also referred to as the QPX floating-point unit: QPX = Quad Processing eXtension

  • A type of single instruction multiple data (SIMD) vector hardware

  • Four register files (RF) / double precision pipelines. Each pipeline is comprised of 32 registers 64-bits wide.

    QPX Registers

  • Supports SIMD vector operations on:
    • 4-wide double or single precision floating-point numbers (single precision is converted to double)
    • 2-wide complex numbers

  • Instruction extensions to PowerISA

  • Peak throughput: 8 floating-point ops (4xFMA) + load + store per cycle

  • Supports "a multitude" (per IBM) of data alignments

  • Usage information covered in the Using the QPX Floating-point Unit section.
BG/Q Compute Chip
BG/Q Compute Chip Photo 1

BG/Q Compute Chip
BG/Q Compute Chip Schematic 2

BG/Q PowerPC A2 core
IBM PowerPC A2 Processor Core 1

BG/Q Quad Floating-Point Unit
BG/Q Quad Floating-Point Unit 1



Hardware

BG/Q Compute Card and I/O Card (Nodes)

Similarities:
  • Physically, compute nodes and I/O nodes are virtually identical.

  • Both are comprised of the same components:
    • One BG/Q compute chip
    • 72 SDRAM chips = 16 GB DDR3
    • Connectors to power, JTAG and 5D Torus network

Differences:

  • Main differences are due to how the nodes are used (function)

  • I/O nodes perform all I/O requests on behalf of compute nodes; compute nodes do not perform I/O directly.

  • Compute nodes are connected to other compute nodes by the 5D torus network; I/O nodes are not.

  • I/O nodes are the only connection to the "outside world". At LLNL, this is by means of a PCIe IB adapter card to a QDR Infiniband network.

  • I/O nodes number only a fraction of the compute nodes (1:32, 1:64, 1:128). At LLNL the ratio is 1:128.

  • Compute nodes run a light-weight, Linux-based kernel, called the Compute Node Kernel (CNK). I/O nodes run a full Linux kernel.

  • Compute nodes are located inside the rack; I/O nodes are housed in drawers external to the main rack (usually).

  • Compute nodes are water cooled; I/O nodes are air-cooled.
BG/Q Compute Card
BG/Q Compute Card 1

BG/Q I/O Drawer
BG/Q I/O Drawer with 8 I/O Cards 3



Hardware

BG/Q Node Card

Characteristics:

Hardware

BG/Q Midplanes and Racks

Midplanes:
  • 16 node cards plug in from both sides to comprise a 512 node midplane.

  • Two midplanes comprise a 1024 node rack

  • Network communications within and between midplanes are conducted over the 5D Torus.

  • Each midplane includes a service card (not shown), required for control, monitoring and diagnostic functions.

Racks:

  • Each rack measures 48"w x 52"d x 83"h and weighs appox. 4500 lbs (with coolant and one I/O drawer).

  • Includes redundant power supply units

  • Network communications between racks are conducted over torus cables.

  • Controlled from the service node(s) over 1 Gb Control Ethernet / JTAG network

  • Up to four I/O drawers can be conveniently mounted on top of a rack

  • Multiple racks comprise an entire system - scalable up to 512 racks.

    BG/Q Racks

    Larger image HERE
BG/Q Midplane
1024 Node BG/Q Rack Showing Midplanes



Hardware

BG/Q Networks

5D Torus:
  • Interconnects all compute nodes; every compute node is connected to each of its ten nearest neighbors.

  • Integrates MPI point-to-point, collective and barrier operations into a single network. BG/P and BG/L used three different networks.

  • Peak Bandwidth: 40 GB/s per node (10 links * 2 GB/s * 2-way/bidirectional)

  • Measured Performance - using IBM's low-level, optimized messaging interface (PAMI) software:
    • All-to-all: 97% of peak
    • Bisection: > 93% of peak
    • Nearest-neighbor: 98% of peak
    • Collective: Floating point reductions at 94.6% of peak

  • Hardware Latency: 80 ns nearest neighbor; 3 us worst case (96 rack system)

  • DMA (Direct Memory Access) engine offloads work of injecting and receiving packets from the core - leaves more cycles for the core to do computations

  • Electrical signaling within a midplane; optical links between midplanes through link chips that convert electrical to optical.

  • Hardware assisted barriers and floating point support for collectives in network

  • Single pass floating point reductions at near link bandwidth; bit reproducible

  • The torus also includes an extra 11th link - the I/O link:
    • Used to ship I/O from compute nodes to I/O nodes
    • To match I/O bandwidth to the external file system, only some compute nodes have the I/O link attached to an I/O node
    • Typically, an I/O node is connected to two compute nodes

  • 5D torus dimensions determined by number and placement of the systems racks/midplanes. Some example configurations are provided:

  • The mapping of user MPI tasks to the 5D torus is discussed later in the Partitions, Mapping and Personality section.

Functional LAN / SAN:

  • For Sequoia, this is a Lustre based, quad data rate (QDR) Infiniband storage area network (SAN)

  • This is a completely new version of Lustre:
    • ZFS based
    • Developed by LLNL with Whamcloud/Intel

  • Connects I/O, login and service nodes to Lustre storage OSS and MDS nodes via an Infiniband switching network.

  • Also connects to other LC network infrastructure and other IB clusters (later expansion).

  • Compute nodes are not connected. All interactions between the compute nodes and the outside world are carried through the I/O nodes.

  • SAN Infiniband hardware from Mellanox

  • See the diagram at right for additional architecture details

1 Gb Control Ethernet / JTAG:

  • JTAG = Joint Test Action Group - refers to an IEEE 1149.1 interface for control, monitoring, and debugging.

  • The Control/JTAG network grants the Service Node(s) direct access to all nodes. It is used for system boot, debug, and monitoring.

  • It also enables the Service Node(s) to provide run-time non-invasive reliability, availability, and serviceability (RAS) support.

    BG/Q System Configuration
    BG/Q Networks (generic) 4

Sequoia 5D Torus Network
Click for larger image

BG/Q 5D Torus Network
Midplane, 512 nodes, 4x4x4x4x2 Torus 2
Click for larger image

Sequoia SAN Architecture
Sequoia SAN Architecture
Click for larger image



Hardware

Login Nodes

Description:



Hardware

Visualization Clusters

Max:

Surface:



Accessing LC's BG/Q Machines



Accounts, Allocations and Banks

Accounts:

Allocations and Banks:



Software and Development Environment


Summary:

Login Nodes:

Compute Node Kernel (CNK):

I/O Node Kernel:

Batch Systems:

File Systems:

Dotkit Packages:

Compilers:

Math Libraries

Debuggers and Performance Analysis Tools:

Visualization Software and Compute Resources:

X11 Issues:

Web Browsers, PDF Viewers:

Man Pages:

Known Problems:



Compilers


Overview:

IBM Compilers:

IBM Compiler Options:

GNU Compilers:

GNU binutils and Other Utilities - BG/Q Versions:

Optimization:

Cross-Compilation Caveats:

Static vs. Dynamically Linked Libraries:

Miscellaneous

Mixed Language Compilations:

Manually Building an MPI Executable:

See the documentation



MPI


Overview:

MPI Compiler Commands:

Alternative MPI Library Builds

Level of Thread Support

Message Routing and Protocols

MPI Environment Variables

MPMD Programs:



OpenMP and Pthreads


OpenMP:

Pthreads:



Running Jobs

System Configuration Details

First Things First:

System Configuration/Status Information:

Livermore Computing Homepage
computing.llnl.gov
MyLC User Portal
mylc.llnl.gov

Basic Configuration Commands:



Running Jobs

Number of Nodes, Tasks and Threads

BG/Q Execution Options:

Considerations and Questions to Ask:

Valid Block Sizes:

A Few Recommendations



Running Jobs

The srun Command

Overview:

Other Usage Notes:



Running Jobs

Batch Jobs

Only SLURM:

Batch Partitions/Queues and Policies:

Job Control Scripts:

Submitting Your Job:

Interacting With Your Jobs:

Killing Batch Jobs:



Running Jobs

Interactive Jobs

Using the pdebug Pool:

Running Jobs

Uncertainty Quantification (UQ) Jobs

Overview:

Running Jobs

CNK Environment Variables



Memory Considerations


BG/Q Memory Architecture:
  • BG/Q nodes have four hierarchical levels of memory:
    • L1 Prefetch Cache
    • L1 Data Cache, L1 Instruction Cache
    • L2 Cache
    • DDR3 Main Memory

  • The schematic at right, and the table below, describe these memory components.

    Component Size Location Latency
    (processor clocks)
    Additional Information
    L1 Prefetch Cache 4 KB On-chip 24 32 lines
    128-byte line size
    Identifies and prefetches memory access patterns
    L1 Instruction Cache 16 KB On-chip 3 64-byte line size
    4-way set associative
    L1 Data Cache 16 KB On-chip 6 64-byte line size
    8-way set associative
    L2 Cache 32 MB On-chip 82 128-byte line size
    16-way set associative
    16 x 2 MB slices
    Cross-bar switch connected
    DDR3 Main Memory 16 GB Off-chip >350 128-byte line size

  • The L2 Cache supports atomic operations, speculative execution/threads, and transactional memory with rollback. These topics are discussed further in the Transactional Memory and Speculative Execution section.

  • All threads are pinned to a core so affinity is guaranteed

  • Because of the crossbar switch, there are no non-uniform memory access (NUMA) effects.

  • For additional information on the BG/Q memory architecture, see Chapter 4 in the Blue Gene/Q Application Development Redbook.

BG/Q Compute Chip Memory 2
Memory Constraints:



Transactional Memory and Speculative Execution


What Are They and Why Use Them?

Using BG/Q Transactional Memory (TM):

Using BG/Q Thread-level Speculative Execution (SE):



Using the QPX Floating-point Unit


Overview:

Automatic SIMDization:

Vector Intrinsic Functions:



Partitions, Mapping and Personality


Partitions:
  • The physical dimensions of the BG/Q 5D torus network depend upon the number of racks/midplanes and how they are arranged on the floor. Some example configurations are shown in the adjacent table.
    • Sequoia's 96-rack 5D torus is 16 x 12 x 16 x 16 x 2

  • The five dimensions are referred to as A, B, C, D and E.
    • The A, B, C, D dimensions are physically cabled between midplanes
    • The E dimension is always 2 and exists internally within a node card

  • The entire 5D torus network can be (and usually is) partitioned into blocks.

  • Large blocks:
    • Occupy one or more complete, 512-node midplanes
    • Always a multiple of 512 nodes
    • Can be a wrapped torus in all 5 dimensions
    • Valid block sizes form a rectangle of midplanes. For a list of valid block sizes, see the discussion under Number of Nodes, Tasks and Threads.

  • Small blocks:
    • Restricted to a single midplane or smaller
    • Occupy one or more node cards
    • Always a multiple of 32 nodes (32, 64, 128, 256)
    • Cannot be a wrapped torus in all 5 dimensions

  • The network in a block is isolated from other blocks:
    • No network interference from other blocks
    • MPI jobs do not compete with each other for network communications

  • Blocks require an I/O node to function: the ratio of I/O nodes to compute nodes defines the smallest block, which is typically 64 or 128 nodes. In the case of Sequoia, it is 128.

  • Sub-block jobs:
    • New with BG/Q
    • Small blocks are dynamically subdivided by the job scheduler into yet smaller rectangular blocks
    • Allows multiple jobs from multiple users to run within a single block
    • Any job shape, between a single node (1x1x1x1x1) and the entire midplane (4x4x4x4x2), is valid for a sub-block job
    • Caveat: all sub-block jobs share the enclosing block's I/O node(s). This can result in I/O contention between jobs
    • MPI interference between jobs is prevented
BG/Q Torus Dimensions
Click for larger image

BG/Q Sub-blocks

Partition Usage Notes:

Mapping:

Personality:

Math Libraries


ESSL:

IBM's Mathematical Acceleration Subsystem (MASS) Libraries:

FFTW:

LAPACK, ScaLAPACK, BLAS, BLACS:



Parallel I/O


Lustre Architecture:

Usage Information:



HPSS Archival Storage


HPSS Storage:

Access Methods and Usage:

Quotas:

Additional Information:



Debugging


Core Files:


addr2line:


CoreProcessor Tool:
  • This Perl script tool can analyze, sort and view text core files. It can also attach to hung processes for deadlock determination.

  • Works even if the operating system / node is completely dead because it uses the JTAG network

  • Location: /bgsys/drivers/ppcfloor/coreprocessor/bin/coreprocessor.pl

  • To get usage information: coreprocessor.pl -h

  • Normally used in GUI mode, so set DISPLAY then:
    coreprocessor.pl -c=/g/g5/joeuser/rundir -b=/g/g5/joeuser/rundir/a.out
    where -c is the directory containing your core files and -b is the name/path of your executable.
    For non-GUI mode see: coreprocessor.pl -h

  • Usage is fairly simple and straight-forward
    1. Select the preferred Group Mode, in this case, "Ungrouped w/Traceback". Other options are shown HERE.
    2. Select an item/routine in the corefile list
    3. Select a core file from the Common nodes pane
    4. The file/line which failed is displayed (if compiled with -g)
    5. The selected corefile then appears in the bottom pane

  • For jobs with many core files, the most practical Group Mode is "Stack Traceback (condensed)". This mode groups all similar core files into a single stack trace. The number of corefiles sharing the same stack trace is displayed next to each routine. Example available HERE.

  • Additional usage information and documentation: see the Coreprocessor chapter in the IBM Blue Gene/Q System Administration redbook.


Coreprocessor utility
Larger image available HERE


STAT for Corefiles:

  • The Stack Trace Analysis Tool (STAT), discussed in more detail in the STAT debugging section, can also be used to debug BG/Q lightweight core files.

  • Usage:
    • Use the core_stack_merge command to merge the lightweight core files produced by a crashed application into STAT .dot format files. For example:
      core_stack_merge -x myapplication -c core.*
    • Two output files will be produced, named myapplication.dot and myapplication_line.dot.
    • Then use the stat-view command on your _line.dot file to view the call graph prefix tree. For example:
      stat-view myapplication_line.dot
    • The application's call graph tree represents the global state of the crashed program. A simple example is provided here:
    • Note: you can also use your .dot file with the stat-view command, except it is missing the line number information.

  • If your job is hung, and it doesn't use a built-in signal handler to catch SIGSEGV signals, you can force it to terminate and dump core files by using the kill_job command to send a SIGSEGV signal to it. For example:
    /bgsys/drivers/ppcfloor/hlcs/bin/kill_job --id bg_jobid -s SIGSEGV
    where bg_jobid is the BG/Q jobid - not the SLURM jobid.

  • How to determine the BG/Q jobid:
    • Include --runjob-opts="--verbose INFO" as an option to your srun command when you start the job.
    • Otherwise, you will need to contact the LC Hotline and request that a BG/Q system admin use a DB2 query to get the BG/Q jobid.

  • Additional information about using STAT to debug BG/Q lightweight corefiles can be found on the LC internal (requires authentication) wiki at: https://lc.llnl.gov/confluence/display/BGQ/Debugging#Debugging-BlueGeneLightweightCoreFileDebugging.


Debugging

TotalView

Overview:
  • Important: This section only covers the very basics on getting TotalView started on Blue Gene systems. Please see the following also:

  • The TotalView parallel debugger is available on all LC Blue Gene systems.

  • LC's license allows users to run TotalView up to the full size of the system. However, practically speaking, the useable limit is currently between 4K - 8K compute nodes.

  • Most of the relevant commands should already be in your path from /usr/local/bin:

    Command Description
    totalview TotalView Graphical User Interface
    totalviewcli TotalView Command Line Interface
    srun MPI job launch command. See the srun section for details.
    mxterm A script that pops open an xterm window from a batch job so that a user can perform debug sessions interactively. Simply type mxterm for usage information.

  • Debugging batch partition jobs: you need to use the mxterm script (or something equivalent)

  • mxterm syntax:

    mxterm [#nodes] [#tasks] [#minutes] [msub arguments ...]

  • mxterm usage:
    • Login to a front-end login node and make sure that your Xwindows environment is setup properly. You can verify this by launching a simple X application like xclock or xterm
    • Issue the mxterm with your specific parameters. Note that the #tasks argument is ignored, but you still need to enter a dummy value. If you're not sure of the syntax, just enter the mxterm command without arguments and hit return. A usage summary will display - available HERE.
    • mxterm will then automatically generate and submit a batch script for you (in the background).
    • You will be provided with the usual job id# which you can then use to monitor your job in the queue.
    • When your job begins executing, an xterm window will appear on your desktop machine.
    • In your new xterm window, launch your parallel job under TotalView's control. For example:

      totalview srun -a -N8 -n128 a.out

    • Once TotalView's opening windows appear, you can then begin your debug session.

  • Attaching to an already running/hung batch job:
    1. You must first find where your job's srun process is running. It will be on one of the front-end nodes - but most likely NOT the front-end node you are logged into. Two easy ways to do this are shown below:

      Method 1:
      % squeue -o "%i %u %B"
      JOBID USER EXEC_HOST
      22030 joeuser vulcanlac5
      22040 swltest vulcanlac5
      22039 swltest vulcanlac6
      
      
      Shows the jobid, user and front-end node where that job's srun process is running
      Method 2:
      % scontrol show job 73963 | grep BatchHost
         BatchHost=vulcanlac5
      
      
      If you know the jobid, you can use this variation of the scontrol command and grep on the output for the front-end node where your srun process is running.

    2. Assuming that your Xwindows environment is setup correctly, launch totalview: totalview &
    3. After TotalView's opening windows appear, select the New Program window
      • Click on the Attach to process button.
      • Click on the Add Host... button
      • In the Add New Host dialog box, enter the name of the front-end node where your srun process is located, and click the OK button.
    4. You should now see a list of your processes - select the parent srun process and then click OK.
    5. TotalView will then attach to your srun process. You will probably need to Halt your srun process. After it stops, you can proceed to debug as normal.

Floating Point Exception Debugging:

  • TotalView supports floating point exception debugging on Blue Gene.

  • With the -qflttrap IBM compiler options, an offending task will generate a SIGFPE UNIX signal when it hits a specified exception.

  • Under TotalView a SIGFPE signal will cause your job to stop immediately. You can then use TotalView to perform a root cause analysis.

  • Syntax of the XL compiler's floating point exception trap options (see the compiler man page and IBM Compiler documentation for more info):

    -qflttrap=suboption1:suboption2:suboptionN

  • The suboptions determine what types of floating-point exception conditions to detect at run time. The suboptions are:

    Suboption Description
    enable Turn on checking for the specified exceptions
    inexact Detect and trap on floating-point inexact, if exception checking is enabled
    invalid Detect and trap on floating-point invalid operations
    nanq Detect and trap all quiet not-a-number (NaN) values
    overflow Detect and trap on floating-point overflow
    qpxstore Detect and trap Not a Number (NaN) or infinity values in Quad Processing eXtension (QPX) vectors.
    underflow Detect and trap on floating-point underflow
    zerodivide Detect and trap on floating-point division by zero

  • For example:

    -qflttrap=enable:inexact:invalid:nanq:overflow:underflow:zerodivide

Enabling floating point exception trapping can significantly degrade program performance. For more information, see the discussion on "Performance Impact" under https://lc.llnl.gov/confluence/display/BGQ/Floating+Point+Exceptions. (requires authentication).

TotalView Scalable Early Access (SEA) Program:

  • The DOE Tri-labs and Rogue Wave Software (TotalView vendor) are engaged in a collaboration to produce a more scalable, commercial grade, parallel debugger for use by DOE Tri-Lab users.

  • The purpose of this program is two-fold:
    • To assist Tri-Lab users in debugging those errors that emerge at a large process count (i.e., a scale that current production versions of TotalView cannot comfortably handle);
    • To gather early customer feedback on the project's direction before the improvements are folded into the production line of TotalView.

  • TotalView developers are very interested in collecting early end-user experiences, such as usability concerns and TotalView scalability, and performance realized on real-world field problems.

  • Users who wish to try out this version of TotalView should see the documentation located at: https://lc.llnl.gov/confluence/display/RWS/TotalView+Scalable+Early+Access+%28SEA%29+Program (internal wiki - requires OTP authentication).

  • Contact: Dong Ahn (ahn1@llnl.gov)


Debugging

STAT

Overview:
  • The Stack Trace Analysis Tool (STAT) gathers and merges stack traces from a parallel application's processes.

  • Primarily intended to attach to a hung job, and quickly identify where the job is hung.

  • The output from STAT consists of 2D spatial and 3D spatial-temporal graphs. These graphs encode calling behavior of the application processes in the form of a prefix tree. Example of a STAT 2D spatial graph shown on right.

  • Graph nodes are labeled by function names. The directed edges show the calling sequence from caller to callee, and are labeled by the set of tasks that follow that call path. Nodes that are visited by the same set of tasks are assigned the same color.

  • STAT is also capable of gathering stack traces with more fine-grained information, such as the program counter or the source file and line number of each frame.

  • A GUI is provided for viewing and analyzing the STAT output graphs

  • Location:
    • /usr/local/bin/stat-gui - GUI
    • /usr/local/bin/stat-cl - command line
    • /usr/local/bin/stat-view - viewer for DOT format output files
    • /usr/local/tools/stat - install directory, documentation

  • Using the STAT GUI for parallel jobs:
    • Assuming that you have a running job that is hung, and that you are logged into a BG/Q front-end "lac" node, use the stat-gui command to start the STAT GUI.
    • After it appears, it will display your srun processes on the node you are logged into. By default, it selects the parent srun process. Click the "Attach" button if this is correct.
    • If you don't see any srun processes, that means they are running on the "other" lac login node. Just type the other login node's name in the STATGUI's "Search Remote Host" box, as shown in the above example.
    • After a few moments, a graph depicting the state of your job will appear, allowing you to determine where your job is hung. Example:
    • Additional functionality for STAT can be found by consulting the "More information" links below.

  • More information:


Performance Analysis Tools

What's Available?

    The following performance analysis tools are available on LC's BG/Q platforms. These tools cover the full range of performance tuning: tracing, profiling, MPI, threads, and hardware event counters. Each is discussed in more detail in following sections.

    Tool Description
    gprof: Standard Unix profiling utility that includes an application's routine call graph.
    HPCToolkit:
    (Rice University)
    Comprehensive, integrated suite of tools for parallel program performance analysis. Based on sampling for lower overhead. Serial, multithreaded and/or multiprocess codes.
    HPC Toolkit:
    (IBM)
    Includes several components that can be used to trace and profile MPI programs, capture hardware events, and graphically visualize results. Serial, MPI, threaded, and hybrid applications.
    mpitrace: Lightweight profiling and tracing library for MPI applications. Includes Hardware Performance Monitoring (HPM) statistics.
    memP: Lightweight memory profiling library for MPI applications.
    mpiP: Lightweight profiling library for MPI routines in applications.
    Open|SpeedShop: An open source performance analysis tool framework that includes the most common performance analysis steps in one integrated tool. Comprehensive performance analysis for sequential, multithreaded, and MPI applications.
    PAPI: Performance Application Programming Interface (PAPI). Standardized, cross-platform, API for obtaining hardware counter statistics.
    TAU: The TAU Performance System is an integrated, portable, suite of performance analysis tools for the analysis of large-scale parallel applications.
    Vampir / VampirTrace: VampireTrace is an open source tracing library. It can generate Open Trace Format (OTF) trace files and profiling data for MPI, OpenMP, pthreads and PAPI events. Vampir is an OTF trace file viewer.
    Valgrind: Valgrind is a suite of simulation-based debugging and profiling tools. The Memcheck tool detects a comprehensive set of memory errors, including reads and writes of unallocated or freed memory and memory leaks.



Performance Analysis Tools

gprof

Overview:

  • Standard, text-based, Unix profiling utility that includes an application's routine call graph.

  • Can be used with C/C++ and Fortran.

  • gprof displays the following information:
    • The parent of each procedure.
    • An index number for each procedure.
    • The percentage of CPU time taken by that procedure and all procedures it calls (the calling tree).
    • A breakdown of time used by the procedure and its descendents.
    • The number of times the procedure was called.
    • The direct descendents of each procedure.

  • Example:

  • Location: /bgsys/drivers/ppcfloor/gnu-linux/bin/powerpc64-bgq-linux-gprof. The one in /usr/bin is for the front-end nodes.

  • Using gprof
    • Compile your program with the -pg option. If your compilation includes the -c option (to produce a *.o file), then you will need to include the -pg during the link/load also.
    • Run the program. When it completes you should have a file called gmon.out which contains runtime statistics. If you are running a parallel program you will have multiple files differentiated by the process id which created them, such as gmon.out.0 gmon.out.1 gmon.out.2, etc.
    • For serial users, view the profile statistics with gprof by typing gprof at the shell prompt in the same directory that you ran the program. By default, gprof will look for a file called gmon.out and display the statistics contained in it.
    • For parallel users, view the profile statistics with gprof by typing gprof followed by the name of your executable and the gmon.out.X files you wish to view. You may view any single file or any combination.
    • Examples:

      gprof myprog gmon.out.0
      gprof myprog gmon.out.0 gmon.out.1 gmon.out.2
      gprof myprog gmon.out.*
      

  • Notes:
    • When more than one gmon.out input file is specified, the resulting gprof report is a merge of the multiple inputs.
    • In most cases, you will want to redirect the output of gprof from stdout to a file, for example:
      gprof myprog gmon.out.12 > output.txt

  • More information: gprof man page


Performance Analysis Tools

HPCToolkit

Overview:

  • HPCToolkit is an integrated suite of tools for measurement and analysis of program performance on computers ranging from multicore desktop systems to the largest supercomputers.

  • Uses low overhead statistical sampling of timers and hardware performance counters to collect accurate measurements of a program's work, resource consumption, and inefficiency and attributes them to the full calling context in which they occur.

  • Works with C/C++ and Fortran, applications that are either statically or dynamically linked.

  • Supports measurement and analysis of serial codes, threaded codes (pthreads, OpenMP), MPI, and hybrid (MPI + threads) parallel codes.

  • Primary components and their relationships:

    • hpcrun: collects accurate and precise calling-context-sensitive performance measurements for unmodified fully optimized applications at very low overhead (1-5%). It uses asynchronous sampling triggered by system timers and performance monitoring unit events to drive collection of call path profiles and optionally traces.

    • hpcstruct: To associate calling-context-sensitive measurements with source code structure, hpcstruct analyzes fully optimized application binaries and recovers information about their relationship to source code. In particular, hpcstruct relates object code to source code files, procedures, loop nests, and identifies inlined code.

    • hpcprof: overlays call path profiles and traces with program structure computed by hpcstruct and correlates the result with source code. hpcprof/mpi handles thousands of profiles from a parallel execution by performing this correlation in parallel. hpcprof and hpcprof/mpi generate a performance database that can be explored using the hpcviewer and hpctraceviewer user interfaces.

    • hpcviewer: a graphical user interface that interactively presents performance data in three complementary code-centric views (top-down, bottom-up, and flat), as well as a graphical view that enables one to assess performance variability across threads and processes. hpcviewer is designed to facilitate rapid top-down analysis using derived metrics that highlight scalability losses and inefficiency rather than focusing exclusively on program hot spots.

    • hpctraceviewer: a graphical user interface that presents a hierarchical, time-centric view of a program execution. The tool can rapidly render graphical views of trace lines for thousands of processors for an execution tens of minutes long even a laptop. hpctraceviewer's hierarchical graphical presentation is quite different than that of other tools - it renders execution traces at multiple levels of abstraction by showing activity over time at different call stack depths.

  • Location: /usr/global/tools/hpctoolkit/bgqos_0

  • Using HPCToolkit:
    • Due to its multi-component and sophisticated nature, usage instructions for HPCToolkit are beyond the scope of this document. A few hints are provided below.
    • Be sure to use the LC dotkit package for HPCTookit. The command use -l will list all available packages. Find the one of interest and then load it - for example: use hpctoolkit
    • Consult the User's Manual and other HPCTookit documentation - links provided below.
    • Note the Blue Gene instructions, where applicable, in the documentation, as some things are done differently for these architectures.

  • More information:


Performance Analysis Tools

IBM HPC Toolkit





    NOTE: IBM's HPC Toolkit is not currently available on LC's BG/Q clusters. Usage information will be added here when/if it becomes available.






Performance Analysis Tools

mpitrace

Overview:
  • The mpitrace library can be used to profile an application and report:
    • MPI routines called - number of calls, average message size (bytes) and aggregate time spent
    • MPI routine call sites - the address where MPI routines are called, number of calls and aggregate time spent
    • The number of torus network hops from sender to destination for each message
    • Hardware Performance Monitor (HPM) counts - for selected hardware events
    • The application's heap memory footprint
    • Time spent at each source code statement - the number of "ticks" per line

  • The mpitrace library can also be used trace MPI events during a programs execution. All or selected MPI events are saved to a binary events.trc file for later viewing with the traceview viewer.

  • Implemented via wrappers around MPI calls

  • Note: this useful library is an internal (non-product) tool provided to LC by Bob Walkup from IBM.

  • Locations:
    • /usr/local/tools/mpitrace/lib/ - Main version with full functionality
    • /usr/local/tools/mpitrace/lite/ - Lite version with smaller memory footprint, but reduced functionality:
    • /usr/local/tools/mpitrace/pthreads/ - HPM version for applications that use Pthreads
    • /usr/local/tools/traceview/ - traceview GUI source code. The traceview executable in this directory is provided for MS Windows platforms.

  • Compiling and linking:
    • Compile as usual, but using the -g flag is required for reporting call sites and source line ticks statistics.
    • Then link as shown below. Note that the examples shown are for using the main version. Use the paths shown above for the Lite and Pthreads versions.

      -L/usr/local/tools/mpitrace/lib -lmpitrace
      -L/bgsys/drivers/ppcfloor/bgpm/lib -lbgpm
      Basic MPI profiling and tracing
      -L/usr/local/tools/mpitrace/lib -lmpihpm
      -L/bgsys/drivers/ppcfloor/bgpm/lib -lbgpm
      MPI profiling, tracing + HPM profiling
      -L/usr/local/tools/mpitrace/lib -lmpihpm_smp
      -L/bgsys/drivers/ppcfloor/bgpm/lib -lbgpm
      MPI + OpenMP profiling, tracing + HPM profiling

  • Instrumented/selective tracing: Only trace those parts of the program contained within trace start/stop calls. Syntax for trace start/stop calls:

    Fortran C C++
    call trace_start()
    do work + mpi ...
    call trace_stop()
    void trace_start(void);
    void trace_stop(void);
    
    trace_start();
    do work + mpi ...
    trace_stop();
    extern "C" void trace_start(void);
    extern "C" void trace_stop(void);
    
    trace_start();
    do work + mpi ...
    trace_stop();

  • Instrumented/selective profiling and source line "ticks" profiling: See the mpitrace documentation for instructions.

  • Running:
    • A number of environment variables control how profiling and tracing is performed. Some of these are described in the table below. Please consult the mpitrace documentation for additional details not covered here.

      Environment Variable Description Default
      PROFILE_BY_CALL_SITE Set to yes to obtain the call site for every MPI function call. Requires compiling with the -g flag no
      TRACE_SEND_PATTERN Set to yes to collects information about the number of hops for point-to-point communication on the torus network. no
      SAVE_ALL_TASKS Set to yes to produce an output file for every MPI rank. By default, output files are only produced for MPI rank 0, and the ranks having the minimum, median, and maximum times in MPI. no
      SAVE_LIST Specify a list of MPI ranks that will produce an output file. By default, output files are only produced for MPI rank 0, and the ranks having the minimum, median, and maximum times in MPI. Example: setenv SAVE_ALL_TASKS 0,32,64,128,256,512 unset
      TRACEBACK_LEVEL In cases where there are deeply nested layers on top of MPI, you may want to profile higher up the call chain. This can be done by setting this environment variable to an integer value above zero indicating how many levels above the MPI calls profiling should take place. 0
      TRACE_DIR Specify the directory where output files should be written. Working directory
      HPM_GROUP Set to an integer value indicating which predefined hardware counter group to use. Hardware counter groups are listed in the file: /usr/local/tools/mpitrace/CounterGroups 0
      HPM_PROFILE Set to "yes" to turn on HPM profiling. Executable needs to have been linked with an HPM library. unset
      HPM_SCOPE Set to process or thread to aggregate hardware counter statistics at the process or thread level. See documentation for explanation. node
      TRACE_ALL_TASKS For jobs that have more than 256 tasks, setting this to yes will cause all tasks to be traced. Can cause problems for large, long running jobs (too much data). no
      TRACE_ALL_EVENTS Set to yes to trace all MPI events. This is used if you don't explicitly instrument your source code with trace start/stop routine calls yourself. no
      TRACE_MAX_RANK Specifies the maximum task rank that should be profiled. Can be used to override the default of 255 (256 tasks). 255
      SWAP_BYTES The event trace file is binary, and therefore, it is sensitive to byte order. The trace files are written in little endian format by default. Setting this environment variable to "yes" will produce a big endian binary trace output file. no

  • Output:
    • MPI profiling: The default is to produce plain text files of MPI data for MPI rank 0, and the ranks that had the minimum, median, and maximum times in MPI. Files are named mpi_profile.#.rank where # is a unique number for each job. The file for MPI rank 0 also contains a summary of data from all other MPI ranks.
    • HPM profiling: similar to MPI profiling, except the files are named hpm_process_summary.#.rank
    • MPI tracing: A single binary trace data file called events.trc is produced. Intended to be viewed with the traceview GUI utility.
    • The number of profiling output files produced, and the data they contain, can be modified by setting the environment variables in the above table.
    • Examples:

  • Tracing caveats:
    • Tracing large, long running executables can generate a huge output file, even to the point of being useless.
    • Tracing incurs overhead and increases a job's runtime.
    • Optimized code may produce misleading or erroneous trace results.

  • Documentation:
    • Located in /usr/local/tools/mpitrace on BG/Q systems
    • MPI_Wrappers_for_BGQ.pdf - Main documentation
    • README.mpitrace - LC specific notes
    • CounterGroups - HPM counter groups


Performance Analysis Tools

memP

Overview:
  • memP is a locally developed, light weight, parallel heap profiling library based on the mpiP MPI profiling tool.

  • Primary feature is to identify the heap allocation that causes an MPI task to reach its memory in use high water mark (HWM).

  • Two types of memP reports:
    1. Summary Report: Generated from within MPI_Finalize, this report describes the memory HWM of each task over the run of the application. This can be used to determine which task allocates the most memory and how this compares to the memory of other tasks.
    2. Task Report: Based on specific criteria, a report can be generated for each task, that provides a snapshot of the heap memory currently in use, including the amount allocated at specific call sites.

  • Location: /usr/local/tools/memP

  • Using memP:
    • Load the memP dotkit package with the command use memp
    • Compile with the recommended BG/Q flags and link your application with the required libraries:
      -Wl,-zmuldefs -L/usr/local/tools/memp/lib -lmemP
    • Examples:

      mpixlc -g -Wl,-zmuldefs -o myprog myprog.c -L/usr/local/tools/memP/lib -lmemP
      mpixlf77 -g -Wl,-zmuldefs -o myprog myprog.f -L/usr/local/tools/memP/lib -lmemP 

    • Optional: set the MEMP environment variable to specify the type of output you desire, if other than the default, summary text file. See the "Output Options" discussion below.
    • Then run your MPI application as usual. You can verify that memP is working by the header and trailer output it sends to stdout, and output file generation following execution.

  • Output Options:
    • By default, a single text summary report showing the top HWM tasks will be produced:
    • Other options exist to produce reports on a per task basis, display call sites where the HWM is reached, set HWM thresholds, generate stack traces, and more.
    • XML format reports that can be viewed via an LC utility are also an option:
    • For details, see the "More information" link below.

  • More information: http://memp.sourceforge.net/


Performance Analysis Tools

mpiP

Overview:
  • mpiP is a lightweight profiling library for MPI applications.
    • Software developed by LLNL
    • Collects only statistical information about MPI routines, generating much less data than tracing tools
    • Captures and stores information local to each task (local memory and disk)
    • Uses communication only at the end of the application to merge results from all tasks into one output file

  • mpiP provides statistical information about a program's MPI calls:
    • Percent of a task's time attributed to MPI calls
    • Where each MPI call is made within the program (callsites)
    • Top 20 callsites
    • Callsite statistics (for all callsites)

  • Location: /usr/local/tools/mpip

  • Using mpiP:
    • Involves little more than compiling with the -g flag and linking with the mpiP library.
    • Examples:

      mpixlc -g -o myprog myprog.c -L/usr/local/tools/mpip/lib -lmpiP 
      mpixlf77 -g -o myprog myprog.f -L/usr/local/tools/mpip/lib -lmpiP 

    • After compiling, run your application as usual. You can verify that mpiP is working by the header and trailer output it sends to stdout, and the creation of a single output file (see "Output" below).

  • Output:
    • After your application completes, mpiP will write its output file to the current directory. The output file name will have the format of myprog.N.XXXXX.mpiP where N=#MPI tasks and XXXXX=collector task process id.
    • mpiP's output file is divided into 5 sections:
      1. Environment Information
      2. MPI Time Per Task
      3. Callsites
      4. Aggregate Times of Top 20 Callsites
      5. Callsite Statistics
    • Example:

  • More information: http://mpip.sourceforge.net


Performance Analysis Tools

Open|SpeedShop

Overview:
  • Open|SpeedShop is an open source performance analysis tool framework that integrates the most common performance analysis steps all in one tool.

  • Primary functionality:
    • Sampling Experiments
    • Support for Callstack Analysis
    • Hardware Performance Counters
    • MPI Profiling and Tracing
    • I/O Profiling and Tracing
    • Floating Point Exception Analysis

  • Instrumentation options include:
    • Unmodified application binaries
    • Offline and online data collection
    • Attach to running applications

  • Four user interface options:
    • Graphical user interface
    • Command line
    • Batch
    • Python scripting API

  • Designed to be modular and extensible. Supports several levels of plug-ins which allow users to add their own performance experiments.

  • Linux based platforms - currently IA64, IA32, EM64T, AMD64, IBM Power PC, Cray XT/XE and IBM Blue Gene.

  • Open|SpeedShop development is hosted by the Krell Institute. The infrastructure and base components are released as open source code primarily under LGPL.

  • Location: /usr/global/tools/openspeedshop/

  • Using HPCToolkit:
    • Due to its multi-component and sophisticated nature, usage instructions for Open|SpeedShop are beyond the scope of this document. A few hints are provided below.
    • Be sure to use the LC dotkit package for Open|SpeedShop. The command use -l will list all available packages. Find the one of interest and then load it - for example: use openss.
    • Consult the User's Guide and other Open|SpeedShop documentation - links provided below.

  • More information:


Performance Analysis Tools

PAPI

Overview:
  • Performance Application Programming Interface (PAPI) is an industry standard, cross-platform, API for obtaining hardware performance counter statistics, such as:
    • Branching, conditional, unconditional
    • Cache requests, hits, misses, L1, L2, L3
    • Stores, conditional, success, fail
    • Instruction counting
    • Loads, prefetches
    • Cycle stalls
    • Floating point operations
    • TLB operations
    • Hardware interrupts

  • Hardware events are recorded by making calls to the PAPI API routines for the events of interest.

  • There are two groups of events:

    • Preset Events: Standard API set of over 100 CPU events for application performance tuning. Application developers can access these events through the PAPI high-level API. A list of these routines can be found at http://icl.cs.utk.edu/projects/papi/wiki/PAPIC:PAPI_presets.3. For convenience, they are also .

    • Native Events: Platform specific events that extend beyond the Preset Event set. Require using PAPI's low-level API - generally intended for experienced programmers and tool developers.

  • Originally, the API focused on CPU events, but the more recent PAPI-C (PAPI Component) API includes other machine components such as network interface cards, power monitors and I/O units.

  • Both a C and Fortran calling interface

  • On BG/Q, PAPI interfaces to a subset of IBM's BGPM (Blue Gene Performance Monitoring) API. BGPM includes over 400 events grouped into 5 categories, which map to the hardware unit where they are counted:
    • Processor Unit
    • L2 Unit
    • I/O Unit
    • Network Unit
    • CNK (compute node kernel) Unit

  • Location: /usr/local/tools/papi

  • Using PAPI:
    • Using PAPI in an application typically requires a few simple steps: include the event definitions, initialize the PAPI library, set up event counters, and link with the PAPI library.
    • Documentation on how to use the API can be found at http://icl.cs.utk.edu/papi. See the Documentation section.
    • A useful PAPI getting started tutorial is available at http://www.cisl.ucar.edu/css/staff/rory/papi/papi.php.

  • More information:
    • PAPI website: http://icl.cs.utk.edu/papi
    • IBM BGPM API documentation - see the install directory at /bgsys/drivers/ppcfloor/bgpm/docs/html/index.html.
    • BG/Q native events (from the installation documentation):


Performance Analysis Tools

TAU

Overview:
  • The TAU (Tuning and Analysis Utilities) Performance System is an integrated, portable, profiling and tracing toolkit for performance analysis of parallel programs written in Fortran, C, C++, Java, Python. From Performance Research Lab, University of Oregon.

  • Profiling: shows how much time was spent in each routine

  • Tracing: when and where events take place

  • TAU instrumentation is used to accomplish both profiling and tracing. Three different methods:
    • Binary "rewriting"
    • Compiler directed
    • Source transformation (both automatic and selective)

  • Ability to use PAPI hardware counters

  • Graphical representation of profiling/tracing via TAU's ParaProf GUI tool.

  • Location: /usr/global/tools/tau/bgqos_0

  • Quickstart for TAU Profiling:
    1. Load the TAU environment: use tau
    2. Decide what you want to instrument by selecting the appropriate TAU stub makefile. These are named according to to the metrics they record, and they will be located in the bgq/lib subdirectory of your TAU installation. For example, in:
      /usr/global/tools/tau/bgqos_0/tau-2.21.3/bgq/lib
      you will see makefile stubs such as:
      Makefile.tau-bgqtimers-mpi-pdt                    Makefile.tau-bgqtimers-pdt
      Makefile.tau-bgqtimers-mpi-pdt-openmp-opari       Makefile.tau-bgqtimers-pdt-openmp-opari
      Makefile.tau-bgqtimers-papi-mpi-pdt               Makefile.tau-bgqtimers-pthread-pdt
      Makefile.tau-bgqtimers-papi-mpi-pdt-openmp-opari  Makefile.tau-depthlimit-bgqtimers-mpi-pdt
      Makefile.tau-bgqtimers-papi-pdt                   Makefile.tau-param-bgqtimers-mpi-pdt
      Makefile.tau-bgqtimers-papi-pdt-openmp-opari      Makefile.tau-phase-bgqtimers-papi-mpi-pdt
      Makefile.tau-bgqtimers-papi-pthread-pdt
    3. Set TAU_MAKEFILE to the full pathname of the makefile stub you choose. For example:
      setenv TAU_MAKEFILE /usr/global/tools/tau/bgqos_0/tau-2.21.3/bgq/lib/Makefile.tau-bgqtimers-mpi-pdt
    4. Compile your program substituting the appropriate TAU compiler script for your usual compiler: tau_cxx.sh (C++), tau_cc.sh (C), tau_f90.sh (F90), tau_f77.sh (F77). These scripts should be in your path after following the setup instructions above.
    5. Run your program
    6. When finished, you should have files called profile.NNN, one per MPI task.
    7. View the output using the pprof (text) or paraprof (GUI) tools. Simply issue the command and it will look for the relevant profile.NNN files to display.
      Example output - simple 8 task MPI program:

  • Note: TAU is a very full featured toolkit that cannot be covered here adequately. See the more information links below to get a better feel for what this performance analysis package can do.

  • More information:


Performance Analysis Tools

VampirTrace / Vampir

Overview:
  • VampirTrace is an open source, performance analysis tool set and library used to instrument, trace and profile parallel applications. Developed at TU-Dresden, in collaboration with the KOJAK project at JSC/FZ Julich.

  • Supports applications using:
    • MPI
    • OpenMP
    • Pthreads
    • GPU accelerators

  • Trace events can include:
    • Application's routine/function calls
    • MPI calls
    • User defined events
    • PAPI performance counters
    • I/O
    • Memory allocations

  • Instrumentation options include:
    • Fully automatic - performed via compiler wrappers
    • Manual using the VampirTrace API
    • Fully automatic using the TAU instrumentor
    • Runtime binary instrumentation using Dyninst

  • Vampir is a proprietary trace visualizer developed by the Center for Information Services and High Performance Computing (ZIH) at TU Dresden. It is used to graphically display the Open Trace Format (OTF) output produced by VampirTrace.

  • Locations:
    • VampirTrace: /usr/global/tools/vampirtrace/bgqos_0
    • VampirTrace: /usr/global/tools/vampirtrace/bgqos_0

  • Quickstart for basic usage at LC:
    1. Load the VampirTrace environment: use vampirtrace-bgq
    2. Compile / link your code using one of the VampirTrace compiler wrappers: vtCC, vtc++, vtcc, vtcxx, vtf77 or vtf90. Need to tell the wrapper which native compiler you prefer. For example:

      vtcc -vt:cc mpixlc -o hello mpi_hello.c

    3. Set desired environment variables - there are many choices. For example, to do both profiling and tracing, and to prefix the output files with the name of the code:

      setenv VT_MODE STAT:TRACE
      setenv VT_FILE_PREFIX hello

    4. Run the executable
    5. View the output using the Vampir GUI:

      use vampir
      vampir myfile.otf

      NOTE: As of December, 2012, Vampir is only installed on the following LC systems: cab, edge, hera, sierra, rzmerl, rzzeus.

  • Output:
    • Profile data is written to a plain text file named a.prof.txt by default. Use the VT_FILE_PREFIX environment variable to name it something different. Example:

                                          excl. time  incl. time
      *excl. time  incl. time      calls      / call      / call  name
          0.186s      0.186s           1     0.186s      0.186s   MPI_Finalize
          0.123s      0.123s     4033.75    30.459us    30.459us  MPI_Recv
         94.592ms     0.687s           1    94.592ms     0.687s   main
         53.345ms    53.345ms       2000    26.672us    26.672us  MPI_Ssend
         51.888ms    51.888ms       2000    25.944us    25.944us  MPI_Waitall
         47.471ms    47.471ms    2033.75    23.341us    23.341us  MPI_Send
         32.281ms    32.281ms       4000     8.070us     8.070us  MPI_Irecv
         29.833ms    29.833ms       1000    29.832us    29.832us  MPI_Sendrecv
      

    • Tracing data is written to an Open Trace Format (OTF) file named a.otf by default. Use the VT_FILE_PREFIX environment variable to name it something different.
    • Note: VampirTrace may create other output files that are not of viewing interest, particularly if the files are not merged into the two default files mentioned above.

  • Note: VampirTrace and Vampir are full featured tools that cannot be covered here adequately. See the more information links below to get a better feel for what they can do.

  • More information:


Performance Analysis Tools

Valgrind

Overview:
  • NOTE: Valgrind is not currently available on the LC BG/Q systems. Usage information will be added here when/if it becomes available.

  • The Valgrind tool suite provides a number of debugging and profiling tools that help you make your programs faster and more correct.

  • The Valgrind distribution currently includes the following tools:

    • Memcheck: is a memory error detector. It helps you make your programs, particularly those written in C and C++, more correct.

    • Cachegrind: is a cache and branch-prediction profiler. It helps you make your programs run faster.

    • Callgrind: is a call-graph generating cache profiler. It has some overlap with Cachegrind, but also gathers some information that Cachegrind does not.

    • Helgrind: is a thread error detector. It helps you make your multi-threaded programs more correct.

    • DRD: is also a thread error detector. It is similar to Helgrind but uses different analysis techniques and so may find different problems.

    • Massif: is a heap profiler. It helps you make your programs use less memory.

    • DHAT: is a different kind of heap profiler. It helps you understand issues of block lifetimes, block utilisation, and layout inefficiencies.

    • SGcheck: is an experimental tool that can detect overruns of stack and global arrays. Its functionality is complementary to that of Memcheck: SGcheck finds problems that Memcheck can't, and vice versa..

    • BBV: is an experimental SimPoint basic block vector generator. It is useful to people doing computer architecture research and development.

  • Valgrind is also an instrumentation framework for building dynamic analysis tools - you can also it to build your own tools.

  • For more information visit valgrind.org


Documentation, Help and References


Local Documentation - General and BG/Q Specific:

Help - LC Hotline:

  • The LC Hotline staff provide walk-in, phone and email assistance weekdays 8:00am - noon, 1:00pm - 4:45pm.

  • Walk-in Consulting
    • On-site users can visit the LC help desk consultants in Building 453, Room 1103. Note that this is a Q-clearance area. Need a map?

  • Phone:
    • (925) 422-4531 - Main number
    • 422-4532 - Direct phone line for technical consulting help
    • 422-4533 - Direct phone line for support help (accounts, passwords, forms, etc)

  • Email
    • Technical Help:
      OCF: lc-hotline@llnl.gov
      SCF: lc-hotline@pop.llnl.gov
    • Support:
      OCF: lc-support@llnl.gov
      SCF: lc-support@pop.llnl.gov

Help - BG/Q Specific:

  • Sequoia Users Meeting: third Thursday each month. Held in B451 White Room from 3:00-4:00pm. Web conference and phone dial-in numbers available - contact the LC Hotline.

  • "BG/Q Virtual Water Cooler" telecon every Thursday (except 3rd) from 3:00- 4:00pm. Intended to be an open user forum discussion regarding the Seq/Vulcan/rzuseq systems. Available for consulting with domain experts on topics such as porting codes, system status, jobs scheduling, file systems, etc.

References:






This completes the tutorial.

Evaluation Form       Please complete the online evaluation form - unless you are doing the exercise, in which case please complete it at the end of the exercise.

Where would you like to go now?