Using the Dawn BG/P System

Blaise Barney, Lawrence Livermore National Laboratory LLNL-WEB-412512

NOTE: All of LC's IBM BG/P systems have been retired, however their information is being retained here for archival purposes.

Table of Contents

  1. Abstract
  2. Evolution of IBM's Blue Gene Architectures
  3. What Is Dawn?
  4. Hardware
    1. Dawn General Configuration
    2. BG/P versus BG/L
    3. BG/P's Scaling Architecture
    4. PowerPC 450 Processor and BG/P ASIC
    5. BG/P Compute Card and I/O Card
    6. BG/P Node, Link and Service Cards
    7. BG/P Midplanes and Racks
    8. BG/P Networks
    9. Login Nodes, Service Nodes and Disk
  5. Accessing LC's BG/P Machines
  6. Software and Development Environment
  7. Compilers
  8. MPI
    1. mpirun
    2. Execution Modes
  9. OpenMP and Pthreads
  10. Aligning Data and SIMD Instructions
  11. Math Libraries
  12. Miscellaneous Software
  13. Running Jobs
    1. System Configuration Details
    2. Batch Job Specifics
    3. Interactive Job Specifics
  14. I/O and HPSS Storage
  15. Using HTC Mode
  16. Debugging
    1. Core Files
    2. addr2line
    3. getstack
    4. Other First Pass Debugging Tips
    5. CoreProcessor Tool
    6. TotalView
    7. STAT
  17. Performance Analysis Tools
    1. What's Available?
    2. mpitrace
    3. peekperf, peekview
    4. Xprofiler
    5. Hardware Performance Monitoring (HPM)
    6. gprof
    7. mpiP
    8. PAPI
    9. TAU
  18. References and More Information
  19. Exercise


Abstract


This tutorial is intended for users of Livermore Computing's Dawn BlueGene/P system. It begins with a brief history leading up to the BG/P architecture. Dawn's configuration is presented, followed by detailed information on the BG/P hardware architecture, including the PowerPC 450, BG/P ASIC, double FPU, compute, I/O, login and service nodes, midplanes, racks and the 5 BG/P networks. Topics relating to the software development environment are covered, followed by detailed usage information for BG/P compilers, MPI, OpenMP and Pthreads. Data alignment, math libraries, system configuration information, and specifics on running both batch and interactive jobs are presented. The tutorial concludes with a discussion on BG/P debugging and performance analysis tools.

Level/Prerequisites: Intended for those who are new to developing parallel programs in the IBM BG/P environment. A basic understanding of parallel programming in C or Fortran is assumed. The material covered by EC3501 - Introduction to Livermore Computing Resources would also be useful.



Evolution of IBM's Blue Gene Architectures

This section provides a brief history of the IBM Blue Gene architecture.

1982-1998 QCDSP Predecessor:
  • The QCDSP (Quantum Chromodynamics on Digital Signal Processors) Project led by Columbia University and 4 other collaborating institutions, designed and built systems that incorporated features seen in the Blue Gene architecture:
    • Built with thousands of commodity, off-the-shelf processors
    • Low cost, low power consumption, small footprint
    • Torus network

  • Multiple systems built, ranging up to 600 Gflops

  • Gordon Bell Prize for Most Cost Effective Supercomputer in '98. Award presented at SC98 conference.

  • Follow-on QCDOC Project (QCD On a Chip) built several multi-Tflop systems

  • More information: http://phys.columbia.edu/~cqft/

December 1999:

  • IBM Research announced a 5 year, $100M US, effort to build a petaflop scale supercomputer to attack problems such as protein folding. It was called the "Blue Gene" project.

  • The Blue Gene project had two primary goals:
    • Advance the understanding of the mechanisms behind protein folding via large-scale simulation
    • Explore novel ideas in massively parallel machine architecture and software

  • This project's intention was to enable biomolecular simulations that would be orders of magnitude larger than current technology permits.

  • Also projected to be world's fastest supercomputer.

  • The press release is available HERE.
8192 Node QCDSP Machine
8192 node, 400 Gflop QCDSP at Columbia University
Source: http://phys.columbia.edu/~cqft/


Blue Gene Project Logo

November 2001:

November 2002:

  • At the SC2002 conference, Energy Secretary Spencer Abraham announced that The Department of Energy (DOE) had awarded IBM a $290 million contract to build the two fastest supercomputers in the world with a combined peak speed of 460 trillion calculations per second (teraflops).

  • The first system would be ASC Purple, with a peak speed of 100 teraflops and over 12,000 processors.

  • The second system would be Blue Gene/L, with a peak speed of 360 teraflops and 130,000 processors.

  • The press release is available HERE.

2003 - June 2004:

  • Blue Gene development progresses within IBM Rochester and IBM T.J. Watson Research Center

  • June 2003: First-pass chips (DD1) completed. (Limited to 500 MHz).

  • November 2003: 512-node DD1 system achieves 1.4 Tflop Linpack for #73 on top500.org. 32-node prototype folds proteins live on the demo floor at SC2003 conference.

  • February 2, 2004: Second pass (DD2) BG/L chips achieves 700 MHz design.

  • June 2004: 4-rack, 4096 node DD1 prototype achieves 11.7 Tflop Linpack for #4 position on top500.org. A 2-rack, 2048 node DD2 system achieves 8.7 Tflop Linpack for #8 position.

November 2004 - November 2007:

  • First production Blue Gene/L system ranks #1 on the Top500.org list seven consecutive times.

  • November 2004: IBM Rochester beta Blue Gene system achieves 71 Tflop Linpack for #1 position on top500.org. The 16-rack, 16384 node system later moved to Lawrence Livermore National Laboratory for installation.

  • June 2005: 32-rack, 32,768 node LLNL system achieves 137 Tflop Linpack for #1 position on top500.org again. The press release is available HERE

  • 2005: Complete 64-rack 65,536 node LLNL system installed. Installation of other BGL systems occur at Astron, AIST, Argonne, SDSC, Edinburg, ...

  • November 2005: 65K node LLNL system achieves 281 Tflop Linpack for #1 position again, and retains this slot thru November 2007. IBM T.J. Watson system is ranked #2 at 91 Tflop Linpack.

  • November 2007: Expanded LLNL BG/L system (106,496 nodes) achieves 478 Tflops Linpack to remain #1 until the next Top500 list in June 2008.

Blue Gene/P:

  • The Blue Gene/P architecture is a follow-on to Blue Gene/L

  • November 2007: The first Blue Gene/P system appeared on the Top500 at position #2 with 167 Tflops. Installed at Forschungszentrum Juelich in Germany.

  • Similar to Blue Gene/L, but has 4 cores per node instead of 2.

  • LLNL's Dawn system is a 500 Tflop Blue Gene/P machine with 36,864 compute nodes and 147,456 cores.

Blue Gene/Q and Sequoia:

  • The next generation Blue Gene architecture is Blue Gene/Q. The first production system, called Sequoia, will be installed at LLNL in 2012.

  • Configuration:
    • 20 Petaflops
    • 98,304 nodes
    • 1.6 million cores
    • 1.6 petabytes of memory
    • 96 refrigerator sized racks occupying 3422 square feet
    • 17 times more power efficient than BG/L

  • State of the art switching infrastructure that will take advantage of advanced fiber optics at all levels

  • News and press releases:

     



Blue Gene/L Banner

Blue Gene/L Artist Conception

BGL Installation with Cables
Installation of LLNL's BG/L Machine


November 2007 Top 500 list with BGL as #1

LLNL BG/L System
LLNL's BG/L Machine - larger image HERE


What Is Dawn?


Overview:

BG/P - Similar to BG/L:

Photos:



Hardware

Dawn General Configuration

Nodes:
  • The Dawn BG/P system consists of five different node types:
    • Compute nodes
    • I/O nodes
    • Login/Front-end nodes
    • HTC nodes
    • Service nodes

  • Compute Nodes: Comprise the heart of a system. This is where user jobs run.
    • 36,864 IBM PowerPC 450 nodes
    • 4 cores/node; total of 147,456 cores
    • 850 MHz clock, 32-bit architecture
    • 4 GB memory/node; 147.5 TB total
    • Double floating point units
    • Minimal, Linux-like kernel OS (CNK - compute node kernel)
    • Interconnects: torus, collective, global barrier

  • I/O Nodes: Dedicated to performing all I/O requests by compute nodes - not available to users directly.
    • 576 IBM PowerPC 450 nodes - same node type as compute nodes
    • 1:64 ratio with compute nodes
    • Connected to compute nodes via the collective network.
    • Connected to outside world via 10 Gb ethernet
    • Full Linux kernel OS

  • Login/Front-end Nodes: These are where users login, compile and interact with the batch system.
    • Different architecture than core BG/P nodes
    • At LC: IBM JS22 blades (14 total): IBM POWER6, 8 cores @ 4.0 GHz
    • Full 64-bit Linux
    • 8 GB memory per node
    • Connected to I/O nodes and outside world via 10 Gb ethernet

  • HTC Nodes: "High Throughput Computing" resource for short running serial and parallel jobs.

  • Service Nodes: Reserved for BG/P system management functions.
    • System boot, machine partitioning, system performance measurements, system health monitoring, etc.
    • Use IBM DB2 databases to store system state and performance information
    • Different than core BG/P nodes
    • At LC: IBM P6-550, P6-520 (3 nodes total); Full 64-bit Linux
    • Connected to outside world via 10 Gb Ethernet
    • Connected to BG/P nodes by private 1 Gb Ethernet network for JTAG hardware access

Internal Networks:

  • Networks: As with all BG/P systems, Dawn includes 5 networks used to connect the various BG/P nodes in different ways (these are discussed in detail later):
    • 3D Torus
    • Collective
    • Global Barrier/Interrupt
    • 10Gbit Functional Ethernet
    • 1Gbit Control Ethernet/JTAG

External Connectivity:

  • Dawn is connected to other LC systems and resources over a common 10 Gb Ethernet switch network:
    • Other LC clusters
    • Archival HPSS storage
    • Lustre parallel file systems
    • Other WAN resources, file systems
Dawn General Configuration
Dawn General Configuration

BG/P General Configuration
BG/P General Configuration

Above image has been reproduced by LLNL with the permission of International Business Machines Corporation from IBM Redbooks® publication SG24-7287: IBM System Blue Gene Solution: Blue Gene/P Application Development (http://www. redbooks.ibm.com/abstracts/sg247287.html?Open). © International Business Machines Corporation. 2007, 2008. All rights reserved.

Dawn Configuration
Dawn External and Service Networks
Larger image HERE



BG/P versus BG/L

Other Differences:



Hardware

BG/P Scaling Architecture

Chip to Full System:

Hardware

PowerPC 450 Processor and BG/P ASIC

BG/P ASIC:

PowerPC 450:
  • Follow-on to BG/L PPC440 processor - very similar

  • 32-bit architecture at 850 MHz

  • Single integer unit

  • Single load/store unit

  • L1 Cache:
    • 32 KB instruction
    • 32 KB data
    • 32-byte lines, 64 way-set-associative
PowerPC 440 block diagram
Larger image HERE
Source: commons.wikimedia.org/wiki/File:PowerPC_440.png

"Double Hummer" Floating-point Unit:

  • Each core has an associated double floating-point unit - virtually identical to the double FPU in BG/L.

  • 32 64-bit primary floating-point registers in FPU0 and 32 64-bit secondary floating-point registers in FPU1.

  • Standard PowerPC instructions (lfd, fadd, fmadd, fadds, fdiv...), execute on FPU0

  • Parallel SIMD instructions (lfpdx, fpadd, fpmadd, fpre...) for 64-bit (double precision) floating-point numbers exploit both FPUs

  • Loads and stores both in single and double precision plus parallel load/stores

  • Instruction set extensions for complex number arithmetic

  • The dual-pipeline FPU can simultaneously execute two fused multiply-add instructions per machine cycle, each of which is counted as 2 FLOPs. Thus, each processor unit (PPC450 + FPU) has a peak performance of 4 FLOPs per cycle. Peak performance for a 4 core node is is 16 FLOPs per cycle @ 850 MHz, or 13.6 Gflops total.

  • Best performance is achieved with data aligned on 16-byte boundaries. See the Aligning Data section for more information.

  • Example DAXPY performance comparison between double and single FPU use available HERE.
Double Floating-point unit

Performance Counters:



Hardware

BG/P Compute Card and I/O Card

Primary Components:

Main Differences Between Compute and I/O Nodes:



Hardware

BG/P Node, Link and Service Cards

Node Card:

Link Card:

Service Card:



Hardware

BG/P Midplanes and Racks

Midplanes:

Racks:



Hardware

BG/P Networks

3D Torus:
  • Backbone for MPI point-to-point communications

  • Interconnects all compute nodes; every compute node is connected to each of its six nearest neighbors.

  • Bandwidth: 5.1 GB/s per node (3.4 Gb/s bidirectional * 6 links/node)

  • Latency (MPI): 3 us for one hop, 10 us to the farthest

  • Adaptive/dynamic cut-through hardware routing

  • DMA (Direct Memory Access) engine offloads work of injecting and receiving packets from the core - leaves more cycles for the core to do computations

  • Additional Torus network graphic HERE

BG/P 3D Torus Network

Global Collective:
  • Interconnects all compute and I/O nodes

  • One-to-all and all-to-all MPI broadcasts

  • MPI Reduction operations

  • Also used to pass I/O requests from compute nodes to I/O nodes

  • Bandwidth: 5.1 GB/s per node (6.8 Gb/s bidirectional * 3 links/node)

  • Latency (MPI): 5 us of one way tree traversal
BG/P Global Collective Network

Global Barrier / Interrupt:
  • As the number of tasks grows, the cost of a simple software MPI barrier becomes prohibitive. This network provides an efficient solution.

  • Connects to all compute nodes

  • Low latency network for MPI barrier and interrupts

  • Latency (MPI): 1.3 us for one way to reach 72K nodes

  • Four links per node

  • Bandwidth: 3.4 GB/s per node (3.4 Gb/s bidirectional * 4 links/node). However, bandwidth is not generally a concern for this network since it isn't intended to transport data.
BG/P Global Barrier/Interrupt Network

10 Gb Ethernet:
  • Connects all I/O nodes to the 10 Gb Ethernet switch, which in turn provides access to external file systems.

  • Compute nodes are not connected. All interactions between the compute nodes and the outside world are carried through the I/O nodes.

  • Optical fiber
BG/P 10 Gb Ethernet Network

1 Gb Control Ethernet / JTAG:
  • JTAG = Joint Test Action Group - refers to an IEEE 1149.1 interface for control, monitoring, and debugging.

  • The Control/JTAG network grants the service node direct access to all nodes. It is used for system boot, debug, and monitoring.

  • It also enables the Service Node to provide run-time non-invasive reliability, availability, and serviceability (RAS) support and non-invasive access to performance counters.
BG/P 1 Gb Control Ethernet Network



Hardware

Login Nodes, Service Nodes and Disk

Login Nodes:

Service Nodes:

  • Primary service node:
    • IBM Power 550
    • POWER6, 8 cores @ 4.2 GHz
    • 64-bit, Linux OS

  • Secondary service nodes:
    • IBM Power 520
    • POWER6, 2 cores @ 4.2 GHz
    • 64-bit, Linux OS

Disk:

  • IBM DS5000 Storage Systems (shown)
    • 56 Terabytes
    • File systems for login nodes and service nodes

  • Lustre parallel file systems
    • dawn: 2.4 PB mounted as /p/lscratch2
    • rzdawndev: 1.1 PB mounted as /p/lscratcha
    • udawn: 1.4 PB mounted as /p/lscratchd
IBM Power 550 Service Nodes
Larger image HERE
Service and Login Nodes
Larger image HERE

IBM Disk array
Larger image HERE


Accessing LC's BG/P Machines


The instructions below summarize the basics for connecting to LC's BG/P systems. Additional access related information can be found at computing.llnl.gov/access.

LLNL LANL/Sandia Other/Internet
dawn
  • Need to be logged into an SCF network machine
  • ssh dawn or connect to dawn via local SSH application
  • Userid: LC username
  • Password: PIN + OTP token code -or- static LC SCF password
  • Login and authenticate on local Securenet attached machine
  • ssh -l lc_userid dawn[1-14].llnl.gov
  • No password required
  • Login and authenticate on local Securenet attached machine
  • ssh -l lc_userid dawn[1-14].llnl.gov
  • Password: PIN + OTP token code -or- static LC SCF password
rzdawndev
  • Need to be logged into a machine that is not part of the OCF Collaboration Zone (CZ)
  • ssh rzgw or connect to rzgw via local SSH application
  • Userid: LC username
  • Password: PIN + CRYPTOCard token code
  • Then, ssh rzdawndev
  • Userid: LC username
  • Password: PIN + OTP token code
  • Start LLNL VPN client on local machine and authenticate to VPN with your LLNL OUN and PIN + OTP token code
  • ssh rzgw or connect to rzgw via local SSH application
  • Userid: LC username
  • Password: PIN + CRYPTOCard token code
  • Then, ssh rzdawndev
  • Userid: LC username
  • Password: PIN + OTP token code
  • Start LLNL VPN client on local machine and authenticate to VPN with your LLNL OUN and PIN + OTP token code
  • ssh rzgw or connect to rzgw via local SSH application
  • Userid: LC username
  • Password: PIN + CRYPTOCard token code
  • Then, ssh rzdawndev
  • Userid: LC username
  • Password: PIN + OTP token code
udawn
  • Need to be logged into an OCF network machine
  • ssh udawn or connect to udawn via local SSH application
  • Userid: LC username
  • Password: PIN + OTP token code
  • Login and authenticate on local unclassified network machine
  • ssh udawn.llnl.gov or connect to udawn.llnl.gov via local SSH application
  • Userid: LC username
  • Password: PIN + OTP token code
  • Linux example:
    ssh -l lc_userid udawn.llnl.gov
  • Login and authenticate on local unclassified network machine
  • ssh udawn.llnl.gov or connect to udawn.llnl.gov via local SSH application
  • Userid: LC username
  • Password: PIN + OTP token code
  • Linux example:
    ssh -l lc_userid udawn.llnl.gov



Software and Development Environment


Summary:

Login Nodes:

Compute Node Kernel (CNK):
  • This is an open source, light-weight, 32-bit Linux based operating system that runs on the compute nodes. Available at wiki.bg.anl-external.org/index.php/Main_Page.

  • Minimal kernel:
    • Similar to the CNK on BG/L
    • Signal handling
    • Sockets
    • Starting/stopping jobs
    • Ships I/O calls to the I/O nodes over the Collective network
    • Missing some system calls, such as fork(), execve(), system(), etc.

  • Support for MPI, OpenMP and Pthreads

  • Communicates with the outside world through the Compute I/O Daemon (CIOD) over the Collective network.

  • Supported/unsupported system calls:

I/O Node Kernel:

  • Full, embedded, 32-bit Linux kernel running on the I/O nodes

  • Includes BusyBox "tiny" versions of common UNIX utilities

  • Provides the only connection to the outside world for the compute nodes through the CIOD

  • Performs all I/O requests for the compute nodes.

  • Also performs system calls not handled by the compute nodes

  • Provides support for debuggers and tools through its Tool Daemon

  • Parallel file systems supported:
    • Network File System (NFS)
    • Parallel Virtual File System (PVFS)
    • IBM General Parallel File System (GPFS)
    • Lustre File System
Compute Node Kernel

I/O Node Kernel
The above images have been reproduced by LLNL with the permission of International Business Machines Corporation from IBM Redbooks® publication SG24-7287: IBM System Blue Gene Solution: Blue Gene/P Application Development (http://www.redbooks.ibm.com/abstracts/sg247287.html?Open). © International Business Machines Corporation. 2007, 2008. All rights reserved.

Batch Systems:

File Systems:

Dotkit Packages:

Compilers:

Math Libraries

Debuggers and Performance Analysis Tools:

Man Pages:



Compilers


IBM Compilers:

GNU Compilers:

Compiler Syntax:

Compiler Commands:

IBM Compiler Options:

Optimization:

Cross-Compilation Caveats:

Shared Library Constraints:

Mixed Language Compilations:

Other GNU Commands - BG/P Versions:

Miscellaneous:

See the IBM Documentation - Really!



MPI


Overview:

Compiler Commands:

Running Your Job - the mpirun Command:

mpirun Notes:

Execution Modes:

  • MPI programs execute differently, depending upon the execution mode specified.

  • SMP Mode
    • 1 MPI task per node
    • Up to 4 Pthreads/OpenMP threads
    • Full node memory available
    • All resources dedicated to single kernel image
    • This is the default mode

  • Virtual Node Mode
    • 4 MPI tasks per node
    • No threads
    • Each task gets its own copy of the kernel
    • Each task gets 1/4th of the node memory
    • Network resources split in fourths
    • L3 Cache split in half and 2 cores share each half
    • Memory bandwidth split in half

  • Dual Mode
    • Hybrid of the Virtual Node and SMP modes
    • 2 MPI tasks per node
    • Up to 2 threads per task
    • Each task gets its own copy of the kernel
    • 1/2 of the node memory per task

  • The execution mode is specified with the -mode flag when launching your job with the mpirun command. For example:

    mpirun -mode smp
    mpirun -mode vn
    mpirun -mode dual

  • Available memory: dependent upon the execution mode used:

    text=code data=initialized static variables bss=uninitialized static variables
    heap=dynamically allocated variables stack=stack variables

MPI Execution Modes

MPI Documentation:



OpenMP and Pthreads


OpenMP:

Pthreads:



Aligning Data and SIMD Instructions


This section applies to taking advantage of the BG/P double floating-point unit.

Double FPU Requirements:

Alignment Exceptions:

Checking for Double FPU Use:



Math Libraries


ESSL:

MASS, MASSV:

FFTW:

LAPACK:



Miscellaneous Software


Python:



Running Jobs

System Configuration Details

First Things First:

System Configuration/Status Information:

Basic Configuration Commands:



Running Jobs

Batch Job Specifics

SLURM Only:

Batch Partitions/Queues and Policies:

Submitting Batch Jobs:

Interacting With Your Jobs:

Killing Batch Jobs:



Running Jobs

Interactive Job Specifics

Different Than Other LC Machines:

I/O and HPSS Storage


Parallel I/O:

Large File Support:

HPSS Storage:



Using HTC Mode


High Throughput Computing (HTC) Mode:

Running Serial HTC Jobs:

Running Parallel MPI HTC Jobs:



Debugging


Core Files:

addr2line: Address exception

getstack:

Other "First Pass" Debugging Tips:

CoreProcessor Tool:
  • This Perl script tool can analyze, sort and view text core files. It can also attach to hung processes for deadlock determination.

  • Location: /bgsys/drivers/ppcfloor/tools/coreprocessor/coreprocessor.pl

  • To get usage information: coreprocessor.pl -h

  • Normally used in GUI mode, so set DISPLAY then:
    coreprocessor.pl -c=/g/g5/joeuser/rundir -b=a.out
    where -c is the directory containing your core files and -b is the name/path of your executable.

  • Usage is fairly simple and straight-forward
    1. Select the preferred Group Mode, in this case, "Ungrouped w/Traceback". Other options are shown HERE.
    2. Select an item/routine in the corefile list
    3. Select a core file from the Common nodes pane
    4. The selected corefile then appears in the bottom pane
    5. The file/line which failed is displayed

  • For jobs with many core files, the most practical Group Mode is "Stack Traceback (condensed)". This mode groups all similar core files into a single stack trace. The number of corefiles sharing the same stack trace is displayed next to each routine. Example available HERE.

  • For non-GUI mode see: coreprocessor.pl -h
Coreprocessor utility
Larger image available HERE

TotalView:

Floating Point Exception Debugging:

STAT:



Performance Analysis Tools

What's Available?

mpitrace Library:

peekperf, peekview:

Xprofiler:

Hardware Performance Monitoring (HPM):

gprof:

mpiP:

PAPI:

TAU:


This completes the tutorial.

Evaluation Form       Please complete the online evaluation form - unless you are doing the exercise, in which case please complete it at the end of the exercise.

Where would you like to go now?



References and More Information