Using ASC Purple

presented by

Blaise Barney
Livermore Computing

NOTE: All of LC's IBM POWER systems have been retired, however their information is being retained here for archival purposes.


Table of Contents

  1. Abstract
  2. ASC Purple Background
  3. Hardware
    1. Configuration
    2. POWER5 Processor
    3. p5 575 Node and Frame
    4. High Performance Switch (HPS) Network
    5. GPFS Parallel File System
  4. Accounts
  5. Access
  6. User Environment Topics
  7. Software and Development Environment
  8. Parallel Operating Environment (POE) Overview
  9. Compilers
  10. MPI
  11. Running on Purple Systems
    1. Important Differences
    2. Understanding Your System Configuration
    3. Setting POE Environment Variables
    4. Invoking the Executable
    5. Monitoring Job Status
    6. Interactive Job Specifics
    7. Batch Job Specifics
    8. More On SLURM
    9. Optimizing CPU Usage
    10. Large Pages
    11. RDMA
  12. Debugging With TotalView
  13. Misc - Recommendations, Known Problems, Etc.
  14. References and More Information
  15. Exercise


Abstract


This tutorial provides an introduction to using Livermore Computing's (LC) ASC Purple systems. The intended audience is primarily those who are new to using the IBM POWER architecture and computing in LC's HPC environment. Those who are already knowledgeable with computing in LC's HPC environment, especially users of LC's POWER based systems (such as ASC White), will already be familiar with a substantial portion of these materials.

The tutorial begins by providing a brief background of ASC Purple and the configuration of LC's Purple systems. The primary hardware components of Purple are then presented, including IBM's POWER5 processor, p5 575 node and frame, HPS switch, and GPFS parallel I/O architecture. After covering the hardware related topics, a brief discussion on how to obtain an account and access the Purple systems follows. Software topics are then discussed, including the LC development environment, IBM's Parallel Operating Environment (POE), compilers, MPI implementations, and how to run both batch and interactive parallel jobs. Debugging and performance related tools/topics are briefly discussed, however detailed usage of these tools is beyond the scope of this presentation and is covered in other tutorials and LC documentation. The tutorial concludes with several LC specific and miscellaneous topics. A lab exercise using LC's unclassified Purple system follows the presentation.

Level/Prerequisites: Intended for those who are new to developing parallel programs in LC's IBM POWER environment. A basic understanding of parallel programming in C or Fortran is assumed. The material covered by EC3501 - Introduction to Livermore Computing Resources would also be useful.



ASC Purple Background

The history of ASC Purple is really the history of two separate, but interrelated timelines: the evolution of IBM's POWER architecture, and the NNSA's ASC program.

IBM's POWER Architectures:

NNSA's ASC Program:

ASC Purple Timeline:



Hardware

Configuration

Primary Components:

Topology:

LC's Purple Systems:



Hardware

POWER5 Processor

POWER5 Basics:

Chip Modules:

Multiple Modules:

ASC Purple Chips and Modules:
  • ASC Purple compute nodes are p5 575 nodes, which differ from standard p5 nodes in having only one active core in a dual-processor chip.

  • With only one active cpu in a chip, the entire L2 and L3 cache is dedicated. This design benefits scientific HPC applications by providing better cpu-memory bandwidth.

  • ASC Purple nodes are built from Dual-chip Modules (DCMs). Each node has a total of eight DCMs. A photo showing these appears in the next section below.
p5 575 DCM


Hardware

p5 575 Node and Frame

p5 575 Node Characteristics:

ASC Purple Frames:

  • Like LC's other POWER systems, ASC Purple nodes and switch hardware are housed in frames. An example frame used for the p5 575 compute nodes is shown at right.

  • Frame characteristics:
    • Redundant frame power supply
    • Air cooling
    • Concurrent (hot swappable) node maintenance
    • Monitoring and control from a single point via the Hardware Management Console (HMC)

  • ASC Purple frames can hold up to twelve p5 575 nodes.

  • Some frames are used solely for switch hardware - stages 2 and 3.

  • Managed via the Hardware Management Console/cluster
ASC Purple frame


Hardware

High Performance Switch (HPS) Network

Quick Intro:

Topology:

Switch Network Characteristics:

Switch Drawer:

Switch Board:

  • The switch board is really the heart of the HPS network. The main features of the switch board are listed below.

  • There are 8 logical Switch Chips, each of which is connected to 4 other Switch Chips to form an internal 4x4 crossbar switch.

  • A total of 32 ports controlled by Link Driver Chips on riser cards, are used to connect to nodes and/or other switch boards.

  • Depending upon how the Switch Board is used, it will be called a Node Switch Board (NSB) or Intermediate Switch Board (ISB):
    • NSB: 16 ports are configured for node connections. The other 16 ports are configured for connections to switch boards in other frames.
    • ISB: all ports are used to cascade to other switch boards.
    • Practically speaking, the distinction between an NSB and ISB is only one of topology. An ISB is just located higher up in the network hierarchy.

  • Switch-node connections are by copper cable. Switch-switch connections can be either copper or optical fiber cable.

  • Minimal hardware latency: approximately 59 nanoseconds to cross each Switch Chip.

  • Some simple example configurations using both NSB and ISB switch boards are shown below. The number “4" refers to the number of ports connecting each ISB to each NSB.
HPS Switch Board

Switch Network Interface (SNI): SNI diagram

Switch Application Performance:



Hardware

GPFS Parallel File System

Overview:

LC Configuration Details:



Accounts

Note: This section represents a subset of the information available on LC's HPC accounts web pages located at computing.llnl.gov/accounts. Please consult those pages for forms and additional details.

How to Obtain an Account on uP and/or Purple:

Capability Computing:



Access

Note: This section represents a subset of the information available on LC's HPC access web pages located at computing.llnl.gov/access. Please consult those pages for additional details.

Summary:

One Time Passwords (OTP):

DCE Passwords:

Login Nodes:

Tri-lab Login Exceptions:

SSH:

Internet Access Services:

Web Page Access:

  • The majority of LC's user oriented web pages are publicly available without restriction over the Internet. Accessing these pages does not require any special account or password authentication. These pages are on LLNL's unrestricted ("green") network.

  • Web pages on LLNL's unrestricted network have been approved for public access after passing through a Review and Release process and receiving a UCRL number.

  • However, some user web pages are considered "internal" and may only be viewed by those who have the necessary authentication. These pages are on the restricted (yellow) network.

  • Pages on the restricted network may have vendor confidential information, site confidential information, or just simply have no general interest to non-LLNL people, and have not gone through the Review and Release process.

  • Accessing restricted web pages requires being on-site at LLNL, or having an appropriate Internet Service Account such as VPN, VPN-C or OTS.

  • When attempting to access an internal web page, you will typically see a rerouting message and password dialog box, such as shown at right.
Internal Web Page Access

SecureNet:



User Environment Topics


This section briefly covers a number of topics that will be of interest to users who are new to LC's HPC environment and Purple systems in particular. Existing LC users will already be familiar with most of these topics. Additional details can be found by searching LC's computing web pages at computing.llnl.gov and also by consulting the LC Resources tutorial.

Topics covered include:

Login Files:

Home Directories:

Temporary File Systems:

Archival Storage:

File Transfer:

File Sharing:

File Interchange System (FIS):

Mail:

Help and Documentation:



Software and Development Environment


The software and development environment for ASC Purple systems is very similar to that shared by other LC systems. Topics relevant to Purple are discussed below. For more information about topics shared by all LC systems, see the Introduction to LC Resources tutorial and search the LC Home Page.

AIX Operating System:

Parallel Environment:

Compilers:

IBM Math Libraries:

Batch System:

Software Tools:

Video and Graphics Services:



Parallel Operating Environment (POE) Overview


Most of what you'll do on any parallel IBM AIX POWER system will be under IBM's Parallel Operating Environment (POE) software. This section provides a quick overview. Other sections provide the details for actually using POE.

PE vs POE:

Types of Parallelism Supported:

Interactive and Batch:

Typical Usage Progression:

A Few Miscellaneous Words About POE:

Some POE Terminology:



Compilers


Compilers and Compiler Scripts:

Compiler Syntax:

Common Compiler Invocation Commands:

Compiler Options:

32-bit versus 64-bit:

Optimization:

Miscellaneous:

See the IBM Documentation - Really!



MPI


IBM's MPI Library:

Usage Notes:

Programming and Performance Considerations:



Running on Purple Systems

Important Differences

For those who are familiar with LC's other IBM systems, note that there are a few very important differences between those systems and Purple systems. These differences are briefly discussed here for visibility, and covered in more detail later as needed.

Large Pages:

SLURM:

RDMA

Simultaneous Multi-Threading (SMT)

POE Co-Scheduler



Running on Purple Systems

Understanding Your System Configuration

First Things First:

System Configuration/Status Information:

LC Configuration Commands:

IBM Configuration Commands:



Running on Purple Systems

Setting POE Environment Variables

In General:
Note At LC, POE does not behave exactly as documented by IBM. This is mostly due to LC's use of LCRM and SLURM.

How to Set POE Environment Variables:

POE, Moab and SLURM:

Basic Interactive POE Environment Variables:

Example Basic Interactive Environment Variable Settings:

Other Common/Useful POE Environment Variables

LLNL Preset POE Environment Variables:



Running on Purple Systems

Invoking the Executable

Syntax:

Multiple Program Multiple Data (MPMD) Programs:

Using POE with Serial Programs:

POE Error Messages:



Running on Purple Systems

Monitoring Job Status



Running on Purple Systems

Interactive Job Specifics

The pdebug Interactive Pool/Partition:

Insufficient Resources:

Killing Interactive Jobs:

Running on Purple Systems

Batch Job Specifics

Moab Workload Manager:

Submitting Batch Jobs:

Quick Summary of Common LCRM Batch Commands:

Batch Jobs and POE Environment Variables:

Logging Into Batch Nodes:

Note For serial and other non-MPI jobs, you will need to put a "dummy" POE command in your batch script if you want to be able to login while it is running on a batch node. Something as simple as poe true or poe hostname will work. Put this command as the first executable command in your job script.

Killing Batch Jobs:



Running on Purple Systems

More On SLURM

SLURM and a few SLURM commands have been briefly discussed in several places already in this tutorial. This section provides a concise summary of useful/important SLURM information for Purple systems.

SLURM Architecture:

SLURM Commands:

SLURM Environment Variables:

Miscellaneous:

Additional Information:



Running on Purple Systems

Optimizing CPU Usage

SMP Nodes:

Effectively Using Available CPUs:

When Not to Use All CPUs:



Running on Purple Systems

Large Pages

Large Page Overview:

Large Pages and Purple:

How to Enable Large Pages:

When NOT to Use Large Pages:

Miscellaneous Large Page Info:



Running on Purple Systems

RDMA

What is RDMA?

How to Use RDMA:



Debugging With TotalView


TotalView windows

The Very Basics:

  1. Be sure to compile your program with the -g option

  2. When starting TotalView, specify the poe process and then use TotalView's -a option for your program and any other arguments (including POE arguments). For example:
    totalview poe -a myprog -procs 4
  3. TotalView will then load the poe process and open its Root and Process windows as usual. Note that the poe process appears in the Process Window.

  4. Use the Go command in the Process Window to start poe with your executable.

  5. TotalView will then attempt to acquire your partition and load your job. When it is ready to run your job, you will be prompted about stopping your parallel job (below). In most cases, answering yes is the right thing to do.

    TotalView Prompt

  6. Your executable should then appear in the Process Window. You are now ready to begin debugging your parallel program.

  7. For debugging in batch, see Batch System Debugging in LC's TotalView tutorial.

A Couple LC Specific Notes:

TotalView and Large Pages:



Misc - Recommendations, Known Problems, Etc.


Purple Performance Recommendations:

Performance Related POE Environment Variables:

DAT Times:

Parallel I/O Warnings:




This completes the tutorial.

Evaluation Form       Please complete the online evaluation form - unless you are doing the exercise, in which case please complete it at the end of the exercise.

Where would you like to go now?



References and More Information