Livermore Computing Resources and Environment

Author: Blaise Barney, Lawrence Livermore National Laboratory UCRL-MI-133316

Table of Contents

  1. Abstract
  2. Organization
  3. Terminology
  4. Hardware
    1. Systems Summary
    2. IBM BG/Q Systems
    3. Intel Xeon Systems
    4. CORAL Systems
    5. Future Systems
    6. Typical LC Linux Cluster
    7. Interconnects
    8. Facilities, Machine Room Tours, Photos
  5. Accounts
  6. Accessing LC Systems
    1. Passwords and Authentication
    2. Access Methods
    3. Where to Login
    4. A Few More Words About SSH
    5. Remote Access Services
    6. SecureNet
  7. File Systems
    1. Home Directories and Login Files
    2. /usr/workspace File Systems
    3. Temporary File Systems
    4. Parallel File Systems
    5. Archival HPSS Storage
    6. /usr/gapps, /usr/gdata File Systems
    7. Quotas
    8. Purge Policies
    9. Backups
    10. File Transfer and Sharing
    11. File Interchange Service (FIS)
  8. System Configuration and Status Information
  9. Exercise 1
  10. Software and Development Environment Overview
    1. Development Environment Group (DEG)
    2. TOSS Operating System
    3. Software Lists
    4. Modules and Dotkit
    5. Atlassian Tools - Confluence, JIRA, etc.
  11. Compilers
  12. Debuggers
  13. Performance Analysis Tools
  14. Graphics Software and Resources
  15. Running Jobs
    1. Where to Run?
    2. Batch Versus Interactive
    3. Starting Jobs - srun
    4. Interacting With Jobs
    5. Other Topics of Interest
  16. Batch Systems
  17. Miscellaneous Topics
    1. Clusters With GPUs
    2. Big Data at LC
    3. Green Data Oasis
    4. Security Reminders
  18. Where to Get Information & Help
  19. Exercise 2




Abstract


This is the second tutorial in the "Livermore Computing Getting Started" workshop. It provides an overview of Livermore Computing's (LC) supercomputing resources and how to effectively use them. As such, it is definitely intended as a "getting started" document for new users or for those who want to know "in a nutshell" what supercomputing at LC is all about from a practical user's perspective. It is also intended to provide essential, practical information for attendees planning to attend the other tutorials in this workshop.

A wide variety of topics are covered in what is hopefully, a logical progression, starting with a description of the LC organization, a summary of the available supercomputing hardware resources, how to obtain an account and how to access LC systems. Important aspects concerning the user environment are then addressed, such as the user's home directory, various files and file systems, how to transfer/share files, quotas, archival storage and getting system status/configuration information. A brief description of the software development environment (compilers, debuggers, and performance tools), a summary of video and graphics services, and the basics of how to run jobs follow. Several miscellaneous topics are discussed. Finally, this tutorial concludes with a discussion on where to obtain more information and help. Note: This tutorial only provides an overview of using LC's Slurm/Moab batch systems - these topics are covered in the EC4045 "Slurm and Moab" tutorial.

Level/Prerequisites: This tutorial is geared to new users of LC systems and might actually be considered a prerequisite for using LC systems and attending other tutorials that describe parallel programming on LC systems in more detail.



Organization


What Is Livermore Computing?

History of Livermore Computing:



Terminology


"The acronyms can be a bit overwhelming"
- Excerpt from a workshop attendee evaluation form

DISCLAIMER: All information presented today is subject to change! This information was current as of June 2018.


Hardware

Systems Summary

Mix of Resources:

Primary Systems:

Peak Comparisons:



Hardware

IBM Blue Gene/Q Systems

 BG/Q users should consult the "Additional Information" references below. This tutorial does not cover much of the unique BG/Q architecture and environment.
Overview:
  • Unique BG/Q architecture - some key features:
    • 64-bit, 16 PowerPC A2 cores @1.6 GHz per node
    • 4 hardware threads per core
    • 5-Dimensional Torus network
    • Extremely power efficient
    • Water cooling
    • Transactional "rollback" memory in hardware

  • seq:
    • Sequoia is a 20 Pflop, classified BG/Q machine with 98,304 compute nodes and 1,572,864 cores.
    • Sequoia was ranked as the world's most powerful computer from June-November, 2012. The NNSA press release is available HERE.
    • Shared between Tri-lab users
    • Accounts provided through the Advanced Technology Computing Campaign (ATCC) proposal process.

  • vulcan:
    • Vulcan is a 5 Pflop BG/Q system in the unclassified Collaboration Zone (CZ)
    • Identical architecture to seq - just smaller
    • Mostly LLNL ASC/M&IC/HPCIC and PSAAP university users

  • rzuseq:
    • rzuseq is a 512 node system in the unclassified Restricted Zone (RZ)
    • Not a production machine
    • Identical architecture to seq and vulcan - just smaller

Additional Information:

 Sequoia BG/Q Tutorial: computing.llnl.gov/tutorials/bgq
 Highly recommended for all BG/Q users, due to BG/Q's unique architecture and environment.




Sequoia BG/Q System

Click for larger image


Vulcan BG/Q System

System Details:



Hardware

Intel Xeon Systems

Overview:
  • The majority of LC's systems are Intel Xeon based Linux clusters, and include: the following processor architectures:
    • Intel Xeon 18-core E5-2695 v4 (Broadwell)
    • Intel Xeon 8-core E5-2670 (Sandy Bridge - TLCC2) w/without NVIDIA GPUs
    • Intel Xeon 12-core E5-2695 v2 (Ivy Bridge)

  • Mix of resources:
    • 8, 12 and 18 core processors
    • OCF and SCF
    • ASC, M&IC, VIZ
    • Capacity, Grand Challenge, visualization, testbed
    • Several GPU enabled clusters

  • 64-bit architecture

  • TOSS operating system stack (TOSS 2 and TOSS 3)

  • InfiniBand interconnect

  • Hyper-threading enabled (2 threads/core)

  • Vector/SIMD operations

  • For detailed hardware information, please see the "Additional Information" references below.

Additional Information:




Quartz Intel Cluster

Click for larger image


Zin Intel Cluster

System Details:



Hardware

CORAL Systems

CORAL:

  •  C O R A L  = Collaboration of Oak Ridge, Argonne, and Livermore

  • A first-of-its-kind U.S. Department of Energy (DOE) collaboration between the NNSA's ASC Program and the Office of Science's Advanced Scientific Computing Research program (ASCR).

  • CORAL is the next major phase in the DOE's scientific computing roadmap and path to exascale computing.

  • Will culminate in three ultra-high performance supercomputers at Lawrence Livermore, Oak Ridge, and Argonne national laboratories.

  • Will be used for the most demanding scientific and national security simulation and modeling applications, and will enable continued U.S. leadership in computing.

  • The three CORAL systems are:
    • LLNL: 
    Sierra
    • ORNL: 
    Summit
    • ANL: 
    Aurora

  • LLNL and ORNL systems are being delivered in the 2017-18 timeframe. The Argonne system's planned delivery (revised) is in 2021.


CORAL Early Access (EA) Systems:
  • In preparation for the final delivery Sierra systems, LLNL has implemented three "early access" systems, one on each network:
    • ray 
    - OCF-CZ
    • rzmanta 
    - OCF-RZ
    • shark 
    - SCF

  • Primary purpose is to provide platforms where Tri-lab users can begin porting and preparing for the hardware and software that will be delivered with the final Sierra systems.

  • Similar to the final delivery Sierra systems but use the previous generation IBM Power processors and NVIDIA GPUs.

  • IBM Power Systems S822LC Server:
    • Hybrid architecture using IBM POWER8+ processors and NVIDIA Pascal GPUs.

  • IBM POWER8+ processors:
    • 2 per node (dual-socket)
    • 10 cores/socket; 20 cores per node
    • 8 SMT threads per core; 160 SMT threads per node
    • Clock: due to adaptive power management options, the clock speed can vary depending upon the system load. At LC speeds can vary from approximately 2 GHz - 4 GHz.

  • NVIDIA GPUs:
    • 4 NVIDIA Tesla P100 (Pascal) GPUs per compute node (not on login/service nodes)
    • 3584 CUDA cores per GPU; 14,336 per node

  • Memory:
    • 256 GB DDR4 per node
    • 16 GB HBM2 (High Bandwidth Memory 2) per GPU; 732 GB/s peak bandwidth

  • NVLINK 1.0:
    • Interconnect for GPU-GPU and CPU-GPU shared memory
    • 4 links per GPU with 160 GB/s total bandwidth

  • NVRAM:
    • 1.6 TB NVMe PCIe SSD per compute node (CZ ray system only)

  • Network:
    • Mellanox 100 Gb/s Enhanced Data Rate (EDR) InfiniBand
    • One dual-port 100 Gb/s EDR Mellanox adapter per node

  • Parallel File System: IBM Spectrum Scale (GPFS)
    • ray: 
    1.3 PB
    • rzmanta: 
    431 TB
    • shark: 
    431 TB

  • Batch System: IBM Spectrum LSF


CORAL EA Ray Cluster
Click for larger image

Sierra Systems:
  • Sierra is a classified, 125 petaflop, IBM Power Systems AC922 hybrid architecture system comprised of IBM POWER9 nodes with NVIDIA Volta GPUs. Sierra is a Tri-lab resource sited at Lawrence Livermore National Laboratory.

  • Unclassified Sierra systems are similar, but smaller, and include:
    • lassen - a 20 petaflop system located on LC's CZ zone.
    • rzansel - a 1.5 petaflop system is located on LC's RZ zone.

  • IBM Power Systems AC922 Server:
    • Hybrid architecture using IBM POWER9 processors and NVIDIA Volta GPUs.

  • IBM POWER9 processors (compute nodes):
    • 2 per node (dual-socket)
    • 22 cores/socket; 44 cores per node
    • 4 SMT threads per core; 176 SMT threads per node
    • Clock: due to adaptive power management options, the clock speed can vary depending upon the system load. At LC speeds can vary from approximately 2.0 - 3.1 GHz.

  • NVIDIA GPUs:
    • 4 NVIDIA Tesla V100 (Volta) GPUs per compute, login, launch node
    • 5120 CUDA cores per GPU; 20,480 per node

  • Memory:
    • 256 GB DDR4 per compute node
    • 16 GB HBM2 (High Bandwidth Memory 2) per GPU; 900 GB/s peak bandwidth

  • NVLINK 2.0:
    • Interconnect for GPU-GPU and CPU-GPU shared memory
    • 6 links per GPU with 300 GB/s total bandwidth

  • NVRAM:
    • 1.6 TB NVMe PCIe SSD per compute node

  • Network:
    • Mellanox 100 Gb/s Enhanced Data Rate (EDR) InfiniBand
    • One dual-port 100 Gb/s EDR Mellanox adapter per node

  • Parallel File System: IBM Spectrum Scale (GPFS)

  • Batch System: IBM Spectrum LSF

  • Water (warm) cooled compute nodes


Sierra
Click for larger image


Hardware

Future Systems

Advanced Technology Systems (ATS):

  • Supercomputers dedicated to the largest and most complex calculations critical to stockpile stewardship; "capability computing".

  • Typically include leading-edge/novel architecture components, custom engineering

  • Shared across the Tri-labs; accounts granted to projects via a formal proposal process

  • ATS-3 "Crossroads": Will be sited at LANL

  • ATS-4 "El Capitan": Will be sited at LLNL

Commodity Technology Systems (CTS):

  • Robust, cost-effective systems to meet the day-to-day simulation workload needs of the ASC program; "work-horse, capacity computing"

  • Common Tri-Lab procurement with platforms delivered to all three labs; accounts handled independently by each lab.

  • CTS-1 systems are currently in production at all three labs.

  • CTS-2: TBA




Trinity

Click for larger image


Sierra



Hardware

Typical LC Linux Cluster

Basic Components:

Nodes:

Frames / Racks:

Scalable Unit:



Hardware

Interconnects

Primary components:

Topology:

Performance:

Hardware

Facilities, Machine Room Tours, Photos

Facilities:
  • Most of LC's computing resources are located in the Livermore Computing Complex (LCC) building 453, and buildings 451 and 654. The LCC was formerly known as the Terascale Simulation Facility (TSF).

  • Map available HERE

  • LCC highlights:
    • Four-story office tower with 121,600 square feet for 285 offices, a visualization theater, a 150-seat auditorium, and several conference rooms on each floor.
    • Machine room with 48,000 square feet of unobstructed computer room floor
    • 30 megawatts machine power capacity
    • Mechanical cooling system with cooling towers boasting total capacity of 12,600 gallons per minute, a chiller plant with total capacity of 7,200 tons, and air handlers with a total capacity of 2,720,000 cubic feet per minute
    • 3,600-gallon-per-minute, closed-loop, liquid-cooling system for Sequoia that can cool up to 9.6 megawatts.

  • LC's building 654, comprises 6,000 sq/ft of computer floor space and is scalable up to 7.5 MW. B654 schematic drawing

  • Additional reading/viewing:

Machine Room Tours:

  • LLNL hosts can request tours of the B453 machine room for visitors and groups. Hosts are responsible for providing Administrative Escorts (AE) and ensuring AE policies/rules are followed.

  • Tour participants must be US citizens

  • For Livermore Computing Complex Building 453 tour information, please contact hpc-tours@llnl.gov.

  • Summer students: Weekly tours are offered for summer students (US Citizen). See the Lab Events Calendar for details and registration: ebb.llnl.gov.

Machine Photos:

Click for larger image


Accounts


Overview:
  • The process for obtaining an LC account varies, depending upon factors such as:
    • Lab employee?
    • Collaborator (non-employee)?
    • Foreign national?
    • Classified or unclassified?

  • It also involves more than one account processing system:

  • Because things can get a little complex, you should consult the LC accounts documentation at: hpc.llnl.gov/accounts.

  • One Time Password (OTP) Tokens:
    • For OCF accounts, you will receive via US mail, an RSA One-time Password (OTP) token. Instructions on how to activate and use this token are included with your account notification email.
    • For OCF RZ accounts, you will also receive an RZ RSA OTP token.
    • For SCF accounts, you will be asked to visit the LC Hotline to obtain your OTP token and setup your PIN.

  • Required training: All account requests require completion of online training before they are activated.

  • Annual Renewal: Accounts are subject to annual revalidations and completion of online training.

  • Foreign national accounts require additional processing and take longer to setup.

  • Virtual Private Network (VPN) Account: for remote access may also be required. Discussed later under Remote Access Services.

  • Questions? Contact the LC Hotline: (925) 422-4533 lc-support@llnl.gov


Accessing LC Systems

Passwords and Authentication

One-time Passwords (OTP):

OCF Collaboration Zone (CZ) or Restricted Zone (RZ)?



Accessing LC Systems

Access Methods

SSH Required:
OCF Access - Collaboration Zone (CZ):
  • Simply use SSH to the cluster login - for example:

    ssh quartz.llnl.gov

  • Authenticate with your LC username and PIN + OTP RSA token

  • Works the same from inside or outside the LLNL network

  • LANL / Sandia:
    • Begin on a LANL/Sandia iHPC login node. For example:
      ihpc-login.sandia.gov
      ihpc-gate1.lanl.gov
    • Then use the ssh -l LCusername command to login, where LCusername is your LC username. No password required. For example:

      ssh -l joeuser quartz.llnl.gov



OCF Access - Restricted Zone (RZ):
  • From inside LLNL:
    • You must be inside the RZ or LLNL institutional network. Access from the CZ is not permitted.
    • SSH to the gateway machine rzgw.llnl.gov
    • Authenticate with your LC username and RZ PIN + OTP token
    • Then SSH to the desired RZ cluster
    • Authenticate with your LC username and RZ PIN + OTP token (note: SSH keys can be used to bypass this step).

  • From outside LLNL:
    • Must have a Remote Access Service account (discussed later) already setup - usually VPN.
    • First, start up and authenticate to your Remote Access Service account. If you are using LLNL's VPN, use your LLNL OUN (Official User Name) and your PIN + OTP RSA token
    • SSH to the gateway machine rzgw.llnl.gov
    • Authenticate with your LC username and RZ PIN + OTP token
    • Then SSH to the desired RZ cluster
    • Authenticate with your LC username and RZ PIN + OTP token (note: SSH keys can be used to bypass this step).

  • LANL / Sandia:
    • Begin on a LANL/Sandia iHPC login node:
      • Sandia - start from ihpc.sandia.gov
      • LANL - start from ihpc-gate1.lanl.gov
    • Then use the ssh -l LCusername command to login to the RZ gateway node, where LCusername is your LC username. For example:

      ssh -l joeuser rzgw.llnl.gov

    • Authenticate with your RZ PIN + OTP token
    • On rzgw: kinit sandia-username@dce.sandia.gov or kinit lanl-username@lanl.gov
    • Enter Sandia/LANL kerberos password
    • Then ssh to desired RZ machine. No password required.




SCF Access:
  • Within the SCF at LLNL:
    • Simply ssh to the cluster login and authenticate with your PIN + OTP RSA token

  • LANL / Sandia:
    • Authenticate on a designated LANL/Sandia machine locally using the kinit -f command.
      Be sure to specify the -f option for a forwardable credential.
    • For LANL only: connect to the LANL gateway machine: ssh red-wtrw
    • Then use the ssh -l LCusername command to login, where LCusername is your LC username. No password required. For example:

      ssh -l joeuser seq.llnl.gov

  • From other classified DOE sites over SecureNet:
    • Use SSH with your PIN + OTP RSA token
    • RSA/DSA key authentication is disabled


SSH Examples:

CZ and RZ Access Methods:

Web Page Access:



Accessing LC Systems

Where to Login

Login Nodes

Cluster Login

Logging Into Compute Nodes:



Accessing LC Systems

Remote Access Services

Services Available:

For Help, Software Downloads and More Information:



Accessing LC Systems

A Few More Words About SSH

OpenSSH:

RSA/DSA Authentication (SSH Keys):

SSH Timeouts:

SSH and X11:

  • If you are logged into an LC cluster from your desktop, and are running applications that generate graphical displays, you will need to have X11 setup on your desktop.

  • Linux: automatic - nothing special needs to be done in most cases

  • Macs: you'll need X server software installed. XQuartz is commonly used (http://www.xquartz.org/).

  • Windows: you'll need X server software installed. LLNL provides X-Win32, which can be downloaded/installed from your desktop's LANDesk Management software. Xming is a popular, free X server available for non-LLNL systems.

  • Helpful Hints:

    • X-Win32 setup instructions for LLNL: https://hpc.llnl.gov/manuals/access-lc-systems/x-win32-configuration

    • It's usually not necessary to define your DISPLAY variable in an SSH session between LC hosts. It should be picked up automatically.

    • Make sure your X server is setup to allow tunneling/forwarding of X11 connections BEFORE you connect to the LC host.

    • Often, you need to supply the -X or -Y flag to your ssh command to enable X11 forwarding.

    • May also try setting the two parameters below in your .ssh/config file:

      ForwardX11=yes
      ForwardX11Trusted=yes

    • Use the verbose option to troubleshoot problems:

      ssh -v [other options] [host]



Need SSH?

More Information:



Accessing LC Systems

SecureNet





File Systems

Home Directories and Login Files

Home Directories:

LC's Login Files:

Master Dot Files:

Architecture Specific Dot Files:

Operating System Specific Dot Files:

A Few Hints:

Need a New Copy?



File Systems

/usr/workspace File Systems



File Systems

Temporary File Systems

Useful Commands:



File Systems

Parallel File Systems

Overview:

Linux Parallel File Systems - Lustre:

LC Parallel File Systems Summary:



File Systems

Archival HPSS Storage

Access Methods and Usage:

Recommendation: Initiate your file transfers from one of LC's special purpose clusters, which have been optimized for high-speed data movement to storage:
oslic on the OCF-CZ
rzslic on the OCF-RZ
cslic on the SCF

Additional Information:



File Systems

/usr/gapps, /usr/gdata File Systems

Overview:



File Systems

Quotas

Home Directories:

Exceeding quota:
  • Warning appears in login messages if usage over 90% quota
  • Heed quota warnings - risk of data loss if quota is exceeded!

Other File Systems:



File Systems

Purge Policies

Temporary files - don't forget:


File Systems

Backups

Online .snapshot directories

Livermore Computing System Backups

Archival HPSS Storage



File Systems

File Transfer and Sharing

File Transfer Tools:

  • There are a number of ways to transfer files - depending upon what you want to do.

  • hopper - A powerful, interactive, cross-platform tool that allows users to transfer and manipulate files and directories by means of a graphical user interface. Users can connect to and manage resources using most of the major file transfer protocols, including FTP, SFTP, SSH, NFT, and HTAR. See the hopper web pages ( https://hpc.llnl.gov/software/hopper), hopper man page or use the hopper -readme command for more information.

  • ftp - Is available for file transfer between LC machines. The ftp client at LC is an optimized parallel ftp implementation. It can be used to transfer files with machines outside LLNL if the command originates from an LLNL machine and the foreign host will permit it. FTP to LC machines from outside LLNL is not permitted unless the user is connected via an appropriate Remote Access service such as OTS or VPN. Documentation is available via the ftp man page or the FTP Usage Guide (https://hpc.llnl.gov/manuals/ezstorage/ftp)

  • scp - (secure copy) is available on all LC machines. Example:

    scp thisfile user@host2:thatfile

  • sftp - Performs ftp-like operations over encrypted ssh.

  • MyLC - Livermore Computing's user portal provides a mechanism for transferring files to/from your desktop machine and your home directory on an LC machine. See the "utilities" tab. Available at mylc.llnl.gov

  • nft - (Network File Transfer) is LC's utility for persistent file transfer with job tracking. This is a command line utility that assumes transfers with storage and has a specific syntax. Documentation is available via its man page or the NFT Reference Manual (https://hpc.llnl.gov/manuals/ezstorage/nft).

  • htar - Is highly optimized for creation of archive files directly into HPSS, without having to go through the intermediate step of first creating the archive file on local disk storage, and then copying the archive file to HPSS via some other process such as ftp. The program uses multiple threads and a sophisticated buffering scheme in order to package member files into in-memory buffers, while making use of the high-speed network striping capabilities of HPSS. Syntax resembles that of the UNIX tar command. Documentation is available via its man page or the HTAR Reference Manual (https://hpc.llnl.gov/manuals/ezstorage/htar).

  • hsi - Hierarchical Storage Interface. HSI is a utility that communicates with HPSS via a user- friendly interface that makes it easy to transfer files and manipulate files and directories using familiar UNIX-style commands. HSI supports recursion for most commands as well as CSH-style support for wildcard patterns and interactive command line and history mechanisms. Documentation is available via its man page or the HSI website (http://www.mgleicher.us/).

  • Tri-lab high bandwidth file transfers over SecureNet:
    • All three Labs support wrapper scripts for enhanced data transfer between sites - classified side only.
    • Three different protocols can be used: hsi, htar and pftp.
    • Transfers can be from host to storage or host to host
    • Commands are given names that are self-explanatory - see the accompanying image at right.
    • At LLNL, these scripts are located in /usr/local/bin
    • For additional information please see https://aces.sandia.gov/hpss_info (requires Sandia authentication)

File Sharing Rules:

  • User home directories are required to be accessible to the user only. No group or world sharing is permitted without Associate Director approval.

  • Group sharing is permitted in /usr/workshare group directories.

  • Group sharing is permitted in lustre directories.

  • The collaborative /usr/gapps file systems permit group sharing. World sharing is permitted with Associate Director approval.

Hopper
Click for larger image


MyLC
Click for larger image


Tri-lab SCF File Transfers
Click for larger image

Give and Take Utilities:

Anonymous FTP server:



File Systems

File Interchange Service (FIS)

Usage:



System Configuration and Status Information


First Things First:


LC Homepage: hpc.llnl.gov

MyLC User Portal: mylc.llnl.gov

System Configuration Information:

System Configuration Commands:

System Status Information:

  • LC Homepage:
    • hpc.llnl.gov (User Portal toggle) - just look on the main page for the System Status links (shown at right).
    • The same links appear under the Hardware menu.
    • Unclassified systems only

  • MyLC Portal:
    • mylc.llnl.gov
    • Several portlets provide system status information:
      • machine status
      • login node status
      • scratch file system status
      • enclave status
    • Classified MyLC is at: https://lc.llnl.gov/lorenz/

  • Machine status email lists:
    • Provide the most timely status information for system maintenance, problems, and system changes/updates
    • ocf-status and scf-status cover all machines on the OCF / SCF
    • Additionally, each machine has its own status list - for example:
      sierra-status@llnl.gov

  • Login banner & news items - always displayed immediately after logging in
    • Login banner includes basic configuration information, announcements and news items. Example login banner HERE.
    • News items (unread) appear at the bottom of the login banner. For usage, type news -h.



Exercise 1

Logging In, Basic Configuration and File Systems Information

Overview:
  • Login to an LC cluster with X11 forwarding enabled
  • Test X11
  • Identify and SSH to other login nodes
  • Familiarize yourself with the cluster's configuration
  • Try the mxterm utility to access compute nodes
  • Learn where/how to obtain hardware, OS and other configuration information for LC clusters
  • Review basic file system info
  • Try moving files to the HPSS storage system
  • View file system status information

GO TO THE EXERCISE HERE

    Approx. 20-25 minutes



Software and Development Environment Overview


Development Environment Group (DEG):

TOSS Operating System:

Software Lists, Documentation and Downloads:

Modules:

On LC's older TOSS 2 clusters, some software applications use both Modules and Dotkit.

Dotkit:

Atlassian Tools:



Compilers

General Information

Available Compilers and Invocation Commands:

Compiler Versions and Defaults:

Compiler Options:

Compiler Documentation:

Optimizations:

Floating-point Exceptions:

Precision, Performance and IEEE 754 Compliance:

Mixing C and Fortran:



Debuggers

Debuggers

 This section only touches on selected highlights. For more information users will definitely need to consult the relevant documentation mentioned below. Also, please consult the "Development Environment Software" web page located at https://hpc.llnl.gov/software/development-environment-software.

TotalView:
  • TotalView is probably the most widely used debugger for parallel programs. It can be used with C/C++ and Fortran programs and supports all common forms of parallelism, including pthreads, openMP, MPI, accelerators and GPUs.

  • Starting TotalView for serial codes: simply issue the command:

      totalview myprog

  • Starting TotalView for interactive parallel jobs:

    • Some special command line options are required to run a parallel job through TotalView under SLURM. You need to run srun under TotalView, and then specify the -a flag followed by 1)srun options, 2)your program, and 3)your program flags (in that order). The general syntax is:

      totalview srun -a -n #processes -p pdebug myprog [prog args]

    • To debug an already running interactive parallel job, simply issue the totalview command and then attach to the srun process that started the job.

    • Debugging batch jobs is covered in LC's TotalView tutorial and in the "Debugging in Batch" section below.

  • Documentation:
DDT:
  • DDT stands for "Distributed Debugging Tool", a product of Allinea Software Ltd.

  • DDT is a comprehensive graphical debugger designed specifically for debugging complex parallel codes. It is supported on a variety of platforms for C/C++ and Fortran. It is able to be used to debug multi-process MPI programs, and multi-threaded programs, including OpenMP.

  • Currently, LC has a limited number of fixed and floating licenses for OCF and SCF Linux machines.

  • Usage information: see LC's DDT Quick Start information located at: https://hpc.llnl.gov/software/development-environment-software/allinea-ddt

  • Documentation: see the vendor website: http://www.allinea.com
Small ddt screen shot

STAT - Stack Trace Analysis Tool: STAT

Debugging in Batch: mxterm / sxterm:

Other Debuggers:

A Few Additional Useful Debugging Hints:



Performance Analysis Tools


We Need a Book!

Memory Correctness Tools:

Profiling, Tracing and Performance Analysis:

Beyond LC:



Graphics Software and Resources


Graphics Software:

Consulting:

Video Production:

Visualization Machine Resources:

PowerWalls:

Contacts & More Information:



Running Jobs

Where to Run?

  This section only provides a general overview on running jobs on LC systems. Details associated with running jobs are covered in depth in other LC tutorials at https://hpc.llnl.gov/training/tutorials (Slurm and Moab, MPI, BG/Q, OpenMP, Pthreads, etc.)

Determining Your Job's Requirements:

Getting Machine Configuration Information:

Job Limits:

Accounts and Banks:

Serial vs Parallel:

Dedicated Application Time (DAT):



Running Jobs

Batch Versus Interactive

Interactive Jobs (pdebug):

Batch Jobs (pbatch):

 This section only provides a quick summary of batch usage on LC's Linux and BG/Q clusters. For details, see the Slurm and Moab and Sequoia and Vulcan BG/Q tutorials. For CORAL/Sierra systems see the Sierra tutorial.



Running Jobs

Starting Jobs - srun

 This section provides an overview on starting jobs on Linux and BG/Q systems. For additional details, see the Slurm and Moab and Sequoia and Vulcan BG/Q tutorials. For CORAL/Sierra systems see the Sierra tutorial.

The srun command:

srun options:



Running Jobs

Interacting With Jobs

 This section provides a quick summary of commands used to interact with jobs on Linux and BG/Q systems. For details, see the Slurm and Moab and Sequoia and Vulcan BG/Q tutorials. For CORAL/Sierra systems see the Sierra tutorial.

Monitoring Jobs and Displaying Job Information:

Holding / Releasing Jobs:

Modifying Jobs:

Terminating / Canceling Jobs:



Running Jobs

Other Topics of Interest

Optimizing Core Usage:

Diskless Nodes:

Process/Thread Binding to Cores:

Vectorization:

Hyper-threading:

Clusters Without an Interconnect (serial and single-node jobs):



Batch Systems


LC Batch Systems



Miscellaneous Topics

Clusters With GPUs




Big Data at LC




Green Data Oasis (GDO)



Security Reminders

Just a Few Reminders... For the Full Story:

Where to Get Information & Help


LC Hotline:

LC Users Home Page: hpc.llnl.gov

  • hpc.llnl.gov: LC maintains extensive web documentation for all systems and also for computing in the LC environment:

  • A few highlights:
    • Accounts - how to request an account; forms
    • Access Information - how to access and login to LC systems
    • Training - online tutorials and workshops
    • Compute Platforms - complete list with details for all LC systems
    • Machine Status shows current OCF machines status with links to detailed information such as MOTD, currently running jobs, configuration, announcements, etc.
    • Software - including compilers, tools, debuggers, visualization, math libs
    • Running Jobs - including using LC's workload managers
    • Documentation - user manuals for a range of topics, technical bulletins, user meetings slides
    • Getting Help - how to contact the LC Hotline

  • Some web pages are password protected. If prompted to enter a userid/password, use your OTP login.

  • Some web pages may only be accessed from LLNL machines or by using one of the LC Remote Access Services covered previously.

Lorenz User Dashboard: mylc.llnl.gov

  • Provides a wealth of real-time information in a user-friendly dashboard

  • Simply enter "mylc" into your browser's address bar. The actual URL is: https://lc.llnl.gov/lorenz/mylc/mylc.cgi

  • Click on the screenshot at right to see a larger version. Note: If your browser squeezes the large image into a single window, try zooming to get more detail.

Login Banner:

  • Login banner / MOTD may be very important!
    • News topics for LC, for the login system
    • Some configuration information
    • Useful references and contact information
    • System status information
    • Quota and password expiration warnings also appear when you login

News Items:

  • News postings on each LC System:
    • Unread news items appear with login messages
    • news -l - list all news items
    • news -a - display content of all news messages
    • news -n - lists unread messages
    • news -s - shows number of unread items
    • news item - shows specified news item
    • You can also list/read the files in /var/news on any system. This is useful when your searching for a topic you've already read and can't remember the news item name. You can also "grep" on these files.

  • Also accessible from hpc.llnl.gov and Lorenz.

Machine Email Lists:

  • Machine status email lists exist for all LC machines

  • Provide important, timely information not necessarily announced elsewhere

  • ocf-status@llnl.gov and scf-status@llnl.gov are general lists for all users

  • Plus each machine has its own list, for example: zin-status@llnl.gov.

  • The LC Hotline initially populates a list with subscribers, but you can subscribe/unsubscribe yourself anytime using the listserv.llnl.gov website.

LC User Meeting:

  • When held, is usually scheduled for the first Tuesday of the month at 9:30 am

  • Building 132 Auditorium (or as otherwise announced)

  • Agenda and viewgraphs on LC Home Page (hpc.llnl.gov) See "Documentation" and look for "User Meeting Viewgraphs". Note that these are LLNL internal web pages.


Click for larger images



Exercise 2

Compiling, Running, Job and System Status Information

Overview:
  • Get information about running and queued jobs
  • Get compiler information
  • Compile and run serial programs
  • Compile and run parallel MPI and OpenMP programs, both interactively and in batch
  • Check hyper-threading
  • Get online system status information (and more)

GO TO THE EXERCISE HERE




This completes the tutorial.

      Please complete the online evaluation form.

Where would you like to go now?





Author: Blaise Barney, Livermore Computing. Always interested in comments/corrections!