Linux Clusters Overview

Author: Blaise Barney, Lawrence Livermore National Laboratory UCRL-WEB-150348

Table of Contents

  1. Abstract
  2. Background of Linux Clusters at LLNL
  3. Cluster Configurations and Scalable Units
  4. LC Linux Cluster Systems
  5. Intel Xeon Hardware Overview
  6. AMD Opteron Hardware Overview
  7. Infiniband Interconnect Overview
  8. Software and Development Environment
  9. Compilers
    1. General Information
    2. Intel Compilers
    3. PGI Compilers
    4. GNU Compilers
  10. Exercise 1
  11. MPI
  12. Running Jobs
    1. Overview
    2. Batch Versus Interactive
    3. Starting Jobs - srun
    4. Terminating Jobs
    5. Displaying Queue and Job Status Information
    6. Optimizing CPU Usage
    7. Memory Considerations
    8. Vectorization and Hyper-threading
    9. Process and Thread Binding
  13. Debugging
  14. Tools
  15. Exercise 2
  16. References and More Information



Abstract


This tutorial is intended to be an introduction to using LC's Linux clusters. It begins by providing a brief historical background of Linux clusters at LC, noting their success and adoption as a production, high performance computing platform. The primary hardware components of LC's Linux clusters are then presented, including the various types of nodes, processors and switch interconnects. The detailed hardware configuration for each of LC's production Linux clusters completes the hardware related information.

After covering the hardware related topics, software topics are discussed, including the LC development environment, compilers, and how to run both batch and interactive parallel jobs. Important issues in each of these areas are noted. Available debuggers and performance related tools/topics are briefly discussed, however detailed usage is beyond the scope of this tutorial. A lab exercise using one of LC's Linux clusters follows the presentation.

Level/Prerequisites: This tutorial is intended for those who are new to developing parallel programs in LC's Linux cluster environment. A basic understanding of parallel programming in C or Fortran is required. The material covered by the following tutorials would also be helpful:
EC3501: Livermore Computing Resources and Environment
EC4045: Moab and SLURM



Background of Linux Clusters at LLNL


The Linux Project:

Alpha Linux Clusters:

PCR Clusters:

MCR Cluster...and More:

Which Led To Thunder...

  • In September, 2003 the RFP for LC's first IA-64 cluster was released. Proposal from California Digital Corporation, a small local company, was accepted.

  • 1024 node system comprised of 4-CPU Itanium 2 "Madison Tiger4" nodes

  • Thunder debuted as #2 on the Top500 Supercomputers list in June, 2004. Peak performance was 22.9 Tflops.

  • For more information about Thunder see:

The Peloton Systems:

  • In early 2006, LC launched its Opteron/Infiniband Linux cluster procurement with the release of the Peloton RFP.

  • Appro www.appro.com was awarded the contract in June, 2006.

  • Peloton clusters were built in 5.5 Tflop "scalable units" (SU) of ~144 nodes

  • All Peloton clusters used AMD dual-core Socket F Opterons:
    • 8 cpus per node
    • 2.4 GHz clock
    • Option to upgrade to 4-core Opteron "Deerhound" later (not taken)

  • The six Peloton systems represented a mix of resources: OCF, SCF, ASC, M&IC, Capability and Capacity:

    SystemNetworkNodesCPUs/CoresTflops
    Atlas OCF 1,152 9,216 44.2
    Minos SCF 864 6,912 33.2
    Rhea SCF 576 4,608 22.1
    Zeus OCF 288 2,304 11.1
    Yana OCF 83 640 3.1
    Hopi SCF 76 608 2.9

  • The last Peloton clusters were retired in June 2012.
Atlas Peloton Cluster

And Then, TLCC and TLCC2:

  • In July, 2007 the Tri-laboratory Linux Capacity Cluster (TLCC) RFP was released.

  • The TLCC procurement represents the first time that the Department of Energy/National Nuclear Security Administration (DOE/NNSA) has awarded a single purchase contract that covers all three national defense laboratories: Los Alamos, Sandia, and Livermore. Read the announcement HERE.

  • The TLCC architecture is very similar to the Peloton architecture: Opteron multi-core processors with an Infiniband interconnect. The primary difference is that TLCC clusters are quad-core instead of dual-core.

  • TLCC clusters were/are:

    SystemNetworkNodesCPUs/CoresTflops
    Juno SCF 1,152 18,432 162.2
    Hera OCF 864 13,824 127.2
    Eos SCF 288 4,608 40.6

  • In June, 2011 the TLCC2 procurement was announced, as a follow-on to the successful TLCC systems. Press releases:

  • The TLCC2 systems are LC's current "capacity" compute platform, consisting of multiple Intel Xeon E5-2670 (Sandy Bridge EP), QDR Infiniband based clusters:

    SystemNetworkNodesCPUs/CoresTflops
    Zin SCF 2,916 46,656 961.1
    Cab OCF-CZ 1,296 20,736 426.0
    Rzmerl OCF-RZ 162 2,592 53.9
    Pinot SNSI 162 2,592 53.9

And Now? What Next?

  • LC has continued to procure Linux clusters for various purposes, which follow the same basic architecture as the TLCC2 systems:
    • 64-bit, x86 architecture (Intel)
    • multi-core, multi-socket
    • QDR Infiniband interconnect

  • As of June, 2015, future TLCC-type procurements are currently in progress. They will be called Commodity Technology Systems (CTS), and are expected to begin arriving in mid-2016.
Juno, Eos TLCC Clusters

Zin TLCC2 Cluster


Cluster Configurations and Scalable Units


Basic Components:

Nodes:

Frames / Racks:

Scalable Unit:



LC Linux Cluster Systems

LC Linux Clusters Summary

Photos:



Intel Xeon Hardware Overview

Intel Xeon Processor

Xeon 5500 and 5600 Processors:


Xeon 5600 Processor Chip and Die (Image source: Intel)
Intel Nehalem Microarchitecture (Image source: wikipedia.org)

Xeon E5-2670 Processor:

Chipsets/Motherboards:



AMD Opteron Hardware Overview

AMD Opteron Processor

Basics:

Design Facts/Features Quad-core Processors: (at LC)

  • Socket F processor (multi-core) design

  • 64-bit architecture providing full 64-bit memory addressing

  • Clockspeed: 2.2 - 2.3 GHz

  • Floating point units: 4 results per clock cycle

  • Caches:

    Cache Quad-core
    Dedicated L1 Instruction 64 KB, 64-byte line, 2-way associative
    Dedicated L1 Data 64 KB, 64-byte line, 2-way associative
    Dedicated L2 512 KB
    Shared L3 2 MB

  • Direct Connect Architecture:

    • No front side bus - directly connects multiple processor cores, memory controller and I/O to the Socket F processor. Helps eliminate the bottlenecks inherent in a traditional front-side bus.

    • Integrated (on-die) memory controller. CPU-memory bandwidth is 10.7 GB/s per Socket F with DDR2-667 DIMMs.

    • HyperTransport interconnects (3) directly connect sockets to I/O subsystems, other chipsets, and other sockets (off-chip CPUs). Provide 8 GB/s bandwidth per link (4 GB/s each direction). Aggregate bandwidth of 24 GB/s per processor (shared by all cores).

  • Full support for Intel's SIMD vectorization instructions (SSE, SSE2, SSE3...)

  • Reduced power consumption:
    • Energy efficient DDR2 memory
    • AMD dynamic power management technology

  • AMD Virtualization: disparate applications can coexist on same system. Each application's environment (operating system, middleware, communications, etc.) is represented as a virtual machine.

  • Design features common to most modern processors:
    • Enhanced branch prediction capabilities
    • Speculative / out-of-order execution
    • Superscalar
    • Pipelined architecture
Opteron Diagram

Chipsets/Motherboards:

  • As with other vendor processors, AMD Opterons are commonly combined with other components to make a complete system.

  • Motherboards for Opteron systems are manufactured by a number of vendors. LC's machines use the SuperMicro H8QM8-2 motherboard for both dual-core and quad-core machines.

  • Key components:
    • Four AMD Opteron Socket F processors:
      • Quad-core: model 8354 (TLCC clusters)
    • 16 DIMM sockets supporting DDR2 667/533/400 MHz SDRAM.
      • Quad-core: 32 GB/node (TLCC clusters)
    • AMD-8132 Chipset (PCI-X bridge)
    • nVidia MCP55 Pro Chipset (I/O Hub)
    • Two PCI-X 100 MHz slots
    • Two PCI-X 133/100 MHz slots
    • One PCI-Express x16 slot
    • One PCI-Express x8 slot
    • Two 1GB ethernet ports
    • Remote server management hardware
    • Ports/controllers for SCSI, SATA, IDE, USB, keyboard, mouse, serial, parallel, etc. (most of these not used by LC machines)

  • Photo at right, block diagram below



Infiniband Interconnect Overview

Infiniband Interconnect

Primary components:

Topology:

Performance:

Software and Development Environment


This section only provides a summary of the software and development environment for LC's Linux clusters. Please see the Introduction to LC Resources tutorial for details.

TOSS Operating System:

Batch Systems:

File Systems:

Dotkit Packages:

Compilers, Tools, Graphics and Other Software:

Man Pages:



Compilers

General Information

Available Compilers and Invocation Commands:

Versions, Defaults and Paths:

Floating-point Exceptions:

Precision, Performance and IEEE 754 Compliance:

Mixing C and Fortran:

Large Static Data:

Compiler Documentation and Man Pages:



Compilers

Intel Compilers

General Information:

Compiler Invocation Commands:

Common / Useful Options:



Compilers

PGI Compilers

General Information:

Compiler Invocation Commands:

Common / Useful Options:



Compilers

GNU Compilers

General Information:

Compiler Invocation Commands:

Common / Useful Options:



Linux Clusters Overview Exercise 1

Getting Started

Overview:
  • Login to an LC cluster using your workshop username and OTP token
  • Copy the exercise files to your home directory
  • Familiarize yourself with the cluster's configuration
  • Familiarize yourself with available compilers
  • Build and run serial applications
  • Compare compiler optimizations

GO TO THE EXERCISE HERE



MPI


This section discusses general MPI usage information for LC's Linux clusters. For information on MPI programming, please consult the LC MPI tutorial.

MVAPICH

General Info:

MPI Build Scripts:

Running MPI Jobs:

Open MPI

General Information:

Running Jobs

Overview

Big Differences:

Job Limits:



Running Jobs

Batch Versus Interactive

Interactive Jobs:

Batch Jobs:

Quick Summary of Common Batch Commands:



Running Jobs

Starting Jobs - srun

The srun command:

srun options:

Clusters Without an Interconnect - Additional Notes:



Running Jobs

Terminating Jobs

Interactive:

Batch:



Running Jobs

Displaying Queue and Job Status Information

Overview:

Running Jobs

Optimizing CPU Usage

Clusters with an Interconnect:

Clusters Without an Interconnect:



Running Jobs

Memory Considerations

64-bit Architecture Memory Limit:

LC's Diskless Linux Clusters:

Compiler, Shell, Pthreads and OpenMP Limits:

MPI Memory Use:

Large Static Data:

Opterons Only: NUMA and Infiniband Adapter Considerations:

  • Each Opteron node has 4 Socket F processors - either dual-core or quad-core. Each Socket F is directly connected to its own local DIMMs with a bandwidth of 10.7 GB/s.

  • Shared memory access to the DIMMs of other sockets occurs over the HyperTransport links at 8 GB/s (4 GB/s each direction).

  • To improve performance, LC "pins" MPI tasks to a specific CPU. This promotes memory locality and reduction of shared memory accesses across multiple HyperTransport links to DIMMs on other sockets.

  • Only 1 socket is directly connected to the Infiniband adapter. This means that communications from other sockets need to traverse more than one HyperTransport link to get to the IB adapter, with the possibility of degraded bandwidth. Unofficial tests suggest up to 30%.
Opteron memory and Infiniband schematic


Running Jobs

Vectorization and Hyper-threading

Vectorization:

Hyper-threading:



Running Jobs

Process and Thread Binding

Process/Thread Binding to Cores:

Debugging


This section only touches on selected highlights. For more information users will definitely need to consult the relevant documentation mentioned below. Also, please consult the "Supported Software and Computing Tools" web page located at computing.llnl.gov/code/content/software_tools.php.

TotalView:
  • TotalView is probably the most widely used debugger for parallel programs. It can be used with C/C++ and Fortran programs and supports all common forms of parallelism, including pthreads, openMP, MPI, accelerators and GPUs.

  • Starting TotalView for serial codes: simply issue the command:

      totalview myprog

  • Starting TotalView for interactive parallel jobs:

    • Some special command line options are required to run a parallel job through TotalView under SLURM. You need to run srun under TotalView, and then specify the -a flag followed by 1)srun options, 2)your program, and 3)your program flags (in that order). The general syntax is:

      totalview srun -a -n #processes -p pdebug myprog [prog args]

    • To debug an already running interactive parallel job, simply issue the totalview command and then attach to the srun process that started the job.

    • Debugging batch jobs is covered in LC's TotalView tutorial and in the "Debugging in Batch" section below.

  • Documentation:
Small TotalView screen shot
DDT:
  • DDT stands for "Distributed Debugging Tool", a product of Allinea Software Ltd.

  • DDT is a comprehensive graphical debugger designed specifically for debugging complex parallel codes. It is supported on a variety of platforms for C/C++ and Fortran. It is able to be used to debug multi-process MPI programs, and multi-threaded programs, including OpenMP.

  • Currently, LC has a limited number of fixed and floating licenses for OCF and SCF Linux machines.

  • Usage information: see LC's DDT Quick Start information located at: https://computing.llnl.gov/?set=code&page=ddt

  • Documentation:
Small ddt screen shot

STAT - Stack Trace Analysis Tool:

Other Debuggers:

Debugging in Batch: mxterm:

Core Files:

A Few Additional Useful Debugging Hints:



Tools


We Need a Book!

Memory Correctness Tools:

Profiling, Tracing and Performance Analysis:

Beyond LC:



Linux Clusters Overview Exercise 2

Parallel Programs and More

Overview:
  • Login to an LC workshop cluster, if you are not already logged in
  • Build and run parallel MPI programs
  • Build and run parallel Pthreads programs
  • Build and run parallel OpenMP programs
  • Build and run a parallel benchmarks
  • Build and run an MPI message passing bandwidth test
  • Try hyper-threading
  • Obtain online machine status information

GO TO THE EXERCISE HERE






This completes the tutorial.

      Please complete the online evaluation form - unless you are doing the exercise, in which case please complete it at the end of the exercise.

Where would you like to go now?



References and More Information