Linux Clusters Overview

Author: Blaise Barney, Lawrence Livermore National Laboratory UCRL-WEB-150348

Table of Contents

  1. Abstract
  2. Background of Linux Clusters at LLNL
  3. Cluster Configurations and Scalable Units
  4. LC Linux Cluster Systems
  5. Intel Xeon Hardware Overview
  6. Infiniband Interconnect Overview
  7. Software and Development Environment
  8. Compilers
  9. Exercise 1
  10. MPI
  11. Running Jobs
    1. Overview
    2. Batch Versus Interactive
    3. Starting Jobs - srun
    4. Interacting With Jobs
    5. Optimizing CPU Usage
    6. Memory Considerations
    7. Vectorization and Hyper-threading
    8. Process and Thread Binding
  12. Debugging
  13. Tools
  14. Exercise 2
  15. References and More Information



Abstract


This tutorial is intended to be an introduction to using LC's Linux clusters. It begins by providing a brief historical background of Linux clusters at LC, noting their success and adoption as a production, high performance computing platform. The primary hardware components of LC's Linux clusters are then presented, including the various types of nodes, processors and switch interconnects. The detailed hardware configuration for each of LC's production Linux clusters completes the hardware related information.

After covering the hardware related topics, software topics are discussed, including the LC development environment, compilers, and how to run both batch and interactive parallel jobs. Important issues in each of these areas are noted. Available debuggers and performance related tools/topics are briefly discussed, however detailed usage is beyond the scope of this tutorial. A lab exercise using one of LC's Linux clusters follows the presentation.

Level/Prerequisites: This tutorial is intended for those who are new to developing parallel programs in LC's Linux cluster environment. A basic understanding of parallel programming in C or Fortran is required. The material covered by the following tutorials would also be helpful:
EC3501: Livermore Computing Resources and Environment
EC4045: Moab and SLURM



Background of Linux Clusters at LLNL


The Linux Project:

Alpha Linux Clusters:

PCR Clusters:

MCR Cluster...and More:

Which Led To Thunder...

  • In September, 2003 the RFP for LC's first IA-64 cluster was released. Proposal from California Digital Corporation, a small local company, was accepted.

  • 1024 node system comprised of 4-CPU Itanium 2 "Madison Tiger4" nodes

  • Thunder debuted as #2 on the Top500 Supercomputers list in June, 2004. Peak performance was 22.9 Tflops.

  • For more information about Thunder see:

The Peloton Systems:

  • In early 2006, LC launched its Opteron/Infiniband Linux cluster procurement with the release of the Peloton RFP.

  • Appro www.appro.com was awarded the contract in June, 2006.

  • Peloton clusters were built in 5.5 Tflop "scalable units" (SU) of ~144 nodes

  • All Peloton clusters used AMD dual-core Socket F Opterons:
    • 8 cpus per node
    • 2.4 GHz clock
    • Option to upgrade to 4-core Opteron "Deerhound" later (not taken)

  • The six Peloton systems represented a mix of resources: OCF, SCF, ASC, M&IC, Capability and Capacity:

    SystemNetworkNodesCoresTflops
    Atlas OCF 1,152 9,216 44.2
    Minos SCF 864 6,912 33.2
    Rhea SCF 576 4,608 22.1
    Zeus OCF 288 2,304 11.1
    Yana OCF 83 640 3.1
    Hopi SCF 76 608 2.9

  • The last Peloton clusters were retired in June 2012.
Atlas Peloton Cluster

And Then, TLCC and TLCC2:

  • In July, 2007 the Tri-laboratory Linux Capacity Cluster (TLCC) RFP was released.

  • The TLCC procurement represents the first time that the Department of Energy/National Nuclear Security Administration (DOE/NNSA) has awarded a single purchase contract that covers all three national defense laboratories: Los Alamos, Sandia, and Livermore. Read the announcement HERE.

  • The TLCC architecture is very similar to the Peloton architecture: Opteron multi-core processors with an Infiniband interconnect. The primary difference is that TLCC clusters are quad-core instead of dual-core.

  • TLCC clusters were/are:

    SystemNetworkNodesCoresTflops
    Juno SCF 1,152 18,432 162.2
    Hera OCF 864 13,824 127.2
    Eos SCF 288 4,608 40.6

  • In June, 2011 the TLCC2 procurement was announced, as a follow-on to the successful TLCC systems. Press releases:

  • The TLCC2 systems consist of multiple Intel Xeon E5-2670 (Sandy Bridge EP), QDR Infiniband based clusters:

    SystemNetworkNodesCoresTflops
    Zin SCF 2,916 46,656 961.1
    Cab OCF-CZ 1,296 20,736 426.0
    Rzmerl OCF-RZ 162 2,592 53.9
    Pinot SNSI 162 2,592 53.9

  • Additionally, LC procured other Linux clusters similar to TLCC2 systems for various purposes.

    And Now? What Next?

    • As of June 2015, Tri-lab procurements as follow-ons to the TLCC2 systems are in progress. They are now called Commodity Technology Systems (CTS)

    • CTS-1 systems started to become available in late 2016 - early 2017. These systems are based on Intel Broadwell E5-2695 v4 processors, 36 cores per node, 128 GB node memory, with Intel Omni-Path 100 Gb/s interconnect. They include:

      SystemNetworkNodesCoresTflops
      Agate SCF 48 1,728 58.1
      Borax OCF-CZ 48 1,728 58.1
      Jade SCF 2,688 96,768 3,251.4
      Mica SCF 384 13,824 530.8
      Quartz OCF-CZ 2,688 96,768 3,251.4
      RZGenie OCF-RZ 48 1,728 58.1
      RZTopaz OCF-RZ 768 27,648 464.5
      RZTrona OCF-RZ 20 720 24.2

    • CTS-2 systems are expected to start becoming available in the 2019-2020 time frame.
  • Juno, Eos TLCC Clusters

    Zin TLCC2 Cluster

    Quartz CTS-1 Cluster


    Cluster Configurations and Scalable Units


    Basic Components:

    Nodes:

    Frames / Racks:

    Scalable Unit:



    LC Linux Cluster Systems

    LC Linux Clusters Summary

    Photos:



    Intel Xeon Hardware Overview

    Intel Xeon Processor

    Xeon E5-2670 Processor:

    Xeon E5-2695 v4 Processor:

    Additional Information:



    Infiniband Interconnect Overview

    Interconnects

    Primary components:

    Topology:

    Performance:

    Software and Development Environment


    This section only provides a summary of the software and development environment for LC's Linux clusters. Please see the Introduction to LC Resources tutorial for details.

    TOSS Operating System:

    Batch Systems:

    File Systems:

    Dotkit:

    Modules:

    On LC's TOSS 2 clusters, some software applications use both Modules and Dotkit.

    Compilers, Tools, Graphics and Other Software:



    Compilers

    General Information

    Available Compilers and Invocation Commands:

    Compiler Versions and Defaults:

    Compiler Options:

    Compiler Documentation:

    Optimizations:

    Floating-point Exceptions:

    Precision, Performance and IEEE 754 Compliance:

    Mixing C and Fortran:



    Linux Clusters Overview Exercise 1

    Getting Started

    Overview:
    • Login to an LC cluster using your workshop username and OTP token
    • Copy the exercise files to your home directory
    • Familiarize yourself with the cluster's configuration
    • Familiarize yourself with available compilers
    • Build and run serial applications
    • Compare compiler optimizations

    GO TO THE EXERCISE HERE



    MPI


    This section discusses general MPI usage information for LC's Linux clusters. For information on MPI programming, please consult the LC MPI tutorial.

    MVAPICH

    General Info:

    Compiling:

    Running:

    Documentation:



    Open MPI

    General Information:

    Compiling:

    Running:

    Documentation:



    MPI Build Scripts

    Note: you may need to load a dotkit/module for the desired MPI implementation, as discussed above. Failing to do this will result in getting the default implementation.



    Level of Thread Support



    Running Jobs

    Overview

    Big Differences:

    Job Limits:



    Running Jobs

    Batch Versus Interactive

    Interactive Jobs (pdebug):

    Batch Jobs (pbatch):

     This section only provides a quick summary of batch usage on LC's clusters. For details, see the Moab and SLURM Tutorial.

    Quick Summary of Common Batch Commands:



    Running Jobs

    Starting Jobs - srun

    The srun command:

    srun options:

    Clusters Without an Interconnect - Additional Notes:



    Running Jobs

    Interacting With Jobs

     This section only provides a quick summary of commands used to interact with jobs. For additional information, see the Moab and SLURM Tutorial.

    Displaying Job Information:

    Holding / Releasing Jobs:

    Modifying Jobs:

    Terminating / Canceling Jobs:



    Running Jobs

    Optimizing CPU Usage

    Clusters with an Interconnect:

    Clusters Without an Interconnect:



    Running Jobs

    Memory Considerations

    64-bit Architecture Memory Limit:

    LC's Diskless Linux Clusters:

    Compiler, Shell, Pthreads and OpenMP Limits:

    MPI Memory Use:

    Large Static Data:



    Running Jobs

    Vectorization and Hyper-threading

    Vectorization:

    Hyper-threading:



    Running Jobs

    Process and Thread Binding

    Process/Thread Binding to Cores:

    Debugging


    This section only touches on selected highlights. For more information users will definitely need to consult the relevant documentation mentioned below. Also, please consult the "Supported Software and Computing Tools" web page located at computing.llnl.gov/code/content/software_tools.php.

    TotalView:
    • TotalView is probably the most widely used debugger for parallel programs. It can be used with C/C++ and Fortran programs and supports all common forms of parallelism, including pthreads, openMP, MPI, accelerators and GPUs.

    • Starting TotalView for serial codes: simply issue the command:

        totalview myprog

    • Starting TotalView for interactive parallel jobs:

      • Some special command line options are required to run a parallel job through TotalView under SLURM. You need to run srun under TotalView, and then specify the -a flag followed by 1)srun options, 2)your program, and 3)your program flags (in that order). The general syntax is:

        totalview srun -a -n #processes -p pdebug myprog [prog args]

      • To debug an already running interactive parallel job, simply issue the totalview command and then attach to the srun process that started the job.

      • Debugging batch jobs is covered in LC's TotalView tutorial and in the "Debugging in Batch" section below.

    • Documentation:
    Small TotalView screen shot
    DDT:
    • DDT stands for "Distributed Debugging Tool", a product of Allinea Software Ltd.

    • DDT is a comprehensive graphical debugger designed specifically for debugging complex parallel codes. It is supported on a variety of platforms for C/C++ and Fortran. It is able to be used to debug multi-process MPI programs, and multi-threaded programs, including OpenMP.

    • Currently, LC has a limited number of fixed and floating licenses for OCF and SCF Linux machines.

    • Usage information: see LC's DDT Quick Start information located at: https://computing.llnl.gov/?set=code&page=ddt

    • Documentation:
    Small ddt screen shot

    STAT - Stack Trace Analysis Tool:

    Other Debuggers:

    Debugging in Batch: mxterm:

    Core Files:

    A Few Additional Useful Debugging Hints:



    Tools


    We Need a Book!

    Memory Correctness Tools:

    Profiling, Tracing and Performance Analysis:

    Beyond LC:



    Linux Clusters Overview Exercise 2

    Parallel Programs and More

    Overview:
    • Login to an LC workshop cluster, if you are not already logged in
    • Build and run parallel MPI programs
    • Build and run parallel Pthreads programs
    • Build and run parallel OpenMP programs
    • Build and run a parallel benchmarks
    • Build and run an MPI message passing bandwidth test
    • Try hyper-threading
    • Obtain online machine status information

    GO TO THE EXERCISE HERE






    This completes the tutorial.

          Please complete the online evaluation form - unless you are doing the exercise, in which case please complete it at the end of the exercise.

    Where would you like to go now?



    References and More Information