Linux Clusters Overview

Author: Blaise Barney, Lawrence Livermore National Laboratory UCRL-WEB-150348
LLNL-MI-726637

Table of Contents

  1. Abstract
  2. Background of Linux Clusters at LLNL
  3. Cluster Configurations and Scalable Units
  4. LC Linux Cluster Systems
  5. Intel Xeon Hardware Overview
  6. Infiniband Interconnect Overview
  7. Software and Development Environment
  8. Compilers
  9. Exercise 1
  10. MPI
  11. Running Jobs
    1. Overview
    2. Batch Versus Interactive
    3. Starting Jobs - srun
    4. Interacting With Jobs
    5. Optimizing CPU Usage
    6. Memory Considerations
    7. Vectorization and Hyper-threading
    8. Process and Thread Binding
  12. Debugging
  13. Tools
  14. Exercise 2
  15. GPU Clusters
    1. Available GPU Clusters
    2. Hardware Overview
    3. GPU Programming APIs
      1. CUDA
      2. OpenMP
      3. OpenACC
      4. OpenCL
    4. Compiling
      1. CUDA
      2. OpenMP
      3. OpenACC
    5. Misc. Tips & Tools
  16. References and More Information



Abstract


This tutorial is intended to be an introduction to using LC's Linux clusters. It begins by providing a brief historical background of Linux clusters at LC, noting their success and adoption as a production, high performance computing platform. The primary hardware components of LC's Linux clusters are then presented, including the various types of nodes, processors and switch interconnects. The detailed hardware configuration for each of LC's production Linux clusters completes the hardware related information.

After covering the hardware related topics, software topics are discussed, including the LC development environment, compilers, and how to run both batch and interactive parallel jobs. Important issues in each of these areas are noted. Available debuggers and performance related tools/topics are briefly discussed, however detailed usage is beyond the scope of this tutorial. A lab exercise using one of LC's Linux clusters follows the presentation.

Level/Prerequisites: This tutorial is intended for those who are new to developing parallel programs in LC's Linux cluster environment. A basic understanding of parallel programming in C or Fortran is required. The material covered by the following tutorials would also be helpful:
EC3501: Livermore Computing Resources and Environment
EC4045: Moab and SLURM



Background of Linux Clusters at LLNL


The Linux Project:

Alpha Linux Clusters:

PCR Clusters:

MCR Cluster...and More:

Which Led To Thunder...

  • In September, 2003 the RFP for LC's first IA-64 cluster was released. Proposal from California Digital Corporation, a small local company, was accepted.

  • 1024 node system comprised of 4-CPU Itanium 2 "Madison Tiger4" nodes

  • Thunder debuted as #2 on the Top500 Supercomputers list in June, 2004. Peak performance was 22.9 Tflops.

  • For more information about Thunder see:

The Peloton Systems:

  • In early 2006, LC launched its Opteron/Infiniband Linux cluster procurement with the release of the Peloton RFP.

  • Appro www.appro.com was awarded the contract in June, 2006.

  • Peloton clusters were built in 5.5 Tflop "scalable units" (SU) of ~144 nodes

  • All Peloton clusters used AMD dual-core Socket F Opterons:
    • 8 cpus per node
    • 2.4 GHz clock
    • Option to upgrade to 4-core Opteron "Deerhound" later (not taken)

  • The six Peloton systems represented a mix of resources: OCF, SCF, ASC, M&IC, Capability and Capacity:

    SystemNetworkNodesCoresTflops
    Atlas OCF 1,152 9,216 44.2
    Minos SCF 864 6,912 33.2
    Rhea SCF 576 4,608 22.1
    Zeus OCF 288 2,304 11.1
    Yana OCF 83 640 3.1
    Hopi SCF 76 608 2.9

  • The last Peloton clusters were retired in June 2012.
Atlas Peloton Cluster

And Then, TLCC and TLCC2:

  • In July, 2007 the Tri-laboratory Linux Capacity Cluster (TLCC) RFP was released.

  • The TLCC procurement represents the first time that the Department of Energy/National Nuclear Security Administration (DOE/NNSA) has awarded a single purchase contract that covers all three national defense laboratories: Los Alamos, Sandia, and Livermore. Read the announcement HERE.

  • The TLCC architecture is very similar to the Peloton architecture: Opteron multi-core processors with an Infiniband interconnect. The primary difference is that TLCC clusters are quad-core instead of dual-core.

  • TLCC clusters were/are:

    SystemNetworkNodesCoresTflops
    Juno SCF 1,152 18,432 162.2
    Hera OCF 864 13,824 127.2
    Eos SCF 288 4,608 40.6

  • In June, 2011 the TLCC2 procurement was announced, as a follow-on to the successful TLCC systems. Press releases:

  • The TLCC2 systems consist of multiple Intel Xeon E5-2670 (Sandy Bridge EP), QDR Infiniband based clusters:

    SystemNetworkNodesCoresTflops
    Zin SCF 2,916 46,656 961.1
    Cab OCF-CZ 1,296 20,736 426.0
    Rzmerl OCF-RZ 162 2,592 53.9
    Pinot SNSI 162 2,592 53.9

  • Additionally, LC procured other Linux clusters similar to TLCC2 systems for various purposes.

    And Now? What Next?

    • As of June 2015, Tri-lab procurements as follow-ons to the TLCC2 systems are in progress. They are now called Commodity Technology Systems (CTS)

    • CTS-1 systems started to become available in late 2016 - early 2017. These systems are based on Intel Broadwell E5-2695 v4 processors, 36 cores per node, 128 GB node memory, with Intel Omni-Path 100 Gb/s interconnect. They include:

      SystemNetworkNodesCoresTflops
      Agate SCF 48 1,728 58.1
      Borax OCF-CZ 48 1,728 58.1
      Jade SCF 2,688 96,768 3,251.4
      Mica SCF 384 13,824 530.8
      Quartz OCF-CZ 2,688 96,768 3,251.4
      RZGenie OCF-RZ 48 1,728 58.1
      RZTopaz OCF-RZ 768 27,648 464.5
      RZTrona OCF-RZ 20 720 24.2

    • CTS-2 systems are expected to start becoming available in the 2019-2020 time frame.
  • Juno, Eos TLCC Clusters

    Zin TLCC2 Cluster

    Quartz CTS-1 Cluster


    Cluster Configurations and Scalable Units


    Basic Components:

    Nodes:

    Frames / Racks:

    Scalable Unit:



    LC Linux Cluster Systems

    LC Linux Clusters Summary

    Photos:



    Intel Xeon Hardware Overview

    Intel Xeon Processor

    Xeon E5-2670 Processor:

    Xeon E5-2695 v4 Processor:

    Additional Information:



    Infiniband Interconnect Overview

    Interconnects

    Primary components:

    Topology:

    Performance:

    Software and Development Environment


    This section only provides a summary of the software and development environment for LC's Linux clusters. Please see the Introduction to LC Resources tutorial for details.

    TOSS Operating System:

    Batch Systems:

    File Systems:

    Dotkit:

    Modules:

    On LC's TOSS 2 clusters, some software applications use both Modules and Dotkit.

    Compilers, Tools, Graphics and Other Software:



    Compilers

    General Information

    Available Compilers and Invocation Commands:

    Compiler Versions and Defaults:

    Compiler Options:

    Compiler Documentation:

    Optimizations:

    Floating-point Exceptions:

    Precision, Performance and IEEE 754 Compliance:

    Mixing C and Fortran:



    Linux Clusters Overview Exercise 1

    Getting Started

    Overview:
    • Login to an LC cluster using your workshop username and OTP token
    • Copy the exercise files to your home directory
    • Familiarize yourself with the cluster's configuration
    • Familiarize yourself with available compilers
    • Build and run serial applications
    • Compare compiler optimizations

    GO TO THE EXERCISE HERE



    MPI


    This section discusses general MPI usage information for LC's Linux clusters. For information on MPI programming, please consult the LC MPI tutorial.

    MVAPICH

    General Info:

    Compiling:

    Running:

    Documentation:



    Open MPI

    General Information:

    Compiling:

    Running:

    Documentation:



    MPI Build Scripts

    Note: you may need to load a dotkit/module for the desired MPI implementation, as discussed above. Failing to do this will result in getting the default implementation.



    Level of Thread Support



    Running Jobs

    Overview

    Big Differences:

    Job Limits:



    Running Jobs

    Batch Versus Interactive

    Interactive Jobs (pdebug):

    Batch Jobs (pbatch):

     This section only provides a quick summary of batch usage on LC's clusters. For details, see the Moab and SLURM Tutorial.

    Quick Summary of Common Batch Commands:



    Running Jobs

    Starting Jobs - srun

    The srun command:

    srun options:

    Clusters Without an Interconnect - Additional Notes:



    Running Jobs

    Interacting With Jobs

     This section only provides a quick summary of commands used to interact with jobs. For additional information, see the Moab and SLURM Tutorial.

    Displaying Job Information:

    Holding / Releasing Jobs:

    Modifying Jobs:

    Terminating / Canceling Jobs:



    Running Jobs

    Optimizing CPU Usage

    Clusters with an Interconnect:

    Clusters Without an Interconnect:



    Running Jobs

    Memory Considerations

    64-bit Architecture Memory Limit:

    LC's Diskless Linux Clusters:

    Compiler, Shell, Pthreads and OpenMP Limits:

    MPI Memory Use:

    Large Static Data:



    Running Jobs

    Vectorization and Hyper-threading

    Vectorization:

    Hyper-threading:



    Running Jobs

    Process and Thread Binding

    Process/Thread Binding to Cores:

    Debugging


    This section only touches on selected highlights. For more information users will definitely need to consult the relevant documentation mentioned below. Also, please consult the "Supported Software and Computing Tools" web page located at computing.llnl.gov/code/content/software_tools.php.

    TotalView:
    • TotalView is probably the most widely used debugger for parallel programs. It can be used with C/C++ and Fortran programs and supports all common forms of parallelism, including pthreads, openMP, MPI, accelerators and GPUs.

    • Starting TotalView for serial codes: simply issue the command:

        totalview myprog

    • Starting TotalView for interactive parallel jobs:

      • Some special command line options are required to run a parallel job through TotalView under SLURM. You need to run srun under TotalView, and then specify the -a flag followed by 1)srun options, 2)your program, and 3)your program flags (in that order). The general syntax is:

        totalview srun -a -n #processes -p pdebug myprog [prog args]

      • To debug an already running interactive parallel job, simply issue the totalview command and then attach to the srun process that started the job.

      • Debugging batch jobs is covered in LC's TotalView tutorial and in the "Debugging in Batch" section below.

    • Documentation:
    Small TotalView screen shot
    DDT:
    • DDT stands for "Distributed Debugging Tool", a product of Allinea Software Ltd.

    • DDT is a comprehensive graphical debugger designed specifically for debugging complex parallel codes. It is supported on a variety of platforms for C/C++ and Fortran. It is able to be used to debug multi-process MPI programs, and multi-threaded programs, including OpenMP.

    • Currently, LC has a limited number of fixed and floating licenses for OCF and SCF Linux machines.

    • Usage information: see LC's DDT Quick Start information located at: https://computing.llnl.gov/?set=code&page=ddt

    • Documentation:
    Small ddt screen shot

    STAT - Stack Trace Analysis Tool:

    Other Debuggers:

    Debugging in Batch: mxterm:

    Core Files:

    A Few Additional Useful Debugging Hints:



    Tools


    We Need a Book!

    Memory Correctness Tools:

    Profiling, Tracing and Performance Analysis:

    Beyond LC:



    Linux Clusters Overview Exercise 2

    Parallel Programs and More

    Overview:
    • Login to an LC workshop cluster, if you are not already logged in
    • Build and run parallel MPI programs
    • Build and run parallel Pthreads programs
    • Build and run parallel OpenMP programs
    • Build and run a parallel benchmarks
    • Build and run an MPI message passing bandwidth test
    • Try hyper-threading
    • Obtain online machine status information

    GO TO THE EXERCISE HERE




    GPU Clusters

    Available GPU Clusters

    Current Systems: (Mar 2017)

    Future Systems: (2017 and beyond)

    • As part of the multi-lab CORAL procurement, LLNL will be siting the Sierra system, expected to be available in late 2017.

    • A few Sierra stats:
      • Workload performance 5-7 times Sequoia
      • 120-150 petaflops peak
      • IBM POWER 9 nodes with NVIDIA Volta GPUs
      • 800 GB SSD NVRAM per node
      • 2.1-2.7 PB memory
      • Mellanox IB interconnect
      • GPFS parallel file system of 120 PB with 1.0 TB/s aggregate bandwidth

    • CORAL Early Access (EA) (IBM POWER 8 with NVIDIA Pascal GPUs) has been delivered, consisting of the Ray, RZManta and Shark clusters. These clusters are covered at: https://lc.llnl.gov/confluence/display/CORALEA/CORAL+EA+Systems

    • For more information on CORAL/Sierra see: https://asc.llnl.gov/coral-info


    GPU Clusters

    Hardware Overview

    Linux Clusters:
    • From a hardware (and software) perspective, LC's GPU clusters are essentially the same as LC's other Linux clusters:
      • Intel Xeon processor based
      • InfiniBand interconnect
      • Diskless, rack mounted "thin" nodes
      • TOSS operating system/software stack
      • Login nodes, compute nodes, gateway (I/O) and management nodes
      • Compute nodes organized into partitions (pbatch, pdebug, pviz, etc.)
      • Same compilers, debuggers and tools

    • The only significant difference is that compute nodes (or a subset of compute nodes in the case of Max) include GPU hardware.

    Surface GPU Cluster 5

    Click for larger image

    GPU Boards:

    GPU Chip - Primary Components (Tesla K20, K40, K80):

    SMX Details (Tesla K20, K40, K80):

    • The SMX unit is where the GPU's computational operations are performed.

    • The number of SMX units per GPU chip varies, based on the type of GPU:
      • Max: 14
      • Surface: 15
      • Rzhasgpu: 13

    • SMX units have 192 single-precision CUDA cores

    • Each CUDA core has one floating-point and one integer arithmetic unit; fully pipelined

    • 64 double-precision (DP) units are used for double-precision math

    • 32 special function units (SFU) are used for fast approximate transcendental operations

    • Memory:
      • GK110: 64 KB; divided between shared memory and L1 cache; configurable
      • GK210: 128 KB; divided between shared memory and L1 cache; configurable
      • 48 KB Read-Only Data Cache; both GK110 and GK210

    • Register File:
      • GK110: 65,536 x 32-bit
      • GK210: 131,072 x 32-bit

    • Other hardware:
      • Instruction Cache
      • Warp Scheduler
      • Dispatch Units
      • Texture Filtering Units (16)
      • Load/Store Units (32)
      • Interconnect Network
    NVIDIA GK210 GPU SMX Block Diagram 8
    Click for larger image

    Additional Reading:



    GPU Clusters

    GPU Programming APIs
    CUDA

    Overview:

    Basic Approach:

    1. Identify which sections of code will be executed by the Host (CPU) and which sections by the Device (GPU).
      • Device code is typically computationally intensive and able to be executed by many threads simultaneously in parallel.
      • Host code is usually everything else

    2. Create kernels for the Device code:
      • A kernel is a routine executed on the GPU as an array of threads in parallel
      • Kernels are called from the Host
      • Kernel syntax is similar to standard C/C++, but includes some CUDA extensions.
      • All threads execute the same code
      • Each thread has a unique ID
      • Example with CUDA extensions highlighted:

        Standard C Routine CUDA C Kernel
        void vecAdd(double *a, double *b, double *c, int n)
        {
          for (int i = 0; i < n; ++i)
            c[i] = a[i] + b[i];
        }
        
        __global__ void vecAdd(double *a, double *b, double *c, int n) 
        { 
          int i = blockIdx.x * blockDim.x + threadIdx.x; 
          if (i < n) c[i] = a[i] + b[i]; 
        }
        

      • Notes:
        __global__ indicates a GPU kernel routine; must return void
        blockIdx.x, blockDim.x, threadIdx.x are read-only built in variables used to compute each thread's unique ID as an index into the vectors

    3. Allocate space for data on the Host
      • Use malloc as usual in your Host code
      • Initialize data (typically)
      • Helpful convention: prepend Host variables with h_ to distinguish them from Device variables. Example:
        h_a = (double*)malloc(bytes);
        h_b = (double*)malloc(bytes);
        h_c = (double*)malloc(bytes);

    4. Allocate space for data on the Device
      • Done in Host code, but actually allocates Device memory
      • Use CUDA routine such as cudaMalloc
      • Helpful convention: prepend Device variables with d_ to distinguish them from Host variables. Example:
        cudaMalloc(&d_a, bytes);
        cudaMalloc(&d_b, bytes);
        cudaMalloc(&d_c, bytes);

    5. Copy data from the Host to the Device
      • Done in Host code
      • Use CUDA routine such as cudaMemcpy
      • Example:
        cudaMemcpy(d_a, h_a, bytes, cudaMemcpyHostToDevice);
        cudaMemcpy(d_b, h_b, bytes, cudaMemcpyHostToDevice);

    6. Set the number of threads to use; threads per block (blocksize) and blocks per grid (gridsize):
      • Can have significant impact on performance; need to experiment
      • Optimal settings vary by GPU, application, memory, etc.
      • One tool (a spreadsheet) that can assist is the "CUDA Occupancy Calculator". Available from NVIDIA - just google it.

    7. Launch kernel from the Host to run on the Device
      • Called from Host code but actually executes on Device
      • Uses a special syntax clearly showing that a kernel is being called
      • Need to specify the gridsize and blocksize from previous step
      • Arguments depend upon your kernel
      • Example:
        vecAdd<<<gridSize, blockSize>>>(d_a, d_b, d_c, n);

    8. Copy results back from the Device to the Host
      • Done in Host code
      • Use CUDA routine such as cudaMemcpy
      • Example:
        cudaMemcpy(h_c, d_c, bytes, cudaMemcpyDeviceToHost);

    9. Deallocate Host and/or Device space if no longer needed:
      • Done in Host code
      • Example:
         // Release host memory
        free(h_a);
        free(h_b);
        free(h_c);
        // Release device memory
        cudaFree(d_a);
        cudaFree(d_b);
        cudaFree(d_c);
        

    10. Continue processing on Host, making additional data allocations, copies and kernel calls as needed.

    Documentation and Further Reading:

     NVIDIA CUDA Zone: https://developer.nvidia.com/cuda-zone
     Recommended - one stop shop for everything CUDA.



    GPU Clusters

    GPU Programming APIs
    OpenMP

    Overview:

    Basic Approach:

    1. Identify which sections of code will be executed by the Host (CPU) and which sections by the Device (GPU).
      • Device code is typically computationally intensive and able to be executed by many threads simultaneously in parallel.
      • Host code is usually everything else

    2. Determine which data must be exchanged between the Host and Device

    3. Apply OpenMP directives to parallel regions and/or loops. In particular, use the OpenMP 4.5 "Device" directives.

    4. Apply OpenMP Device directives and/or directive clauses that define data transfer between Host and Device.

    Documentation and Further Reading:

     OpenMP.org official website: http://openmp.org
     Recommended. Standard specifications, news, events, links to education resources, forums.



    GPU Clusters

    GPU Programming APIs
    OpenACC

    Overview:

    Basic Approach:

    1. Identify which sections of code will be executed by the Host (CPU) and which sections by the Device (GPU).
      • Device code is typically computationally intensive and able to be executed by many threads simultaneously in parallel.
      • Host code is usually everything else

    2. Determine which data must be exchanged between the Host and Device

    3. Apply OpenACC directives to parallel regions and/or loops.

    4. Apply OpenACC directives and/or directive clauses that define data transfer between Host and Device.

    Documentation and Further Reading:

     OpenACC.org official website: http://www.openacc-standard.org/
     Recommended. Standard specifications, news, events, links to education, software and application resources.



    GPU Clusters

    GPU Programming APIs
    OpenCL

    Overview:
     OpenCL is not discussed further in this document. Some LC specific usage information can be found at: https://computing.llnl.gov/vis/open_cl.shtml

    Documentation and Further Reading:

     Khronos Group website: https://www.khronos.org/opencl
     Recommended. Standard specifications, software, resources, documentation, forums


    GPU Clusters

    Compiling
    CUDA

    The following instructions apply to LC's Linux GPU clusters surface, max and rzhasgpu.

    C/C++:

    Fortran:

    Getting Help:

    C/C++

    Fortran:



    GPU Clusters

    Compiling
    OpenMP

    The following instructions apply to LC's Linux GPU clusters surface, max and rzhasgpu.

    Current Situation at LC:

    Using GPU Enabled Clang:



    GPU Clusters

    Compiling
    OpenACC

    The following instructions apply to LC's Linux GPU clusters surface, max and rzhasgpu.

    Load Required Modules:

    Compile Using a PGI Compiler Command:

    Getting Help:



    GPU Clusters

    Misc. Tips & Tools

    The following information applies to LC's Linux GPU clusters surface, max and rzhasgpu.

    Running Jobs:

    GPU Hardware Info:

    Debuggers, Tools:






    This completes the tutorial.

          Please complete the online evaluation form - unless you are doing the exercise, in which case please complete it at the end of the exercise.

    Where would you like to go now?



    References and More Information