Linux Clusters Overview

Author: Blaise Barney, Lawrence Livermore National Laboratory UCRL-WEB-150348
LLNL-MI-726637

Table of Contents

  1. Abstract
  2. Background of Linux Clusters at LLNL
  3. Cluster Configurations and Scalable Units
  4. LC Linux Cluster Systems
  5. Intel Xeon Hardware Overview
  6. Infiniband Interconnect Overview
  7. Software and Development Environment
  8. Compilers
  9. Exercise 1
  10. MPI
  11. Running Jobs
    1. Overview
    2. Batch Versus Interactive
    3. Starting Jobs - srun
    4. Interacting With Jobs
    5. Optimizing CPU Usage
    6. Memory Considerations
    7. Vectorization and Hyper-threading
    8. Process and Thread Binding
  12. Debugging
  13. Tools
  14. Exercise 2
  15. GPU Clusters
    1. Available GPU Clusters
    2. Hardware Overview
    3. GPU Programming APIs
      1. CUDA
      2. OpenMP
      3. OpenACC
      4. OpenCL
    4. Compiling
      1. CUDA
      2. OpenMP
      3. OpenACC
    5. Misc. Tips & Tools
  16. References and More Information



Abstract


This tutorial is intended to be an introduction to using LC's Linux clusters. It begins by providing a brief historical background of Linux clusters at LC, noting their success and adoption as a production, high performance computing platform. The primary hardware components of LC's Linux clusters are then presented, including the various types of nodes, processors and switch interconnects. The detailed hardware configuration for each of LC's production Linux clusters completes the hardware related information.

After covering the hardware related topics, software topics are discussed, including the LC development environment, compilers, and how to run both batch and interactive parallel jobs. Important issues in each of these areas are noted. Available debuggers and performance related tools/topics are briefly discussed, however detailed usage is beyond the scope of this tutorial. A lab exercise using one of LC's Linux clusters follows the presentation.

Level/Prerequisites: This tutorial is intended for those who are new to developing parallel programs in LC's Linux cluster environment. A basic understanding of parallel programming in C or Fortran is required. The material covered by the following tutorials would also be helpful:
EC3501: Livermore Computing Resources and Environment
EC4045: Slurm and Moab



Background of Linux Clusters at LLNL


The Linux Project:

Alpha Linux Clusters:

PCR Clusters:

MCR Cluster...and More:

Which Led To Thunder...

  • In September, 2003 the RFP for LC's first IA-64 cluster was released. Proposal from California Digital Corporation, a small local company, was accepted.

  • 1024 node system comprised of 4-CPU Itanium 2 "Madison Tiger4" nodes

  • Thunder debuted as #2 on the Top500 Supercomputers list in June, 2004. Peak performance was 22.9 Tflops.

  • For more information about Thunder see:

The Peloton Systems:

  • In early 2006, LC launched its Opteron/Infiniband Linux cluster procurement with the release of the Peloton RFP.

  • Appro www.appro.com was awarded the contract in June, 2006.

  • Peloton clusters were built in 5.5 Tflop "scalable units" (SU) of ~144 nodes

  • All Peloton clusters used AMD dual-core Socket F Opterons:
    • 8 cpus per node
    • 2.4 GHz clock
    • Option to upgrade to 4-core Opteron "Deerhound" later (not taken)

  • The six Peloton systems represented a mix of resources: OCF, SCF, ASC, M&IC, Capability and Capacity:

    SystemNetworkNodesCoresTflops
    Atlas OCF 1,152 9,216 44.2
    Minos SCF 864 6,912 33.2
    Rhea SCF 576 4,608 22.1
    Zeus OCF 288 2,304 11.1
    Yana OCF 83 640 3.1
    Hopi SCF 76 608 2.9

  • The last Peloton clusters were retired in June 2012.
Atlas Peloton Cluster

And Then, TLCC and TLCC2:

  • In July, 2007 the Tri-laboratory Linux Capacity Cluster (TLCC) RFP was released.

  • The TLCC procurement represents the first time that the Department of Energy/National Nuclear Security Administration (DOE/NNSA) has awarded a single purchase contract that covers all three national defense laboratories: Los Alamos, Sandia, and Livermore. Read the announcement HERE.

  • The TLCC architecture is very similar to the Peloton architecture: Opteron multi-core processors with an Infiniband interconnect. The primary difference is that TLCC clusters are quad-core instead of dual-core.

  • TLCC clusters were/are:

    SystemNetworkNodesCoresTflops
    Juno SCF 1,152 18,432 162.2
    Hera OCF 864 13,824 127.2
    Eos SCF 288 4,608 40.6

  • In June, 2011 the TLCC2 procurement was announced, as a follow-on to the successful TLCC systems. Press releases:

  • The TLCC2 systems consist of multiple Intel Xeon E5-2670 (Sandy Bridge EP), QDR Infiniband based clusters:

    SystemNetworkNodesCoresTflops
    Zin SCF 2,916 46,656 961.1
    Cab OCF-CZ 1,296 20,736 426.0
    Rzmerl OCF-RZ 162 2,592 53.9
    Pinot SNSI 162 2,592 53.9

  • Additionally, LC procured other Linux clusters similar to TLCC2 systems for various purposes.

Commodity Technology Systems (CTS):

  • CTS systems are the follow-on to TLCC2 systems.

  • CTS-1 systems become available in late 2016 - early 2017. These systems are based on Intel Broadwell E5-2695 v4 processors, 36 cores per node, 128 GB node memory, with Intel Omni-Path 100 Gb/s interconnect. They include:

    SystemNetworkNodesCoresTflops
    Agate SCF 48 1,728 58.1
    Borax OCF-CZ 48 1,728 58.1
    Jade SCF 2,688 96,768 3,251.4
    Mica SCF 384 13,824 530.8
    Quartz OCF-CZ 2,688 96,768 3,251.4
    RZGenie OCF-RZ 48 1,728 58.1
    RZTopaz OCF-RZ 768 27,648 464.5
    RZTrona OCF-RZ 20 720 24.2

  • CTS-2 systems are expected to start becoming available in the 2019-2020 time frame.
Juno, Eos TLCC Clusters

Zin TLCC2 Cluster

Quartz CTS-1 Cluster


Cluster Configurations and Scalable Units


Basic Components:

Nodes:

Frames / Racks:

Scalable Unit:



LC Linux Cluster Systems

LC Linux Clusters Summary

Photos:



Intel Xeon Hardware Overview

Intel Xeon Processor

Xeon E5-2670 Processor:

Xeon E5-2695 v4 Processor:

Additional Information:



Infiniband Interconnect Overview

Interconnects

Primary components:

Topology:

Performance:

Software and Development Environment


This section only provides a summary of the software and development environment for LC's Linux clusters. Please see the Introduction to LC Resources tutorial for details.

TOSS Operating System:

Batch Systems:

File Systems:

Modules:

Dotkit:

Compilers, Tools, Graphics and Other Software:



Compilers

General Information

Available Compilers and Invocation Commands:

Compiler Versions and Defaults:

Compiler Options:

Compiler Documentation:

Optimizations:

Floating-point Exceptions:

Precision, Performance and IEEE 754 Compliance:

Mixing C and Fortran:



Linux Clusters Overview Exercise 1

Getting Started

Overview:
  • Login to an LC cluster using your workshop username and OTP token
  • Copy the exercise files to your home directory
  • Familiarize yourself with the cluster's configuration
  • Familiarize yourself with available compilers
  • Build and run serial applications
  • Compare compiler optimizations

GO TO THE EXERCISE HERE



MPI


This section discusses general MPI usage information for LC's Linux clusters. For information on MPI programming, please consult the LC MPI tutorial.

MVAPICH

General Info:

Compiling:

Running:

Documentation:



Open MPI

General Information:

Compiling:

Running:

Documentation:



Intel MPI:



MPI Build Scripts

Note: you may need to load a dotkit/module for the desired MPI implementation, as discussed above. Failing to do this will result in getting the default implementation.



Level of Thread Support



Running Jobs

Overview

Big Differences:

Job Limits:



Running Jobs

Batch Versus Interactive

Interactive Jobs (pdebug):

Batch Jobs (pbatch):

 This section only provides a quick summary of batch usage on LC's clusters. For details, see the Slurm and Moab Tutorial.



Running Jobs

Starting Jobs - srun

The srun command:

srun options:

Clusters Without an Interconnect - Additional Notes:



Running Jobs

Interacting With Jobs

 This section only provides a quick summary of commands used to interact with jobs. For additional information, see the Slurm and Moab tutorial.

Monitoring Jobs and Displaying Job Information:

Holding / Releasing Jobs:

Modifying Jobs:

Terminating / Canceling Jobs:



Running Jobs

Optimizing CPU Usage

Clusters with an Interconnect:

Clusters Without an Interconnect:



Running Jobs

Memory Considerations

64-bit Architecture Memory Limit:

LC's Diskless Linux Clusters:

Compiler, Shell, Pthreads and OpenMP Limits:

MPI Memory Use:

Large Static Data:



Running Jobs

Vectorization and Hyper-threading

Vectorization:

Hyper-threading:



Running Jobs

Process and Thread Binding

Process/Thread Binding to Cores:

Debugging


This section only touches on selected highlights. For more information users will definitely need to consult the relevant documentation mentioned below. Also, please consult the "Supported Software and Computing Tools" web page located at computing.llnl.gov/code/content/software_tools.php.

TotalView:
  • TotalView is probably the most widely used debugger for parallel programs. It can be used with C/C++ and Fortran programs and supports all common forms of parallelism, including pthreads, openMP, MPI, accelerators and GPUs.

  • Starting TotalView for serial codes: simply issue the command:

      totalview myprog

  • Starting TotalView for interactive parallel jobs:

    • Some special command line options are required to run a parallel job through TotalView under SLURM. You need to run srun under TotalView, and then specify the -a flag followed by 1)srun options, 2)your program, and 3)your program flags (in that order). The general syntax is:

      totalview srun -a -n #processes -p pdebug myprog [prog args]

    • To debug an already running interactive parallel job, simply issue the totalview command and then attach to the srun process that started the job.

    • Debugging batch jobs is covered in LC's TotalView tutorial and in the "Debugging in Batch" section below.

  • Documentation:
Small TotalView screen shot
DDT:
  • DDT stands for "Distributed Debugging Tool", a product of Allinea Software Ltd.

  • DDT is a comprehensive graphical debugger designed specifically for debugging complex parallel codes. It is supported on a variety of platforms for C/C++ and Fortran. It is able to be used to debug multi-process MPI programs, and multi-threaded programs, including OpenMP.

  • Currently, LC has a limited number of fixed and floating licenses for OCF and SCF Linux machines.

  • Usage information: see LC's DDT Quick Start information located at: https://computing.llnl.gov/?set=code&page=ddt

  • Documentation:
Small ddt screen shot

STAT - Stack Trace Analysis Tool:

Other Debuggers:

Debugging in Batch: mxterm:

Core Files:

A Few Additional Useful Debugging Hints:



Tools


We Need a Book!

Memory Correctness Tools:

Profiling, Tracing and Performance Analysis:

Beyond LC:



Linux Clusters Overview Exercise 2

Parallel Programs and More

Overview:
  • Login to an LC workshop cluster, if you are not already logged in
  • Build and run parallel MPI programs
  • Build and run parallel Pthreads programs
  • Build and run parallel OpenMP programs
  • Build and run a parallel benchmarks
  • Build and run an MPI message passing bandwidth test
  • Try hyper-threading
  • Obtain online machine status information

GO TO THE EXERCISE HERE




GPU Clusters

Available GPU Clusters

Intel Systems: (May 2018)

IBM Sierra Systems: (May 2018 and beyond)



GPU Clusters

Hardware Overview

Linux Clusters:
  • From a hardware (and software) perspective, LC's GPU clusters are essentially the same as LC's other Linux clusters:
    • Intel Xeon processor based
    • InfiniBand interconnect
    • Diskless, rack mounted "thin" nodes
    • TOSS operating system/software stack
    • Login nodes, compute nodes, gateway (I/O) and management nodes
    • Compute nodes organized into partitions (pbatch, pdebug, pviz, etc.)
    • Same compilers, debuggers and tools

  • The only significant difference is that compute nodes (or a subset of compute nodes in the case of Max) include GPU hardware.

  • Images at right (click for larger image)
Surface GPU Cluster 5 Pascal GPU Cluster 10

GPU Boards:

GPU Chip - Primary Components (Tesla K20, K40, K80):

SMX Details (Tesla K20, K40, K80):

    Note: For details on NVIDIA's Pascal GPU (used in LC's Pascal cluster), see the NVIDIA Tesla P100 Whitepaper.

  • The SMX unit is where the GPU's computational operations are performed.

  • The number of SMX units per GPU chip varies, based on the type of GPU:
    • Max: 14
    • Surface: 15
    • Rzhasgpu: 13

  • SMX units have 192 single-precision CUDA cores

  • Each CUDA core has one floating-point and one integer arithmetic unit; fully pipelined

  • 64 double-precision (DP) units are used for double-precision math

  • 32 special function units (SFU) are used for fast approximate transcendental operations

  • Memory:
    • GK110: 64 KB; divided between shared memory and L1 cache; configurable
    • GK210: 128 KB; divided between shared memory and L1 cache; configurable
    • 48 KB Read-Only Data Cache; both GK110 and GK210

  • Register File:
    • GK110: 65,536 x 32-bit
    • GK210: 131,072 x 32-bit

  • Other hardware:
    • Instruction Cache
    • Warp Scheduler
    • Dispatch Units
    • Texture Filtering Units (16)
    • Load/Store Units (32)
    • Interconnect Network
NVIDIA GK210 GPU SMX Block Diagram 8
Click for larger image

Additional Reading:



GPU Clusters

GPU Programming APIs
CUDA

Overview:

Basic Approach:

  1. Identify which sections of code will be executed by the Host (CPU) and which sections by the Device (GPU).
    • Device code is typically computationally intensive and able to be executed by many threads simultaneously in parallel.
    • Host code is usually everything else

  2. Create kernels for the Device code:
    • A kernel is a routine executed on the GPU as an array of threads in parallel
    • Kernels are called from the Host
    • Kernel syntax is similar to standard C/C++, but includes some CUDA extensions.
    • All threads execute the same code
    • Each thread has a unique ID
    • Example with CUDA extensions highlighted:

      Standard C Routine CUDA C Kernel
      void vecAdd(double *a, double *b, double *c, int n)
      {
        for (int i = 0; i < n; ++i)
          c[i] = a[i] + b[i];
      }
      
      __global__ void vecAdd(double *a, double *b, double *c, int n) 
      { 
        int i = blockIdx.x * blockDim.x + threadIdx.x; 
        if (i < n) c[i] = a[i] + b[i]; 
      }
      

    • Notes:
      __global__ indicates a GPU kernel routine; must return void
      blockIdx.x, blockDim.x, threadIdx.x are read-only built in variables used to compute each thread's unique ID as an index into the vectors

  3. Allocate space for data on the Host
    • Use malloc as usual in your Host code
    • Initialize data (typically)
    • Helpful convention: prepend Host variables with h_ to distinguish them from Device variables. Example:
      h_a = (double*)malloc(bytes);
      h_b = (double*)malloc(bytes);
      h_c = (double*)malloc(bytes);

  4. Allocate space for data on the Device
    • Done in Host code, but actually allocates Device memory
    • Use CUDA routine such as cudaMalloc
    • Helpful convention: prepend Device variables with d_ to distinguish them from Host variables. Example:
      cudaMalloc(&d_a, bytes);
      cudaMalloc(&d_b, bytes);
      cudaMalloc(&d_c, bytes);

  5. Copy data from the Host to the Device
    • Done in Host code
    • Use CUDA routine such as cudaMemcpy
    • Example:
      cudaMemcpy(d_a, h_a, bytes, cudaMemcpyHostToDevice);
      cudaMemcpy(d_b, h_b, bytes, cudaMemcpyHostToDevice);

  6. Set the number of threads to use; threads per block (blocksize) and blocks per grid (gridsize):
    • Can have significant impact on performance; need to experiment
    • Optimal settings vary by GPU, application, memory, etc.
    • One tool (a spreadsheet) that can assist is the "CUDA Occupancy Calculator". Available from NVIDIA - just google it.

  7. Launch kernel from the Host to run on the Device
    • Called from Host code but actually executes on Device
    • Uses a special syntax clearly showing that a kernel is being called
    • Need to specify the gridsize and blocksize from previous step
    • Arguments depend upon your kernel
    • Example:
      vecAdd<<<gridSize, blockSize>>>(d_a, d_b, d_c, n);

  8. Copy results back from the Device to the Host
    • Done in Host code
    • Use CUDA routine such as cudaMemcpy
    • Example:
      cudaMemcpy(h_c, d_c, bytes, cudaMemcpyDeviceToHost);

  9. Deallocate Host and/or Device space if no longer needed:
    • Done in Host code
    • Example:
       // Release host memory
      free(h_a);
      free(h_b);
      free(h_c);
      // Release device memory
      cudaFree(d_a);
      cudaFree(d_b);
      cudaFree(d_c);
      

  10. Continue processing on Host, making additional data allocations, copies and kernel calls as needed.

Documentation and Further Reading:

 NVIDIA CUDA Zone: https://developer.nvidia.com/cuda-zone
 Recommended - one stop shop for everything CUDA.



GPU Clusters

GPU Programming APIs
OpenMP

Overview:

Basic Approach:

  1. Identify which sections of code will be executed by the Host (CPU) and which sections by the Device (GPU).
    • Device code is typically computationally intensive and able to be executed by many threads simultaneously in parallel.
    • Host code is usually everything else

  2. Determine which data must be exchanged between the Host and Device

  3. Apply OpenMP directives to parallel regions and/or loops. In particular, use the OpenMP 4.5 "Device" directives.

  4. Apply OpenMP Device directives and/or directive clauses that define data transfer between Host and Device.

Documentation and Further Reading:

 OpenMP.org official website: http://openmp.org
 Recommended. Standard specifications, news, events, links to education resources, forums.



GPU Clusters

GPU Programming APIs
OpenACC

Overview:

Basic Approach:

  1. Identify which sections of code will be executed by the Host (CPU) and which sections by the Device (GPU).
    • Device code is typically computationally intensive and able to be executed by many threads simultaneously in parallel.
    • Host code is usually everything else

  2. Determine which data must be exchanged between the Host and Device

  3. Apply OpenACC directives to parallel regions and/or loops.

  4. Apply OpenACC directives and/or directive clauses that define data transfer between Host and Device.

Documentation and Further Reading:

 OpenACC.org official website: http://www.openacc-standard.org/
 Recommended. Standard specifications, news, events, links to education, software and application resources.



GPU Clusters

GPU Programming APIs
OpenCL

Overview:
 OpenCL is not discussed further in this document, as it is not commonly used or supported at LC.

Documentation and Further Reading:

 Khronos Group website: https://www.khronos.org/opencl
 Recommended. Standard specifications, software, resources, documentation, forums


GPU Clusters

Compiling
CUDA

The following instructions apply to LC's TOSS 3 Linux GPU clusters:

C/C++:

Fortran:

Getting Help:



GPU Clusters

Compiling
OpenMP

Current Situation for LC's Linux Clusters:



GPU Clusters

Compiling
OpenACC

Load Required Modules:

Compile Using a PGI Compiler Command:

Getting Help:



GPU Clusters

Misc. Tips & Tools

The following information applies to LC's Linux GPU clusters surface, max and rzhasgpu.

Running Jobs:

GPU Hardware Info:

Debuggers, Tools:






This completes the tutorial.

      Please complete the online evaluation form - unless you are doing the exercise, in which case please complete it at the end of the exercise.

Where would you like to go now?



References and More Information