This tutorial is intended to be an introduction to using LC's Linux
clusters. It begins by providing a brief historical background of Linux clusters
at LC, noting their success and adoption as a production, high performance
computing platform. The primary hardware components of LC's Linux clusters
are then presented, including the various types of nodes, processors and switch interconnects. The detailed hardware configuration for each of LC's production Linux clusters completes the hardware related information.
After covering the hardware related topics, software topics are discussed,
including the LC development environment, compilers, and how to run both batch and interactive parallel jobs. Important issues in each of these areas are
noted. Available debuggers and performance related tools/topics are briefly discussed, however detailed usage is beyond the scope of this tutorial.
A lab exercise using one of LC's Linux clusters follows the presentation.
Level/Prerequisites: This tutorial is intended for those who are new to developing parallel programs in LC's Linux cluster environment. A basic understanding of parallel programming in C or Fortran is required. The material covered by the following tutorials would also be helpful:
EC3501: Livermore Computing Resources and Environment EC4045: Moab and SLURM
Background of Linux Clusters at LLNL
The Linux Project:
LLNL first began experimenting with Linux clusters in 1999-2000 in a
partnership with Compaq and Quadrics to port Quadrics software to
The Linux Project was started for several reasons:
Cost: price-performance analysis demonstrated that near-commodity
hardware in clusters running Linux could be more cost-effective
than proprietary solutions;
Focus: the decreasing importance of high-performance computing
(HPC) relative to commodity purchases was making it more difficult to
convince proprietary systems vendors to implement HPC specific
Control: it was believed that by controlling the OS in-house,
Livermore Computing could better support its customers;
Community: the platform created could be leveraged by the general
The objective of this effort was to apply LC's scalable systems strategy
(the "Livermore Model") to commodity hardware running the open source
Based on SMP compute nodes attached to a high-speed, low-latency
Uses OpenMP to exploit SMP parallelism within a node and MPI
to exploit parallelism between nodes.
Provides a POSIX interface parallel filesystem.
Application toolset: C, C++ and Fortran compilers, scalable
MPI/OpenMP GUI debugger, performance analysis tools.
Peloton clusters were built in 5.5 Tflop "scalable units" (SU) of ~144 nodes
All Peloton clusters used AMD dual-core Socket F Opterons:
8 cpus per node
2.4 GHz clock
Option to upgrade to 4-core Opteron "Deerhound" later (not taken)
The six Peloton systems represented a mix of resources: OCF, SCF,
ASC, M&IC, Capability and Capacity:
The last Peloton clusters were retired in June 2012.
Atlas Peloton Cluster
And Then, TLCC and TLCC2:
In July, 2007 the Tri-laboratory Linux Capacity Cluster (TLCC) RFP was
The TLCC procurement represents the first time that the Department of
Energy/National Nuclear Security Administration (DOE/NNSA) has awarded a
single purchase contract that covers all three national defense
laboratories: Los Alamos, Sandia, and Livermore. Read the announcement
The TLCC architecture is very similar to the Peloton architecture: Opteron
multi-core processors with an Infiniband interconnect. The primary
difference is that TLCC clusters are quad-core instead of dual-core.
TLCC clusters were/are:
In June, 2011 the TLCC2 procurement was announced, as a follow-on to the
successful TLCC systems. Press releases:
The TLCC2 systems consist of multiple Intel Xeon E5-2670 (Sandy Bridge EP), QDR
Infiniband based clusters:
Additionally, LC procured other Linux clusters similar to TLCC2 systems
for various purposes.
And Now? What Next?
As of June 2015, Tri-lab procurements as follow-ons to the TLCC2 systems
are in progress. They are now called Commodity Technology Systems
CTS-1 systems started to become available in late 2016 - early 2017. These
systems are based on Intel Broadwell E5-2695 v4 processors, 36 cores per node,
128 GB node memory, with Intel Omni-Path 100 Gb/s interconnect. They include:
CTS-2 systems are expected to start becoming available in the 2019-2020
Juno, Eos TLCC Clusters
Zin TLCC2 Cluster
Quartz CTS-1 Cluster
Cluster Configurations and Scalable Units
Currently, LC has several types of production Linux clusters
based on the following processor architectures:
All of LC's Linux clusters differ in their configuration details, however
they do share the same basic hardware building blocks:
Frames / racks
High speed interconnect (most clusters)
Other hardware (file systems, management hardware, etc.)
The basic building block of a Linux cluster is the node. A node is
essentially an independent computer. Key features:
Self-contained, diskless, multi-core computer.
Low form-factor - Clusters nodes are very thin in order to
Rack Mounted - Nodes are mounted compactly in a drawer fashion to
facilitate maintenance, reduced footprint, etc.
Remote Management - There is no keyboard, mouse, monitor or other
device typically used to interact with a computer. All node management
occurs over the network from a "management" node.
Nodes are typically configured into 4 types, according to their function:
Compute nodes - Nodes that run user jobs. The majority of nodes
in a cluster. Compute nodes are typically split into one of
two partitions: batch or interactive/debug.
Login nodes - One or more per cluster. This is where you login
for access to the compute nodes. Login nodes are also used
to build your applications and control cluster jobs.
Gateway (I/O) nodes - These nodes are dedicated
fileservers. They connect the compute nodes to essential file systems
which are mounted on disk storage devices, such as Lustre OSTs.
The number of these nodes vary per cluster.
Administrative/management nodes - Used by system administrators to
manage the entire cluster. Not accessible to users.
Examples (click for larger image):
Single compute node - TLCC2
Single compute node - CTS-1
Frames / Racks:
Frames are the physical cabinets that hold most of a cluster's components:
Nodes of various types
Other network components
Parallel file system disk resources (usually in separate racks)
Vary in size/appearance between the different Linux clusters at LC.
Power and console management - frames include hardware and software that
allow system administrators to perform most tasks remotely.
Example images below (click for larger image):
Frames - TLCC2
Frames - CTS-1
The basic building block of LC's production Linux clusters is called a
"Scalable Unit" (SU). An SU consists of:
Nodes (compute, login, management, gateway)
First stage switches that connect to each node directly
Miscellaneous management hardware
Frames sufficient to house all of the hardware
Additionally, second stage switch hardware is needed to connect multi-SU
clusters (not shown).
The number of nodes in an SU depends upon the type of switch hardware being
used. For example:
QLogic = 162 nodes
Intel Omni-Path = 192 nodes
Multiple SUs are combined to create a cluster. For example:
2 SU = 324 / 384 nodes
4 SU = 648 / 768 nodes
8 SU = 1296 / 1536 nodes
The SU design is meant to:
Standardize configuration details across the enterprise
Easily "grow" clusters in incremental units
Leverage procurements and reduce costs across the Tri-labs
An example of a 2 SU cluster is shown below for illustrative purposes.
Note that a frame holding the second level switch hardware is not shown.
LC Linux Cluster Systems
LC Linux Clusters Summary
The table below summarizes the key characteristics of LC's Linux clusters.
Note that some systems are limited access and not Generally Available (GA)
Several examples of LC's Linux clusters, past and present, are shown below (click for larger image):
QLogic QDR 18-864 port: all used ports connect to a first stage switches
via optic fiber cabling.
Example image below (click for a larger image):
QLogic 1st and 2nd Stage Switches (back)
Two-stage, federated, bidirectional, fat-tree.
The number of second stage switches depends upon the number
of scalable units (SUs) that comprise the cluster and the type of switch
2688-way Interconnect Quartz/Jade - 14 SU
The inter-node bandwidth measurements below were taken on live, heavily
LC machines using a simple MPI non-blocking test code. One task on each of two
nodes. Not all systems are represented. Your mileage may vary.
Intel Xeon Clusters with QDR QLogic (Muir, Ansel)
Intel Xeon Clusters with QDR QLogic (TLCC2)
Intel Xeon Clusters with Intel Omni-Path (CTS-1)
Software and Development Environment
This section only provides a summary of the software and development
environment for LC's Linux clusters. Please see the
Introduction to LC Resources tutorial for details.
TOSS Operating System:
All LC Linux clusters use TOSS (Tri-Laboratory Operating System Stack).
The primary components of TOSS include:
Red Hat Enterprise Linux (RHEL) distribution with modifications to
support targeted HPC hardware and cluster computing
RHEL kernel optimized for large scale cluster computing
OpenFabrics Enterprise Distribution InfiniBand software stack including
MVAPICH and OpenMPI libraries
Moab workload manager and SLURM resource manager
Integrated Lustre and Panasas parallel file system software
Scalable cluster administration tools
Cluster monitoring tools
GNU, C, C++ and Fortran90 compilers (GNU, Intel, PGI)
Testing software framework for hardware and operating system
LC's Simple Linux Utility for Resource Management
SLURM is the native job scheduling system on each cluster
The SLURM resource manager on one cluster does not communicate with
the SLURM resource managers on other clusters
The Dotkit system provides a convenient, uniform way to select among
multiple versions of software installed on the LC systems.
Dotkit is similar to Linux modules, and is being replaced by modules on
LC's newer TOSS 3 systems.
Many LC software applications require that you load a particular dotkit
"package" in order to execute the software, see the man pages, get the selected
After you login to a particular machine, try the
use -l command to see available dotkit packages.
List available packages: use -l
Load a package: use package_name
Unload a package: unuse package_name
List loaded packages: use
Search available packages: use -l keyword
Read package help info: use -h package_name
Display package contents: use -hv package_name
Show default packages: dpkg-defaults (look for asterisks)
LC's Linux clusters support the standard Modules package.
Modules are replacing Dotkit on LC's newer TOSS 3 systems.
On LC's TOSS 2 clusters, some software applications use both Modules and
List available modules: module avail
Load a module: module add|load modulefile
Unload a module: module rm|unload modulefile
List loaded modules: module list
Read module help info: module
Display module contents: module display|show modulefile
For more information see:
The module man page
Compilers, Tools, Graphics and Other Software:
The table below lists and provides links to the majority of software available through LC or related organizations.
The Livermore Interactive Numerical Mathematical Software Access Utility, is a Web-based access utility for math library software. The LINMath Web site also has pointers to packages available from external sources:
Center for Applied Scientific Computing (CASC) Software
A wide range of software available for download from LLNL's CASC. Includes mathematical software, language tools, PDE software frameworks, visualization, data analysis, program analysis, debugging, and benchmarks:
Available Compilers and Invocation Commands:
The table below summarizes compiler availability and invocation commands
on LC Linux clusters.
Note that parallel compiler commands are actually LC scripts that
ultimately invoke the corresponding serial compiler.
The IEEE floating point standard defines several exceptions (FPEs)
when the result of a floating point operation is unclear or undesirable:
overflow: an operation's result is too large to be represented as a
float. Can be trapped, or else returned as a +/- infinity.
underflow: an operation's result is too small to be represented as a
normalized float. Can be trapped, or else represented as
as a denormalized float (zero exponent w/ non-zero fraction) or zero.
divide-by-zero: attempting to divide a float by zero. Can be trapped,
or else returned as a +/- infinity.
inexact: result was rounded off. Can be trapped or returned
as rounded result.
invalid: an operation's result is ill-defined, such as 0/0 or the
sqrt of a negative number. Can be trapped or returned as NaN (not a
By default, the Xeon processors used at LC mask/ignore
FPEs. Programs that encounter FPEs will not terminate
abnormally, but instead, will continue execution
with the potential of producing wrong results.
Compilers differ in their ability to handle FPEs:
Some means of controlling FPE handling through the
Provides the -fp-trap* options.
PGI C, C++, Fortran
Offer the convenient -Ktrap* options to
Offers the -ffpe-trap=* options
Using the feenableexcept()
routine may help. See the man page for details. Simple
Precision, Performance and IEEE 754 Compliance:
Typically, most compilers do not guarantee IEEE 754 compliance
for floating-point arithmetic unless it is explicitly specified by a
compiler flag. This is because compiler optimizations are performed
at the possible expense of precision.
Unfortunately for most programs, adhering to IEEE floating-point
arithmetic adversely affects performance.
If you are not sure whether your application needs this, try
compiling and running your program both with and without it to
evaluate the effects on both performance and precision.
See the relevant compiler documentation for details.
Mixing C and Fortran:
If you are linking C/C++ and FORTRAN code together, and need to explicitly
specify the FORTRAN or C/C++ libraries on the link line, LC provides a
general recommendation and example in the
/usr/local/docs/linux.basics file. See the
"MIXING C AND FORTRAN" section.
All of the other issues involved with mixed language programming apply, such as:
Column-major vs. row-major array ordering
Routine name differences - appended underscores
Arguments passed by reference versus by value
Common blocks vs. extern structs
Memory alignment differences
File I/O - Fortran unit numbers vs. C/C++ file pointers
/usr/local/docs/openmpi.basics on LC's TOSS 2 clusters:
MPI Build Scripts
LC developed MPI compiler wrapper scripts are used to compile MPI programs
Automatically perform some error checks, include the appropriate
MPI #include files, link to the necessary MPI libraries, and pass options to
the underlying compiler.
Note: you may need to load a dotkit/module for the desired MPI implementation,
as discussed above. Failing to do this will result in getting the default
For additional information:
See the man page (if it exists)
Issue the script name with the -help option
View the script yourself directly
MPI Build Scripts
MVAPCH 1.2 (TOSS 2 only)
gcc - GNU
gcc - GNU
icc - Intel
pgcc - PGI
g++ - GNU
g++ - GNU
icpc - Intel
pgCC - PGI
g77 - GNU
gfortran - GNU
ifort - Intel
pgf77 - PGI
pgf90 - PGI
C compiler of dotkit package loaded
C++ compiler of dotkit package loaded
Fortran77 compiler of dotkit package loaded
Fortran90 compiler of dotkit package loaded
C compiler of dotkit package loaded
mpiCC mpic++ mpicxx
C++ compiler of dotkit package loaded
Fortran77 compiler of dotkit package loaded
Fortran90 compiler of dotkit package loaded
Level of Thread Support
MPI libraries vary in their level of thread support:
MPI_THREAD_SINGLE - Level 0: Only one thread will execute.
MPI_THREAD_FUNNELED - Level 1:
The process may be multi-threaded, but only
the main thread will make MPI calls - all MPI calls are funneled
to the main thread.
MPI_THREAD_SERIALIZED - Level 2:
The process may be multi-threaded, and
multiple threads may make MPI calls, but only one at a time. That is,
calls are not made concurrently from two distinct threads as all MPI calls
MPI_THREAD_MULTIPLE - Level 3:
Multiple threads may call MPI with no restrictions.
ALL: mylc.llnl.gov. Click
on any machine name in the "My Accounts" section.
An easy way to determine the job limits for a machine where you are logged
in is to use the command:
where machinename is the name of the machine you are logged
Further discussion, and a summary table of job limits for all production
machines are available in the Queue Limits section of the Moab tutorial.
Batch Versus Interactive
Interactive Jobs (pdebug):
Most LC clusters have a pdebug partition that permits users to
run "interactively" from a login node.
Your job is launched from the login node command line using the
srun command - covered in the
Starting Jobs section.
The job then runs on a pdebug compute node(s) - NOT on the login node
stdin, stdout, stderr are handled to make it appear the job is running
locally on the login node
As the name pdebug implies, interactive jobs should
be short, small debugging jobs, not production runs:
Shorter time limit
Fewer number of nodes permitted
There is usually a "good neighbor" policy in effect - don't monopolize
Some clusters may have additional partitions permitting interactive jobs.
Although the pdebug partition is generally associated with interactive use, it
can also be used for debugging jobs submitted with a batch script (next).
Batch Jobs (pbatch):
This section only provides a quick summary of batch usage on
LC's clusters. For details, see the Moab and SLURM Tutorial.
Typically, most of a cluster's compute nodes are configured into a pbatch
The pbatch partition is intended for production work:
Longer time limits
Larger number of nodes per job
Limits enforced by batch system rather than "good neighbor" policy
The pbatch partition is managed by the SLURM and Moab schedulers
Batch jobs must be submitted in the form of a job control script
with the msub command. Examples:
msub myjobscript -q ppdebug -a mic
msub myjobscript -l walltime=45:00
Example Moab job control script:
##### Example Moab Job Control Script
##### These lines are for Moab
#MSUB -l nodes=16
#MSUB -l partition=zin
#MSUB -l walltime=2:00:00
#MSUB -A phys
#MSUB -q pbatch
#MSUB -m be
#MSUB -o /p/lscratch3/joeuser/par_solve/myjob.out
##### These are shell commands
# Display job information for possible diagnostic use
echo "Job id = $SLURM_JOBID"
# Run info
srun -n256 a.out
After successfully submitting a job, you may then check its progress and
interact with it (hold, release, alter, terminate) by means of other batch
commands - discussed in the Interacting With Jobs
Some clusters have additional partitions permitting batch jobs.
Interactive use of pbatch nodes is facilitated by using the
mxterm command - discussed in the
Where to Login section.
Interactive debugging of batch jobs is possible - covered in the
Interactive srun jobs launched from the command line
should normally be terminated with a SIGINT (CTRL-C):
The first CTRL-C will report the state of the tasks
A second CTRL-C within one second will terminate the tasks
For batch jobs, the mjobctl -c and
canceljob commands can be used.
More information/examples: see Canceling Jobs in the Moab and SLURM tutorial.
Optimizing CPU Usage
Clusters with an Interconnect:
Fully utilizing the cores on a node requires that you use the right
combination of srun and Moab options, depending upon
what you want to do and which type of machine you are using.
MPI only: for example, if you are running on a cluster that has
16 cores per node, and you want your job to use all 16 cores on 4 nodes (16 MPI
tasks per node), then you would do something like:
srun -n64 -ppdebug a.out
#MSUB -l nodes=4
srun -n64 a.out
MPI with Threads:
If your MPI job uses POSIX or OpenMP threads within each node, you will
need to calculate how many cores will be required in addition to the
number of tasks.
For example, running on a cluster having 16 cores per node, an 8-task
job where each task creates 4 OpenMP threads, would need a total of 32 cores, or
2 nodes: 8 tasks * 4 threads / 16 cores/node = 2 nodes
srun -N2 -n8 -ppdebug a.out
#MSUB -l nodes=2
srun -N2 -n8 a.out
Don't forget that the default MPI on LC's Linux clusters is not
thread-safe, so the master thread must perform all MPI calls.
You can include multiple srun commands within your batch
job command script. For example, suppose that you were conducting a
scalability run on an 16 core/node Linux cluster.
You could allocate the maximum number of nodes
that you would use with #MSUB -l nodes= and then have a
series of srun commands that use varying numbers of nodes:
The issues for CPU utilization are different for clusters without a
switch for several reasons:
Jobs can be serial
Nodes may be shared with other users
Jobs are limited to one node
On these systems, the more important issue is over utilization of cores
rather than under utilization.
IMPORTANT: For batch jobs, you need to specify how many cores your
job will require. The way to do this is with the Moab ttc= option:
#MSUB -l ttc=#cpus
This allows Moab to effectively schedule the node
for other jobs if you don't use all of a node's cores.
This is particularly important for MPI jobs and threaded codes.
Failure to do this can/will result in:
The batch scheduler assuming by default that your job requires only
Other jobs being scheduled on the node you're using because the
scheduler thinks your job only needs one core.
Poor performance and jobs taking longer to run than expected because
of oversubscription. Jobs may be killed because they will take longer
than the time limit specified.
Upsetting other users because you've oversubscribed the node.
Wasting the time and effort of LC support staff because of upset
users calling about unexpected job behavior.
Important notes for Agate, Borax and RZTrona users: Please see the
"Running on Serial Clusters"
section of the Moab tutorial for important
information about running MPI jobs on these clusters.
64-bit Architecture Memory Limit:
Because LC's Linux clusters employ a 64-bit architecture, 16 exabytes of
memory can be addressed - which is about 4 billion times more than 4 GB limit
of 32-bit architectures. By current standards, this is virtually unlimited
In reality, machines are usually configured with only GBs of memory, so any
address access that exceeds physical memory will result (on most systems)
with paging and degraded performance.
However, LC machines are an exception to this because they have no
local disk - see below.
LC's Diskless Linux Clusters:
LC's Linux clusters are configured with diskless compute nodes. This has
very important implications for programs that exceed physical memory. For
example, most compute nodes have 16-64 GB of physical memory.
Because compute nodes don't have disks, there is no virtual (swap) memory,
which means there is no paging. Programs that exceed physical memory will
terminate with an OOM (out of memory) error and/or segmentation fault.
Compiler, Shell, Pthreads and OpenMP Limits:
Compiler data structure limits are in effect, but may be handled
differently by different compilers.
Shell stack limits: most are set to "unlimited" by default.
Pthreads stack limits apply, and may differ between compilers.
OpenMP stack limits apply, and may differ between compilers.
MPI Memory Use:
All MPI implementations require additional memory use. This varies
between MPI implementations and between versions of any given
The amount of memory used increases with the number of MPI tasks.
Determining how much memory the MPI library uses can be accomplished
by using various tools, such as TotalView's Memscape feature.
Large Static Data:
If your executable contains a large amount of static data, LC recommends
that you compile with the -mcmodel=medium option.
This flag allows the Data and .BSS sections of your executable to extend
beyond a default 2 GB limit.
True for all compilers (see the respective compiler man page).
Vectorization and Hyper-threading
Historically, the Xeon architecture has provided support for
SIMD (Single Instruction Multiple Data) vector instructions through
Intel's Streaming SIMD Extensions (SSE, SSE2, SSE3, SSE4) instruction sets.
AVX - Advanced Vector Extension instruction set (2011) improved on SSE instructions
by increasing the size of the vector registers from 128-bits to 256-bits. AVX2 (2013)
offered further improvements, such as fused multiply-add (FMA) instructions to double
The primary purpose of these instructions is to increase CPU
throughput by performing operations on vectors of data elements,
rather than on single data elements. For example:
Sandy Bridge-EP (TLCC2) processors support SSE and AVX instructions.
Broadwell (CTS-1) processors support SSE, AVX and AVX2.
To take advantage of the potential performance improvements offered
by vectorization, all you need to do is compile with the appropriate
compiler flags. Some recommendations are shown in the table below.
-mavx -mavx2 (C/C++ only)
Vectorization is performed on eligible loops. Note that
not all loops are able to be vectorized. A number of
factors can prevent the compiler from vectorizing a loop,
GOTOs in or out of the loop
Data dependence, such as needing a value from a previous
Complex coding (difficult loop analysis)
To view/confirm that a loop has been vectorized, use one of the reporting
flags shown above. You can also generate an assembler file and look at the
instructions used. For example:
On TOSS 2 clusters, users need to explicitly enable hyper-threads
using the SLURM --enable-hyperthreads
option of the srun command.
Example: on a 16-core node, check to see if hyper-threading is
turned on - it isn't by default, so only 16 cores are shown. Then
execute the same command using srun's
--enable-hyperthreads flag. Now 32 "cores" appear.
mpifind - reports on how processes and threads are bound to cores.
Documentation at https://lc.llnl.gov/confluence/display/TLCC2/mpifind
Example output for a 4-process job, each with 4 threads. Output on the left is
with OMP_PROC_BIND unset, and on the right, with OMP_PROC_BIND=TRUE.
This section only touches on selected highlights. For more information users will definitely need to consult the relevant documentation mentioned below. Also, please consult the "Supported Software and Computing Tools" web page located at
TotalView is probably the most widely used debugger for parallel programs.
It can be used with C/C++ and Fortran programs and supports all common
forms of parallelism, including pthreads, openMP, MPI, accelerators and GPUs.
Starting TotalView for serial codes: simply issue the command:
Starting TotalView for interactive parallel jobs:
Some special command line options are required to run a parallel job
through TotalView under SLURM. You need to run srun under
and then specify the -a flag followed by 1)srun options,
2)your program, and 3)your program flags (in that order). The general
totalview srun -a -n #processes -p pdebug myprog [prog args]
To debug an already running interactive parallel job, simply issue the
totalview command and then attach
to the srun process that started the job.
Debugging batch jobs is covered in LC's
TotalView tutorial and in the "Debugging in Batch"
DDT stands for "Distributed Debugging Tool", a product of Allinea Software
DDT is a comprehensive graphical debugger designed specifically for
debugging complex parallel codes. It is supported on a variety of platforms
for C/C++ and Fortran. It is able to be used to debug multi-process MPI
programs, and multi-threaded programs, including OpenMP.
Currently, LC has a limited number of fixed and floating licenses for OCF and
SCF Linux machines.
Local copies of documentation under /usr/local/docs/ddt
STAT - Stack Trace Analysis Tool:
The Stack Trace Analysis Tool gathers and merges stack traces from a
parallel application's processes.
STAT is particularly useful for debugging hung programs.
It produces call graphs: 2D spatial and 3D spatial-temporal
The 2D spatial call prefix tree represents a single snapshot of the
entire application (see image).
The 3D spatial-temporal call prefix tree represents a series of
snapshots from the application taken over time.
In these graphs, the nodes are labeled by function names. The directed
edges, showing the calling sequence from caller to callee, are labeled by
the set of tasks that follow that call path. Nodes that are visited by the
same set of tasks are assigned the same color, giving a visual reference to
the various equivalence classes.
This tool should be in your default path as:
/usr/local/bin/stat-gui - GUI
/usr/local/bin/stat-cl - command line
/usr/local/bin/stat-view - viewer for DOT format
DDD: GNU DDD debugger is a graphical front-end for command-line debuggers
such as GDB, DBX, WDB, Ladebug, JDB, XDB, the Perl debugger, the bash
debugger, or the Python debugger. Documentation:
Debugging in Batch: mxterm:
Debugging batch parallel jobs on LC production clusters
is fairly straightforward. The main idea is that you need to submit a batch
job that gets your partition allocated and running.
Once you have your partition, you can login to any of the nodes within it,
and then starting running as though your in the interactive pdebug
For convenience, LC has developed a utilty called
mxterm which makes the
process even easier.
How to use mxterm:
If you are on a Windows PC, start your X11 application (such as
Make sure you enable X11 tunneling for your ssh session
ssh and login to your cluster
Issue the command as follows:
#nodes = number of nodes your job requires
#tasks = number of tasks your job requires
#minutes = how long you need to keep your partition for debugging
This will submit a batch job for you that will open an xterm when it
starts to run.
After the xterm appears, cd to the directory with
your source code and begin your debug session.
This utility does not have a man page, however you can view the
usage information by simple typing the name of the command.
It is quite likely that your shell's core file size setting may limit
the size of a core file so that it is inadequate for debugging,
especially with TotalView.
To check your shell's limit settings, use either the
limit (csh/tcsh) or
ulimit -a (sh/ksh/bash) command. For example:
To override your default core file size setting, use one of the following
limit coredumpsize 64
ulimit -c 64
Some users have complained that for many-process jobs, they actually
don't want core files or only want small core files because normal core
files can fill up their disk space. The limit
(csh/tcsh) or ulimit -c (sh/ksh/bash) commands can
be used as shown above to set smaller / zero sizes.
A Few Additional Useful Debugging Hints:
Add the sinfo and
squeue commands to your
batch scripts to assist in diagnosing problems. In particular, these
commands will document which nodes your job is using.
Also add the -l option to your
srun command so that
output statements are prepended with the task number that created them.
Be sure to check the exit status of all I/O operations when reading
or writing files in Lustre. This will allow you to detect any I/O problems
with the underlying OST servers.
If you know/suspect that there are problems with particular nodes,
you can use the srun -x option to skip these nodes.
srun -N12 -x "cab40 cab41" -ppdebug myjob
We Need a Book!
The subject of "Tools" for Linux cluster applications is far too broad and
deep to cover here. Instead, a few pointers are being provided for those
who are interested in further research.
Valgrind's Memcheck tool detects a comprehensive set of memory errors, including reads and writes of unallocated or freed memory and memory leaks.
Allows you to stop execution when heap API problems occur, list memory leaks, paint allocated and deallocated blocks, identify dangling pointers, hold onto deallocated memory, graphically browse the heap, identify the source line and stack backtrace of an allocation or deallocation, summarize heap use by routine, filter and dump heap information, and review memory usage by process or by library.
The memP tool provides heap profiling through the generation of two reports: a summary of the heap high-water-mark across all processes in a parallel job as well as a detailed task-specific report that provides a snapshot of the heap memory currently in use, including the amount allocated at specific call sites.
Primarily a thread correctness tool, but memory debugging features are included.
Profiling, Tracing and Performance Analysis:
Open|SpeedShop is a comprehensive performance tool set with a unified look and feel that covers most important performance analysis steps. It offers various different interfaces, including a flexible GUI, a scripting interface, and a Python class. Supported experiments include profiling using PC sampling, inclusive and exclusive execution time analysis, hardware counter support, as well as MPI, I/O, and floating point exception tracing. All analysis is applied on unmodified binaries and can be used on codes with MPI and/or thread parallelism.
TAU is a robust profiling and tracing tool from the University of Oregon that includes support for MPI and OpenMP. TAU provides an instrumentation API, but source code can also be automatically instrumented and there is support for dynamic instrumentation as well. TAU is generally viewed as having a steep learning curve, but experienced users have applying the tool with good results at LLNL. TAU can be configured with many feature combinations. If the features you are interested in are not available in the public installation, please request the appropriate configuration through the hotline. TAU developer response is excellent, so if you are encountering a problem with TAU, there is a good chance it can be quickly addressed.
HPCToolkit is an integrated suite of tools for measurement and analysis of program performance on computers ranging from multicore desktop systems to the largest supercomputers. It uses low overhead statistical sampling of timers and hardware performance counters to collect accurate measurements of a program's work, resource consumption, and inefficiency and attributes them to the full calling context in which they occur.
HPCToolkit works with C/C++/Fortran applications that are either statically or dynamically linked. It supports measurement and analysis of serial codes, threaded codes (pthreads, OpenMP), MPI, and hybrid (MPI + threads) parallel codes.
A lightweight MPI profiling library that provides time spent in MPI functions by callsite and stacktrace. This tool is developed and maintained at LLNL, so support and modifications can be quickly addressed. New run-time functionality can be used to generate mpiP data without relinking through the srun-mpip and poe-mpip scripts on Linux and AIX systems.
Displays call graph profile data. The gprof command is useful in identifying how a program consumes CPU resources. Gprof does simple function profiling and requires that the code be built and linked with -pg. For parallel programs, in order to get a unique output file for each process, you will need to set the undocumented environment variable GMON_OUT_PREFIX to some non-null string. For example:
setenv GMON_OUT_PREFIX 'gmon.out.'`/bin/uname -n`
PGI profiler - pgprof is a tool which analyzes data generated during execution of specially compiled programs. This information can be used to identify which portions of a program will benefit the most from performance tuning.
Portable hardware performance counter library.
A PAPI-based performance profiler that measures hardware performance events of an application without having to instrument the application.
The Intel VTune Amplifier tool is a performance analysis tool for finding hotspots in serial and multithreaded codes. Note the installation on LC machines does not include the advanced hardware analysis capabilities.
The Intel Profiler tool is built into the Intel compiler along with a simple GUI to display the collected results.
Vampir / Vampirtrace
Full featured trace file visualizer and library for generating trace files for parallel programs.
Beyond LC, the web offers endless opportunities for discovering tools
that aren't available here.
In many cases, users can install tools in their own directories if LC doesn't
Linux Clusters Overview Exercise 2
Parallel Programs and More
Login to an LC workshop cluster, if you are not already logged in
Build and run parallel MPI programs
Build and run parallel Pthreads programs
Build and run parallel OpenMP programs
Build and run a parallel benchmarks
Build and run an MPI message passing bandwidth test