CHAOS: Linux from Livermore
For System Administrators
Open source cluster
administration tools for Linux seldom scale up to the size of the
clusters now common at Livermore Computing.
So the customized CHAOS environment includes several extra, locally
developed tools to promote efficient administration of very large
- provides console management.
Every node in an LC Linux cluster is accessible via a serial console
connection because the console is often the only way to (re)configure
the node or check error messages if networking or other serious
But of course hundreds of tightly packed clustered nodes do not have
literal monitors and keyboards attached to display and manipulate their
consoles. And they often reside in buildings distant from their
system administration staff.
So the CHAOS environment offers ConMan,
a customized console-management program designed to
maintain a persistent connection to many
console devices and simultaneous users remotely. ConMan now:
- supports local serial devices and remote terminal servers
with the TELNET protocol.
- maps symbolic names onto physical console devices
and logs output from specified consoles into a file for later review.
- connects to virtual consoles in monitor
(read-only, for continuous logging),
interactive (read-write, for executing commands),
or broadcast (write-only) mode.
- allows "joining" with existing clients if a console is
already in use, or instead "stealing" the console's privileges.
- allows scripts to execute across multiple consoles
Recent versions of ConMan are available through the OCF web site
- facilitates cluster configuration management.
A standard practice among LC system administrators is to codify all
changes made to a cluster in a way that allows the changes to be
quickly reapplied after a fresh installation.
Genders facilitates this by enabling identical scripts to perform
different functions depending on their context.
Genders is a simple, static, flat-file database (plain-text file)
that represents the
layout of a whole cluster.
A Genders file (usually in /etc/genders) contains a list of
node-name/attribute-list pairs, and a copy resides on each node in
the cluster. Scripts then perform Genders-file lookups to configure
each of many nodes appropriately. Genders also includes an rdist distfile
preprocessor to expand attribute macros, so a central repository of
system files can even propagate correctly to multiple clusters using
Among the current Genders software tools are:
a query tool that lists all nodes with a specified attribute
(useful as a conditional test in scripts).
an rdist preprocessor that can quickly redistribute appropriate
configuration file variations when a Genders file changes.
a Perl script that expresses network routing schemes (for load balancing)
in a single configuration file.
- C and Perl APIs--
to query Genders files or, with the help of PDSH (below), to target
a command to just those nodes that share a common Genders attribute.
Recent versions of Genders are available through the OCF web site
- Intelligent Platform Management Interface (IPMI)
- is a standard specification, developed by Dell,
HP, Intel, and NEC, for a way to remotely monitor and manage
the physical status (temperature, air flow, power) of computer nodes.
IPMI is implemented by hardware vendors at the chip level, so
application users are often unaware of it.
It relies on a "baseboard management controller" (BMC),
a small processor that supports IPMI separately from each chip's
main CPU and its operating system.
CHAOS system administrators use several locally developed software
tools to take advantage of IPMI features to check or control nodes
on LC's large Linux clusters (all are shared as open source):
- runs as a daemon to manage and monitor the "baseboard
management controller" (BMC) timer, which enables several system
timeout functions as well as resetting after each operating system
- implements the IPMI ping (path checking) protocol, as well
as the Remote Management Control Protocol. Both are used mostly to
debug IPMI over local area networks.
- works in conjunction with PowerMan (see below)
to remotely control compute-node power supplies by using IPMI.
- (Pluggable Authentication Modules for Linux)
simplifies both user authentication and the management of
resource limits for Linux clusters.
Strictly speaking, PAM is a suite of shared libraries designed
to provide middleware that bridges "access applications"
(such as FTP, LOGIN, PASSWD, and SSH) and authentication mechanisms
(Kerberos, DCE, RSA SecureID tokens).
All Red Hat Linux systems come with PAM by default because this
makes authentication management much easier for system administrators.
In addition, CHAOS systems that use LCRM to control batch jobs and
to manage job resources have taken advantage of PAM to simplify
the use of resource limits as well.
Starting in February, 2006, system administrators no longer need
to modify local LCRM configuration files to alter resource limits
because LCRM now lets PAM manage four CHAOS limits dynamically:
core size, stack size, number of open file descriptors, and
number of processes allowed for any single user.
- executes commands on remote hosts in parallel.
The Parallel Distributed Shell (PDSH) utility is a multithreaded
remote shell client for system administrators,
simmilar to IBM's DSH tool but with better error handling.
PDSH offers several remote shell services including RSH, SSH, and
Kerberos IV. It is designed to gracefully handle the usual kinds of
node problems, such as when a target node is down or slow to respond.
Using PDSH lets a system administrator:
- execute commands across all nodes
of a large cluster as if it were a single machine
(simple commands can execute on over 1000 nodes of a typical cluster
in less than 2seconds),
- run small MPI jobs in parallel across
the QsNet interconnect, which is helpful for trying parallel test
cases or interconnect diagnostics on a new system that still lacks
a regular resource manager.
Recent versions of PDSH are available through the OCF web site
- manages clustered-system power controllers.
Power management for a large cluster poses challenges very like those
posed by console management (above). To minimize circuit loads or
focus repairs, a system administrator may want to boot an entire cluster,
just one rack, or even an individual node. And as with consoles,
remote power control is important when the cluster resides in a
different building than the staff.
PowerMan therefore provides a remote, command-line interface for
a wide variety of power-control and monitoring devices,
through a TCP network connection.
There is no standard protocol for a power-control device interface,
so PowerMan offers a flexible configuration that can adapt to almost
any hardware. PowerMan can query both plug and power supply output
status, and it can power-on, power-off, power-cycle, and hard-reset
individual nodes or node ranges in a CHAOS cluster.
Where hardware allows, PowerMan can also flag nodes needing service
and gather out-of-band temperature data.
Recent versions of PowerMan are available through the OCF web site
- quickly detects and reports which nodes are currently
up and down within the Linux cluster where you run it.
When executed (by any user, not just administrators) with no options,
WHATSUP summarizes the count and name list of up nodes, followed by
the count and name list of down nodes (then automatically ends).
You can optionally report only up nodes (--up), only down nodes
(--down), or node lists not summarized but instead separated by
commas (--comma), returns (--newline), or blanks (--space).
- installs the operating system on cluster nodes.
Livermore Linux clusters currently run with a full copy of the
CHAOS/Red-Hat operating system on every node.
The alternative, using a root file system shared across nodes,
posed concerns about the performance and reliability of network file
servers (as well as complications regarding the integrity of Red Hat
Multiple copies on multiple nodes, however, means that deploying or
upgrading the operating system requires a scalable way to transfer
(or "image") the operating system onto a large number of nodes.
No available open source techniques (such as VA System Imager and LUI)
met local needs. So the CHAOS staff developed YACI (Yet Another
A YACI installation begins by creating a disk partition and placing
the needed files in a staging partition. These are then converted into
compressed TAR images ("tarballs"). Cluster installation continues by
installing management node(s) from a YACI CD-ROM, then network booting
a stand-alone image onto the remaining nodes. Finally, each stand-alone
image partitions its local disk and deploys the "tarballs" of the
original disk partition.
Once the management node(s) are configured, YACI can install CHAOS on
the remaining 1152 nodes of LC's MCR cluster in about 50 minutes.
Recent versions of YACI are available through the OCF web site
Navigation Links: [
Document List ] [
HPC Home ]