ICC Home Privacy and Legal Notice LC User Documents Banner

UCRL-WEB-200040

CHAOS: Linux from Livermore


For System Administrators

Open source cluster administration tools for Linux seldom scale up to the size of the clusters now common at Livermore Computing. So the customized CHAOS environment includes several extra, locally developed tools to promote efficient administration of very large LC clusters:

ConMan
provides console management.

Every node in an LC Linux cluster is accessible via a serial console connection because the console is often the only way to (re)configure the node or check error messages if networking or other serious problems occur. But of course hundreds of tightly packed clustered nodes do not have literal monitors and keyboards attached to display and manipulate their consoles. And they often reside in buildings distant from their system administration staff.

So the CHAOS environment offers ConMan, a customized console-management program designed to maintain a persistent connection to many console devices and simultaneous users remotely. ConMan now:

  • supports local serial devices and remote terminal servers with the TELNET protocol.
  • maps symbolic names onto physical console devices and logs output from specified consoles into a file for later review.
  • connects to virtual consoles in monitor (read-only, for continuous logging), interactive (read-write, for executing commands), or broadcast (write-only) mode.
  • allows "joining" with existing clients if a console is already in use, or instead "stealing" the console's privileges.
  • allows scripts to execute across multiple consoles in parallel.

Recent versions of ConMan are available through the OCF web site


http://www.llnl.gov/linux/conman


Genders
facilitates cluster configuration management.

A standard practice among LC system administrators is to codify all changes made to a cluster in a way that allows the changes to be quickly reapplied after a fresh installation. Genders facilitates this by enabling identical scripts to perform different functions depending on their context.

Genders is a simple, static, flat-file database (plain-text file) that represents the layout of a whole cluster. A Genders file (usually in /etc/genders) contains a list of node-name/attribute-list pairs, and a copy resides on each node in the cluster. Scripts then perform Genders-file lookups to configure each of many nodes appropriately. Genders also includes an rdist distfile preprocessor to expand attribute macros, so a central repository of system files can even propagate correctly to multiple clusters using this technique.

Among the current Genders software tools are:

  • NODEATT--
    a query tool that lists all nodes with a specified attribute (useful as a conditional test in scripts).
  • DIST2--
    an rdist preprocessor that can quickly redistribute appropriate configuration file variations when a Genders file changes.
  • CROUTE--
    a Perl script that expresses network routing schemes (for load balancing) in a single configuration file.
  • C and Perl APIs--
    to query Genders files or, with the help of PDSH (below), to target a command to just those nodes that share a common Genders attribute.

Recent versions of Genders are available through the OCF web site


http://www.llnl.gov/linux/genders


Intelligent Platform Management Interface (IPMI)
is a standard specification, developed by Dell, HP, Intel, and NEC, for a way to remotely monitor and manage the physical status (temperature, air flow, power) of computer nodes. IPMI is implemented by hardware vendors at the chip level, so application users are often unaware of it. It relies on a "baseboard management controller" (BMC), a small processor that supports IPMI separately from each chip's main CPU and its operating system.

CHAOS system administrators use several locally developed software tools to take advantage of IPMI features to check or control nodes on LC's large Linux clusters (all are shared as open source):

BMC-WATCHDOG
runs as a daemon to manage and monitor the "baseboard management controller" (BMC) timer, which enables several system timeout functions as well as resetting after each operating system crash.
IPMIPING
implements the IPMI ping (path checking) protocol, as well as the Remote Management Control Protocol. Both are used mostly to debug IPMI over local area networks.
IPMIPOWER
works in conjunction with PowerMan (see below) to remotely control compute-node power supplies by using IPMI.
PAM
(Pluggable Authentication Modules for Linux) simplifies both user authentication and the management of resource limits for Linux clusters.

Strictly speaking, PAM is a suite of shared libraries designed to provide middleware that bridges "access applications" (such as FTP, LOGIN, PASSWD, and SSH) and authentication mechanisms (Kerberos, DCE, RSA SecureID tokens). All Red Hat Linux systems come with PAM by default because this makes authentication management much easier for system administrators.

In addition, CHAOS systems that use LCRM to control batch jobs and to manage job resources have taken advantage of PAM to simplify the use of resource limits as well. Starting in February, 2006, system administrators no longer need to modify local LCRM configuration files to alter resource limits because LCRM now lets PAM manage four CHAOS limits dynamically: core size, stack size, number of open file descriptors, and number of processes allowed for any single user.

PDSH
executes commands on remote hosts in parallel.

The Parallel Distributed Shell (PDSH) utility is a multithreaded remote shell client for system administrators, simmilar to IBM's DSH tool but with better error handling. PDSH offers several remote shell services including RSH, SSH, and Kerberos IV. It is designed to gracefully handle the usual kinds of node problems, such as when a target node is down or slow to respond. Using PDSH lets a system administrator:

  • execute commands across all nodes of a large cluster as if it were a single machine (simple commands can execute on over 1000 nodes of a typical cluster in less than 2seconds), and
  • run small MPI jobs in parallel across the QsNet interconnect, which is helpful for trying parallel test cases or interconnect diagnostics on a new system that still lacks a regular resource manager.

Recent versions of PDSH are available through the OCF web site


http://www.llnl.gov/linux/pdsh


PowerMan
manages clustered-system power controllers.

Power management for a large cluster poses challenges very like those posed by console management (above). To minimize circuit loads or focus repairs, a system administrator may want to boot an entire cluster, just one rack, or even an individual node. And as with consoles, remote power control is important when the cluster resides in a different building than the staff.

PowerMan therefore provides a remote, command-line interface for a wide variety of power-control and monitoring devices, through a TCP network connection. There is no standard protocol for a power-control device interface, so PowerMan offers a flexible configuration that can adapt to almost any hardware. PowerMan can query both plug and power supply output status, and it can power-on, power-off, power-cycle, and hard-reset individual nodes or node ranges in a CHAOS cluster. Where hardware allows, PowerMan can also flag nodes needing service and gather out-of-band temperature data.

Recent versions of PowerMan are available through the OCF web site


http://www.llnl.gov/linux/powerman


WHATSUP
quickly detects and reports which nodes are currently up and down within the Linux cluster where you run it. When executed (by any user, not just administrators) with no options, WHATSUP summarizes the count and name list of up nodes, followed by the count and name list of down nodes (then automatically ends). You can optionally report only up nodes (--up), only down nodes (--down), or node lists not summarized but instead separated by commas (--comma), returns (--newline), or blanks (--space).
YACI
installs the operating system on cluster nodes.

Livermore Linux clusters currently run with a full copy of the CHAOS/Red-Hat operating system on every node. The alternative, using a root file system shared across nodes, posed concerns about the performance and reliability of network file servers (as well as complications regarding the integrity of Red Hat "packaging"). Multiple copies on multiple nodes, however, means that deploying or upgrading the operating system requires a scalable way to transfer (or "image") the operating system onto a large number of nodes. No available open source techniques (such as VA System Imager and LUI) met local needs. So the CHAOS staff developed YACI (Yet Another Cluster Installer).

A YACI installation begins by creating a disk partition and placing the needed files in a staging partition. These are then converted into compressed TAR images ("tarballs"). Cluster installation continues by installing management node(s) from a YACI CD-ROM, then network booting a stand-alone image onto the remaining nodes. Finally, each stand-alone image partitions its local disk and deploys the "tarballs" of the original disk partition. Once the management node(s) are configured, YACI can install CHAOS on the remaining 1152 nodes of LC's MCR cluster in about 50 minutes.

Recent versions of YACI are available through the OCF web site


http://www.llnl.gov/linux/yaci




Navigation Links: [ Document List ] [ HPC Home ] [ Next ]