CHAOS: Linux from Livermore
Local support for a high-performance interconnect (internal network)
among CHAOS cluster nodes developed in several stages.
First, the Linux Project ported the Quadrics QsNet device drivers
and related software from Compaq Alpha chips running Compaq's
proprietary Tru64 version of UNIX to the same chips but running
Red Hat Linux. As a result, QsNet under Linux not only executed
reliably, but it slightly outperformed the original Tru64 version
(with a maximum bandwidth of 210Mbyte/s).
Second, the Linux Project staff shifted focus to Intel chips and
locally modified Red Hat Linux for QsNet support.
Quadrics, for their part, released most of their software under an
open source license and concentrated their business on Linux platforms.
Meanwhile, the LC collaborators modified the system kernel used
locally (now called CHAOS) to once again support QsNet in three ways:
- LC added (improved) device drivers for Quadrics Elan3
- LC included the Quadrics software environment to run
parallel jobs across a cluster (such as libelan,
a low-level library of message-passing functions).
- LC packaged with CHAOS the Quadrics MPING ping-pong
test, as part of a basic MPI test suite.
The QsNet interconnect is now available on some LC CHAOS-based
Linux clusters (such as Thunder, Lilac, and ALC).
Besides its direct benefits, it enables other, higher-level system
as a scalable parallel file system (next section).
The Elan Communication Library (libelan, mentioned above) helps
optimize MPI behavior on Linux clusters with the Quadrics switch.
Twenty-one environment variables (most begin with the characteristic
string LIBELAN_) allow you to manage the impact of this library to:
For a current list of these environment variables and their specific
roles, see this (open-network only) web site:
- work around application hangs caused by communication problems,
- improve code performance under Linux (CHAOS),
- handle large amounts of message-passing memory, and
- enable Elan library support for MPI debugging.
INFINIBAND (OPENIB) SUPPORT:
In 2003 an ASC PathForward project began to promote commercial
support for a much faster interconnect called Infiniband.
The national laboratories worked with industrial partners and
open-source software efforts in a collaboration (partly ASC funded)
By 2005 the first high-performance computing (HPC) release of
OpenIB became available.
By 2006, LC began installing clusters (the Peleton procurement,
involving machines such as Atlas and Zeus (OCF), and Rhea
and Minos (SCF)) that featured an Infiniband internal network.
CHAOS evloved to support this switch innovation.
By May, 2007, CHAOS version 3.2 was developed specifically for
such clusters and was deployed (exclusively) on them, with OpenIB
Navigation Links: [
Document List ] [
HPC Home ] [