CHAOS: Linux from Livermore
Scalable Parallel File System
One major service goal of CHAOS is to enable a special-purpose file
system with these unusual properties:
allows several processes to successfully read from or write to
(parts of) the same file at the same time (parallel I/O), including
the case where not all parts of "a file" are contiguous on a single disk.
can grow very large, including many disks, with no loss of functionality
is available to all nodes in a cluster, even a very large cluster,
at the same time (and perhaps later,
to nodes spread across many clusters,
as with LC's NFS-mounted common home directories).
allows jobs to readily access their own files, even if distributed
across devices in a file system, while protecting each job's files
from interference by other jobs running at the same time (with
access authentication suitable for LLNL secure networks).
This goal has been approached in three phases, each of which yielded
some technical refinements that helped enable the next phase.
- Compaq Partnership (2000)--
sought to port the Petal/Frangipani research file system to Compaq
(Alpha-chip) Linux nodes, taking advantage of earlier
for high-bandwidth file transport of file-system information.
- ASCI PathForward (2001)--
collaborated with Cluster File Systems Inc. to develop an experimental
open-source, distributed, object-based file system with the properties
specified above, for Linux nodes that use Intel chips.
- Lustre Lite (current phase)--
refocused the previous effort on providing a practical, scaled-down
version of a POSIX-conformant parallel file system for single
(that is, not globally mounted)
implemented at Livermore Computing.
- One aspect of this work involved moving beyond the usual
"redundant array of inexpensive disks" (RAID) by designing distributed
file storage using a set of virtual disks or "Object Storage Targets"
Each OST is a self-managed CPU/drive combination.
Another aspect of this work recruited Blue Arc Corporation to build
a suitable underlying "storage appliance" to implement the OST design.
Lustre Lite has been deployed on LC Linux/CHAOS clusters,
and even cross-mounted among multiple LC clusters.
Some scalability and reliability problems persist (see below).
User information for the current Lustre implementation, which compares
the actual features of LC's Lustre file systems with those of IBM's
GPFS, as well as points out known pitfalls involving directory names
and interactions with MPI-IO, is available in the
of the online I/O Guide for LC.
REFINEMENTS WITH CHAOS 3.2:
Deploying CHAOS 3.2 (and later) starting in May, 2007, which includes
Lustre 1.4.8, addressed three serious operational problems with
earlier Lustre versions--
- Page Cache Flush--
Under CHAOS 3.2, the SLURM epilog script that always executes
immediately after your job script now flushes the page cache of
Lustre pages (clean and dirty) after every job.
This guarantees that the next job will start with all memory available and
with no interference from delayed I/O.
- Assertion Failures--
Under CHAOS 3.2 (Lustre 1.4.8) any Lustre assertion failures on a
compute node cause the node to panic and jobs to completely terminate.
Previous Lustre versions allowed nodes with assertion failures to
lapse into a strange, partly failed state.
- FLOCK and FCNTL--
Under CHAOS 3.2, system calls to FLOCK(2) and FCNTL(2) to
lock Lustre files always return an error.
This may affect some MPI-IO and HDF5 software.
Previously, separate tasks running on different clients could use
FLOCK or FCNTL to simultaneously obtain exclusive locks on the
same file, clearly an operational mistake.
Navigation Links: [
Document List ] [
HPC Home ] [