ICC Home Privacy and Legal Notice LC User Documents Banner

UCRL-WEB-200040

CHAOS: Linux from Livermore


Scalable Parallel File System

GOALS:
One major service goal of CHAOS is to enable a special-purpose file system with these unusual properties:

  • PARALLEL--
    allows several processes to successfully read from or write to (parts of) the same file at the same time (parallel I/O), including the case where not all parts of "a file" are contiguous on a single disk.
  • SCALABLE--
    can grow very large, including many disks, with no loss of functionality or performance.
  • GLOBAL--
    is available to all nodes in a cluster, even a very large cluster, at the same time (and perhaps later, to nodes spread across many clusters, as with LC's NFS-mounted common home directories).
  • SECURE--
    allows jobs to readily access their own files, even if distributed across devices in a file system, while protecting each job's files from interference by other jobs running at the same time (with access authentication suitable for LLNL secure networks).

PHASES:
This goal has been approached in three phases, each of which yielded some technical refinements that helped enable the next phase.

  • Compaq Partnership (2000)--
    sought to port the Petal/Frangipani research file system to Compaq (Alpha-chip) Linux nodes, taking advantage of earlier QsNet work for high-bandwidth file transport of file-system information.
  • ASCI PathForward (2001)--
    collaborated with Cluster File Systems Inc. to develop an experimental open-source, distributed, object-based file system with the properties specified above, for Linux nodes that use Intel chips.
  • Lustre Lite (current phase)--
    refocused the previous effort on providing a practical, scaled-down version of a POSIX-conformant parallel file system for single (that is, not globally mounted) clusters implemented at Livermore Computing.
    • One aspect of this work involved moving beyond the usual "redundant array of inexpensive disks" (RAID) by designing distributed file storage using a set of virtual disks or "Object Storage Targets" (OSTs) instead. Each OST is a self-managed CPU/drive combination.
    • Another aspect of this work recruited Blue Arc Corporation to build a suitable underlying "storage appliance" to implement the OST design. Lustre Lite has been deployed on LC Linux/CHAOS clusters, and even cross-mounted among multiple LC clusters. Some scalability and reliability problems persist (see below).

USAGE ADVICE:
User information for the current Lustre implementation, which compares the actual features of LC's Lustre file systems with those of IBM's GPFS, as well as points out known pitfalls involving directory names and interactions with MPI-IO, is available in the Lustre section of the online I/O Guide for LC.

REFINEMENTS WITH CHAOS 3.2:
Deploying CHAOS 3.2 (and later) starting in May, 2007, which includes Lustre 1.4.8, addressed three serious operational problems with earlier Lustre versions--

  • Page Cache Flush--
    Under CHAOS 3.2, the SLURM epilog script that always executes immediately after your job script now flushes the page cache of Lustre pages (clean and dirty) after every job. This guarantees that the next job will start with all memory available and with no interference from delayed I/O.
  • Assertion Failures--
    Under CHAOS 3.2 (Lustre 1.4.8) any Lustre assertion failures on a compute node cause the node to panic and jobs to completely terminate. Previous Lustre versions allowed nodes with assertion failures to lapse into a strange, partly failed state.
  • FLOCK and FCNTL--
    Under CHAOS 3.2, system calls to FLOCK(2) and FCNTL(2) to lock Lustre files always return an error. This may affect some MPI-IO and HDF5 software. Previously, separate tasks running on different clients could use FLOCK or FCNTL to simultaneously obtain exclusive locks on the same file, clearly an operational mistake.



Navigation Links: [ Document List ] [ HPC Home ] [ Next ]