I/O Guide for LC
High-performance input and output (I/O), including parallel I/O, is crucial for the overall success of large-scale computer simulations, such as the ASC simulations on which LC's production computing resources are focused. Planning for effective, scalable I/O by application codes is important because:
- Expensive computer time spent performing I/O is lost for the primary task of simulating physical processes. For example, when wall-clock time for I/O grows beyond 25% of a code's total run time, I/O techniques are often causing serious overall performance problems.
- Simulation data, and the time and cycles spent creating them, are wasted if not well managed and effectively saved for future reuse (such as visualization).
- Even with LC's large local disks and storage media, I/O can be a bottleneck in high-performance computing unless applications use the most appropriate techniques for reading and writing their data. The separation of compute nodes and I/O nodes can make this bottleneck worse.
LLNL's Scalable I/O Project has developed an "end-to-end model of the I/O stack" to make clear the layers that data pass through from an application code to physical storage:
|End-to-End I/O Stack|
|Data Models and Formats|
This guide provides an alphabetical glossary of just those key I/O terms needed to understand the later treatments of LC-related I/O tools and techniques. One major section explains the features and local use of the Hierarchical Data Format (HDF), designed to promote efficient large-scale I/O. (HDF use by the I/O-intensive FLASH hydrodynamics code is also analyzed.) Portable, parallel I/O with the MPI library is discussed as well, both in terms of general data-access routines and local implementation constraints. Another section contrasts the General Parallel File System (GPFS, which supports parallel I/O on LC IBM clusters) with NFS, also in use on LC machines. A separate discussion introduces the design features (such as data and metadata separation) and local implementation details (such as file system names and sizes) of Lustre, the open source parallel file system that LC deploys on its Linux/CHAOS (Clustered High Availability Operating System) clusters.
For help, contact the LC Hotline at 925-422-4531 or via e-mail (OCF: email@example.com, SCF: firstname.lastname@example.org).
This section provides an alphabetical set of brief explanations for the unusual technical terms that appear throughout this guide as I/O issues and features are discussed. The glossary here is intented to make the text of this manual easier to use, not to offer a comprehensive I/O dictionary.
- Disk striping
- distributes file data across multiple disks for speed and safety. The amount of consecutive data stored on each disk is the "striping unit" or "strip width," which may be the block size (as on IBM's GPFS), or multiple blocks, or just a few bits. "Declustering" is sometimes a synonym for disk striping. See also the Lustre Striping section below for some relevant user tools.
- Diskless nodes
- are compute nodes in a Linux/CHAOS cluster that have no local hard disks. At LC, the Peloton machines (such as Atlas and Zeus on OCF, Rhea and Minos on SCF) have diskless nodes. The advantage is that there are fewer disk drives to fail, thus increasing overall reliability. But diskless nodes have no swap space, so any application that runs out of memory on those nodes will be terminated by the CHAOS OOM (Out Of Memory) killer. Also, /tmp and /var/tmp on diskless nodes use RAM, not disk. So CHAOS purges those file systems completely between jobs on diskless nodes to reclaim the memory used. Jobs there must use HPSS, Lustre, or /nfs/tmp* for output to survive after the job ends.
- F/b ratio
- measures the effectiveness of an I/O system. F is the rate of executing floating-point operations and b is the rate of performing I/O (so F/b=1 means one bit of I/O occurs for every floating-point operation). While F/b=1 is sometimes thought to be the ideal for supercomputers, real-life F/b ratios are often closer to 100 for scientific applications and closer to 10 for I/O intensive applications.
- is IBM's commercial General Parallel File System. This is the file system installed for parallel I/O on LC's IBM/AIX massively parallel production computers. (See the GPFS at LC section below for local implementation details.)
- is Hierarchical Data Format, a standard way to organize files internally and a supporting I/O library. Both are designed to promote efficient large-scale I/O for scientific applications (see the HDF section below for LC's local implementation details).
- is an open source parallel file system that is installed for parallel I/O on LC's Linux/CHAOS massively parallel production computers. (See the Lustre section below for LC's design constraints, a feature comparison with GPFS, and local implementation details.)
- Parallel file system
- is a file system specifically designed to allow simultaneous reads and writes to nonoverlapping regions of the same (logical) file, simultaneous reads and writes of different files, and distribution (striping) of file data across several I/O nodes or disks (or both), especially for large files. GPFS and Lustre are examples of parallel file systems in use at LC.
- Parallel I/O subsystem
- is a way to transfer data in parallel between compute nodes and dedicated I/O nodes within the same massively parallel machine. The parallel I/O subsystem takes advantage of the machine's high-speed internal switch to handle small requests efficiently, yet it can scatter I/O operations among many nodes to efficiently distribute large files, too. I/O occurs internally across the parallel I/O subsystem, then externally across high-bandwidth channels to mass-storage servers.
- is a redundant array of inexpensive (or independent) disks. RAID technology provides high reliability for stored data by striping the data across several disks in a way that uses more disk space than without striping but maintains parity so that lost data can be reconstructed even if one disk in the array fails. LC's globally mounted NFS disks (such as for the global home directories) use RAID.
Hierarchical Data Format (HDF) is a (specification for a) file format and a supporting I/O library for storing technical data. HDF is designed to promote efficient large-scale I/O for scientific applications running in high-performance computing environments. Hierarchical Data Format 5 (called HDF5) replaces an earlier, and incompatible, attempt to meet similar goals (called HDF4). For more information, visit the HDF Group's HDF5 home page.
HDF5 is available for both AIX (IBM Unix) and Linux operating systems, and most features are supported on both systems.
Your program manipulates HDF5 files by using calls to the HDF5 I/O library. On LC Linux machines, HDF5 installations are in /usr/local/tools/hdf5*. The 1.6.5 and 1.8.1 versions are maintained for GNU, Intel, PGI, and PathScale compilers. On AIX machines, current HDF5 versions are in /usr/local/tools/hdf5.
The HDF5 tools reside in a /bin directory several layers (depending on the host) below /usr/local/hdf5-version on each LC production machine. See the HDF Operations section for details on using the HDF5 library and tools locally. See I/O Analysis of FLASH below for discussion of the benefits and pitfalls of trying to use HDF5 to support a hydrodynamics code that performs extensive I/O.
A tutorial, user guide, and other HDF5 documentation is available online at
HDF5 files are binary containers for efficiently holding scientific data in an organized way, with explicit supporting metadata to facilitate later reuse.
HDF5 files consist of:
Groups that behave like UNIX directories (they organize data hierarchically). Groups can contain other child groups or point to other groups. Every HDF5 group has three attributes that overtly declare its:
- NAME (one "root group" is always named "/").
- PATHNAME (called its "OBJ-XID").
- IMMEDIATE PARENT GROUP.
Datasets within groups that behave like UNIX files within directories, except that they too have overt structures and supporting metadata. Every HDF5 dataset includes these features:
- Dataspace. A way to overtly declare the number and range of the dimensions needed for the subsequent scientific data.
- Datatype. A way to specify how to interpret the data, such as array, compound, or atomic. "Atomic" HDF5 datasets can have their byte order, size, and sign declared, along with any of these subtypes: INTEGER, FLOAT, STRING, TIME, BITFIELD, OPAQUE, OBJECT REFERENCE, REGION REFERENCE, ENUM(ERATION).
- Data. The actual "lowest level" output from or input to your application program (optionally empty).
Because HDF5 files are hierarchically organized and encoded with overt attributes, they can be represented by and manipulated using XML (the ISO standard "eXtensible Markup Language"). The NCSA HDF group is developing a comprehensive suite of standards and tools for using XML with HDF5. For details see
Your program manipulates HDF5 files by using calls to the HDF5 I/O library. Among the most important operations on HDF5 files are file creation, file display, hyperslab selection, parallel HDF5, and HDF5 support in IDL.
The include file hdf5.h (for C) or the module HDF5 (for Fortran) contains definitions and declarations that you must use in any program that invokes the HFD5 library. A call to routine H5Fcreate (C) or h5fcreate.f (Fortran) creates a new HDF5 file, returns its file identifier, and lets you specify its:
- File name.
- Access mode (to control reads and writes).
- Creation property list (to control metadata; defaults available).
- Access property list (to control methods of performing I/O).
Once you have created, expanded, or altered an HDF5 file, you can display its contents (groups, attributes, etc.) in human-readable form by invoking any of several software tools provided for this purpose. A few important HDF5 tools are h5copy, h5dump, and h5ls. The HDF5 tools reside in a /bin directory several layers below /usr/local/tools/hdf5/hdf5-version on each LC production machine. Each HDF5 tool, if run without options, displays several screens of text summarizing its usage syntax and available options, then ends. One of the most helpful tools is H5DUMP, which outputs an ACSII text display in Backaus-Naur Form by default or encoded in XML if you reguest with its -x option. (See also the comments on using the IDL library for HDF5 output, below.)
One way to read existing HDF5 files is by invoking H5Sselect_hyperslab (C) or h5sselect_hyerslab.f (Fortran), which extracts a "hyperslab" from an HDF5 dataset. A hyperslab can be a logically contiguous set of points, or a regular pattern of points or blocks even if noncontiguous.
Hyperslab selection from HDF5 datasets is so flexible that you can read from a dataset with one size, shape, and datatype, and then write into a dataset with a different size, shape, and datatype. For example, you can read blocks from a 2-D array of 32-bit floats and then write that data into a contiguous sequence of 64-bit floats at a specified offset in a 1-D array.
A parallel HDF5 API is supported for C and Fortran but not for C++. Parallel I/O on HDF5 files always involves the MPI concept of a "communicator," a specified set of processes that pass messages to each other. For parallel HDF5 I/O, each process in an MPI communicator invokes H5Pcreate (C) or h5pcreate.f (Fortran) to create an "access template" and obtain a file's access property list, and nvokes H5Pset_fapl_mpio (C) or h5pset_fapl_mpio.f (Fortran) to initiate parallel I/O access.
With parallel HDF5 I/O:
- All parts of the file are accessible by all MPI processes.
- All objects in the file are accessible by all processes.
- Multiple processes can write to the same dataset (or, optionally, to individual datasets).
The HDF5 documentation referenced at the beginning of this section includes annotated programming examples (in C and Fortran) of performing parallel I/O on HDF5 files.
On all production machines, LC offers a licensed commercial library and tools together called Interactive Data Language (IDL). IDL is really a general data exploration and visualization language designed for writing high-level data-analysis programs much more compactly than with C or Fortran, but HDF5 is one of four "self-describing scientific data formats" that IDL routines can read and query (but not write). IDL acknowledges the following limitations when reading HDF5 files:
- No datatype conversion (until after the data is read).
- Only the topmost HDF5 error message is printed from the stack.
- No support for variable-length or opaque datatypes.
- No property-interface support.
- No writes.
On LC machines, the IDL libraries resides in /usr/global/tools/ITT/idl71/bin/bin.linux.x86_64. The documentation for using "Callable IDL" is available in Chapter 16 of /usr/global/tools/ITT/docs/edg.pdf. (See also the comments on using native HDF5 tools for output instead, in the File Display section above.)
A parallel I/O interface allows programs with many processes on many nodes to coordinate their I/O read and write operations for greater efficiency. This parallel interface is called MPI-IO.
Implementing successful parallel I/O, including MPI-IO, requires the underlying support of a parallel file system (such as IBM's GPFS or Linux-oriented Lustre). This diagram shows why a traditional file system causes expensive inefficiency when users attempt parallel I/O without proper hardware support.
One inefficient approach concentrates all read and write requests at a single I/O node (a). Another allows several nodes to read or write data (b), but only to separate (logically distinct) files that have to be somehow merged later. A parallel file system (c) not only supports I/O from many nodes at the same time, but also transfers the data to (different parts of) a single logical file, even if the file is "striped" across multiple physical disks (for safety, convenience, or speed).
With MPI-IO, data moves between processes and files using a variety of specific read and write calls, all variations on MPI_FILE_READ and MPI_FILE_WRITE. You select from these many routines to address three independent aspects of parallel data access: positioning, synchronism, and coordination.
MPI-IO data access routines provide three types of positioning (which you can mix safely within the same program):
- Explicit Offsets. Explicit offset routines perform data access at a file position that you give explicitly as an argument (no file pointer used). All such routine names end in _AT (such as MPI_FILE_WRITE_AT).
- Individual File Pointers. Each I/O operation with a file pointer leaves the pointer pointing to the next data item after the one last accessed by the operation (example: MPI_FILE_WRITE unqualified).
- Shared File Pointers. These behave much like individual file pointers (above), but each routine ends in _SHARED (noncollective) or in _ORDERED (collective).
MPI-OI offers both blocking and nonblocking I/O routines. Blocking I/O calls will not return until their I/O request is completed. Blocking is the default approach (e.g., with MPI_FILE_WRITE or MPI_FILE_WRITE_AT); special routines request nonblocking I/O (next). Nonblocking. Nonblocking I/O calls start an I/O operation but do not wait for it to complete. This can allow data transfer simultaneous with computation if hardware permits. To confirm that your data was actually read or written, however, you must use a separate "request complete" call (such as MPI_WAIT). Nonblocking versions of MPI routines all have names of the form MPI_FILE_Iaaa (where I is for "immediate," such as MPI_FILE_IWRITE).
MPI data access routines may be noncollective or collective, a measure of their dependence on other members of their process group. Noncollective. Noncollective call completion depends only on the calling process itself. The default MPI routines above (such as MPI_FILE_WRITE) perform noncollective data access. Collective. Completion of a collective call (made by all members of a process group) may depend on the activity of every process making the call. But sometimes collective calls perform better than noncollective ones because they can be automatically optimized. Collective MPI routines end in _ALL (such as MPI_FILE_WRITE_ALL) or in the pair _ALL_BEGIN/END (such as the explicit-offset pair MPI_FILE_WRITE_AT_ALL_BEGIN and MPI_FILE_WRITE_AT_ALL_END). The noncollective MPI_FILE_aaa_SHARED pointer routines map to the collective routines called MPI_FILE_aaa_ORDERED.
For more details on MPI-IO routines, on routine naming patterns for handling every combination of positioning, synchronism, and coordination, and for a summary of MPI-IO data access conventions in general, consult the MPI documentation available at:
This section summarizes and compares MPI-IO issues, problems, and implementation constraints that specifically affect local users of LC production machines.
Successfully performing MPI-IO on NFS-mounted file systems requires that NFS is at version3 each NFS shared directory is mounted with the "no attribute caching" (NOACC) option enabled. However, all NFS-mounted file systems on LC production machines (such as /nfs/tmpn or the global home directories) are installed with attribute caching enabled (so NOACC is disabled and does not appear in their attribute list in /etc/fstab). This means that attempts to perform parallel MPI I/O to LC's NFS-mounted disks will fail. (NFS normally caches modified file pages on each client node that performs a write, without promptly updating the master copy on the file server. When multiple parallel clients write to the same file, this shortcut means that NFS will probably not correctly update the master copy.) Furthermore, when a globally mounted NFS file system (such as those supporting LC's common home directories) is flooded with MPI I/O traffic, service slows, often dramatically, not only on the machine running the MPI code but on all machines on which that file system is mounted. This is not a responsible use of shared computing resources.
LC users of the vendor-independent MPICH libraries for MPI should note that the MPICH ROMIO implementation of parallel I/O is not standard compliant in the way it manages file handles, which are used for nonblocking I/O requests.
On LC IBM machines, the environment variable MP_SINGLE_THREAD is an optimization flag. At LC, it is no by default, a setting that assumes multiple message-passing threads and can improve the performance of the threaded MPI library. If this flag is set to yes, then your program cannot use MPI-IO.
Tests show that for some MPI-IO operations GPFS is much more efficient that Lustre, while for other operations Lustre performs much better than does GPFS. See the MPI-IO Interactions with Lustre section below for a discussion of the difficulties of predicting how MPI-IO operations tuned to one parallel file system will behave when moved to a different parallel file system.
A file system is the software (or sometimes, the software with the collection of data that it manages) that allows you as a user to manipulate hierarchically organized, access-controlled files and directories rather than just the countless raw blocks of data that comprise them (on disk). This section compares the LC-installed IBM General Parallel File System (GPFS), designed for large-scale parallel I/O, with traditional file systems and offers basic usage advice for GPFS.
See the section above on MPI-IO for a diagram that shows why some parallel file system (such as IBM's GPFS) is necessary to support efficient parallel I/O from many processes running concurrently on many separate compute nodes. See also MPI-IO Issues at LC about how careless parallel I/O can not only hurt your code's performance but also undermine I/O service for many users on many machines at once.
The Network File System (NFS) is a widely distributed commercial file-system product designed to minimize file transfers by letting multiple computers all access a collection of files as if it was local to them. LC uses NFS extensively, but NFS is not designed to:
- Let multiple (parallel) processes simultaneously read from or write to the same file from different compute nodes.
- Scale across many I/O servers to avoid I/O bandwidth bottlnecks.
- Transparently balence incoming I/O data across all disks in the file system with a built-in striping algorithm (but RAID disks do provide behind-the-scenes striping at LC).
To support these special goals, LC has installed GPFS on its AIX IBM SP machines (but not on its Linux machines, even though GPFS is also available for Linux). The parallel file system on LC Linux/CHAOS machines is Lustre, discussed in its own section below.
This table summarizes the most interesting, user-relevant differences between the more familiar NFS file system on the one hand and the less familiar but more parallel-friendly GPFS on the other.
|File System Features||NFS||GPFS|
|Example at LC||/nfs/tmpn||/p/gscratch*|
|Primary role||Share files among machines||Fast parallel I/O for large files|
|Easy to scale?||No||Yes|
|Network needed||Any TCP/IP network||Only IBM SP "switch"|
|Access control method||UNIX permission bits (CHMOD)||UNIX permission bits (CHMOD)|
|Block size||256 byte||4 Mbyte|
|Stripe width||Depends on RAID||256 Kbyte|
|Maximum file size||2 Gbyte (longer with v3)||26 Gbyte|
...uses client buffering?
...uses server buffering?
...lock managed by?
Requesting compute node
|Purged at LC?||Home, no; tmp, yes||Yes|
|Supports file quotas?||Yes||No|
Using GPFS (at LC) effectively and appropriately requires storing your files so that you avoid needlessly clogging the file system and especially so that you avoid losing valuable data to the GPFS file purge. The different purge policies that apply to NFS, GPFS (AIX), and Lustre (Linux) file systems are delineated in the File Purge Policy section of EZFILES. A summary of the purge policy for IBM (GPFS) systems is also available on those systems by running news PurgePolicy.ibm.
If you make use of GPFS on LC's AIX machines, you may need to coordinate your disk space needs with other users on the same machine. Every LC AIX machine therefore offers a system file called /usr/local/etc/pfs_status.machinename that reports for each available parallel file system its current total size, space already used, percentage used, percentage of possible inodes (roughly, files) used, and an ordered list of users and their current space usage (in both Tbytes and number of files).
FLASH is an adaptive-mesh parallel hydrodynamics code developed at the University of Chicago's Center for Astrophysical Thermonuclear Flashes, a DOE Academic Strategic Alliance Program (ASAP) center. FLASH simulates astrophysical thermonuclear flashes (such as supernovae and x-ray bursts) in two or three dimensions. Written in Fortran90, this code uses MPI for interprocess communication, relies on HDF5 for handling output data, and solves the compressible Euler equations on a block-structured adaptive mesh.
Because I/O is important for overall FLASH performance, this is an excellent test case for I/O optimization strategies (each FLASH run often generates 0.5 Tbyte of data, and I/O sometimes takes up as much as half of the total FLASH run time on 1024 processors).
The FLASH I/O benchmark tests FLASH's I/O performance independently of using the entire code. It sets up the same data structures as FLASH, fills them with dummy data, and then performs I/O through the HDF5 interface (or alternatives). The benchmark tests I/O performance on three kinds of files:
- Checkpoint files are used to restart after a failed run, these files store all variables, the tree structure, the current simulation step, and the number of steps. Computational blocks account for more than 95% of the data written during each checkpoint, and 24 separate I/O operations (one per variable) are needed to write all of the computational blocks.
- Plot files are used for visualization runs. Once again, a separate I/O operation per variable is involved, but not all variables are stored and precision is reduced to 4-byte reals instead of 8-byte reals.
- Plot files With Corners similar to plot files but with an extra step added to generate a 9-by-9-by-9 interpolated block instead of the normal 8-by-8-by-8 block (to facilicate subsequent visualization).
FLASH performs I/O in this way to minimize the memory needed (a buffer to hold all of the variables for one single write would be very large) and because later data analysis is greatly aided by storing each variable separately.
Possible general I/O optimization strategies (most related to HDF use by FLASH) that are currently under study at LLNL and other ASC sites include:
- Storage Density. To store each variable in a separate record, single variables are extracted from the array of blocks, where the values are not contiguous in memory. FLASH extracts these values using the "hyperslab" feature of the HDF5 library.
- Record Size. The small records that FLASH writes at the beginning of its output may be as expensive as the large chunk that it dumps at the end. Packing small records, either in the code itself or by instructing the HDF5 library to buffer them before writing to disk, could significantly improve performance.
- Write Calls. FLASH issues only a single call to H5Dwrite for each variable stored. But within the library, HDF could make one compound MPI object to address the data or it could issue many separate write calls itself. Setting the "data transfer property" to use collective I/O should force HDF to use the first strategy instead of the second, perhaps making a significant difference in overall I/O performance.
- Two-Phase I/O. Experiments on an ASC system showed a fivefold increase in I/O rate by using two-phase I/O: first, collect output across processors into a buffer, then write a large contiguous chunk of memory to disk. This requires careful interaction among FLASH, the HDF5 library, and the MPI_file_open command, and the portability of that interaction remains to be tested on other machines (HDF5 features are known to vary among platforms).
- Split I/O. Normally the metadata for an XML-encoded HDF5 file resides in the same file as the data stored. Splitting the metadata into a file separate from the FLASH data itself might improve I/O, especially for situations (such as writing checkpoint files) where the likelihood of ever reading the file later is small so reading inefficiencies can be ignored.
A parallel file system is part of any complete massively parallel computing environment. (In fact, failure to use an available parallel file system and instead running parallel I/O to a traditional global file system such as /nfs/tmpn will degrade I/O performance for all users across all the machines that share that traditional file system.)
In general terms, such a parallel file system:
- Mounts on every compute and login node across the cluster that it serves.
- Stores very large files efficiently, such as application-code data sets or restart dumps of runs that encounter trouble.
- Uses high-speed local communication paths to move data quickly to minimize I/O delays during code execution.
At LC, a parallel file system tailored to LLNL's specific computational needs and resource design policies must also:
- Scale up to effectively serve clusters with over 1,000 nodes (and eventually those with over 10,000 nodes).
- Rely on open source software (to maximize vendor flexibility and encourage collaboration with university researchers worldwide).
- Be independent of any single brand of storage-device hardware (so LC is able to change hardware vendors as new design features become available and make the most of hardware funds.
LC's (collaborative) attempt to develop a practical parallel file system that meets these criteria is called Lustre (for "Linux Cluster"). The technical description of Lustre appears at the Lustre Web site:
The subsections below compare Lustre with GPFS, describe the unusual implementation features that Lustre includes (as installed for production use on LC Linux machines), and explain how to cope with the currently known pitfalls or complexities that Lustre presents to users. LC's point of contact for users needing technical advice about the local Lustre file systems is Richard Hedges (email@example.com).
LLNL deploys both GPFS and Lustre parallel file systems (to support AIX and Linux/CHAOS compute clusters, respectively). Here is a feature-by-feature comparison of these two parallel file systems.
|File System Features||Lustre||GPFS|
|Hardware||Many brands||IBM only and Linux clusters|
|Software||GNU public license, open source||IBM proprietary|
|Switches allowed||TCP/IP (Ethernet), Quadrics Elan 3 or 4, InfiniBand||Storage Area Network (SAN), Network Shared Disk, or combination, InfiniBand|
|Networking protocol||Lnet (based on Sandia's Portals API)||IBM proprietary|
|File locking||Intent based (request + reason sent together)||Token passing|
|Lock granularity||Byte range||Byte range|
|Data/metadata operations||Separated, by different servers||Together, by the same servers|
|Scalability strategy||(1) Object storage targets manage data moves to actual disks, (2) metadata servers manage namespace||(1) Same disks attached to all nodes, (2) storage nodes manage both data and metadata|
File systems at or near their capacity often show degraded performance, higher I/O error rates, or sometimes complete service failure. To make service more predictable and reliable, LC intentionally destroys (purges) files on at-risk file systems intended for temporary storage (especially the large NFS-mounted termporary file systems and the GPFS and Lustre parallel file systems).
The different purge policies that apply to NFS, GPFS (AIX), and Lustre (Linux) file systems are delineated in the File Purge Policy section of EZFILES. A summary of the purge policy for GPFS systems is available on those systems by running news PurgePolicy.ibm. A summary of the purge policy for Lustre systems is available on those systems by running news PurgePolicy.linux.
At LC, the installed parallel file systems (Lustre for Linux/CHAOS machines and GPFS for IBM/AIX machines) have file-system names that generalize more easily and emphasize the temporary nature of data placed on these devices. The file systems are mounted across multiple clusters for greater convenience with less need to move files between like machines.
Each LC parallel file system has a name of the form
l (lowercase el) indicates a Lustre (Linux/CHAOS) file system,
g indicates a GPFS (IBM/AIX) file system,
ocfletter is a unique one-letter identifier for OCF systems (a, b, c, etc.), and
scfnumber is a unique one-digit identifier for SCF systems (1, 2, 3, etc.).
Two key design features distinguish Lustre's implementation from other parallel file systems:
- Division of labor. Lustre separates a file's metadata (its attributes and status information) from the file's data (its "content" that a program wants to get or put). Each is handled by its own separate server(s) for efficiency.
- Object approach. All actual file I/O is mediated by one or more object storage targets (OSTs), on which files seem to reside but which really mask the character of the underlying physical storage devices (could be multiple disks or other "file systems" of various sizes or brands). This approach promotes flexibility, reliability, and scalability.
A Lustre metadata server (MDS; there are usually several for speed and failover redundancy) manages all namespace operations relevant to your file, such as assigning or updating references to the file's name, owner, permissions, access (conflict-control) locks, and (apparent) location on an OST (or striped across several OSTs). The MDS does not, however, participate in data transfers between the program and storage devices. (Those occur directly between the client node and the OSTs involved to save overhead.) Nevertheless, LC Lustre users sometimes encounter slow metadata performance (such as slow response to ls when executed with options like -l).
Each Lustre OST is a server devoted to file I/O operations. It handles reads and writes of file data, but it talks to the MDS only if namespace changes for the file are needed. OSTs can fill up as if they were disks and can share files by striping (as if they were RAID disks), but files actually reside on lower-level physical devices managed indirectly by device drivers. These underlying storage devices are sometimes called object-based disks (OBDs) to emphasize how much their physical traits are hidden from your application program: they can be generic or customized and branded, can include nondisk storage, and can be upgraded to newer technology below an OST without disturbing that OST's consistent, reliable service to your program I/O requests. OSTs provide not only storage abstraction, but also modular expandability: more OSTs can be added to an existing Lustre file system to expand the pool of logical locations for incoming files.
One drawback of allowing different brands and kinds of physical storage device to hide below the OST level is that your application program writing to Lustre may receive different exit or return codes at different times. Users often interpret these return-code differences as file system errors, when they usually just reflect subtle between-brand differences allowed within the POSIX specification. If you use Lustre extensively, change your application to overtly check I/O return codes so that you can appropriately ignore unimportant differences yet still detect file-corrupting genuine I/O errors.
At LLNL, Lustre uses either the Quadrics Elan or the InfiniBand network switch (depending on the Linux cluster where it is installed). Network independence is another Lustre design strength, faciliated by its use of the Portals protocol stack, an abstract approach to networking originally developed at Sandia National Laboratory (now available as open source software) and call Lnet as implemented as part of Lustre at LLNL..
For each Linux/CHAOS production cluster at LC running Lustre, the chart below shows (as of December 2010) the name(s) of the public parallel file system(s) by which computer users access Lustre disk space on that cluster and the current total size of each parallel file system. For a current listing of mounted file systems, run bdf. For maximum bandwidths of the lscratch file systems, see OCF maximum Lustre bandwidths and SCF maximum Lustre bandwidths.
|Lustre File Systems|
This section briefly describes known Lustre usage (operational) issues or pitfalls and suggests ways to cope with each one.
Using Lustre effectively and appropriately requires storing (archivally) your files so that you avoid needlessly clogging the parallel file system and especially so that you avoid losing valuable data in case of a Lustre file purge. Once files are purged from Lustre they cannot be recovered, so use archival storage (see EZSTORAGE) to protect your important content. The different purge policies that apply to NFS, GPFS (AIX), and Lustre (Linux) file systems are delineated in the File Purge Policy section of EZFILES. A summary of the purge policy for Lustre systems is available on those systems by running news PurgePolicy.linux.
Each implementation of MPI-IO (parallel I/O using the MPI library) depends for success, and certainly for whatever scalability it offers, on the underlying parallel file system that performs its requested I/O operations. On AIX systems, MPI-IO is strongly affected by IBM's GPFS support, and likewise on LC's Linux/CHAOS clusters, where Lustre supports MPI-IO requests. (Warning: Attempting MPI-IO to a standard shared file system such as /nfs/tmpn, or worse, to your common home directory, will severely degrade I/O performance for all users of that file system across all machines where it is mounted.)
MPI-IO efficiency varies greatly depending on its underlying parallel file system for three reasons:
- File system software (middleware) often reorganizes (to suit itself) how programmatic I/O operations appear to the actual hardware that services them. For example, multiple small noncontiguous file requests may coalesce into one large(r) I/O step to reduce network traffic.
- File-locking is crucial for reliable simultaneous reads from or writes to (different parts of) one file, yet locking availability and grain size vary from one parallel file system to another.
- Management operations (open, close, resize) depend on each file system's API for their implementation (and hence for their efficiency) details.
The features of a parallel file system most likely to influence how well MPI-IO works and how easily it scales up as the number of nodes grows large include:
- Just how the file system supports noncontiguous I/O to distributed files.
- The system's consistency semantics, that is, just when data and metadata are locally cached and when changes quickly propagate to all clients (after you write to a file, for example).
- Whether (and how) client-independent (across-node) handles (references to files) are available.
Some MPI-relevant features are advertised by parallel file system vendors, while others are hidden or even proprietary. So as a user, you should expect significant and sometimes inexplicable differences in MPI-IO performance as you move your applications from one parallel file system to another (even within the LC computing environment). Sometimes simple changes from one login node to another cause major differences in resource contention and hence in file-transfer rates. See also the Lustre Striping section below.
Because of the volume of material involved and the computationally high overhead for parallel file operations, LC does not backup its Lustre file systems. Should power failures or other unscheduled hardware problems occur, all of your data residing on any Lustre file system could be lost with no possiblity for recovery. Also, LC currently does not provide redundant (failover) OSSs. So each OSS failure makes some data unavailable until hardware is repaired or replaced.
Hence, you should thoughtfully move or copy important Lustre files to duplicate (and safer) locations yourself. Moving large numbers of small files can be tedious, error-prone, and very network congestive, however. LC provides a special software tool, called HTAR, specifically designed to efficiently transfer very large file sets either directly to archival storage or (if you request) to another file system on another LC machine. Using HTAR to self-backup your Lustre files thus benefits you as well as other users (who avoid the congestion you could cause by using slower manual transfers).
HTAR resembles traditional TAR in many ways (but not in requiring duplicate local disk space to create its target archive file, a great benefit). To take full advantage of HTAR's efficient backup potential, consult the feature explanations and examples in the HTAR Reference Manual.
Like many high-performance storage systems, Lustre uses disk striping to improve I/O speed: the system automatically divides the data to be stored into "stripes" and spreads those stripes across (some) available storage locations so that they can be processed in parallel. Because Lustre places files on logical OSTs, which manage the physical disk interactions hidden from the user, user data is striped across multiple OSTs to improve performance and to better balance the storage load.
Lustre is fairly fault tolerant (compared to GPFS and NFS), and it continues to operate even if one (or more) specific OST goes offline. Striping data widely across OSTs works against this reliability, of course. If any portion of your data resides on an OST that is down, attempted access of that file returns an I/O error until the faulty OST returns to service. So, it may be important to know which OSTs a particular file is spread across, or even to influence that spread. Two utilities (see below) address this need on LC machines served by Lustre.
Stripe width is the number of devices (or, in the case of Lustre, the number of OSTs) across which a file is divided. LC assigns different default stripe widths to Lustre on different computers to take advantage of different storage resources as well as differences among each machine's "object storage servers" (OSSs, the Linux nodes that communicate with OSTs). BlueGene/L, for example, runs two OST processes on each OSS node. The current default Lustre stripe width for all systems is two.
The lfs find utility reports on the distribution of file segments across Lustre OSTs. To change the default Lustre striping characteristics for a new (empty, not yet written) file or for a directory (so that new files written to it inherit those characteristics), use lfs help setstripe. See man lfs for details.
Some strategy considerations:
- Creating a very large file (for example, a large TAR bundle of already large files) on a small number of OSTs will result in very suboptimal performance. Striping over more OSTs will use a larger fraction of the available storage devices (or simply avoid TARing files that are already quite large).
- If your application program writes one file per process, then letting Lustre place these small separate files on different OSTs round-robin (the default) will beneficially balance the load on the underlying storage devices. Striping over many (or all) OSTs here degrades performance.
- If your application program instead has all of its parallel processes write to different parts of a single shared file then you will probably need to help Lustre widely distribute this load. In this case, striping the big shared file over many (or all) OSTs is probably very desirable.