I/O Guide for LC

I/O Guide for LC


High-performance input and output (I/O), including parallel I/O, is crucial for the overall success of large-scale computer simulations, such as the ASC simulations on which LC's production computing resources are focused. Planning for effective, scalable I/O by application codes is important because:

LLNL's Scalable I/O Project has developed an "end-to-end model of the I/O stack" to make clear the layers that data pass through from an application code to physical storage:

End-to-End I/O Stack
Data Models and Formats

This guide provides an alphabetical glossary of just those key I/O terms needed to understand the later treatments of LC-related I/O tools and techniques. One major section explains the features and local use of the Hierarchical Data Format (HDF), designed to promote efficient large-scale I/O. (HDF use by the I/O-intensive FLASH hydrodynamics code is also analyzed.) Portable, parallel I/O with the MPI library is discussed as well, both in terms of general data-access routines and local implementation constraints. A separate discussion introduces the design features (such as data and metadata separation) and local implementation details (such as file system names and sizes) of Lustre, the open source parallel file system that LC deploys on its Linux/TOSS (Tri-Lab Operating System Stack) clusters.

For help, contact the LC Hotline at 925-422-4531 or via e-mail (OCF: lc-hotline@llnl.gov, SCF: lc-hotline@pop.llnl.gov).

I/O Glossary

This section provides an alphabetical set of brief explanations for the unusual technical terms that appear throughout this guide as I/O issues and features are discussed. The glossary here is intented to make the text of this manual easier to use, not to offer a comprehensive I/O dictionary.

Disk striping
distributes file data across multiple disks for speed and safety. The amount of consecutive data stored on each disk is the "striping unit" or "strip width," which may be the block size, or multiple blocks, or just a few bits. "Declustering" is sometimes a synonym for disk striping. See also the Lustre Striping section below for some relevant user tools.
Diskless nodes
are compute nodes in a Linux/TOSS cluster that have no local hard disks. The advantage is that there are fewer disk drives to fail, thus increasing overall reliability. But diskless nodes have no swap space, so any application that runs out of memory on those nodes will be terminated by the TOSS OOM (Out Of Memory) killer. Also, /tmp and /var/tmp on diskless nodes use RAM, not disk. So TOSS purges those file systems completely between jobs on diskless nodes to reclaim the memory used. Jobs there must use HPSS, Lustre, or /nfs/tmp* for output to survive after the job ends.
F/b ratio
measures the effectiveness of an I/O system. F is the rate of executing floating-point operations and b is the rate of performing I/O (so F/b=1 means one bit of I/O occurs for every floating-point operation). While F/b=1 is sometimes thought to be the ideal for supercomputers, real-life F/b ratios are often closer to 100 for scientific applications and closer to 10 for I/O intensive applications.
is IBM's commercial General Parallel File System.
is Hierarchical Data Format, a standard way to organize files internally and a supporting I/O library. Both are designed to promote efficient large-scale I/O for scientific applications (see the HDF section below for LC's local implementation details).
is an open source parallel file system that is installed for parallel I/O on LC's Linux massively parallel production computers. (See the Lustre section below for LC's design constraints and local implementation details.)
Parallel file system
is a file system specifically designed to allow simultaneous reads and writes to nonoverlapping regions of the same (logical) file, simultaneous reads and writes of different files, and distribution (striping) of file data across several I/O nodes or disks (or both), especially for large files.The Lustre parallel file system is used at LC.
Parallel I/O subsystem
is a way to transfer data in parallel between compute nodes and dedicated I/O nodes within the same massively parallel machine. The parallel I/O subsystem takes advantage of the machine's high-speed internal switch to handle small requests efficiently, yet it can scatter I/O operations among many nodes to efficiently distribute large files, too. I/O occurs internally across the parallel I/O subsystem, then externally across high-bandwidth channels to mass-storage servers.
is a redundant array of inexpensive (or independent) disks. RAID technology provides high reliability for stored data by striping the data across several disks in a way that uses more disk space than without striping but maintains parity so that lost data can be reconstructed even if one disk in the array fails. LC's globally mounted NFS disks (such as for the global home directories) use RAID.

Hierarchical Data Format

HDF Features

Hierarchical Data Format (HDF) is a (specification for a) file format and a supporting I/O library for storing technical data. HDF is designed to promote efficient large-scale I/O for scientific applications running in high-performance computing environments. Hierarchical Data Format 5 (called HDF5) replaces an earlier, and incompatible, attempt to meet similar goals (called HDF4). For more information, visit the HDF Group's HDF5 home page.

HDF Availability

HDF5 is available for LC's Linux operating systems, and most features are supported. Your program manipulates HDF5 files by using calls to the HDF5 I/O library. On LC Linux machines, HDF5 installations are in /usr/local/tools/hdf5*. Versions are maintained for GNU, Intel, PGI, and PathScale compilers.

The HDF5 tools reside in a /bin directory several layers (depending on the host) below /usr/local/hdf5-version on each LC production machine. See the HDF Operations section for details on using the HDF5 library and tools locally. See I/O Analysis of FLASH below for discussion of the benefits and pitfalls of trying to use HDF5 to support a hydrodynamics code that performs extensive I/O.

A tutorial, user guide, and other HDF5 documentation is available online at http://www.hdfgroup.org/HDF5/doc/doc-info.html.

HDF5 File Structure

HDF5 files are binary containers for efficiently holding scientific data in an organized way, with explicit supporting metadata to facilitate later reuse.


HDF5 files consist of:

Groups that behave like UNIX directories (they organize data hierarchically). Groups can contain other child groups or point to other groups. Every HDF5 group has three attributes that overtly declare its:

Datasets within groups that behave like UNIX files within directories, except that they too have overt structures and supporting metadata. Every HDF5 dataset includes these features:

XML Role

Because HDF5 files are hierarchically organized and encoded with overt attributes, they can be represented by and manipulated using XML (the ISO standard "eXtensible Markup Language"). The NCSA HDF group is developing a comprehensive suite of standards and tools for using XML with HDF5. For details see http://www.hdfgroup.org/HDF5/XML/.

HDF Operations

Your program manipulates HDF5 files by using calls to the HDF5 I/O library. Among the most important operations on HDF5 files are file creation, file display, hyperslab selection, parallel HDF5, and HDF5 support in IDL.

File Creation

The include file hdf5.h (for C) or the module HDF5 (for Fortran) contains definitions and declarations that you must use in any program that invokes the HFD5 library. A call to routine H5Fcreate (C) or h5fcreate.f (Fortran) creates a new HDF5 file, returns its file identifier, and lets you specify its:

File Display

Once you have created, expanded, or altered an HDF5 file, you can display its contents (groups, attributes, etc.) in human-readable form by invoking any of several software tools provided for this purpose. A few important HDF5 tools are h5copy, h5dump, and h5ls. The HDF5 tools reside in a /bin directory several layers below /usr/local/tools/hdf5/hdf5-version on each LC production machine. Each HDF5 tool, if run without options, displays several screens of text summarizing its usage syntax and available options, then ends. One of the most helpful tools is H5DUMP, which outputs an ACSII text display in Backaus-Naur Form by default or encoded in XML if you reguest with its -x option. (See also the comments on using the IDL library for HDF5 output, below.)

Hyperslab Selection

One way to read existing HDF5 files is by invoking H5Sselect_hyperslab (C) or h5sselect_hyerslab.f (Fortran), which extracts a "hyperslab" from an HDF5 dataset. A hyperslab can be a logically contiguous set of points, or a regular pattern of points or blocks even if noncontiguous.

Hyperslab selection from HDF5 datasets is so flexible that you can read from a dataset with one size, shape, and datatype, and then write into a dataset with a different size, shape, and datatype. For example, you can read blocks from a 2-D array of 32-bit floats and then write that data into a contiguous sequence of 64-bit floats at a specified offset in a 1-D array.

Parallel HDF5

A parallel HDF5 API is supported for C and Fortran but not for C++. Parallel I/O on HDF5 files always involves the MPI concept of a "communicator," a specified set of processes that pass messages to each other. For parallel HDF5 I/O, each process in an MPI communicator invokes H5Pcreate (C) or h5pcreate.f (Fortran) to create an "access template" and obtain a file's access property list, and nvokes H5Pset_fapl_mpio (C) or h5pset_fapl_mpio.f (Fortran) to initiate parallel I/O access.

With parallel HDF5 I/O:

The HDF5 documentation referenced at the beginning of this section includes annotated programming examples (in C and Fortran) of performing parallel I/O on HDF5 files.

HDF5 Support in IDL

On all production machines, LC offers a licensed commercial library and tools together called Interactive Data Language (IDL). IDL is really a general data exploration and visualization language designed for writing high-level data-analysis programs much more compactly than with C or Fortran, but HDF5 is one of four "self-describing scientific data formats" that IDL routines can read and query (but not write). IDL acknowledges the following limitations when reading HDF5 files:

On LC machines, the IDL libraries resides in /usr/global/tools/ITT/idl71/bin/bin.linux.x86_64. The documentation for using "Callable IDL" is available in Chapter 16 of /usr/global/tools/ITT/docs/edg.pdf. (See also the comments on using native HDF5 tools for output instead, in the File Display section above.)



A parallel I/O interface allows programs with many processes on many nodes to coordinate their I/O read and write operations for greater efficiency. This parallel interface is called MPI-IO.

Implementing successful parallel I/O, including MPI-IO, requires the underlying support of a parallel file system. This diagram shows why a traditional file system causes expensive inefficiency when users attempt parallel I/O without proper hardware support.

File system efficiencies for parallel I/O

One inefficient approach concentrates all read and write requests at a single I/O node (a). Another allows several nodes to read or write data (b), but only to separate (logically distinct) files that have to be somehow merged later. A parallel file system (c) not only supports I/O from many nodes at the same time, but also transfers the data to (different parts of) a single logical file, even if the file is "striped" across multiple physical disks (for safety, convenience, or speed).

MPI-IO Data Access

With MPI-IO, data moves between processes and files using a variety of specific read and write calls, all variations on MPI_FILE_READ and MPI_FILE_WRITE. You select from these many routines to address three independent aspects of parallel data access: positioning, synchronism, and coordination.


MPI-IO data access routines provide three types of positioning (which you can mix safely within the same program):


MPI-OI offers both blocking and nonblocking I/O routines. Blocking I/O calls will not return until their I/O request is completed. Blocking is the default approach (e.g., with MPI_FILE_WRITE or MPI_FILE_WRITE_AT); special routines request nonblocking I/O (next). Nonblocking. Nonblocking I/O calls start an I/O operation but do not wait for it to complete. This can allow data transfer simultaneous with computation if hardware permits. To confirm that your data was actually read or written, however, you must use a separate "request complete" call (such as MPI_WAIT). Nonblocking versions of MPI routines all have names of the form MPI_FILE_Iaaa (where I is for "immediate," such as MPI_FILE_IWRITE).


MPI data access routines may be noncollective or collective, a measure of their dependence on other members of their process group. Noncollective. Noncollective call completion depends only on the calling process itself. The default MPI routines above (such as MPI_FILE_WRITE) perform noncollective data access. Collective. Completion of a collective call (made by all members of a process group) may depend on the activity of every process making the call. But sometimes collective calls perform better than noncollective ones because they can be automatically optimized. Collective MPI routines end in _ALL (such as MPI_FILE_WRITE_ALL) or in the pair _ALL_BEGIN/END (such as the explicit-offset pair MPI_FILE_WRITE_AT_ALL_BEGIN and MPI_FILE_WRITE_AT_ALL_END). The noncollective MPI_FILE_aaa_SHARED pointer routines map to the collective routines called MPI_FILE_aaa_ORDERED.

For more details on MPI-IO routines, on routine naming patterns for handling every combination of positioning, synchronism, and coordination, and for a summary of MPI-IO data access conventions in general, consult the MPI documentation available at:


MPI-IO Issues at LC

This section summarizes and compares MPI-IO issues, problems, and implementation constraints that specifically affect local users of LC production machines.

NFS Incompatible with MPI-IO

Successfully performing MPI-IO on NFS-mounted file systems requires that NFS is at version3 each NFS shared directory is mounted with the "no attribute caching" (NOACC) option enabled. However, all NFS-mounted file systems on LC production machines (such as /nfs/tmpn or the global home directories) are installed with attribute caching enabled (so NOACC is disabled and does not appear in their attribute list in /etc/fstab). This means that attempts to perform parallel MPI I/O to LC's NFS-mounted disks will fail. (NFS normally caches modified file pages on each client node that performs a write, without promptly updating the master copy on the file server. When multiple parallel clients write to the same file, this shortcut means that NFS will probably not correctly update the master copy.) Furthermore, when a globally mounted NFS file system (such as those supporting LC's common home directories) is flooded with MPI I/O traffic, service slows, often dramatically, not only on the machine running the MPI code but on all machines on which that file system is mounted. This is not a responsible use of shared computing resources.

ROMIO Problems

LC users of the vendor-independent MPICH libraries for MPI should note that the MPICH ROMIO implementation of parallel I/O is not standard compliant in the way it manages file handles, which are used for nonblocking I/O requests.

Portability Issues

See the MPI-IO Interactions with Lustre section below for a discussion of the difficulties of predicting how MPI-IO operations tuned to one parallel file system will behave when moved to a different parallel file system.

I/O Analysis of FLASH

What Is FLASH?

FLASH is an adaptive-mesh parallel hydrodynamics code developed at the University of Chicago's Center for Astrophysical Thermonuclear Flashes. FLASH simulates astrophysical thermonuclear flashes (such as supernovae and x-ray bursts) in two or three dimensions. Written in Fortran90, this code uses MPI for interprocess communication, relies on HDF5 for handling output data, and solves the compressible Euler equations on a block-structured adaptive mesh.

Because I/O is important for overall FLASH performance, this is an excellent test case for I/O optimization strategies (each FLASH run often generates 0.5 Tbyte of data, and I/O sometimes takes up as much as half of the total FLASH run time on 1024 processors).

What Is the FLASH I/O Benchmark?

The FLASH I/O benchmark tests FLASH's I/O performance independently of using the entire code. It sets up the same data structures as FLASH, fills them with dummy data, and then performs I/O through the HDF5 interface (or alternatives). The benchmark tests I/O performance on three kinds of files:

FLASH performs I/O in this way to minimize the memory needed (a buffer to hold all of the variables for one single write would be very large) and because later data analysis is greatly aided by storing each variable separately.

Current I/O Issues

Possible general I/O optimization strategies (most related to HDF use by FLASH) that are currently under study at LLNL and other ASC sites include:

Lustre Parallel File System

Lustre Goals and Roles

A parallel file system is part of any complete massively parallel computing environment. (In fact, failure to use an available parallel file system and instead running parallel I/O to a traditional global file system such as /nfs/tmpn will degrade I/O performance for all users across all the machines that share that traditional file system.)

In general terms, such a parallel file system:

At LC, a parallel file system tailored to LLNL's specific computational needs and resource design policies must also:

LC's (collaborative) attempt to develop a practical parallel file system that meets these criteria is called Lustre. The technical description of Lustre appears at the Lustre Web site.

The subsections below describe the unusual implementation features that Lustre includes (as installed for production use on LC Linux machines), and explain how to cope with the currently known pitfalls or complexities that Lustre presents to users. LC's point of contact for users needing technical advice about the local Lustre file systems is Richard Hedges (hedges1@llnl.gov).

Lustre Configuration


LLNL deploys Lustre parallel file systems. Some of the file system features include:

Lustre File System Features
Hardware Many brands
Software GNU public license, open source
Switches allowed TCP/IP (Ethernet), Quadrics Elan 3 or 4, InfiniBand
Networking protocol Lnet (based on Sandia's Portals API)
File locking Intent based (request + reason sent together)
Lock granularity Byte range
Data/metadata operations Separated, by different servers
Scalability strategy (1) Object storage targets manage data moves to actual disks, (2) metadata servers manage namespace
POSIX compliant Yes


File systems at or near their capacity often show degraded performance, higher I/O error rates, or sometimes complete service failure. To make service more predictable and reliable, LC intentionally destroys (purges) files on at-risk file systems intended for temporary storage (especially the large NFS-mounted termporary file systems and Lustre parallel file systems).

The different purge policies that apply to NFS and Lustre file systems are delineated in the File Purge Policy section of EZFILES. A summary of the purge policy for Lustre systems is available on those systems by running news PurgePolicy.linux.


At LC, the installed Lustre parallel file systems have file-system names that generalize more easily and emphasize the temporary nature of data placed on these devices. The file systems are mounted across multiple clusters for greater convenience with less need to move files between like machines.

Each LC parallel file system has a name of the form



l (lowercase el) indicates a Lustre (Linux/TOSS) file system,

ocfletter is a unique one-letter identifier for OCF systems (a, b, c, etc.), and

scfnumber is a unique one-digit identifier for SCF systems (1, 2, 3, etc.).

Lustre LLNL Implementation

LLNL's Lustre Strategy

Two key design features distinguish Lustre's implementation from other parallel file systems:

A Lustre metadata server (MDS; there are usually several for speed and failover redundancy) manages all namespace operations relevant to your file, such as assigning or updating references to the file's name, owner, permissions, access (conflict-control) locks, and (apparent) location on an OST (or striped across several OSTs). The MDS does not, however, participate in data transfers between the program and storage devices. (Those occur directly between the client node and the OSTs involved to save overhead.) Nevertheless, LC Lustre users sometimes encounter slow metadata performance (such as slow response to ls when executed with options like -l).

Each Lustre OST is a server devoted to file I/O operations. It handles reads and writes of file data, but it talks to the MDS only if namespace changes for the file are needed. OSTs can fill up as if they were disks and can share files by striping (as if they were RAID disks), but files actually reside on lower-level physical devices managed indirectly by device drivers. These underlying storage devices are sometimes called object-based disks (OBDs) to emphasize how much their physical traits are hidden from your application program: they can be generic or customized and branded, can include nondisk storage, and can be upgraded to newer technology below an OST without disturbing that OST's consistent, reliable service to your program I/O requests. OSTs provide not only storage abstraction, but also modular expandability: more OSTs can be added to an existing Lustre file system to expand the pool of logical locations for incoming files.

One drawback of allowing different brands and kinds of physical storage device to hide below the OST level is that your application program writing to Lustre may receive different exit or return codes at different times. Users often interpret these return-code differences as file system errors, when they usually just reflect subtle between-brand differences allowed within the POSIX specification. If you use Lustre extensively, change your application to overtly check I/O return codes so that you can appropriately ignore unimportant differences yet still detect file-corrupting genuine I/O errors.

At LLNL, Lustre uses either the Quadrics Elan or the InfiniBand network switch (depending on the Linux cluster where it is installed). Network independence is another Lustre design strength, faciliated by its use of the Portals protocol stack, an abstract approach to networking originally developed at Sandia National Laboratory (now available as open source software) and call Lnet as implemented as part of Lustre at LLNL.

LLNL's Lustre Service

For each Linux production cluster at LC running Lustre, consult the LC Lustre File System Summary and Maximum Bandwidths for information about disk space and total file system space. For a current listing of mounted file systems, run bdf.

Lustre Operational Issues

This section briefly describes known Lustre usage (operational) issues or pitfalls and suggests ways to cope with each one.

Lustre Purge Policy

Using Lustre effectively and appropriately requires storing (archivally) your files so that you avoid needlessly clogging the parallel file system and especially so that you avoid losing valuable data in case of a Lustre file purge. Once files are purged from Lustre they cannot be recovered, so use archival storage (see EZSTORAGE) to protect your important content. The different purge policies that apply to NFS and Lustre file systems are delineated in the File Purge Policy section of EZFILES. A summary of the purge policy for Lustre systems is available on those systems by running news PurgePolicy.linux.

MPI-IO Interaction with Lustre

Each implementation of MPI-IO (parallel I/O using the MPI library) depends for success, and certainly for whatever scalability it offers, on the underlying parallel file system that performs its requested I/O operations. On LC's Linux clusters, Lustre supports MPI-IO requests. (Warning: Attempting MPI-IO to a standard shared file system such as /nfs/tmpn, or worse, to your common home directory, will severely degrade I/O performance for all users of that file system across all machines where it is mounted.)

MPI-IO efficiency varies greatly depending on its underlying parallel file system for three reasons:

The features of a parallel file system most likely to influence how well MPI-IO works and how easily it scales up as the number of nodes grows large include:

Some MPI-relevant features are advertised by parallel file system vendors, while others are hidden or even proprietary. So as a user, you should expect significant and sometimes inexplicable differences in MPI-IO performance as you move your applications from one parallel file system to another (even within the LC computing environment). Sometimes simple changes from one login node to another cause major differences in resource contention and hence in file-transfer rates. See also the Lustre Striping section below.

Lustre Backup Policy

Because of the volume of material involved and the computationally high overhead for parallel file operations, LC does not backup its Lustre file systems. Should power failures or other unscheduled hardware problems occur, all of your data residing on any Lustre file system could be lost with no possiblity for recovery. Also, LC currently does not provide redundant (failover) OSSs. So each OSS failure makes some data unavailable until hardware is repaired or replaced.

Hence, you should thoughtfully move or copy important Lustre files to duplicate (and safer) locations yourself. Moving large numbers of small files can be tedious, error-prone, and very network congestive, however. LC provides a special software tool, called HTAR, specifically designed to efficiently transfer very large file sets either directly to archival storage or (if you request) to another file system on another LC machine. Using HTAR to self-backup your Lustre files thus benefits you as well as other users (who avoid the congestion you could cause by using slower manual transfers).

HTAR resembles traditional TAR in many ways (but not in requiring duplicate local disk space to create its target archive file, a great benefit). To take full advantage of HTAR's efficient backup potential, consult the feature explanations and examples in the HTAR Reference Manual.

Lustre Striping

Like many high-performance storage systems, Lustre uses disk striping to improve I/O speed: the system automatically divides the data to be stored into "stripes" and spreads those stripes across (some) available storage locations so that they can be processed in parallel. Because Lustre places files on logical OSTs, which manage the physical disk interactions hidden from the user, user data is striped across multiple OSTs to improve performance and to better balance the storage load.

Lustre is fairly fault tolerant (compared to NFS), and it continues to operate even if one (or more) specific OST goes offline. Striping data widely across OSTs works against this reliability, of course. If any portion of your data resides on an OST that is down, attempted access of that file returns an I/O error until the faulty OST returns to service. So, it may be important to know which OSTs a particular file is spread across, or even to influence that spread. Two utilities (see below) address this need on LC machines served by Lustre.

Stripe width is the number of devices (or, in the case of Lustre, the number of OSTs) across which a file is divided. LC assigns different default stripe widths to Lustre on different computers to take advantage of different storage resources as well as differences among each machine's "object storage servers" (OSSs, the Linux nodes that communicate with OSTs). The current default Lustre stripe width for all systems is two.

The lfs find utility reports on the distribution of file segments across Lustre OSTs. To change the default Lustre striping characteristics for a new (empty, not yet written) file or for a directory (so that new files written to it inherit those characteristics), use lfs help setstripe. See man lfs for details.

Some strategy considerations: