Our applications are ingesting and producing more data at a higher rate than ever before.
Kathryn Mohror
arrows showing progress left to right overlaid with ones and zeros

New ‘Unify’ File Systems Deliver Fast I/O Performance over Distributed Storage

Monday, November 26, 2018

It’s no secret that high performance computing systems (HPC) are growing in complexity and capability. As LLNL brings online Sierra, one of the world’s fastest supercomputers, computer scientists are investigating performance efficiency improvements at all levels of next-generation HPC architectures.

LLNL’s Center for Applied Scientific Computing (CASC) and Livermore Computing divisions recently joined forces to tackle performance drag caused by input/output (I/O) workloads. Kathryn Mohror, head of CASC’s Data Analysis Group, explains, “Our applications are ingesting and producing more data at a higher rate than ever before.”

Mohror leads a team developing a file system framework called Unify. The project team includes LLNL’s Adam Moody, Danielle Sikich, Ned Bass, and Cameron Stanavige with external collaborators from Florida State University and Oak Ridge National Laboratory.

Breaking through the Bottlenecks

Bottlenecks can occur when I/O operations are performed to a shared resource, typically a parallel file system. Recently, burst buffers have been introduced into HPC systems as temporary storage located between the compute nodes and the parallel file system. Applications can quickly perform their “bursts” of I/O to and from the burst buffers, and data that should be stored permanently can be transferred to the parallel file system in the background while the application continues its compute operations.

Burst buffers greatly improve the I/O performance of applications and reduce the load on the parallel file system. Two designs of burst buffers exist today: shared and node-local. Shared burst buffers are implemented as storage locations that are shared by multiple compute nodes and jobs. They relieve the pressure somewhat but can suffer from contention problems depending on allocation and scheduling policies.

Node-local burst buffers are located on individual compute nodes and are not shared, which can result in scalable I/O performance without contention. However, node-local burst buffers present challenges to applications that want to share data across compute nodes in a job. The Unify project addresses these challenges and achieves the best of both worlds: high, scalable performance and data sharing across compute nodes.

The idea for Unify was born from an earlier research prototype, which Mohror’s team presented at the 2016 Supercomputing Conference. The first Unify production file system is now under development for the Exascale Computing Project. This Unify file system delivers high performance because it utilizes node-local storage, and it eliminates inter-job interference because the file system is isolated to the compute nodes used by an individual job. A Unify API transfers data that needs to be stored permanently to the parallel file system. This temporary file storage strategy streamlines job execution. “You keep only the data you need to keep,” summarizes Mohror.

Figure: When the user runs a job, Unify provides a shared namespace across node-local burst buffers allocated to the job. An API transfers the data the user wants to keep back to the parallel file system. (Click to enlarge.)

Mix and Match

The Unify framework also supports different kinds of workloads. General-purpose file systems must support every kind of I/O workload, so parallel data operations can end up being performed serially. This problem compounds with data-intensive scientific applications.

“By designing file systems for specific I/O workloads, we can reduce the need for expensive file system operations and speed up I/O overall,” says Mohror. “We want to provide a range of specialized Unify file systems so users can choose what is best for their needs.”

Like much of LLNL’s HPC performance improvement software, Unify is open source. The first Unify file system, UnifyCR (for checkpoint/restart workloads), is already available on GitHub. The team is working on another file system in the Unify “family” designed for machine learning workloads, in which large data sets need to be distributed quickly. Additional Unify file systems are in development, using LLNL’s Catalyst HPC cluster for testing.