High Performance Computing: PSE Milepost: HPSS & DISK-IO

White PSE Milepost (Excerpt)

Disk Input/Output and HPSS Tests


ASCI applications need reliable, easy-to-use high performance input/output systems whose throughput rates and capacity are in balance with the computational speed and memory size of the ASCI platforms. FY01 performance requirements for the 10-30 TeraOp systems include 10 Gigabytes/second disk I/O.

Overview of Milepost Tests

Applications Tests

For application codes utilized in the Code Capability Demonstration, the ability of the platform I/O and file system infrastructure to manage the capture and movement of simulation restart dumps and visualization files for subsequent analysis will be demonstrated. For each code in the Code Capability demonstration, the Code Capability Status and Run Record form will capture evidence of the I/O to GPFS and HPSS.

HPSS Tests

The High-Performance Storage System (HPSS) milepost demonstrations were created to demonstrate ASCI 12 TeraOp platform-to-archive data transfer functionality and capabilities at scale and under load. They were also designed to show that ASCI platform applications and code developers can depend on the stable delivery of these key services. These tests demonstrate end-to-end I/O from GPFS disk to HPSS disk using NFT/Endeavor or FTP/PFTP (a single FTP interface is planned that will transparently invoke PFTP as required for highest data transfer rates on large files).

Disk I/O Tests

The General Parallel File System (GPFS) is a high performance, scalable file system designed for massively parallel programming. GPFS addresses the need by the ASCI program for the large data storage and retrieval necessary for tera-scale computation. It is the objective of the milepost I/O tests to demonstrate I/O from ASCI applications to GPFS disk. This portion of the testing entails an in-depth study of the disk I/O (to GPFS). These tests demonstrate performance, stability, and functionality of GPFS, HDF5, and MPI I/O for patterns most important to key ASCI applications. I/O stress tests which exercise the GPFS file system (through the POSIX, MPI-IO, and HDF5 interfaces) over a period of 2 continuous hours show performance, stability, and functionality of GPFS, HDF5, MPI I/O, and HPSS for I/O patterns most important to key ASCI applications. These tests track quality and performance levels of these I/O resources.

PSE Involvement and Contribution

HPSS Milepost/PSE Involvement

The High Performance Storage System (HPSS) is a large collaborative software development project, begun in 1993 as a Cooperative Research and Development Agreement (CRADA) between government and industry. The HPSS collaboration is based on the premise that no single organization has the experience and resources to meet all the extreme challenges represented by the growing storage system I/O, capacity, and functionality imbalances present in high-performance computing environments such as ASCI.

Scalable I/O Project Collaboration and Coordination

One fundamental activity of the Scalable I/O Project within the Problem Solving Environment is to help platform partners, ISVs, and researchers improve the quality of their parallel file systems and mid-level parallel I/O systems (such as MPI-IO). The SIOP determines requirements for these products, measures their performance and reliability, advises the platform partners and researchers about how these systems are really used, and educates users about how to get the most benefit from them. Moreover, by keeping track of trends in high performance computing hardware and software architectures and by projecting the evolution of user needs, the SIOP devises plans to guide the long-term development of future I/O systems.

Another major activity of this project is to ensure that higher-level I/O libraries make effective use of the lower levels. In particular, the layers of the Tri-Lab high-level I/O model (e.g., SAF API, CDMlib, HDF5, MPI-IO, industry file systems) must work well together on all ASCI platforms. We will continue to perform end-to-end testing of these layers and help to analyze and improve the overall performance as seen by the applications.

To ensure that ASCI needs will be met, close collaboration is required with developers of file systems, parallel HDF5 libraries, and MPI-IO libraries and with developers of high-level data management libraries, as well as with application developers. Considerable efforts have been expended within the PSE (in collaboration with IBM and LLNL systems administration staff) to characterize, diagnose, and enhance the parallel I/O systems, in particular the GPFS file systems and the MPI-IO implementation from IBM.

Benefit to the ASCI Program

The most obvious and direct benefits of PSE's involvement in, and funding of, the HPSS Project are made evident by the PSE Milepost runs. These runs demonstrated HPSS providing ASCI codes and code developers with a scalable, stable, high performance archive in which the results, images and work products generated by ASCI multi-TeraOp machines could be stored. This archive not only provides permanent and safe storage of the fruits of the ASCI computational investment, but also frees up platform disk resources allowing codes and users to continue generating stockpile simulations while moving data to more cost effective storage media.

As mentioned above, the ASCI Program's extreme storage requirements require the performance, security and scalability provided by HPSS. The ability of the ASCI program to directly and immediately influence HPSS requirements based on unique ASCI user and security needs can not be underestimated. At the same time, the ASCI Program directly benefits by leveraging the substantial investment and expertise that IBM, Sun, StorageTek and other HPSS development partners bring to the HPSS product and its functionality. This collaboration and its product directly serve ASCI users around-the-clock, 365 days a year at all three ASCI laboratories.

Quality in terms of deliverable performance and stability of GPFS has greatly benefited from extensive ASCI testing. The IBM platforms at LLNL are large SP systems with the largest GPFS filesystems in existence. The explicit PSE I/O testing as well as the stresses of production workloads have subjected GPFS to its most comprehensive performance and reliability tests to date. Beyond identifying, characterizing, and tracking resolution of GPFS problems, we have collaborated with IBM on performance enhancements (e.g., data shipping in GPFS and MPI-IO).

For more information, contact:

HPSS Contacts:
George Richmond, richmond4@llnl.gov, 925-423-9833

Disk I/O Contacts: