High Performance Computing: Scalable I/O Project

Disk I/O Test Definitions

Data Patterns Important to ASCI Applications

  • segmented data pattern: each client writes and reads to a contiguous portion of a single common file. The is the "bread and butter" data pattern for the GPFS filesystem.
  • strided data pattern: each client writes and reads to interleaved portions of a single common file.
  • file per process: each client writes and reads contiguous data to it's own private file. For the tested discussed here, the private files reside in a directory common to all clients in a shared (parallel) filesystem.
  • empty files: This type of test measures overhead of opening file in parallel.
  • randomized 1k blocks: 1k blocks are randomly assigned to the clients for writing to a single file. Designed as a worst-case data pattern for parallel I/O of densely covered files.

Direct GPFS (POSIX interface) tests

  • segmented data pattern: vary number of clients, clients per node, and individual transfer size.
  • strided data pattern: vary number of clients, and individual transfer size. The parameters chosen here for relevance to the interpretation of the MPI-IO discontiguous (randomized 1k blocks) test.
  • file per process: one client per node.
  • empty file

MPI-IO tests to GPFS

  • randomized 1k blocks: The discontiguous test specified in the ASCI White Statement of Work proposes write performance for such a pattern as a measure of the effectiveness of the MPI-IO implementation.
  • segmented data pattern: performance characteristics should be identical to the direct GPFS segmented pattern tests.
  • strided data pattern: differs from direct GPFS pattern in that the data may be reorganized into larger transfers in the MPI-IO implementation.
  • empty file

HDF5 tests to GPFS

  • segmented data pattern

I/O stress test

The I/O stress test exercised the GPFS file system (through both the POSIX and the MPI-IO interfaces) over a period of 2 continuous hours in order to assess reliability under conditions of heavy use. A mix of concurrent reads, writes, file creations and deletions, and status operations were performed. A variety of tests from the types listed above were included.

Disk I/O Test Results

These tests were run on LLNL's large IBM SP systems running Mohonk system software (AIX version 4.3.3, PSSP 3.2 and GPFS v1.3) as part of the PSE 1.1 Milepost. In arithmetic expressions K denotes 1024, M denotes 1024K, and G denotes 1024M.

1. POSIX interface to GPFS
1.1. Strided pattern: Using 1 client process per node on N=128 nodes, create, write and read a single common file (size = 512MB*N). Each process accesses 512MB, but the file is divided into blocks of size b bytes, with b ranging from 512KB to 512MB (increasing by as indicated in table). The blocks are distributed round-robin to the processors: p0 writes and reads blocks 0, N, 2N, 3N, etc. The size of each individual transfer is the same as the size of the block.

The test with 16MB blocksize (and 128 nodes) is the patterns that MPI-IO hands off to GPFS for a 128 node instance of the discontiguous test.
Block size 512KB 1MB 4MB 16MB 512MB
MB/s write 91 199 742 1542 1700
MB/s read 244 781 2942 2796 2482

1.2. Strided pattern: Same as 1.1 with N=128, 256 and blocksize=8MB.

The measurement for N=256 (and 8MB blocksize) is the pattern that MPI-IO hands off to GPFS for a 256 node run of the discontiguous test.
Nodes 128 256
MB/s write 1131 1275
MB/s read 3039 2770

1.3. Segmented pattern, varying transfer size: Using 1 client process per node, using 64 nodes, create, write, and read a single common file (size = 512MB*n). Each process accesses 512M bytes, which are contiguous in the file (and for each process p in 0..63), p's first byte is located at location p*512MB). [Most important transfer size = 512 kB, the stripe size of the file system.]
Transfer Size (KB) 256 512
MB/s write 1951 1909
MB/s read 2332 2476

1.4. Segmented pattern, varying number of clients: Using 1 client process per node, varying the number of nodes from 1 to N (n in 1,10,20,...), create, write and read a single common file (size = 512MB*n). Each process accesses 512M bytes, which are contiguous in the file (and for each process p in 0..(N-1)), p's first byte is located at location p*512MB). The size of each individual transfer is to be 512KB.
Nodes 1 10 20 30 40 50 60 70 80 90 100 110 120 128 256
MB/s write 178 1319 1517 1584 1682 1621 1908 1946 1828 1871 1919 1928 1884 1867 2016
MB/s read 302 1070 1516 1836 2271 2287 2438 2483 2455 2482 2429 2463 2456 2504 2552

1.5. Segmented pattern, varying clients per node: Varied the number c of client processes per node (c=1,2,4,8,16), using 64 nodes in each instance, create, write and read back a single common file (size = 32GB). Each process wrote and read back 16/c separate blocks of 32M contiguous bytes, where the stride between successive blocks written/read by a process is 2GB/c. Processes were distributed round-robin to nodes. The size of each individual transfer was 512KB.
Clients/node 1 2 4 8 16
MB/s write 1888 1520 1491 1158 751
MB/s read 2583 1951 1240 1096 892

1.6. Empty files: Same as 1.4, but the file size will be 0 bytes; all processes participate in opening and closing the file, but only p0 writes or reads. (128 nodes * 16 processes/node)
Create/write .776
Read .034

1.7. File per process: Similar to 1.4, except that each process accesses a separate 512MB file, all in a single directory.
Nodes 1 10 30 60 90 128 256
MB/s write 177 1461 1526 1666 1721 1707 1772
MB/s read 293 1197 1836 2625 2731 2700 2657

2. IBM's MPI-IO library over GPFS:
2.1. Small, randomly assigned blocks: Using 1 client process per node on N nodes, create, write and read a single common file (size = 512MB*N). Each process accesses 512MB, but the file is divided into blocks of 1K bytes, with each block randomly assigned to one of the client processes. Each transfer is a collective operation, involving enough blocks to cover a large contiguous area of the file (2GB).
Nodes 128
MB/s write 606

2.2. Strided pattern, varying block size: Similar to 1.1, but for MPI-IO using multiblock transfers and collective writes and reads. In this example, MPI-IO reorganizes the strided pattern into another strided pattern with a 16 MB block size.

32 node, 1 client per node.
Block size 1KB 4KB 16KB 64KB 256KB 1MB 2MB 4MB 8MB
MB/s write 384 731 814 819 833 921 952 940 893
MB/s read 901 1030 1034 1115 1187 532 682 1199 1422

128 node, 1 client per node, ip mode.
Block size 1KB 4KB 16KB 64KB 256KB 1MB 4MB
MB/s write 525 528 566 559 553 603 651
MB/s read 1098 1148 1196 1256 1396 1504 1502

2.3. Segmented Pattern: Same as 1.4, but using MPI-IO independent writes and reads.

Nodes 10 30
MB/s write 1324 1546
MB/s read 1273 1503

Nodes 16 32 64 96 128
MB/s write 1410 1496 1657 1672 1870
MB/s read 1393 1795 2289 2289 2338

2.4. Empty Files: Same as 1.6, but using MPI-IO open/close instead of POSIX. Open 4 byte file: 96 nodes*16client/node = 1536 clients: < 1 seconds

3. NCSA's HDF5 over IBM's MPI-IO library over GPFS:
Similar to GPFS Test 1.4, but using HDF5 open/write/read/close instead of POSIX. The number of HDF5 datasets was equal to the number of nodes used (and only a single processor per node was used). Within each HDF5 dataset a segmented pattern was used, and each dataset contained 2047.5 MB (current parallel HDF5 cannot handle datasets >= 2GB). The size of each individual transfer was 512KB. No results are shown for reads because the tests ran out of time before the read tests completed, but no problems were encountered, aside from the slowness.
Nodes 30 128
Create/write 65.7 MB/s 77.4 MB/s

PSE Involvement and Contribution

Scalable I/O Project Collaboration and Coordination

To ensure that ASCI needs will be met, close collaboration is required with developers of file systems, parallel HDF5 libraries, and MPI-IO libraries and with developers of high-level data management libraries, as well as with application developers. The PSE will work closely with the following external collaborators:

  • Developers of partner-supplied parallel file systems (IBM, SGI, and Intel).
  • Developers of research parallel file systems.
  • Developers of MPI-IO libraries (partners, ISVs and Argonne National Laboratory).
  • Developers of high-level data management libraries (HDF5 from NCSA, SAF, CDMlib).
  • Developers of scalable global parallel file systems (support for Panasas PathForward project)

The PSE provides high-performance I/O services to the following internal collaborators:

  • ASCI VIEWS Scientific Data Management, particularly the Data Models and Formats group

Considerable efforts have been expended within the PSE (in collaboration with IBM and LLNL systems administration staff) to characterize, diagnose, and enhance the parallel I/O systems, in particular the GPFS file systems and the MPI-IO implementation from IBM. A substantial portion of the activity in support of parallel I/O within the PSE fails into a familiar sequence of events:

  • Perform large scale test to characterize performance or stability.
  • Observe anomaly. (GPFS crash, test hangs, performance issue, etc.)
  • Characterize circumstances of anomaly. Develop repeatable demonstration of the anomaly.
  • Work with system administration staff and IBM personnel to diagnose problem.
  • Work with system administration staff and IBM personnel to test fix.
  • Iterate process starting at appropriate stage.

Initial Stabilization and tuning of GPFS on ASCI system
The initial delivery of GPFS with the ASCI blue system was the first GPFS installation of such large scale. This required significant testing and characterization before production users were allowed access. Below are descriptions of six instances requiring major efforts of the type described.

MPI-IO/GPFS working group
The ASCI white statement of work described a framework for collaboration between LLNL and IBM in order to define appropriate performance requirements for the MPI-IO layer on top of GPFS. The working group held teleconferences as needed to discuss the details of and progress in achieving the requirements. In addition to defining the requirements and measuring those levels of performance, two research papers (Europar 2000 and SC2001) resulted from the collaboration.

Diagnosis of switch microcode bug
In the spring of 1999, we began testing the IBM implementation of MPI-IO in earnest. In initial comparisons of the performance of the IBM implementation to the portable ROMIO implementation, a severe performance problem was noted. The problem was eventually tracked down to a bug in the switch microcode which stalls IP communication when there is moderately heavy US traffic. MPI-IO in User Space produces significant US traffic across the switch. GPFS/VSD produces significant IP traffic. This mix is what triggers the problem. It does seem that this situation where nodes make significant use of US & IP at the same time is not very common. Most SP users will never experience the problem outside MPI-IO but some certainly could. When the new microcode was installed and tested, performance improvements as much as a factor of 50 were observed in the performance tests being used.

GPFS hangs for large nodecount test: PMR 25642/APAR IY04923
In the winter of 1999-2000 we found that some of the large scale (64 and 128 nodes) tests would cause GPFS on blue to hang and require a restart. Subsequent to receiving an Efix from IBM yet larger tests (up to 256 nodes) were run successfully.

GPFS data shipping bugs
Summer 2000: GPFS 1.3 contains a new feature referred to as GPFS data shipping, which was designed and implemented specifically to support optimized MPI-IO operations. We began testing this in earnest in the May 2000 timeframe. A number of GPFS crashes and hangs were traced to the new portions of the filesystem code. Dumps taken upon GPFS hangs or crashes provided many details for resolving the issues. Several sessions running the offending test cases under close scrutiny by IBM developers provided more insight.

MPI-IO Memory allocation bug
Fall 2000: For tests involving the strided data pattern on large files, some failures of the MPI-IO implementation were noted. As a check the same test cases were run using the ROMIO implementation and those failures were not present. Further investigation included writing the same data pattern, but using different datatype descriptions to define the pattern. The problem was tracked to the memory allocation algorithm for datatypes in the IBM implementation. The problem was resolved and tested.

Diagnosis/tuning of GPFS 1.3: GPFS block non-coalescence
Summer - Fall 2000: In scaling tests for GPFS 1.3 on snow, writing performance was observed to peak for tests using three nodes and then drop off to approximately one half of the peak value. Testing eventually led to the diagnosis that GPFS file blocks, which were split into smaller packets for transmission over the switch, were not being coalesced before being written causing GPFS to perform a read-modify-write for the second half of the GPFS block. The queue_depth tuning parameter on the SSA arrays was changed from 40 to 8 and finally to 2 to alleviate the problem. This dramatically reduced the non-coalesced writes and restored the scaling performance.

To quantify the performance improvements, a battery of MPI-IO tests were run. The new tuning parameter led to improved writing performance the three most important data patterns:

  • Segmented pattern for >3 nodes improved by a factor of about 2x.
  • Strided pattern with 8 nodes ~ 30% improvement
  • Discontiguous pattern (randomly assigned 1k blocks) ~10% improvement

Benefit to ASCI Program

Quality in terms of deliverable performance and stability of GPFS has greatly benefited from extensive ASCI testing. The IBM platforms at LLNL are large SP systems with the largest GPFS filesystems in existence. The explicit PSE I/O testing as well as the stresses of production workloads have subjected GPFS to its most comprehensive performance and reliability tests to date. Beyond identifying, characterizing, and tracking resolution of GPFS problems, we have collaborated with IBM on performance enhancements (e.g., data shipping in GPFS and MPI-IO).

PSE Milepost: HPSS & Disk I/O