Direct GPFS (POSIX interface) tests
MPI-IO tests to GPFS
HDF5 tests to GPFS
I/O stress test
The I/O stress test exercised the GPFS file system (through both the POSIX and the MPI-IO interfaces) over a period of 2 continuous hours in order to assess reliability under conditions of heavy use. A mix of concurrent reads, writes, file creations and deletions, and status operations were performed. A variety of tests from the types listed above were included.
These tests were run on LLNL's large IBM SP systems running Mohonk system software (AIX version 4.3.3, PSSP 3.2 and GPFS v1.3) as part of the PSE 1.1 Milepost. In arithmetic expressions K denotes 1024, M denotes 1024K, and G denotes 1024M.
1. POSIX interface to GPFS
1.1. Strided pattern: Using 1 client process per node on N=128 nodes, create, write and read a single common file (size = 512MB*N). Each process accesses 512MB, but the file is divided into blocks of size b bytes, with b ranging from 512KB to 512MB (increasing by as indicated in table). The blocks are distributed round-robin to the processors: p0 writes and reads blocks 0, N, 2N, 3N, etc. The size of each individual transfer is the same as the size of the block.2. IBM's MPI-IO library over GPFS:
The test with 16MB blocksize (and 128 nodes) is the patterns that MPI-IO hands off to GPFS for a 128 node instance of the discontiguous test.
Block size 512KB 1MB 4MB 16MB 512MB MB/s write 91 199 742 1542 1700 MB/s read 244 781 2942 2796 2482
1.2. Strided pattern: Same as 1.1 with N=128, 256 and blocksize=8MB.
The measurement for N=256 (and 8MB blocksize) is the pattern that MPI-IO hands off to GPFS for a 256 node run of the discontiguous test.
Nodes 128 256 MB/s write 1131 1275 MB/s read 3039 2770
1.3. Segmented pattern, varying transfer size: Using 1 client process per node, using 64 nodes, create, write, and read a single common file (size = 512MB*n). Each process accesses 512M bytes, which are contiguous in the file (and for each process p in 0..63), p's first byte is located at location p*512MB). [Most important transfer size = 512 kB, the stripe size of the file system.]
Transfer Size (KB) 256 512 MB/s write 1951 1909 MB/s read 2332 2476
1.4. Segmented pattern, varying number of clients: Using 1 client process per node, varying the number of nodes from 1 to N (n in 1,10,20,...), create, write and read a single common file (size = 512MB*n). Each process accesses 512M bytes, which are contiguous in the file (and for each process p in 0..(N-1)), p's first byte is located at location p*512MB). The size of each individual transfer is to be 512KB.
Nodes 1 10 20 30 40 50 60 70 80 90 100 110 120 128 256 MB/s write 178 1319 1517 1584 1682 1621 1908 1946 1828 1871 1919 1928 1884 1867 2016 MB/s read 302 1070 1516 1836 2271 2287 2438 2483 2455 2482 2429 2463 2456 2504 2552
1.5. Segmented pattern, varying clients per node: Varied the number c of client processes per node (c=1,2,4,8,16), using 64 nodes in each instance, create, write and read back a single common file (size = 32GB). Each process wrote and read back 16/c separate blocks of 32M contiguous bytes, where the stride between successive blocks written/read by a process is 2GB/c. Processes were distributed round-robin to nodes. The size of each individual transfer was 512KB.
Clients/node 1 2 4 8 16 MB/s write 1888 1520 1491 1158 751 MB/s read 2583 1951 1240 1096 892
1.6. Empty files: Same as 1.4, but the file size will be 0 bytes; all processes participate in opening and closing the file, but only p0 writes or reads. (128 nodes * 16 processes/node)
Seconds Create/write .776 Read .034
1.7. File per process: Similar to 1.4, except that each process accesses a separate 512MB file, all in a single directory.
Nodes 1 10 30 60 90 128 256 MB/s write 177 1461 1526 1666 1721 1707 1772 MB/s read 293 1197 1836 2625 2731 2700 26572.1. Small, randomly assigned blocks: Using 1 client process per node on N nodes, create, write and read a single common file (size = 512MB*N). Each process accesses 512MB, but the file is divided into blocks of 1K bytes, with each block randomly assigned to one of the client processes. Each transfer is a collective operation, involving enough blocks to cover a large contiguous area of the file (2GB).3. NCSA's HDF5 over IBM's MPI-IO library over GPFS:
Nodes 128 MB/s write 606
2.2. Strided pattern, varying block size: Similar to 1.1, but for MPI-IO using multiblock transfers and collective writes and reads. In this example, MPI-IO reorganizes the strided pattern into another strided pattern with a 16 MB block size.
32 node, 1 client per node.
Block size 1KB 4KB 16KB 64KB 256KB 1MB 2MB 4MB 8MB MB/s write 384 731 814 819 833 921 952 940 893 MB/s read 901 1030 1034 1115 1187 532 682 1199 1422
128 node, 1 client per node, ip mode.
Block size 1KB 4KB 16KB 64KB 256KB 1MB 4MB MB/s write 525 528 566 559 553 603 651 MB/s read 1098 1148 1196 1256 1396 1504 1502
2.3. Segmented Pattern: Same as 1.4, but using MPI-IO independent writes and reads.
Nodes 10 30 MB/s write 1324 1546 MB/s read 1273 1503
Nodes 16 32 64 96 128 MB/s write 1410 1496 1657 1672 1870 MB/s read 1393 1795 2289 2289 2338
2.4. Empty Files: Same as 1.6, but using MPI-IO open/close instead of POSIX. Open 4 byte file: 96 nodes*16client/node = 1536 clients: < 1 secondsSimilar to GPFS Test 1.4, but using HDF5 open/write/read/close instead of POSIX. The number of HDF5 datasets was equal to the number of nodes used (and only a single processor per node was used). Within each HDF5 dataset a segmented pattern was used, and each dataset contained 2047.5 MB (current parallel HDF5 cannot handle datasets >= 2GB). The size of each individual transfer was 512KB. No results are shown for reads because the tests ran out of time before the read tests completed, but no problems were encountered, aside from the slowness.
Nodes 30 128 Create/write 65.7 MB/s 77.4 MB/s Read
To ensure that ASCI needs will be met, close collaboration is required with developers of file systems, parallel HDF5 libraries, and MPI-IO libraries and with developers of high-level data management libraries, as well as with application developers. The PSE will work closely with the following external collaborators:
- Developers of partner-supplied parallel file systems (IBM, SGI, and Intel).
- Developers of research parallel file systems.
- Developers of MPI-IO libraries (partners, ISVs and Argonne National Laboratory).
- Developers of high-level data management libraries (HDF5 from NCSA, SAF, CDMlib).
- Developers of scalable global parallel file systems (support for Panasas PathForward project)
The PSE provides high-performance I/O services to the following internal collaborators:
- ASCI VIEWS Scientific Data Management, particularly the Data Models and Formats group
Considerable efforts have been expended within the PSE (in collaboration with IBM and LLNL systems administration staff) to characterize, diagnose, and enhance the parallel I/O systems, in particular the GPFS file systems and the MPI-IO implementation from IBM. A substantial portion of the activity in support of parallel I/O within the PSE fails into a familiar sequence of events:
- Perform large scale test to characterize performance or stability.
- Observe anomaly. (GPFS crash, test hangs, performance issue, etc.)
- Characterize circumstances of anomaly. Develop repeatable demonstration of the anomaly.
- Work with system administration staff and IBM personnel to diagnose problem.
- Work with system administration staff and IBM personnel to test fix.
- Iterate process starting at appropriate stage.
Initial Stabilization and tuning of GPFS on ASCI system
The initial delivery of GPFS with the ASCI blue system was the first GPFS installation of such large scale. This required significant testing and characterization before production users were allowed access. Below are descriptions of six instances requiring major efforts of the type described.
MPI-IO/GPFS working group
The ASCI white statement of work described a framework for collaboration between LLNL and IBM in order to define appropriate performance requirements for the MPI-IO layer on top of GPFS. The working group held teleconferences as needed to discuss the details of and progress in achieving the requirements. In addition to defining the requirements and measuring those levels of performance, two research papers (Europar 2000 and SC2001) resulted from the collaboration.
Diagnosis of switch microcode bug
In the spring of 1999, we began testing the IBM implementation of MPI-IO in earnest. In initial comparisons of the performance of the IBM implementation to the portable ROMIO implementation, a severe performance problem was noted. The problem was eventually tracked down to a bug in the switch microcode which stalls IP communication when there is moderately heavy US traffic. MPI-IO in User Space produces significant US traffic across the switch. GPFS/VSD produces significant IP traffic. This mix is what triggers the problem. It does seem that this situation where nodes make significant use of US & IP at the same time is not very common. Most SP users will never experience the problem outside MPI-IO but some certainly could. When the new microcode was installed and tested, performance improvements as much as a factor of 50 were observed in the performance tests being used.
GPFS hangs for large nodecount test: PMR 25642/APAR IY04923
In the winter of 1999-2000 we found that some of the large scale (64 and 128 nodes) tests would cause GPFS on blue to hang and require a restart. Subsequent to receiving an Efix from IBM yet larger tests (up to 256 nodes) were run successfully.
GPFS data shipping bugs
Summer 2000: GPFS 1.3 contains a new feature referred to as GPFS data shipping, which was designed and implemented specifically to support optimized MPI-IO operations. We began testing this in earnest in the May 2000 timeframe. A number of GPFS crashes and hangs were traced to the new portions of the filesystem code. Dumps taken upon GPFS hangs or crashes provided many details for resolving the issues. Several sessions running the offending test cases under close scrutiny by IBM developers provided more insight.
MPI-IO Memory allocation bug
Fall 2000: For tests involving the strided data pattern on large files, some failures of the MPI-IO implementation were noted. As a check the same test cases were run using the ROMIO implementation and those failures were not present. Further investigation included writing the same data pattern, but using different datatype descriptions to define the pattern. The problem was tracked to the memory allocation algorithm for datatypes in the IBM implementation. The problem was resolved and tested.
Diagnosis/tuning of GPFS 1.3: GPFS block non-coalescence
Summer - Fall 2000: In scaling tests for GPFS 1.3 on snow, writing performance was observed to peak for tests using three nodes and then drop off to approximately one half of the peak value. Testing eventually led to the diagnosis that GPFS file blocks, which were split into smaller packets for transmission over the switch, were not being coalesced before being written causing GPFS to perform a read-modify-write for the second half of the GPFS block. The queue_depth tuning parameter on the SSA arrays was changed from 40 to 8 and finally to 2 to alleviate the problem. This dramatically reduced the non-coalesced writes and restored the scaling performance.
To quantify the performance improvements, a battery of MPI-IO tests were run. The new tuning parameter led to improved writing performance the three most important data patterns:
- Segmented pattern for >3 nodes improved by a factor of about 2x.
- Strided pattern with 8 nodes ~ 30% improvement
- Discontiguous pattern (randomly assigned 1k blocks) ~10% improvement
Quality in terms of deliverable performance and stability of GPFS has greatly benefited from extensive ASCI testing. The IBM platforms at LLNL are large SP systems with the largest GPFS filesystems in existence. The explicit PSE I/O testing as well as the stresses of production workloads have subjected GPFS to its most comprehensive performance and reliability tests to date. Beyond identifying, characterizing, and tracking resolution of GPFS problems, we have collaborated with IBM on performance enhancements (e.g., data shipping in GPFS and MPI-IO).