Sphinx: Integrated Parallel Microbenchmark Suite
Sphinx, an integrated parallel microbenchmark suite, consists of a harness for running performance tests and extensive tests of MPI, Pthreads and OpenMP. It was adapted from the Special Karlsruhe MPI (SKaMPI) Benchmark suite1 by Bronis R. de Supinski and other members of the PSE/ASDE project, including John May and Bor Chan. LLNL adaptations include extensive tests of the Pthreads interface2 and on-going integration of the LLNL OpenMP Performance Suite.
In addition, several new MPI tests have been added, primarily focusing on the performance of collective operations, including the first widely available tests that accurately measure the operation latency of fan-out collective operations such as MPI_Bcast3. Sphinx was a significant aspect of the ASCI PSE Milepost Run Time Systems Performance Testing; a paper on this effort should be available in the near future. The entire suite is implemented in C and has been run on a wide variety of platforms.
The Sphinx test harness provides a flexible mechanism for running performance tests. The action being measured, such as a message pingpong, is accessed through a function pointer. Different threads or tasks can execute different functions, which supports the measurement of highly complex parallel actions. The Sphinx harness provides a flexible mechanism for varying parameters, or independent variables, of the action. The harness times repeated calls (iterations in Sphinx terminology) of the action.
Sphinx measures several timings (or repetitions) and outputs for their arithmetic mean for a given set of parameters for the action. The timings are stopped when the standard deviation of the repetitions is less than user-determined percentage of their mean, given that a minimum number of repetitions have been measured. Since this cut-off may never be achieved, the harness guarantees test termination through a user-specified maximum number of repetitions.
Results that do not achieve the cut-off are annotated as “UNSETTLED” in the output. Sphinx includes a facility that can optionally be used to correct for harness overhead including that of the function call to the action routines. Results that have anomalous corrections are annotated “UNRELIABLE” in the output. Anomalous corrections include the overhead exceeding the measurement value; complex tests, such as our accurate tests of the operation latency of fan-out collectives, can have more complicated anomalous correction conditions.
Sphinx is highly portable. The primary portability issue involves the use of processor bindings in the Pthreads tests. The mechanism for binding threads to processors varies widely between platforms, both in interface and in capabilities. Sphinx includes a module that abstracts most of these issues. However, care needs to be exercised as several platforms have failed to exhibit appropriate binding behavior. Another possible portability snag is that the code currently must be linked with MPI and Pthreads libraries; future changes will support a “no Pthreads” version; a “no MPI” version may also be implemented. Other issues for portability include the current Makefile mechanism; this will probably be replaced by an autoconf script in the future.
Sphinx is implemented in C. It includes tests of three popular parallelization mechanisms: MPI, Pthreads and OpenMP. The OpenMP tests are excluded by default; they are included only if the macro _HAVE_OMP_ is defined. Generally, the choice of which tests are available in the executable is decided by the makefile target chosen.
The code tests MPI, Pthreads and OpenMP. The MPI tests include tests of the full range of MPI collective communications and pingpong tests of a variety of MPI send and receive combinations; other MPI tests are planned. Tests of most Pthreads functions are included and of all OpenMP constructs, plus the auxiliary OpenMP locking functions.
1. R.H. Reussner, “User Manual of SKaMPI, Special Karlsruher MPI-Benchmark,” Tech. Report, University of Karlsruhe, 1998.
2. B.R. de Supinski and J. May, “Benchmarking Pthreads Performance,” Proc. of the 1999 Intl. Conf. on Parallel and Distributed Processing Techniques and Applications, 1999, pp. 1985-1991.
3. B.R. de Supinski and N. Karonis, “Accurately Measuring Broadcasts in a Computational Grid,” Proc. of the 8th Intl. Symp. on High Performance Distributed Computing, 1999, pp. 29-37.