Privacy and Legal Notice

Simple MPI Performance Measurements II

Introduction | Benchmarks | General Test Conditions | Results | Summary | Algorithms | PVM Results


Introduction

The information presented herein is a series of simple MPI performance measurements on a some large systems at the NASA Advanced Supercomputing (NAS) Division at NASA Ames Research Center. Additional MPI performance measurements (MPI Performance Measurements and Simple MPI Performance Measurements I) are also available.

The work described was performed as part of the AMTI/CSC support contract with the NAS division at the NASA Ames Research Center and is reproduced here with the permission of NASA.


Benchmarks

Three benchmarks are used to make these MPI performance measurements:

LBW—Latency and Bandwidth Test

The LBW benchmark attempts to measure the point-to-point message passing latency and bandwidth. The test uses two MPI processes that repeatedly exchange messages. The time to send/receive a small message is a measure of the latency in the message passing system. Exchanging large messages is used to determine the bandwidth of the message passing system. The size of the message is user-specifiable, so the test can be used to profile the message passing bandwidth as a function of message size.

The LBW command line is:

lbw -n num_times [-B|-L] [-b buff_size] [-s sync/async] [-h] [-a]

where

-n num_times
number of times to repeat the send-recv message passing sequence.
-B|-L
select bandwidth (-B) or latency (-L) test; default is the bandwidth test.
-b buff_size
message buffer size in bytes.
-s sync/async
synchronous/asynchronous message passing style; sync is the default.
-h
print usage line.
-a
print out info line for each MPI process.

Top


ATA—All-To-All Test

The ATA benchmark performs a sequence of calls to the MPI_Alltoall() library routine. The MPI_Alltoall() routine exchanges a data buffer with all the other MPI processes. This routine can generate considerable message traffic and is meant to model an MPI application that is message passing intense. Both bandwidth and latency test types are supported with a user specifiable buffer size for either.

The ATA command line is:

ata -n num_times [-B|-L] [-b buff_size] [-h] [-a]

where

-n num_times
number of times to repeat the ATA message passing sequence.
-B|-L
bandwidth (-B) or latency (-L) test selection; default is bandwidth test.
-b buff_size
buffer size in bytes.
-h
print usage line to <stdout>.
-a
print out info line for each MPI process.

Top


STS—Some-To-Some Test

The STS benchmark emulates the message passing sequences that occur in a number of large scientific and engineering applications that rely on a domain decomposition approach to parallel execution.

Each MPI process sends and receives a set of randomly sized messages. The selection of message passing pairs is made randomly. The total number of source-destination pairs is determined from the product of the total number of processes and the "average" number of messages handled by each process. The latter value is user specifiable via a command-line option. The basic idea is to set up a relatively sparse collection of message passing pairs as compared to the full set of communicating pairs as happens in the ATA test.

Provision is also made to support a double-sided style of message passing where the list of source-destination pairs is doubled in length by including pairs constructed by reversing the source and destination process ranks of the original (single-sided) list. Each such reversed pair is assigned a message length that is randomly selected, i.e., (usually) different from the message size of the original pair.

The STS command line is:

sts -n num_times -m num_msgs [-S|-D] [-s sync|async] [-r rand_seed] [-v] [-a] [-h]

where

-n num_times
number of times to repeat the SomeToSome operation.
-m num_msgs
average number of messages sent by each MPI process.
-S|-D
single (-S) or double (-D) sided message passing.
-s sync/async
synchronous/asynchronous message passing style; async is the default.
-r rand_seed
optional random number seed (4 byte integer).
-v
verbose flag for listing of the intermediate interconnection map and buffer sizes.
-h
print usage line.
-a
print out info line for each MPI process.

Top


General Test Conditions

All tests were conducted in the midst of regular production so they may reflect more nearly what an actual user might be able to achieve in normal use.

To eliminate the usual 10–20% variation in run times that is common on some systems, each test was run a minimum of three to five times and the shortest run time of the set was chosen as the "best" run time.

Finally, the num_times input option for each test was chosen so that each test would run for at least approximately 5 minutes of wall-clock time. The intent was to minimize the statistical effects of any slow first passes that might occur, e.g, while the processors "heat up" their caches or allocate memory for internal buffers.

Top


Results

MPI benchmarks results for are provided for the following systems:

See the Algorithms section for detailed pseudo-code descriptions of the actual algorithms used in the benchmarks. See the Summary section for a comparison of the MPI performance of all the machines tested.

Top


Lomax (03K)

The current Lomax system is a 512-CPU Origin 3800 system that was procured as an upgrade replacement for the 400-MHz Origin 2000 Lomax system. The MPI performance data for the Origin 2000 system are presented in Part I of this series.

Lomax LBW Test
Bandwidth
(106 bytes/s/process)
Latency
(µs)
238.7 4.1
Message Size Bandwidth
40 9.79
400 39.7
1000 66.8
10000 158.5
50000 204.2
100000 224.1
500000 237.7
1000000 238.5
2500000 240.6
5000000 235.5
10000000 180.1
15000000 153.2
20000000 149.5
Lomax Test Conditions
  • MIPS R12000 Processor, Rev.: 3.5
    512 400 MHz IP35 processors
    256 GB memory
    Instruction cache size: 32 KB
    Data cache size: 32 KB
    Secondary instruction/data cache size: 8 MB
  • IRIX 64 6.5
  • MIPSPro Fortran Compiler, Release 7.3.1.1m
  • Message Passing Toolkit 1.5.3.0
  • Date: June 2002
Lomax ATA Test
Processes Bandwidth
(106 bytes/s/process)
2 223.5
4 150.5
8 122.1
16 107.7
24 100.7
32 87.6
48 61.9
64 67.0
96 36.6
128 42.1
192 46.5
256 32.7
384 40.0

Lomax bandwidth versus message size
Lomax bandwidth versus message size.

Lomax STS Test*
  Single-Sided Double-Sided
Processes Bandwidth
(106 bytes/s/process)
Per Pass
Run Time
Bandwidth
(106 bytes/s/process)
Per Pass
Run Time
2 112.0 5.4 112.7 12.1
4 99.8 6.7 96.9 15.6
8 79.4 9.3 70.8 22.0
16 63.8 12.6 64.6 24.8
32 50.0 14.9 54.1 27.2
64 42.9 18.5 40.9 39.1
128 27.3 28.2 28.6 54.6
256 21.8 36.1 20.8 74.7
384 22.9 34.0 23.2 67.8
* num_msgs=6

Top

Because the STS benchmarks is new, a few comments on its results are worth making. The full output for the 384-process single-sided run is listed below.

mpirun -prefix "[%g] " -np 384 sts -m 6 -n 8000 -S
[0]
[0]  -- ****** SomeToSome MPI Benchmark, version 1.0b ****** --
[0]  Message Passing Style:                 Single Sided
[0]  Global Statistics:
[0]  Number of processes:                384
[0]  Number of messages/process:           6
[0]  Number of passes:                  8000
[0]  Avg. buffer size/process:        786224 bytes
[0]  Avg. message size/ process:      131037 bytes
[0]  Total bytes transferred:    2415302.750 10e+06 bytes
[0]  Total time:                     274.388 s
[0]  Bandwidth total:               8802.524 10e+06 bytes/s
[0]  Per pass statistics:
[0]  Total bytes/process               0.786 10e+06 bytes
[0]  Total time:                      34.298 ms
[0]  Bandwidth/process:               22.923 10e+06 bytes/s/process

It took 8000 iterations to reach about 5 minutes of overall run time. During the 274.4 seconds of run time, about 2.42 × 1012 bytes were transferred. This yielded an aggregate message passing bandwidth of 8.8 × 109 bytes/s.

During a single pass, the per process bandwidth was 22.9 × 106 Bytes/sec/proc with a single message passing sequence, over all 384 processors, taking about 34 ms. So, for a message passing application with a computational duty cyle of as little as a second, then the message passing overhead would be under 4%. This would imply that such an application running with a balanced computational load should show reasonable scaling speedups for up to several hundred processes.

Top


IBM01

A small IBM SP system was installed at NAS as part of a Multi-Level Parallelism (MLP) demonstration project. The system uses the Regatta class architecture and is composed of four nodes with each node being an SMP with 8 CPUs. Each CPU is a Power4 IBM processor running at 1.3 GHz. All 8 CPUs in each SMP share approximately 7.6 GB of physical memory.

One interesting aspect of this system is that there are several different methods used to interconnect the CPUs. These different methods are listed below:

IBMO1 Intranode
<default> SP switch
shmem Shared Memory Interface
IBMO1 Internode
<default> Fast Ethernet
s1 SP switch 1
s2 SP switch 2
sa Combined SP switches 1 and 2

The basic LBW results for these various interconnection methods are listed in the following tables.

IBMO1 LBW Intranode
Interconnect Bandwidth
(106 bytes/s/process)
Latency
(µs)
<default> 263.9 30.6
shmem 1979.9 4.6
IBMO1 LBW Internode
Interconnect Bandwidth
(106 bytes/s/process)
Latency
(µs)
<default> 11.9 89.0
s1 157.1 66.5
s2 157.0 66.8
sa 167.7 67.6

IBMO1 LBW Test: Bandwidth vs. Message Size
Bandwidth
(106 bytes/s/process)
Message Size Intranode
<default>
Intranode
shmem
Internode
sa
40 1.3 8.6 0.6
400 10.8 78.0 4.7
1000 27.8 177.0 10.8
10000 82.1 567.8 34.9
50000 195.5 1273.1 100.4
100000 199.5 1804.4 120.2
500000 268.0 2225.6 159.5
1000000 263.9 2024.5 168.2
2500000 244.1 1500.5 173.1
5000000 256.3 1343.7 175.6
10000000 245.1 1475.4 176.5
15000000 272.8 1348.9 176.8
20000000 272.9 1476.4 177.2

LBW Intranode (default) bandwidth versus message size
LBW Intranode <default> bandwidth versus message size.
LBW Intranode shmem bandwidth versus message size
LBW Intranode shmem bandwidth versus message size.

LBW Internode sa, bandwidth versus message size
LBW Internode sa, bandwidth versus message size.

Top

Results from the ATA and STS benchmarks were gathered for both intranode and internode cases. In the intranode case, data were gathered for both the <default> and the shmem interconnects. Based on the internode LBW results, the inter-node ATA and STS tests utilized only the combined bandwidth sa switch.

IBMO1 ATA Intranode
Processes <default> shmem
2 242.2 727.6
4 147.9 554.7
8 112.0 257.0
IBMO1 ATA Internode
Processes sa
2 242.2
4 147.9
8 112.0
16 112.0
24 112.0
32 112.0

IBM01 STS Test Intra-Node, <default>*
  Single-Sided Double-Sided
Processes Bandwidth
(106 bytes/s/process)
Per Pass
Run Time
Bandwidth
(106 bytes/s/process)
Per Pass
Run Time
2 116.2 5.4 129.3 9.6
4 90.2 8.2 94.5 16.3
8 83.9 9.9 75.9 20.3
* num_msgs = 6

IBM01 STS Test IntraNode, shmem*
  Single-Sided Double-Sided
Processes Bandwidth
(106 bytes/s/process)
Per Pass
Run Time
Bandwidth
(106 bytes/s/process)
Per Pass
Run Time
2 874.0 0.9 741.7 1.8
4 519.3 1.2 472.4 2.8
8 294.8 2.4 284.1 4.8
* num_msgs = 6

IBM01 STS Test IntraNode, sa*
  Single-Sided Double-Sided
Processes Bandwidth
(106 bytes/s/process)
Per Pass
Run Time
Bandwidth
(106 bytes/s/process)
Per Pass
Run Time
2 72.6 9.6 74.8 17.6
4 61.1 11.7 66.4 24.0
8 57.6 14.7 55.0 25.5
16 40.5 18.9 42.0 37.8
32 22.2 34.8 23.4 62.8
* num_msgs = 6
ibm01 Test Conditions
  • IBM SP Regatta Class System
    4 node x 8 Power4 1.3 GHz CPUs
    7.6 GB memory per node
  • AIX 5.1
  • XL Fortran Version 07.01.0001.0002
  • Date: July 2002

Top


Ramna

A small SUN Fire 880 server was installed as a replacement for the older (partially populated) SUN Enterprise E-10000 system, Simak. See Paper I for the MPI performance results for Simak.

This section presents the MPI performance data from all three MPI benchmarks for the Ranma system.

Ramna LBW Test
Bandwidth
(106 bytes/s/process)
Latency
(µs)
915.1 3.5
Message Size Bandwidth
40 11.3
400 75.4
1000 155.7
10000 539.1
50000 748.1
100000 813.8
500000 861.8
1000000 842.9
2500000 845.7
5000000 844.7
10000000 821.6
15000000 768.0
20000000 793.0
Ranma Test Conditions
  • SUN Fire 880 server
    8 sparcv9 900-MHz processors
    16384 MB of shared main memory
  • SUNOS 5.9
  • Sun WorkShop 6 update 2,
    Fortran 95 6.2 2001/05/15
  • HPC 5 Toolkit; SUN MPI 4.1
  • Date: December 2002
Ramna ATA Test
Processes Bandwidth
(106 bytes/s/process)
2 640.9
4 430.7
8 280.2

Ramna LBW Bandwidth vs. Message Size

Ramna STS Test*
  Single-Sided Double-Sided
Processes Bandwidth
(106 bytes/s/process)
Per Pass
Run Time
Bandwidth
(106 bytes/s/process)
Per Pass
Run Time
2 410.4 2.07 415.4 4.40
4 210.2 3.53 241.0 6.58
8 150.6 4.59 133.5 11.4
* num_msgs=6

Top

Finally, a message passing interference test was run that attempted to load up Ranma with multiple runs of the two-processor LBW test. The idea was to see how the shared memory subsystem would perform as an increasing message passing load was imposed on it. Each LBW test was run for 10 minutes or longer so that there would be ample time for all the LBW runs to overlap.

The following table lists the LBW bandwidth for each set of LBW runs.

Ramna LBW Message Passing Load Test
LBW Runs Processes Bandwidth
(106 bytes/s/process)
Memory Efficiency*
1 2 840.7 100.0
2 4 739.5 79.6
3 6 614.2 73.1
4 8 530.9 63.1
*The Memory Efficiency column above is simply the percentage ratio of the
bandwidth for each load divided by the single run bandwidth.

As a point of comparison, the same interference experiment was performed on 16 CPUs of the SGI O3K Lomax. The results are listed in the following table.

SGI O3K Lomax LBW Message Passing Load Test
LBW Runs Processes Bandwidth
(106 bytes/s/process)
Memory Efficiency*
1 2 237.8 100.0
2 4 237.9 100.0
3 6 236.5 99.5
4 8 247.6 104.0
5 10 237.8 100.0
6 12 226.7 95.3
7 14 225.6 94.9
8 16 221.3 93.1
*The Memory Efficiency column above is simply the percentage ratio of the
bandwidth for each load divided by the single run bandwidth.

Top


Vabcluster

The Vabcluster at Langely Research Center consists of 8 Intel PCs connected with an ethernet network. MPI benchmark tests were ported to the cluster and some simple initial performance measurements were made. The results are listed below.

Vabcluster LBW Test
Bandwidth
(106 bytes/s/process)
Latency
(µs)
10.2 46.2
Message Size Bandwidth
40 0.864
400 4.75
1000 6.95
10000 9.83
50000 10.2
100000 10.3
500000 10.2
1000000 10.1
2500000 10.2
5000000 10.2
10000000 10.2
15000000 10.2
20000000 10.2
Vabcluster Test Conditions
  • Eight Xeon 2.4-GHz PCs
    1 GB of memory per processor
    Ethernet Network
  • GNU/Linux 2.4.18-14
  • Intel Fortran compiler for 32-bit applications,
    Version 7.1, Build 20030307Z
  • MPICH 1.2.5
  • Date: August 2003
Vabcluster ATA Test
Processes Bandwidth
(106 bytes/s/process)
2 10.6
4 3.39
6 2.00

Vabcluster LBW Bandwidth vs. Message Size

Top


Appro

Appro is a PC cluster-type system that used AMD Opteron processors and is produced by APPRO International, Milpitas, CA. This "loaner" system included 16 nodes with each node having two AMD Opteron processors. The clock rate for each processor was 1.6 MHz. MPI Benchmark data for this system is listed in the following tables.

Appro LBW Test
Intra-node Inter-node
Bandwidth
100,000 bytes/s
Latency
(µs)
Bandwidth
100,000 bytes/s
Latency
(µs)
331. 18.0 64.6 60.0
Message Size Inter-node Bandwidth
40 0.67
400 5.55
1000 11.0
10000 53.3
50000 62.1
100000 64.8
APPRO Test Conditions
  • 16 nodes, each node containing two 1595-MHz
    AMD Opteron Processors
    Main memory size per node: 2 GB
  • Linux 2.4.19-SMP
  • pgf90 Fortran Compiler, version 5.0-2, The Portland Group, Inc.
  • LAM/MPI 1.2, LINUX ch_p4 channel
  • Date: September 2003
Appro ATA Test
Processes Bandwidth
100,000 bytes/s
2 67.3
4 37.3
8 29.0
12 25.8

APPRO LBW Bandwidth vs. Message Size

Top


Altix

Altix is a large SGI Altix system as a follow-on to the cluster of Origin 2000 and 3000 systems. The system consists of 512 Intel Itanium 2 1500-MHz processors. A full set of MPI benchmark runs were made on the system and the results presented below.

Altix LBW Test
Bandwidth
(106 bytes/s/process)
Latency
(µs)
1873. 1.5
Bandwidth vs. Buffer Size
Message Size LBW Bandwidth ATA Bandwidth
(8 Processors)
40 19.1 10.2
400 67.6 35.4
1000 152 69.6
10000 983 135
50000 1266 141
100000 1506 140
500000 1740 187
1000000 1615. 141
2500000 846 141
5000000 801 143
10000000 784 143
15000000 773 142
20000000 777 135
Altix Test Conditions
  • 512 x 1500 MHz Itanium 2 Rev. 5 Processors
    Main memory size: 1008.35 GB
  • LINUX 2.4.21-sgi230ssibetab7
  • Intel Fortran Itanium Compiler for Itanium-based
    applications, Version 7.1, Build 20030814
  • SGI MPI 4.3 MPT 1.8 06/06/03 16:45:05
  • Date: December 2003
Altix ATA Test
Processes Bandwidth
(106 bytes/s/process)
2 1745.
4 478.
8 137.
16 72.3
24 45.7
32 32.7
48 32.0
64 23.3
96 23.1
128 16.0
192 14.1
256 12.5
384 4.8
448 4.6

Altix Bandwidth vs. Buffer Size

Altix STS Test*
  Single-Sided Double-Sided
Processes Bandwidth
(106 bytes/s/process)
Per Pass
Run Time
Bandwidth
(106 bytes/s/process)
Per Pass
Run Time
2 862.3 0.7 863.1 2.2
4 323.7 2.9 282.2 6.1
8 131.5 6.3 115.2 14.9
16 61.7 12.5 43.6 36.2
24 44.5 17.0 28.8 53.4
32 31.1 24.2 19.7 78.8
48 29.4 25.1 16.0 94.0
64 22.5 33.8 16.7 93.4
96 24.9 31.0 15.7 99.6
128 18.1 42.5 22.8 69.4
192 16.2 48.9 15.3 102.4
256 12.5 63.0 17.2 92.2
384 14.6 53.0 17.6 89.4
448 13.9 56.7 15.1 105.0
* num_msgs = 6

Top

Some observations on these STS results:

It is this good MPI message passing scalability that makes it possible for many MPI applications to scale up to large numbers of processors. Even for problems with a compute cycle time as low as one second, the total some-to-some message passing run time falls in the 5-10% overhead range (i.e., 50-100 milliseconds of message passing at the end of each 1000 millisecond compute cycle). So, a parallel efficiency of 90+ percent is not unreachable even for total process counts in the hundreds.

One of the newer features added to the ATA benchmark is the ability to measure the latency of the all-to-all operation. The following table lists this ATA latency along with the average total time and percentage overhead (ratio of latency to total time) for an all-to-all operation. All times in this table are in milliseconds of wall-clock run time.

Altix ATA Latency
Processes Total Time Latency
(µs)
Percent Overhead
2 0.11 0.0042 3.67
4 0.84 0.013 1.59
8 5.83 0.039 0.67
16 22.1 0.12 0.54
24 52.5 0.28 0.53
32 97.9 0.36 0.37
48 150 0.61 0.41
64 274 1.55 0.57
96 415 1.66 0.40
128 802 4.57 0.57
192 1359 4.27 0.31
256 2042 8.80 0.43
384 8075 15.5 0.19
448 9795 14.1 0.14

From these data it appears that the basic overhead associated with a call to MPI_Alltoall() is small compared to the time spent actually moving large messages. To be sure, the internal timing will be somewhat different for transmitting trivial (40 byte) buffers compared to those used for the Bandwidth measurements (100,000 bytes) which are the source of the data in the Total Time column in the ATA Latency table above. Nevertheless, to a first approximation it seems that the overhead in setting up and executing the message passing sequences inside of a call to the MPI_Alltoall() routine are small when compared to the time that it takes to move the amounts of data that are usually transported during an all-to-all operation.

Finally, it is worth noting that all of these Altix MPI benchmark results are derived from data that showed a remarkable variability in performance amongst otherwise identical measurement runs. For example, the following is a listing of the raw bandwidth data for the 128 process ATA runs:

   turing[125] => grep Band run.ata.128.o*
   run.ata.128.o2296: Bandwidth:      4.719 10e+06 bytes/sec/proc
   run.ata.128.o2311: Bandwidth:     15.799 10e+06 bytes/sec/proc
   run.ata.128.o2402: Bandwidth:     15.955 10e+06 bytes/sec/proc
   run.ata.128.o2424: Bandwidth:     12.324 10e+06 bytes/sec/proc
   run.ata.128.o2443: Bandwidth:      9.248 10e+06 bytes/sec/proc

Similiarly for the 96 process (single-sided) runs for the STS benchmark:

   turing[126] => grep Band run.sts.96.o*
   run.sts.96.o2145: Bandwidth / proc:   24.920 10e+06 Bytes/sec/proc
   run.sts.96.o2211: Bandwidth / proc:   15.205 10e+06 Bytes/sec/proc
   run.sts.96.o2262: Bandwidth / proc:   12.524 10e+06 Bytes/sec/proc
   run.sts.96.o2324: Bandwidth / proc:   15.735 10e+06 Bytes/sec/proc
   run.sts.96.o2387: Bandwidth / proc:   21.078 10e+06 Bytes/sec/proc
   run.sts.96.o2456: Bandwidth / proc:   12.049 10e+06 Bytes/sec/proc

This level of variability is fairly typical of all the various benchmark runs. Even the simplest, two-process LBW runs showed some variability:

   run.lbw.o1775: Bandwidth:                         1860.422 10e+06 Bytes/sec
   run.lbw.o1776: Bandwidth:                         1855.908 10e+06 Bytes/sec
   run.lbw.o1777: Bandwidth:                         1869.490 10e+06 Bytes/sec
   run.lbw.o1778: Bandwidth:                         1859.643 10e+06 Bytes/sec
   run.lbw.o1779: Bandwidth:                         1080.807 10e+06 Bytes/sec
   run.lbw.o1780: Bandwidth:                         1873.485 10e+06 Bytes/sec
   run.lbw.o1781: Bandwidth:                         1866.966 10e+06 Bytes/sec
   run.lbw.o1782: Bandwidth:                         1865.982 10e+06 Bytes/sec

Note that the bandwidth for job run.lbw.o1779 is only about 60% of any of the other identical runs. All of these jobs were run using a 4-CPU allocation (to allow room for shells, the mpirun command, etc.) If similar two-process jobs were run in larger allocations (e.g., 64 or 128), then the bandwidth variability appproaches the 2-4x level evidenced in the ATA and STS examples above.

Top


Chapman

An SGI system added to the NAS Origin cluster, named Chapman, contains 1024 600-MHz IP35 MIPS R14000 processors. A full set of MPI benchmark measurements were completed on the system; the resulting performance data are presented below.

Chapman LBW Test
Bandwidth
(106 bytes/s/process)
Latency
(µs)
224. 3.3
Bandwidth vs. Buffer Size
Message Size LBW Bandwidth ATA Bandwidth
(8 Processors)
40 12.3 3.0
400 47.7 18.1
1000 78.3 35.3
10000 178 95.3
50000 199 104
100000 213 103
500000 224 81.5
1000000 224 72.3
2500000 224 67.6
5000000 182 67.4
10000000 156 65.8
15000000 202 67.3
20000000 155 64.7
Chapman Test Conditions
  • 600-MHz IP35 MIPS R14000 processors.
    262144 MB.65 main memory.
  • IRIX64 chapman 6.5-nwokah-root-SN1XXL 05051820 IP35
  • MIPSpro Compilers: Version 7.4.0.0m
  • SGI MPI 4.2 MPT 1.7 01/31/03 13:16:13
  • Date: December 2003
Chapman ATA Test
Processes Bandwidth
(106 bytes/s/process)
2 198
4 135
8 104
16 85.9
24 70.6
32 59.9
48 50.8
64 47.2
96 40.7
128 35.3
192 29.5
256 24.1
384 12.5
448 6.8
512 2.8
640 2.1
768 1.9
896 1.6

Chapman Bandwidth vs. Buffer Size

Chapman STS Test*
  Single-Sided Double-Sided
Processes Bandwidth
(106 bytes/s/process)
Per Pass
Run Time
Bandwidth
(106 bytes/s/process)
Per Pass
Run Time
2 105.3 5.9 110.7 17.2
4 77.7 12.0 68.3 25.4
8 65.7 12.6 45.1 38.0
16 51.3 15.0 37.3 42.4
24 48.1 19.9 35.7 43.1
32 37.8 20.6 29.1 53.2
48 35.8 21.9 28.9 52.2
64 34.7 23.8 26.4 59.0
96 32.5 26.2 23.0 67.9
128 29.4 26.2 20.3 77.8
192 26.5 30.0 18.8 83.4
256 24.1 32.5 18.2 86.9
384 22.9 33.8 17.2 91.4
448 20.2 39.1 17.4 91.2
512 20.3 38.5 14.5 107.7
640 20.8 37.7 15.4 102.6
768 21.4 36.3 15.1 104.5
896 17.6 44.3 15.2 103.3
num_msgs = 6

 

Chapman ATA Latency
Processes Total Time Latency
(µs)
Percent Overhead
2 1.01 0.024 2.38
4 2.97 0.048 1.60
8 7.73 0.105 1.35
16 18.62 0.206 1.10
24 34.0 0.317 0.93
32 53.4 0.423 0.79
48 94.6 0.667 0.71
64 136. 0.887 0.65
96 236. 1.40 0.59
128 363. 1.95 0.54
192 651. 3.16 0.48
256 1061. 4.30 0.41
384 3068. 7.17 0.23
448 6585. 8.52 0.13
512 18377. 10.1 0.05
640 30380. 13.7 0.05
768 39798. 17.3 0.04
896 54380. 21.1 0.04

Top


Summary

LBW
System Bandwidth
(106 bytes/s/process)
Latency
(µs)

Lomax (O3K)

238.7 4.1
IBM SP (shmem) 1979.9 4.6
Ranma 915.4 3.5
Vabcluster 10.2 46.2
Appro (inter-node) 64.6 60.0
Altix 1873 1.5
Chapman 224. 3.3

 

ATA (4 Processors)
System Bandwidth
(106 bytes/s/process)
Lomax (O3K) 150.5
IBM SP (shmem) 554.7
Ranma 430.7
Vabcluster 3.39
Appro (inter-node) 37.3
Altix 478
Chapman 135

 

STS (16 Processors)
System Bandwidth
(106 bytes/s/process)
Per Pass
Run Time
Lomax (O3K) 63.8 12.6
IBM SP (shmem) 40.5 18.9
Ranma 150.6 4.59
Altix 61.7 12.5
Chapman 51.3 15.0

Top


Algorithms

The basic message passing algorithm for each MPI benchmark test is listed below.

LBW

Synchronous:

 t0 = MPI_Wtime()
    do i = 1, num_times
       if (my_rank .eq. SRC_RANK) then
          call MPI_Send (out_buf, msg_size, ..., DEST_RANK, ...)
          call MPI_Recv (in_buf, msg_size, ..., DEST_RANK, ...)
       else
          call MPI_Recv (in_buf, msg_size, ..., SRC_RANK, ...)
          call MPI_Send (out_buf, msg_size, ..., SRC_RANK, ...)
       endif
    enddo
    t1 = MPI_Wtime()

Asynchronous:

 t0 = MPI_Wtime()
    do i = 1, num_times
       if (my_rank .eq. SRC_RANK) then
          call MPI_Irecv (in_buf, msg_size, ..., DEST_RANK, ...)
          call MPI_Send (out_buf, msg_size, ..., DEST_RANK, ...)
          call MPI_Wait (...)
       else
          call MPI_Irecv (in_buf, msg_size, ..., SRC_RANK, ...)
          call MPI_Send (out_buf, msg_size, ..., SRC_RANK, ...)
          call MPI_Wait (...)
       endif
    enddo
    t1 = MPI_Wtime()

Top

ATA

    t1 = MPI_Wtime()
    do i = 1, num_times / 2
       call MPI_Alltoall (inbuf, ...., outbuf, ...)
       call MPI_Alltoall (outbuf, ..., inbuf, ....)
    end do
    t2 = MPI_Wtime()

STS

    t1 = MPI_Wtime()
    do inx = 1, num_times
       call MPI_Sometosome (num_ifaces, iface_list, iface_cnt ...)
    enddo
    t2 = MPI_Wtime()

The iface_list array contains num_ifaces source-destination pairs with the message lengths in the iface_cnt array. The two message passing styles in the MPI_Sometosome() routine are illustrated below.

Asynchronous:

*   Issue irecvs for all of this rank's receives.

    do inx = 1, num_ifaces
       src_rank  = iface_list(SRC, inx)
       dest_rank = iface_list(DST, inx)
       count     = iface_cnt(inx)
       if (my_rank .eq. dest_rank) then
          call MPI_Irecv (recv_buffer, count, ...,
                          src_rank, ...)
       endif
    enddo

*   Issue sends for all of this rank's sends.

    do inx = 1, num_ifaces
       src_rank  = iface_list(SRC, inx)
       dest_rank = iface_list(DST, inx)
       count     = iface_cnt(inx)
       if (my_rank .eq. src_rank) then
          call MPI_Send (send_buffer, count, ..., 
                         dest_rank, ...)
       endif
    enddo

*   Wait for all incoming messages.

    call MPI_Waitall (...)

Synchronous:

*   Issue send/recv pairs for each interface.

    do inx = 1, num_ifaces
       src_rank  = iface_list(SRC, inx)
       dest_rank = iface_list(DST, inx)
       count     = iface_cnt(inx)
       if (my_rank .eq. src_rank) then
          call MPI_Send (send_buffer, count, ..., 
                         dest_rank, ...)
       endif
       if (my_rank .eq. dest_rank) then
          call MPI_Recv (recv_buffer, count, ..., 
                         src_rank, ...)
       endif
    enddo

Top


PVM Results

The following table lists the PVM bandwidth and latency on each of the NAS Origin cluster's component systems.

PVM Bandwidth and Latency on NAS Origin Cluster Component Systems
System Clock Rate
(MHz)
Bandwidth
(106 bytes/s/process)
Latency
(µs)
Dawn 250 48.0 28.2
Hopper 250 66.5 27.9
Steger 250 62.2 31.2
Lomax 400 83.1 19.6
Chapman 600 68.7 22.1
Test Conditions
  • MIPSpro Compilers: Version 7.4
  • PVM 3.3.10 (SGI Version)
  • Date: June 2003
  • Top


    High Performance Computing at LLNL    Lawrence Livermore National Laboratory

    Last modified January 27, 2006
    UCRL-WEB-218462