Privacy and Legal Notice

Simple MPI Performance Measurements I

Introduction | Benchmarks | General Test Conditions | Results | Summary | Algorithms


Introduction

The information presented herein is a series of simple MPI performance measurements on a some large systems at the NASA Advanced Supercomputing (NAS) Division at NASA Ames Research Center. Additional MPI performance measurements (MPI Performance Measurements and Simple MPI Performance Measurements II) are also available.

The work described was performed as part of the MRJ support contract with the NAS division at the NASA Ames Research Center and is reproduced here with the permission of NASA.


Benchmarks

Two benchmarks were used to make these MPI performance measurements. Both programs are written in Fortran 90.


General Test Conditions


Results

MPI benchmarks results for are provided for the following systems:

See the Algorithms section for detailed pseudo-code descriptions of the actual algorithms used in the benchmarks. See the Summary section for a comparison of the MPI performance of all the machines tested.

Top


Babbage

Babbage LBW Test
Bandwidth
(106 bytes/s/process)
Latency
(µs)
33.0 57.2
Test Conditions
  • IBM SP-2, 162 "wide" nodes
    POWER2 67 MHz processor
    128 MB memory per node
  • AIX 2.0
  • poe 2.4
Babbage ATA Test
Processes Bandwidth
(106 bytes/s/process)
2 18.7
4 20.2
8 20.3
16 19.6
24 19.2
32 19.1


Bright

The version of MPI on Bright supports two distinct flavors of MPI message passing. The first is the familiar socket-based communication channel. The second uses a shared memory (multitasking) approach to share buffers and other MPI library data structures. Executables of both the ATA and LBW benchmarks were generated and run for both flavors of MPI.

Bright LBW Test
Socket Shared Memory
Bandwidth
(106 bytes/s/process)
Latency
(µs)
Bandwidth
(106 bytes/s/process)
Latency
(µs)
34.0 530.5 1132 61.0
Test Conditions
  • 32 CPU CRAY SV1e
    10.0 ns clock
    32 GB memory
  • Unicos 10.0.1.0
  • Message Passing Toolkit 1.4.0.0
  • Cray Fortran F90, Version 3.5.0.1
  • Date: August 2001
Bright ATA Test
Socket Shared Memory
Processes Bandwidth
(106 bytes/s/process)
Processes Bandwidth
(106 bytes/s/process)
2 24.7 2 444.3
4 18.3 4 455.4
8 10.6 8 451.4
16 4.32 16 350.2
24 2.39 24 221.3
32 N/A 32 N/A

Top


Davinci

Davinci LBW Test
Bandwidth
(106 bytes/s/process)
Latency
(µs)
64.0 19.5
Test Conditions
  • 8 CPU SGI Power Challenge
    90 MHz R8000 MIPS processors
    2 GB memory
  • IRIX 64 6.2
  • f77, Version 6.2
  • MPI 2.0
Davinci ATA Test
Processes Bandwidth
(106 bytes/s/process)
2 29.9
4 42.5
8 34.0


Newton

The version of MPI on the Newton cluster supports two distinct flavors of MPI message passing. The first is the familiar p4 (socket-based) communication channel. The second uses a multitasking approach to share buffers and other MPI library data structures. This second approach is only supported for intra-node message passing. Executables of both the ATA and LBW benchmarks were generated and run for the socket and multitasking MPI message passing styles.

Newton LBW Test
Socket Multitasking
Bandwidth
(106 bytes/s/process)
Latency Bandwidth
(106 bytes/s/process)
Latency
(µs)
14.5 1847 663 92.0
Test Conditions
  • 4 CPU CRAY J-90
    10.0 ns clock
    1 GB memory
  • Unicos 9.0.2.1
  • Message Passing Toolkit 1.1.0.0
  • Date: August 1997
Newton ATA Test
Socket Multitasking
Processes Bandwidth
(106 bytes/s/process)
Processes Bandwidth
(106 bytes/s/process)
2 4.24 2 101.7
4 3.14 4 117.3
8 2.11 8 128.4

Top


Simak

Here are the LBW and ATA test results from a small SUN E-10000 system.

Simak LBW Test
P4 SHARED SHMEM
Bandwidth
(106 bytes/s/process)
Latency
(µs)
Bandwidth
(106 bytes/s/process)
Latency
(µs)
Bandwidth
(106 bytes/s/process)
Latency
(µs)
65.9 182.4 61.0 180.6 90.0 15.9
Test Conditions
  • 16 Ultra SPARC 333.6-MHz processors
    2 GB memory
  • SunOS 5.6
  • WorkShop Compilers 4.2
  • MPICH 1.2
  • Date: November 2000
Simak ATA Test
Processes Bandwidth
(106 bytes/s/process)
  P4 SHARED SHMEM
2 47.8 44.1 83.0
4 34.9 31.6 75.2
8 12.5 14.4 N/A
16 3.45 4.24 N/A

Note that these MPI performance measurements were made with the 1.2 release of MPICH. Three different configurations were used:

Channel configure Options
P4 -device = ch_p4
SHARED -device = ch_p4 -comm = shared
SHMEM -device = ch_shmem

Additionally, the SHMEM version of MPICH seemed to have some significant problems with the ATA test, especially for CPU counts greater than or equal to 8. In almost all cases, the code would hang up in one of the MPI_Alltoall() calls and never return. Only very small iteration counts would succeed but because of the resultant very short overall run times, the bandwidth varied greatly.

Top


SGI Origin2000/3000 Systems

Turing

Turing LBW Test
Bandwidth
(106 bytes/s/process)
Latency
(µs)
125.5 12.3
Test Conditions
  • MIPS R10000 Processor, Revision: 2.6
    64 195-MHz IP27 processors
    16 GB memory
  • IRIX 64 6.4
  • MIPSPro Fortran Compiler, Release 7.20
  • MPI 3.0
  • Date: July 1998
Turing ATA Test
Processes Bandwidth
(106 bytes/s/process)
2 40.3
4 38.9
8 33.4
16 28.7
24 26.8
32 26.0

 

Turing LBW Test
Bandwidth
(106 bytes/s/process)
Latency
(µs)
121.5 12.6
Test Conditions
  • MIPS R10000 Processor, Revision: 2.6
    64 195-MHz IP27 processors
    16 GB memory
  • IRIX 64 6.4
  • MIPSPro Fortran Compiler, Release 7.20
  • Message Passing Toolkit 1.2
  • Date: July 1998
Turing ATA Test
Processes Bandwidth
(106 bytes/s/process)
2 48.7
4 36.9
8 32.2
16 26.8
24 21.1
32 14.3

For comparison, here are MPI performance measurements with MPICH, using both the socket (ch_p4) and shared memory (ch_shmem) communication channels.

Turing LBW Test with MPICH
Socket Shared Memory
Bandwidth
(106 bytes/s/process)
Latency
(µs)
Bandwidth
(106 bytes/s/process)
Latency
(µs)
13.0 308 49.4 31.1
Test Conditions
  • SGI Origin 2000
    MIPS R10000 Processor Chip Revision: 2.6
    64 195-MHz IP27 processors
    16 GB memory
  • IRIX 64 6.4
  • MPICH 1.1.0
  • Date: February 1998
Turing ATA Test with MPICH
Socket Shared Memory
Processes Bandwidth
(106 bytes/s/process)
Processes Bandwidth
(106 bytes/s/process)
2 7.93 2 31.0
4 5.09 4 35.1
8 2.38 8 33.2

Top


Heisenberg

Heisenberg LBW Test
Bandwidth
(106 bytes/s/process)
Latency
(µs)
139.2 11.8
Test Conditions
  • MIPS R10000 Processor, Revision: 3.4
    128 250-MHz IP27 processors
    32 GB memory
  • IRIX 64 6.4
  • MIPSPro Fortran Compiler, Release 7.2.1
  • MPI 3.1.x.14 (MPT 1.2.1.0.2)
  • Date: July 1998
Heisenberg ATA Test
Processes Bandwidth
(106 bytes/s/process)
2 60.6
4 43.3
8 23.2
16 14.4
24 6.68
32 4.74
48 2.32
64 1.76
96 1.04
128 .682

The principal difference between the MPI performance on Turing and Heisenberg is the difference in the R10000 processors. Turing uses 195 MHz parts and Heisenberg uses 250 MHz parts. The following tables list the ratios of bandwidth, etc., between Heisenberg and Turing.

Heisenberg/Turing Performance Ratios LBW Test
Bandwidth
(106 bytes/s/process)
Latency
(µs)
1.15 .937
Test Conditions
  • Turing
    • MIPS R10000 195-MHz IP27 processors
    • IRIX 64 6.4
    • MIPSPro Fortran Compiler, Release 7.2
    • Message Passing Toolkit 1.2
    • Date: July 1998
  • Heisenberg
    • MIPS R10000 250-MHz IP27 processors
    • IRIX 64 6.5SE
    • MIPSPro Fortran Compiler, Release 7.2.1
    • MPI 3.1.x.14 (MPT 1.2.1.0.2)
    • Date: July 1998
Heisenberg/Turing Performance Ratios ATA Test
Processes Bandwidth
(106 bytes/s/process)
2 1.23
4 1.17
8 .720
16 .537
24 .317
32 .331

The ratio of clock rates between the two systems is 250/195=1.28. Simple dimensional analysis would indicate that quantities like a bandwidth would increase by this ratio and quantites like a latency would decrease by this ratio. So, all other things being equal, bandwidths should increase by a factor of 1.28 and latencies should decrease by the same factor.

Clearly things are not equal. The LBW bandwidth increases only about half of the clock rate speedup and the latency decreases only about a third of what one might expect based on the clock rate improvement.

The situation is even more exaggerated for the ATA test. The two-process case shows a very nice improvement of over 80% of the clock rate speedup. For the higher process count, however, cases with more speedup actually become a slowdown. The 32 process case (which is nowhere near utilizing the full capabilities of either system) shows a slowdown by about a factor of three—14.3 on Turing and 4.74 on Heisenberg. Given the extreme nature of this decrease in bandwidth, it seems likely that some other difference between the two systems, quite apart from the increase in clock rate, is responsible for the comparatively poor performance of the ATA test on Heisenberg.

Top


Lomax

Lomax LBW Test (300 MHz)
Bandwidth
(106 bytes/s/process)
Latency
(µs)
129.1 12.6
Test Conditions
  • MIPS R12000 Processor, Rev.: 2.3
    512 300-MHz IP27 processors
    192 GB memory
  • IRIX 64 6.5
  • MIPSPro Fortran Compiler, Release 7.2.1.1m
  • MPI 3.2.0.0
  • Date: January 2000
Lomax ATA Test (300 MHz)
Processes Bandwidth
(106 bytes/s/process)
2 112
4 64.4
8 47.9
16 37.3
24 34.1
32 25.0
48 18.9
64 17.8
96 14.0
128 8.8
192 1.8
256 1.3

 

Lomax LBW Test (400 MHz)
Bandwidth
(106 bytes/s/process)
Latency
(µs)
140.8 10.2
Test Conditions
  • MIPS R12000 Processor, Rev.: 3.5
    512 400-MHz IP27 processors
    192 GB memory
  • IRIX 64 6.5
  • MIPSPro Fortran Compiler, Release 7.30
  • Message Passing Toolkit 1.4.0.0
  • Date: September 2000
Lomax ATA Test (400 MHz)
Processes Bandwidth
(106 bytes/s/process)
2 144
4 78.3
8 45.3
16 28.0
24 25.5
32 23.9
48 23.8
64 22.3
96 21.2
128 21.2
192 18.6
256 16.2
384 11.6

Top


Grendel

Grendel (SN1) LBW Test
Bandwidth
(106 bytes/s/process)
Latency
(µs)
232.5 7.1
Test Conditions
  • MIPS R12000 Processor, Revision: 3.5
    32 400-MHz IP35 processors
    8 GB memory
  • IRIX 64 6.5-ALPHA
  • MIPSpro Compilers: Version 7.3.1.1m
  • MPI 3.2.0.7
  • Date: November 2000
Grendel (SN1) ATA Test
Processes Bandwidth
(106 bytes/s/process)
2 246.5
4 144.0
8 85.6
16 58.5
24 49.9
32 47.5


Chapman

Chapman (SN1) LBW Test
Bandwidth
(106 bytes/s/process)
Latency
(µs)
253.3 4.5
Test Conditions
  • MIPS R12000 Processor, Revision: 3.5
    1024 400-MHz IP35 processors
    256 GB memory
  • IRIX 6.5-nwokah-root-SN1XXL 10261819
  • MIPSpro Compilers: Version 7.3.1.2m
  • MPT 1.5.2.0; MPI 3.3.0.5
  • ciotti "enhanced" interconnection topology
  • Because of the enormous time delays in
    getting MPI programs started and stopped, the
    iteration counts were adjusted upwards to ensure
    at least 5 minutes of program run time.
  • The ATA individual buffer size was reduced
    to 50 KB because of memory allocation problems.
  • Date: November 2001
Chapman (SN1) ATA Test
Processes Bandwidth
(106 bytes/s/process)
2 229.0
4 157.2
8 124.2
16 106.3
32 79.2
64 69.5
128 65.2
256 56.6
512 41.3
1000 5.75

 

Chapman (600 Mhz) LBW Test
Bandwidth
(106 bytes/s/process)
Latency
(µs)
289.6 3.8
Test Conditions
  • MIPS R12000 Processor, Revision: 3.5
    1024 600-MHz IP35 processors
    Main memory size: 65536 MB
    Instruction cache size: 32 KB
    Primary data cache size: 32 KB
    Secondary data cache size: 8 MB
  • IRIX 64 chapman 6.5.15n.TS.TLB.fix 03081700 IP35
  • MIPSpro Compilers: Version 7.3.1.1m
  • Message Passing Toolkit 1.5.2.0; MPI 3.3.0.5
  • ciotti "enhanced" interconnection topology
  • The ATA individual buffer size was reduced to 50 KB
    because of memory allocation problems for the larger
    CPU count cases.
  • Because of the enormous time delays in getting MPI
    programs started and stopped, the iteration counts
    were adjusted upwards to insure at least 5 minutes of
    program run time. In addition, the initial pass of the
    MPI_Alltoall() loop was excluded from the loop timing to
    exclude the time that is spent allocating memory in the
    large transfer buffers.
  • Despite all of the above precautions, the otherwise
    identical runs for some process counts showed extensive
    run-time variations on the order of 2-4x. These process count
    cases are marked with an asterisks (*) in the above table.
  • Date: April 2002
Chapman (600 Mhz) ATA Test
Processes Bandwidth
(106 bytes/s/process)
2 241.9
4 158.2
8 126.6
16 73.8 *
24 105.1 *
32 28.6 *
48 27.1 *
64 27.8 *
128 64.1 *
192 60.3 *
256 59.1 *
384 58.1
512 55.4
640 51.8
768 49.7
896 50.4
1000 50.0

Top


VonNeuman

The version of MPI on VonNeuman supports two distinct flavors of MPI message passing. The first is the familiar p4 (socket-based) communication channel. The second uses a multitasking approach to share buffers and other MPI library data structures. This second approach is only supported for intranode message passing. Executables of both the ATA and LBW benchmarks were generated and run for the socket and multitasking MPI message passing styles.

VonNeuman LBW Test
Socket Multitasking
Bandwidth
(106 bytes/s/process)
Latency
(µs)
Bandwidth
(106 bytes/s/process)
Latency
(µs)
20.9 745 1768 27.4
Test Conditions
  • 16 CPU Cray C-90
    4.1670 ns clock
    8 GB memory
  • Unicos 8.0.4.2
  • Message Passing Toolkit 1.1.0.0
VonNeuman ATA Test
Socket Multitasking
Processes Bandwidth
(106 bytes/s/process)
Processes Bandwidth
(106 bytes/s/process)
2 10.0 2 122.5
4 4.7 4 127.6
8 .46 8 16.9
16 N/A 16 N/A

Note that the 16 process ATA case for socket MPI message passing failed repeatedly in a manner very reminiscent of how this test failed on Newton. I was unable to get a successful run in four separate attempts.


Top

Whitney

Whitney LBW Test
Bandwidth
(106 bytes/s/process)
Latency
(µs)
6.65 225
Test Conditions
  • PC Cluster
  • 36 x 200 MHz Pentium Pro PCs
  • 128 MB RAM memory/PC
  • Red Hat Linux release 4.1 (Vanderbilt)
  • MPICH 1.1.0 (ch_p4)
  • Date: June 1998
Whitney ATA Test
Processes Bandwidth
(106 bytes/s/process)
2 2.62
4 1.75
8 .89
16 .46
24 .34
32 .073


Summary

LBW
System Bandwidth
(106 bytes/s/process)
Latency
(µs)
Babbage 33.0 57.2
Bright - socket 34.0 530.5
Bright - shmem 1132 61.0
Davinci 64.0 19.5
Grendel (SN1) 232.5 7.1
Heisenberg 139.4 11.8
Lomax (300 MHz) 129.1 12.6
Lomax (400 MHz) 140.8 10.2
Newton - socket 14.5 1847
Newton - multitasking 663 92.0
Simak - shmem 90.0 15.9
Turing 125.5 12.3
Turing - MPICH/socket 13.0 308
Turing - MPICH/ch_shmem 49.4 31.1
VonNeuman - socket 20.9 745
VonNeuman - multitasking 1768 27.4
Whitney 6.65 225
ATA (4 processes)
System Bandwidth
(106 bytes/s/process)
Babbage 20.2
Bright - socket 18.3
Bright - shmem 455.4
Davinci 42.5
Grendel (SN1) 144.0
Heisenberg 35.1
Lomax (300 MHz) 64.4
Lomax (400 MHz) 78.3
Newton - socket 3.14
Newton - multitasking 117.3
Simak - shmem 75.2
Turing 38.9
Turing - MPICH/socket 5.09
Turing - MPICH/ch_shmem 35.1
VonNeuman - socket 4.7
VonNeuman - multitasking 127.6
Whitney 1.75

Top


Algorithms

The algorithms for both the LBW and ATA programs are given below in pseudo-code:

    program lbw

    if (master) then
       obtain num_transfers and test_type from the command line
    endif

    broadcast num_transfers and test_type to other processes

    if (test_type == latency) then
       msg_size = 40/8
    elseif (test_type == bandwidth) then
       msg_size = bufsize
    endif

    t0 = MPI_Wtime()
    do i = 1, num_transfers
       if (master) then
          call MPI_Send (buf, msg_size, ...)
          call MPI_Recv (buf, msg_size, ...)
       else
          call MPI_Recv (buf, msg_size, ...)
          call MPI_Send (buf, msg_size, ...)
       endif
    enddo
    t1 = MPI_Wtime()

    if (test_type == latency) then
       latency = 1.0e+6 * (t1 - t0) / (2 * num_transfers)
    elseif (test_type == bandwidth) then
       total_bytes = 8 * msg_size * num_transfers * 2
       bandwidth = (1.0e-6 * total_bytes) / (t1 - t0)
    endif

    print out results

    call exit (0)
    end
    
    program ata

    if (master) then
       obtain nsteps from the command line 
    endif

    broadcast nsteps to other processors

    t1 = MPI_Wtime()
    do i = 1, nsteps / 2
       call MPI_Alltoall (inbuf, ...., outbuf, ...)
       call MPI_Alltoall (outbuf, ..., inbuf, ....)
    end do
    t2 = MPI_Wtime()

    tot_bytes = 8 * nsteps * (num_procs - 1) * nsteps
    bandwidth = 1.0e-06 * tot_bytes / (t2 - t1)

    print out results

    call exit (0)
    end
    

Top


High Performance Computing at LLNL    Lawrence Livermore National Laboratory

Last modified January 27, 2006
UCRL-WEB-218462