Message Passing Interface (MPI)

Author: Blaise Barney, Lawrence Livermore National Laboratory UCRL-MI-133316

Table of Contents

  1. Abstract
  2. What is MPI?
  3. LLNL MPI Implementations and Compilers
  4. Getting Started
  5. Environment Management Routines
  6. Exercise 1
  7. Point to Point Communication Routines
    1. General Concepts
    2. MPI Message Passing Routine Arguments
    3. Blocking Message Passing Routines
    4. Non-blocking Message Passing Routines
  8. Exercise 2
  9. Collective Communication Routines
  10. Derived Data Types
  11. Group and Communicator Management Routines
  12. Virtual Topologies
  13. A Brief Word on MPI-2 and MPI-3
  14. Exercise 3
  15. References and More Information
  16. Appendix A: MPI-1 Routine Index


Abstract


The Message Passing Interface Standard (MPI) is a message passing library standard based on the consensus of the MPI Forum, which has over 40 participating organizations, including vendors, researchers, software library developers, and users. The goal of the Message Passing Interface is to establish a portable, efficient, and flexible standard for message passing that will be widely used for writing message passing programs. As such, MPI is the first standardized, vendor independent, message passing library. The advantages of developing message passing software using MPI closely match the design goals of portability, efficiency, and flexibility. MPI is not an IEEE or ISO standard, but has in fact, become the "industry standard" for writing message passing programs on HPC platforms.

The goal of this tutorial is to teach those unfamiliar with MPI how to develop and run parallel programs according to the MPI standard. The primary topics that are presented focus on those which are the most useful for new MPI programmers. The tutorial begins with an introduction, background, and basic information for getting started with MPI. This is followed by a detailed look at the MPI routines that are most useful for new MPI programmers, including MPI Environment Management, Point-to-Point Communications, and Collective Communications routines. Numerous examples in both C and Fortran are provided, as well as a lab exercise.

The tutorial materials also include more advanced topics such as Derived Data Types, Group and Communicator Management Routines, and Virtual Topologies. However, these are not actually presented during the lecture, but are meant to serve as "further reading" for those who are interested.

Level/Prerequisites: This tutorial is ideal for those who are new to parallel programming with MPI. A basic understanding of parallel programming in C or Fortran is required. For those who are unfamiliar with Parallel Programming in general, the material covered in EC3500: Introduction To Parallel Computing would be helpful.



What is MPI?


An Interface Specification: MPI Logo

Programming Model:

Reasons for Using MPI:

History and Evolution: (for those interested)

Documentation:



LLNL MPI Implementations and Compilers




IBM BG/Q Clusters:



MVAPICH

General Info:

Compiling:

Running:

Documentation:



Open MPI

General Information:

Compiling:

Running:

Documentation:



MPI Build Scripts

Failing to do this will result in getting the MVAPICH 1.2 implementation.



Level of Thread Support



Getting Started


General MPI Program Structure:

Header File:

Format of MPI Calls:

Communicators and Groups:

Rank:

Error Handling:



Environment Management Routines


This group of routines is used for interrogating and setting the MPI execution environment, and covers an assortment of purposes, such as initializing and terminating the MPI environment, querying a rank's identity, querying the MPI library's version, etc. Most of the commonly used ones are described below.

MPI_Init


MPI_Comm_size


MPI_Comm_rank


MPI_Abort


MPI_Get_processor_name


MPI_Get_version


MPI_Initialized


MPI_Wtime


MPI_Wtick


MPI_Finalize



Examples: Environment Management Routines



MPI Exercise 1

Getting Started

Overview:
  • Login to an LC cluster using your workshop username and OTP token
  • Copy the exercise files to your home directory
  • Familiarize yourself with LC's MPI compilers
  • Write a simple "Hello World" MPI program using several MPI Environment Management routines
  • Successfully compile your program
  • Successfully run your program - several different ways

GO TO THE EXERCISE HERE

    Approx. 20 minutes



Point to Point Communication Routines

General Concepts

First, a Simple Example:

  • The value of PI can be calculated in a number of ways. Consider the following method of approximating PI
    1. Inscribe a circle in a square
    2. Randomly generate points in the square
    3. Determine the number of points in the square that are also in the circle
    4. Let r be the number of points in the circle divided by the number of points in the square
    5. PI ~ 4 r
    6. Note that the more points generated, the better the approximation

  • Serial pseudo code for this procedure:

    npoints = 10000
    circle_count = 0
    
    do j = 1,npoints
      generate 2 random numbers between 0 and 1
      xcoordinate = random1
      ycoordinate = random2
      if (xcoordinate, ycoordinate) inside circle
      then circle_count = circle_count + 1
    end do
    
    PI = 4.0*circle_count/npoints
    

  • Leads to an "embarassingly parallel" solution:
    • Break the loop iterations into chunks that can be executed by different tasks simultaneously.
    • Each task executes its portion of the loop a number of times.
    • Each task can do its work without requiring any information from the other tasks (there are no data dependencies).
    • Master task recieves results from other tasks using send/receive point-to-point operations.

  • Pseudo code solution: red highlights changes for parallelism.

    npoints = 10000
    circle_count = 0
    
    p = number of tasks
    num = npoints/p
    
    find out if I am MASTER or WORKER 
    
    do j = 1,num 
      generate 2 random numbers between 0 and 1
      xcoordinate = random1
      ycoordinate = random2
      if (xcoordinate, ycoordinate) inside circle
      then circle_count = circle_count + 1
    end do
    
    if I am MASTER
      receive from WORKERS their circle_counts
      compute PI (use MASTER and WORKER calculations)
    else if I am WORKER
      send to MASTER circle_count
    endif

    Example MPI Program in C:   mpi_pi_reduce.c
    Example MPI Program in Fortran:   mpi_pi_reduce.f

  • Key Concept: Divide work between available tasks which communicate data via point-to-point message passing calls.
One method of determining PI

One method of determining PI

Types of Point-to-Point Operations:

Buffering: Blocking vs. Non-blocking:

Order and Fairness:



Point to Point Communication Routines

MPI Message Passing Routine Arguments

MPI point-to-point communication routines generally have an argument list that takes one of the following formats:

Buffer

Program (application) address space that references the data that is to be sent or received. In most cases, this is simply the variable name that is be sent/received. For C programs, this argument is passed by reference and usually must be prepended with an ampersand: &var1

Data Count

Indicates the number of data elements of a particular type to be sent.

Data Type

For reasons of portability, MPI predefines its elementary data types. The table below lists those required by the standard.

C Data Types Fortran Data Types
MPI_CHAR signed char MPI_CHARACTER character(1)
MPI_WCHAR wchar_t - wide character    
MPI_SHORT signed short int    
MPI_INT signed int MPI_INTEGER
MPI_INTEGER1
MPI_INTEGER2
MPI_INTEGER4
integer
integer*1
integer*2
integer*4
MPI_LONG signed long int    
MPI_LONG_LONG_INT
MPI_LONG_LONG
signed long long int    
MPI_SIGNED_CHAR signed char    
MPI_UNSIGNED_CHAR unsigned char    
MPI_UNSIGNED_SHORT unsigned short int    
MPI_UNSIGNED unsigned int    
MPI_UNSIGNED_LONG unsigned long int    
MPI_UNSIGNED_LONG_LONG unsigned long long int    
MPI_FLOAT float MPI_REAL
MPI_REAL2
MPI_REAL4
MPI_REAL8
real
real*2
real*4
real*8
MPI_DOUBLE double MPI_DOUBLE_PRECISION double precision
MPI_LONG_DOUBLE long double    
MPI_C_COMPLEX
MPI_C_FLOAT_COMPLEX
float _Complex MPI_COMPLEX complex
MPI_C_DOUBLE_COMPLEX double _Complex MPI_DOUBLE_COMPLEX double complex
MPI_C_LONG_DOUBLE_COMPLEX long double _Complex    
MPI_C_BOOL _Bool MPI_LOGICAL logical
MPI_C_LONG_DOUBLE_COMPLEX long double _Complex    
MPI_INT8_T
MPI_INT16_T
MPI_INT32_T
MPI_INT64_T
int8_t
int16_t
int32_t
int64_t
   
MPI_UINT8_T
MPI_UINT16_T
MPI_UINT32_T
MPI_UINT64_T
uint8_t
uint16_t
uint32_t
uint64_t
   
MPI_BYTE 8 binary digits MPI_BYTE 8 binary digits
MPI_PACKED data packed or unpacked with MPI_Pack()/ MPI_Unpack MPI_PACKED data packed or unpacked with MPI_Pack()/ MPI_Unpack

Notes:

  • Programmers may also create their own data types (see Derived Data Types).
  • MPI_BYTE and MPI_PACKED do not correspond to standard C or Fortran types.
  • Types shown in GRAY FONT are recommended if possible.
  • Some implementations may include additional elementary data types (MPI_LOGICAL2, MPI_COMPLEX32, etc.). Check the MPI header file.

Destination

An argument to send routines that indicates the process where a message should be delivered. Specified as the rank of the receiving process.

Source

An argument to receive routines that indicates the originating process of the message. Specified as the rank of the sending process. This may be set to the wild card MPI_ANY_SOURCE to receive a message from any task.

Tag

Arbitrary non-negative integer assigned by the programmer to uniquely identify a message. Send and receive operations should match message tags. For a receive operation, the wild card MPI_ANY_TAG can be used to receive any message regardless of its tag. The MPI standard guarantees that integers 0-32767 can be used as tags, but most implementations allow a much larger range than this.

Communicator

Indicates the communication context, or set of processes for which the source or destination fields are valid. Unless the programmer is explicitly creating new communicators, the predefined communicator MPI_COMM_WORLD is usually used.

Status

For a receive operation, indicates the source of the message and the tag of the message. In C, this argument is a pointer to a predefined structure MPI_Status (ex. stat.MPI_SOURCE stat.MPI_TAG). In Fortran, it is an integer array of size MPI_STATUS_SIZE (ex. stat(MPI_SOURCE) stat(MPI_TAG)). Additionally, the actual number of bytes received is obtainable from Status via the MPI_Get_count routine.

Request

Used by non-blocking send and receive operations. Since non-blocking operations may return before the requested system buffer space is obtained, the system issues a unique "request number". The programmer uses this system assigned "handle" later (in a WAIT type routine) to determine completion of the non-blocking operation. In C, this argument is a pointer to a predefined structure MPI_Request. In Fortran, it is an integer.


Point to Point Communication Routines

Blocking Message Passing Routines

The more commonly used MPI blocking message passing routines are described below.

MPI_Send


MPI_Recv


MPI_Ssend


MPI_Bsend


MPI_Buffer_attach
MPI_Buffer_detach


MPI_Rsend


MPI_Sendrecv


MPI_Wait
MPI_Waitany
MPI_Waitall
MPI_Waitsome


MPI_Probe


MPI_Get_count



Examples: Blocking Message Passing Routines



Point to Point Communication Routines

Non-blocking Message Passing Routines

The more commonly used MPI non-blocking message passing routines are described below.

MPI_Isend


MPI_Irecv


MPI_Issend


MPI_Ibsend


MPI_Irsend


MPI_Test
MPI_Testany
MPI_Testall
MPI_Testsome


MPI_Iprobe



Examples: Non-blocking Message Passing Routines



MPI Exercise 2

Point-to-Point Message Passing

Overview:
  • Login to the LC workshop cluster, if you are not already logged in
  • Using your "Hello World" MPI program from Exercise 1, add MPI blocking point-to-point routines to send and receive messages
  • Successfully compile your program
  • Successfully run your program - several different ways
  • Try the same thing with nonblocking send/receive routines

GO TO THE EXERCISE HERE

    Approx. 20 minutes



Collective Communication Routines


Types of Collective Operations:

Scope:

Programming Considerations and Restrictions:


Collective Communication Routines

MPI_Barrier


MPI_Bcast


MPI_Scatter


MPI_Gather


MPI_Allgather


MPI_Reduce

MPI_Allreduce


MPI_Reduce_scatter


MPI_Alltoall


MPI_Scan



Examples: Collective Communications



Derived Data Types


Derived Data Type Routines

MPI_Type_contiguous


MPI_Type_vector
MPI_Type_hvector


MPI_Type_indexed
MPI_Type_hindexed


MPI_Type_struct


MPI_Type_extent


MPI_Type_commit


MPI_Type_free



Examples: Contiguous Derived Data Type




Examples: Vector Derived Data Type




Examples: Indexed Derived Data Type




Examples: Struct Derived Data Type



Group and Communicator Management Routines


Groups vs. Communicators:

Primary Purposes of Group and Communicator Objects:

  1. Allow you to organize tasks, based upon function, into task groups.

  2. Enable Collective Communications operations across a subset of related tasks.

  3. Provide basis for implementing user defined virtual topologies

  4. Provide for safe communications

Programming Considerations and Restrictions:


Group and Communicator Management Routines



Virtual Topologies


What Are They?

Why Use Them?

Example:


Virtual Topology Routines



A Brief Word on MPI-2 and MPI-3


MPI-2:

MPI-3:

More Information on MPI-2 and MPI-3:



MPI Exercise 3

Your Choice

Overview:
  • Login to the LC workshop cluster, if you are not already logged in
  • Following the Exercise 3 instructions will take you through all sorts of MPI programs - pick any/all that are of interest.
  • The intention is review the codes and see what's happening - not just compile and run.
  • Several codes provide serial examples for a comparison with the parallel MPI versions.
  • Check out the "bug" programs.

GO TO THE EXERCISE HERE






This completes the tutorial.

Evaluation Form       Please complete the online evaluation form - unless you are doing the exercise, in which case please complete it at the end of the exercises.

Where would you like to go now?



References and More Information



Appendix A: MPI-1 Routine Index