The Message Passing Interface Standard (MPI) is a message passing library
standard based on the consensus of the MPI Forum, which has over 40
participating organizations, including vendors, researchers, software library
developers, and users. The goal of the Message Passing Interface is to
establish a portable, efficient, and flexible standard for message passing
that will be widely used for writing message passing programs. As such, MPI
is the first standardized, vendor independent, message passing library. The
advantages of developing message passing software using MPI closely match the
design goals of portability, efficiency, and flexibility. MPI is not an
IEEE or ISO standard, but has in fact, become the "industry standard" for
writing message passing programs on HPC platforms.
The goal of this tutorial is to teach those unfamiliar with MPI how to
develop and run parallel programs according to the MPI standard. The primary
topics that are presented focus on those which are the most useful for
new MPI programmers. The tutorial begins with an introduction, background,
and basic information for getting started with MPI. This is followed by
a detailed look at the MPI routines that are most useful for new MPI
programmers, including MPI Environment Management, Point-to-Point
Communications, and Collective Communications routines. Numerous examples
in both C and Fortran are provided, as well as a lab exercise.
The tutorial materials also include more advanced topics such as
Derived Data Types, Group and Communicator Management Routines, and
Virtual Topologies. However, these are not actually presented during the
lecture, but are meant to serve as "further reading" for those who are
interested.
Level/Prerequisites: This tutorial is ideal for those who are new to parallel programming with MPI. A basic understanding of parallel programming in C or Fortran is required. For those who are unfamiliar with Parallel Programming in general, the material covered in
EC3500: Introduction To Parallel
Computing would be helpful.
What is MPI?
An Interface Specification:
M P I =
Message
Passing
Interface
MPI is a specification
for the developers and users of message passing libraries. By itself, it is NOT
a library - but rather the specification of what such a library should be.
MPI primarily addresses the message-passing parallel programming
model: data is moved from the address space of one process to
that of another process through cooperative operations on each process.
Simply stated, the goal of the Message Passing Interface is to provide a
widely used standard for writing message passing programs. The interface
attempts to be:
Practical
Portable
Efficient
Flexible
The MPI standard has gone through a number of revisions, with the most
recent version being MPI-3.x
Interface specifications have been defined for C and Fortran90 language
bindings:
C++ bindings from MPI-1 are removed in MPI-3
MPI-3 also provides support for Fortran 2003 and 2008 features
Actual MPI library implementations differ in which version and features
of the MPI standard they support. Developers/users will need to be aware
of this.
Programming Model:
Originally, MPI was designed for distributed memory architectures, which
were becoming increasingly popular at that time (1980s - early 1990s).
As architecture trends changed, shared memory SMPs were combined over
networks creating hybrid distributed memory / shared memory systems.
MPI implementors adapted their libraries to handle both types of underlying
memory architectures seamlessly. They also adapted/developed ways of
handling different interconnects and protocols.
Today, MPI runs on virtually any hardware platform:
Distributed Memory
Shared Memory
Hybrid
The programming model clearly remains a distributed memory model
however, regardless of the underlying physical architecture of the machine.
All parallelism is explicit: the programmer is responsible for
correctly identifying parallelism and implementing parallel
algorithms using MPI constructs.
Reasons for Using MPI:
Standardization
- MPI is the only message passing library that can be considered a standard. It
is supported on virtually all HPC platforms. Practically, it has replaced
all previous message passing libraries.
Portability
- There is little or no need to modify your source code
when you port your application to a different platform that supports
(and is compliant with) the MPI standard.
Performance Opportunities
- Vendor implementations should be able to exploit native hardware features to
optimize performance. Any implementation is free to develop optimized
algorithms.
Functionality
- There are over 430 routines defined in MPI-3, which includes the
majority of those in MPI-2 and MPI-1.
Most MPI programs can be written using a dozen or less routines
Availability
- A variety of implementations are available, both vendor and
public domain.
History and Evolution: (for those interested)
MPI has resulted from the efforts of numerous individuals and groups
that began in 1992. Some history:
1980s - early 1990s: Distributed memory, parallel computing
develops, as do a number of incompatible software tools for
writing such programs - usually with tradeoffs between portability,
performance, functionality and price. Recognition of the need
for a standard arose.
Apr 1992: Workshop on Standards for Message Passing in a
Distributed Memory Environment, sponsored by the Center for
Research on Parallel Computing, Williamsburg, Virginia.
The basic features essential to a standard message passing
interface were discussed, and a working group established to
continue the standardization process. Preliminary draft proposal
developed subsequently.
Nov 1992: Working group meets in Minneapolis. MPI draft
proposal (MPI1) from ORNL presented. Group adopts procedures
and organization to form the
MPI Forum.
It eventually comprised of about 175 individuals from
40 organizations including parallel computer vendors, software
writers, academia and application scientists.
Nov 1993: Supercomputing 93 conference - draft MPI standard
presented.
May 1994: Final version of MPI-1.0 released
MPI-1.1 (Jun 1995)
MPI-1.2 (Jul 1997)
MPI-1.3 (May 2008).
1998: MPI-2 picked up where the first MPI specification left off, and
addressed topics which went far beyond the MPI-1 specification.
Although the MPI programming interface has been standardized, actual library
implementations will differ in which version and features of the standard they
support.
The way MPI programs are compiled and run on different platforms may also vary.
Currently, LC supports these MPI implementations:
MVAPICH - Linux clusters
Open MPI - Linux clusters
Intel MPI - Linux clusters
IBM BG/Q MPI - BG/Q clusters
IBM Spectrum MPI - Coral Early Access (Power8 + Pascal) clusters
A summary of each is provided below, along with links to additional detailed
information.
MPI-1 implementation that also includes support for MPI-I/O
Based on MPICH-1.2.7 MPI library from Argonne National Laboratory
Not thread-safe. All MPI calls should be made by the master thread in
a multi-threaded MPI program.
See /usr/local/docs/mpi.mvapich.basics for LC
usage details.
MVAPICH2 (TOSS 2 and TOSS 3 clusters)
Multiple versions available
MPI-2 and MPI-3 implementations based on MPICH MPI library from Argonne
National Laboratory. Versions 1.9 and later implement MPI-3 according to the
developer's documentation.
TOSS 3: Default MPI implementation
TOSS 2: Not the default - requires the "use" command to load the
selected dotkit package. For example:
use -l mvapich(list available packages)use mvapich2-intel-2.1(use the package of interest)
MPI executables are launched using the SLURM srun command
with the appropriate options. For example, to launch an 8-process MPI job
split across two different nodes in the pdebug pool:
srun -N2 -n8 -ppdebug a.out
The srun command is discussed in detail in the
Running Jobs
section of the Linux Clusters Overview tutorial.
Be sure to load the same Open MPI dotkit/module that you used to
build your executable. If you are running a batch job, you will need
to load the dotkit/module in your batch script.
Launching an Open MPI job can be done using the following commands.
For example, to run a 48 process MPI job:
LC developed MPI compiler wrapper scripts are used to compile MPI programs
Automatically perform some error checks, include the appropriate
MPI #include files, link to the necessary MPI libraries, and pass options to
the underlying compiler.
Note: you may need to load a dotkit/module for the desired MPI implementation,
as discussed above. Failing to do this will result in getting the default
implementation.
For additional information:
See the man page (if it exists)
Issue the script name with the -help option
View the script yourself directly
MPI Build Scripts
Implementation
Language
Script Name
Underlying Compiler
MVAPCH 1.2 (TOSS 2 only)
C
mpicc
gcc - GNU
mpigcc
gcc - GNU
mpiicc
icc - Intel
mpipgcc
pgcc - PGI
C++
mpiCC
g++ - GNU
mpig++
g++ - GNU
mpiicpc
icpc - Intel
mpipgCC
pgCC - PGI
Fortran
mpif77
g77 - GNU
mpigfortran
gfortran - GNU
mpiifort
ifort - Intel
mpipgf77
pgf77 - PGI
mpipgf90
pgf90 - PGI
MVAPCH2
C
mpicc
C compiler of dotkit package loaded
C++
mpicxx
C++ compiler of dotkit package loaded
Fortran
mpif77
Fortran77 compiler of dotkit package loaded
mpif90
Fortran90 compiler of dotkit package loaded
Open MPI
C
mpicc
C compiler of dotkit package loaded
C++
mpiCC mpic++ mpicxx
C++ compiler of dotkit package loaded
Fortran
mpif77
Fortran77 compiler of dotkit package loaded
mpif90
Fortran90 compiler of dotkit package loaded
Level of Thread Support
MPI libraries vary in their level of thread support:
MPI_THREAD_SINGLE - Level 0: Only one thread will execute.
MPI_THREAD_FUNNELED - Level 1:
The process may be multi-threaded, but only
the main thread will make MPI calls - all MPI calls are funneled
to the main thread.
MPI_THREAD_SERIALIZED - Level 2:
The process may be multi-threaded, and
multiple threads may make MPI calls, but only one at a time. That is,
calls are not made concurrently from two distinct threads as all MPI calls
are serialized.
MPI_THREAD_MULTIPLE - Level 3:
Multiple threads may call MPI with no restrictions.
Returned as "ierr" parameter. MPI_SUCCESS if successful
Communicators and Groups:
MPI uses objects called communicators and groups to define which
collection of processes may communicate with each other.
Most MPI routines require you to specify a communicator as an argument.
Communicators and groups will be covered in more detail later. For now,
simply use MPI_COMM_WORLD whenever a communicator is required -
it is the predefined communicator that includes all of your MPI processes.
Rank:
Within a communicator, every process has its own unique, integer
identifier assigned by the system when the process initializes. A rank
is sometimes also called a "task ID". Ranks are contiguous and begin
at zero.
Used by the programmer to specify the source and destination of
messages. Often used conditionally by the application to control
program execution (if rank=0 do this / if rank=1 do that).
Error Handling:
Most MPI routines include a return/error code parameter, as described
in the "Format of MPI Calls" section above.
However, according to the MPI standard, the default behavior of an MPI call is
to abort if there is an error. This means you will probably not be able to
capture a return/error code other than MPI_SUCCESS (zero).
The standard does provide a means to override this default error handler.
A discussion on how to do this is available HERE. You can also consult the error handling section of the
relevant MPI Standard documentation located at
http://www.mpi-forum.org/docs/.
The types of errors displayed to the user are implementation dependent.
Environment Management Routines
This group of routines is used for interrogating and setting the MPI execution
environment, and covers an assortment of purposes, such as initializing and terminating the MPI environment, querying a rank's identity, querying the MPI
library's version, etc. Most of the commonly used ones are described below.
Initializes the MPI execution environment. This function must be called
in every MPI program, must be called before any other MPI functions
and must be called only once in an MPI program. For C programs, MPI_Init
may be used to pass the command line arguments to all processes,
although this is not required by the standard and is implementation
dependent.
Returns the total number of MPI processes in the specified
communicator, such as MPI_COMM_WORLD. If the communicator is
MPI_COMM_WORLD, then it represents the number of MPI tasks
available to your application.
Returns the rank of the calling MPI process within the specified communicator.
Initially, each process will be assigned a unique integer rank
between 0 and number of tasks - 1 within the communicator
MPI_COMM_WORLD. This rank is often referred to as a task ID.
If a process becomes associated with other communicators, it will have
a unique rank within each of these as well.
Terminates all MPI processes associated with the communicator. In
most MPI implementations it terminates ALL processes regardless of
the communicator specified.
Returns the processor name. Also
returns the length of the name. The buffer for "name" must be at
least MPI_MAX_PROCESSOR_NAME characters in size. What is returned into
"name" is implementation dependent - may not be the same as the output
of the "hostname" or "host" shell commands.
Indicates whether MPI_Init has been called - returns flag as either
logical true (1) or false(0). MPI requires that MPI_Init
be called once and only once by each process. This may pose a problem
for modules that want to use MPI and are prepared to call MPI_Init
if necessary. MPI_Initialized solves this problem.
Terminates the MPI execution environment. This function should be
the last MPI routine called in every MPI program - no other MPI
routines may be called after it.
// required MPI include file
#include "mpi.h"
#include <stdio.h>
int main(int argc, char *argv[]) {
int numtasks, rank, len, rc;
char hostname[MPI_MAX_PROCESSOR_NAME];
// initialize MPI
MPI_Init(&argc,&argv);
// get number of tasks
MPI_Comm_size(MPI_COMM_WORLD,&numtasks);
// get my rank
MPI_Comm_rank(MPI_COMM_WORLD,&rank);
// this one is obvious
MPI_Get_processor_name(hostname, &len);
printf ("Number of tasks= %d My rank= %d Running on %s\n", numtasks,rank,hostname);
// do some work with message passing
// done with MPI
MPI_Finalize();
}
program simple
! required MPI include file
include 'mpif.h'
integer numtasks, rank, len, ierr
character(MPI_MAX_PROCESSOR_NAME) hostname
! initialize MPI
call MPI_INIT(ierr)
! get number of tasks
call MPI_COMM_SIZE(MPI_COMM_WORLD, numtasks, ierr)
! get my rank
call MPI_COMM_RANK(MPI_COMM_WORLD, rank, ierr)
! this one is obvious
call MPI_GET_PROCESSOR_NAME(hostname, len, ierr)
print *, 'Number of tasks=',numtasks,' My rank=',rank,' Running on=',hostname
! do some work with message passing
! done with MPI
call MPI_FINALIZE(ierr)
end
MPI Exercise 1
Getting Started
Overview:
Login to an LC cluster using your workshop username and OTP token
Copy the exercise files to your home directory
Familiarize yourself with LC's MPI compilers
Write a simple "Hello World" MPI program using several MPI
Environment Management routines
Successfully compile your program
Successfully run your program - several different ways
The value of PI can be calculated in various ways. Consider the Monte Carlo
method of approximating PI:
Inscribe a circle with radius r in a square with side length of 2r
The area of the circle is Πr2 and the area of the square is 4r2
The ratio of the area of the circle to the area of the square is: Πr2 / 4r2 = Π / 4
If you randomly generate N points inside the square, approximately N * Π / 4 of those points (M) should fall inside the circle.
Π is then approximated as:
N * Π / 4 = M
Π / 4 = M / N
Π = 4 * M / N
Note that increasing the number of points generated improves the approximation.
Serial pseudo code for this procedure:
npoints = 10000
circle_count = 0
do j = 1,npoints
generate 2 random numbers between 0 and 1
xcoordinate = random1
ycoordinate = random2
if (xcoordinate, ycoordinate) inside circle
then circle_count = circle_count + 1
end do
PI = 4.0*circle_count/npoints
Leads to an "embarassingly parallel" solution:
Break the loop iterations into chunks that can be executed by different
tasks simultaneously.
Each task executes its portion of the loop a number of times.
Each task can do its work without requiring any information
from the other tasks (there are no data dependencies).
Master task recieves results from other tasks using send/receive
point-to-point operations.
Pseudo code solution:
red highlights changes for
parallelism.
npoints = 10000
circle_count = 0
p = number of tasks
num = npoints/p
find out if I am MASTER or WORKER
do j = 1,num
generate 2 random numbers between 0 and 1
xcoordinate = random1
ycoordinate = random2
if (xcoordinate, ycoordinate) inside circle
then circle_count = circle_count + 1
end do
if I am MASTER
receive from WORKERS their circle_counts
compute PI (use MASTER and WORKER calculations)
else if I am WORKER
send to MASTER circle_count
endif
Key Concept: Divide work between
available tasks which communicate data via point-to-point message passing
calls.
Types of Point-to-Point Operations:
MPI point-to-point operations typically involve message passing between
two, and only two, different MPI tasks. One task is performing a send
operation and the other task is performing a matching receive operation.
There are different types of send and receive routines used for
different purposes. For example:
Synchronous send
Blocking send / blocking receive
Non-blocking send / non-blocking receive
Buffered send
Combined send/receive
"Ready" send
Any type of send routine can be paired with any type of receive routine.
MPI also provides several routines associated with send - receive
operations, such as those used to wait for a message's arrival or
probe to find out if a message has arrived.
Buffering:
In a perfect world, every send operation would be perfectly
synchronized with its matching receive. This is rarely the case.
Somehow or other, the MPI implementation must be able to deal with
storing data when the two tasks are out of sync.
Consider the following two cases:
A send operation occurs 5 seconds before the receive is ready -
where is the message while the receive is pending?
Multiple sends arrive at the same receiving task which can only
accept one send at a time - what happens to the messages that are
"backing up"?
The MPI implementation (not the MPI standard) decides what happens to
data in these types of cases. Typically, a system buffer area
is reserved to hold data in transit. For example:
System buffer space is:
Opaque to the programmer and managed entirely by the MPI library
A finite resource that can be easy to exhaust
Often mysterious and not well documented
Able to exist on the sending side, the receiving side, or both
Something that may improve program performance because it allows
send - receive operations to be asynchronous.
User managed address space (i.e. your program variables) is called
the application buffer. MPI also provides for a user managed
send buffer.
Blocking vs. Non-blocking:
Most of the MPI point-to-point routines can be used in either blocking
or non-blocking mode.
Blocking:
A blocking send routine will only "return" after it is safe to modify
the application buffer (your send data) for reuse. Safe means that
modifications will not affect the data intended for the receive task.
Safe does not imply that the data was actually received - it may
very well be sitting in a system buffer.
A blocking send can be synchronous which means there is handshaking
occurring with the receive task to confirm a safe send.
A blocking send can be asynchronous if a system buffer is used to
hold the data for eventual delivery to the receive.
A blocking receive only "returns" after the data has arrived and
is ready for use by the program.
Non-blocking:
Non-blocking send and receive routines behave similarly - they will
return almost immediately. They do not wait for any communication
events to complete, such as message copying from user memory to
system buffer space or the actual arrival of message.
Non-blocking operations simply "request" the MPI library to perform
the operation when it is able. The user can not predict when that
will happen.
It is unsafe to modify the application buffer (your
variable space) until you know for a fact the requested
non-blocking operation was actually performed by the library.
There are "wait" routines used to do this.
Non-blocking communications are primarily used to overlap computation
with communication and exploit possible performance gains.
Blocking Send
Non-blocking Send
myvar = 0;
for (i=1; i<ntasks; i++) {
task = i;
MPI_Send (&myvar ... ... task ...);
myvar = myvar + 2
/* do some work */
}
myvar = 0;
for (i=1; i<ntasks; i++) {
task = i;
MPI_Isend (&myvar ... ... task ...);
myvar = myvar + 2;
/* do some work */
MPI_Wait (...);
}
Safe. Why?
Unsafe. Why?
Order and Fairness:
Order:
MPI guarantees that messages will not overtake each other.
If a sender sends two messages (Message 1 and Message 2) in succession
to the same destination, and both match the same receive, the receive
operation will receive Message 1 before Message 2.
If a receiver posts two receives (Receive 1 and Receive 2), in
succession, and both are looking for the same message, Receive 1 will
receive the message before Receive 2.
Order rules do not apply if there are multiple threads participating
in the communication operations.
Fairness:
MPI does not guarantee fairness - it's up to the programmer to
prevent "operation starvation".
Example: task 0 sends a message to task 2. However, task 1 sends
a competing message that matches task 2's receive. Only one of the
sends will complete.
Point to Point Communication Routines
MPI Message Passing Routine Arguments
MPI point-to-point communication routines generally have an argument list
that takes one of the following formats:
Program (application) address space that references the data that is
to be sent or received. In most cases, this is simply the variable
name that is be sent/received. For C programs, this argument is
passed by reference and usually must be prepended with an ampersand:
&var1
Data Count
Indicates the number of data elements of a particular type to be sent.
Data Type
For reasons of portability, MPI predefines its elementary data types.
The table below lists those required by the standard.
data packed or unpacked with MPI_Pack()/
MPI_Unpack
MPI_PACKED
data packed or unpacked with MPI_Pack()/
MPI_Unpack
Notes:
Programmers may also create their own data types
(see Derived Data Types).
MPI_BYTE and MPI_PACKED do not correspond to standard C or
Fortran types.
Types shown in GRAY FONT are recommended if
possible.
Some implementations may include additional elementary data
types (MPI_LOGICAL2, MPI_COMPLEX32, etc.). Check the MPI header file.
Destination
An argument to send routines that indicates the process where a
message should be delivered. Specified as the rank of the receiving
process.
Source
An argument to receive routines that indicates the originating process
of the message. Specified as the rank of the sending process. This may
be set to the wild card MPI_ANY_SOURCE to receive a message from any task.
Tag
Arbitrary non-negative integer assigned by the programmer to uniquely
identify a message. Send and receive operations should match message
tags. For a receive operation, the wild card MPI_ANY_TAG can be used
to receive any message regardless of its tag. The MPI standard guarantees
that integers 0-32767 can be used as tags, but most implementations allow
a much larger range than this.
Communicator
Indicates the communication context, or set of processes for which the
source or destination fields are valid. Unless the programmer is
explicitly creating new communicators, the predefined communicator
MPI_COMM_WORLD is usually used.
Status
For a receive operation, indicates the source of the message and the
tag of the message. In C, this argument is a pointer to a predefined
structure MPI_Status (ex. stat.MPI_SOURCE stat.MPI_TAG).
In Fortran, it is an integer array of size MPI_STATUS_SIZE (ex.
stat(MPI_SOURCE) stat(MPI_TAG)). Additionally, the actual number of
bytes received is obtainable from Status via the MPI_Get_count
routine. The constants MPI_STATUS_IGNORE and MPI_STATUSES_IGNORE can be
substituted if a message's source, tag or size will be be queried later.
Request
Used by non-blocking send and receive operations. Since non-blocking
operations may return before the requested system buffer space is
obtained, the system issues a unique "request number". The programmer
uses this system assigned "handle" later (in a WAIT type routine)
to determine completion of the non-blocking operation. In C, this
argument is a pointer to a predefined structure MPI_Request. In Fortran,
it is an integer.
Point to Point Communication Routines
Blocking Message Passing Routines
The more commonly used MPI blocking message passing routines are described
below.
Basic blocking send operation. Routine returns only after the
application buffer in the sending task is free for reuse. Note that
this routine may be implemented differently on different systems. The
MPI standard permits the use of a system buffer but does not require it.
Some implementations may actually use a synchronous send (discussed
below) to implement the basic blocking send.
Synchronous blocking send: Send a message and block until the
application buffer in the sending task is free for reuse and the
destination process has started to receive the message.
Send a message and post a receive before blocking. Will block until
the sending application buffer is free for reuse and until the receiving
application buffer contains the received message.
MPI_Wait blocks until a specified non-blocking send or receive
operation has completed. For multiple non-blocking operations, the
programmer can specify any, all or some completions.
Performs a blocking test for a message. The "wildcards" MPI_ANY_SOURCE
and MPI_ANY_TAG may be used to test for a message from any source or
with any tag. For the C routine, the actual source and tag will be
returned in the status structure as status.MPI_SOURCE and
status.MPI_TAG. For the Fortran routine, they will be returned in
the integer array status(MPI_SOURCE) and status(MPI_TAG).
Returns the source, tag and number of elements of datatype received. Can be
used with both blocking and non-blocking receive operations.
For the C routine, the actual source and tag will be
returned in the status structure as status.MPI_SOURCE and
status.MPI_TAG. For the Fortran routine, they will be returned in
the integer array status(MPI_SOURCE) and status(MPI_TAG).
Identifies an area in memory to serve as a send buffer. Processing
continues immediately without waiting for the message to be copied out
from the application buffer. A communication request handle is
returned for handling the pending message status. The program should
not modify the application buffer until subsequent calls to MPI_Wait
or MPI_Test indicate that the non-blocking send has completed.
Identifies an area in memory to serve as a receive buffer. Processing
continues immediately without actually waiting for the message to be
received and copied into the the application buffer. A communication
request handle is returned for handling the pending message status.
The program must use calls to MPI_Wait or MPI_Test to determine when the
non-blocking receive operation completes and the requested message is
available in the application buffer.
Non-blocking synchronous send. Similar to MPI_Isend(), except
MPI_Wait() or MPI_Test() indicates when the destination process has
received the message.
MPI_Test checks the status of a specified non-blocking send or receive
operation. The "flag" parameter is returned logical true (1) if the
operation has completed, and logical false (0) if not. For multiple
non-blocking operations, the programmer can specify any, all or some
completions.
Performs a non-blocking test for a message. The "wildcards"
MPI_ANY_SOURCE and MPI_ANY_TAG may be used to test for a message
from any source or with any tag. The integer "flag" parameter is returned
logical true (1) if a message has arrived, and logical false (0) if not.
For the C routine, the actual source and tag will be
returned in the status structure as status.MPI_SOURCE and
status.MPI_TAG. For the Fortran routine, they will be returned in
the integer array status(MPI_SOURCE) and status(MPI_TAG).
Synchronization - processes wait until all members of the group
have reached the synchronization point.
Data Movement - broadcast, scatter/gather, all to all.
Collective Computation (reductions) - one member of the group
collects data from the other members and performs an operation
(min, max, add, multiply, etc.) on that data.
Scope:
Collective communication routines must involve all processes
within the scope of a communicator.
All processes are by default, members in the communicator MPI_COMM_WORLD.
Unexpected behavior, including program failure, can occur if even one
task in the communicator doesn't participate.
It is the programmer's responsibility to ensure that all processes
within a communicator participate in any collective operations.
Programming Considerations and Restrictions:
Collective communication routines do not take message tag arguments.
Collective operations within subsets of processes are accomplished
by first partitioning the subsets into new groups and then
attaching the new groups to new communicators (discussed in the
Group and Communicator
Management Routines section).
Can only be used with MPI predefined datatypes - not with MPI
Derived Data Types.
MPI-2 extended most collective operations to allow data movement
between intercommunicators (not covered here).
With MPI-3, collective operations can be blocking or non-blocking. Only
blocking operations are covered in this tutorial.
Synchronization operation. Creates a barrier synchronization in a group.
Each task, when reaching the MPI_Barrier call, blocks until all tasks in the
group reach the same MPI_Barrier call. Then all tasks are free to proceed.
Data movement operation.
Gathers distinct messages from each task in the group to a single
destination task. This routine is the reverse operation of MPI_Scatter.
Data movement operation.
Concatenation of data to all tasks in a group. Each task in the group,
in effect, performs a one-to-all broadcasting operation within the
group.
The predefined MPI reduction operations appear below. Users can also
define their own reduction functions by using the
MPI_Op_create routine.
MPI Reduction Operation
C Data Types
Fortran Data Type
MPI_MAX
maximum
integer, float
integer, real, complex
MPI_MIN
minimum
integer, float
integer, real, complex
MPI_SUM
sum
integer, float
integer, real, complex
MPI_PROD
product
integer, float
integer, real, complex
MPI_LAND
logical AND
integer
logical
MPI_BAND
bit-wise AND
integer, MPI_BYTE
integer, MPI_BYTE
MPI_LOR
logical OR
integer
logical
MPI_BOR
bit-wise OR
integer, MPI_BYTE
integer, MPI_BYTE
MPI_LXOR
logical XOR
integer
logical
MPI_BXOR
bit-wise XOR
integer, MPI_BYTE
integer, MPI_BYTE
MPI_MAXLOC
max value and location
float, double and long double
real, complex,double precision
MPI_MINLOC
min value and location
float, double and long double
real, complex, double precision
Note from the MPI_Reduce man page:
The operation is always assumed to be associative. All predefined operations are also assumed to be commutative. Users may define operations that are assumed to be associative, but not commutative. The "canonical" evaluation order of a reduction is determined by the ranks of the processes in the group. However, the implementation can take advantage of associativity, or associativity and commutativity in order to change the order of evaluation. This may change the result of the reduction for operations that are not strictly associative and commutative, such as floating point addition. [Advice to implementors]
It is strongly recommended that MPI_REDUCE be implemented so that the same result be obtained whenever the function is applied on the same arguments, appearing in the same order. Note that this may prevent optimizations that take advantage of the physical location of processors. [End of advice to implementors]
Collective computation operation + data movement.
Applies a reduction operation and places the result in all tasks in the
group. This is equivalent to an MPI_Reduce followed by an MPI_Bcast.
Collective computation operation + data movement.
First does an element-wise reduction on a vector across all tasks in the
group. Next, the result vector is split into disjoint segments and
distributed across the tasks. This is equivalent to an MPI_Reduce
followed by an MPI_Scatter operation.
Data movement operation.
Each task in a group performs a scatter operation, sending a distinct
message to all the tasks in the group in order by index.
MPI also provides facilities for you to define your own data structures
based upon sequences of the MPI primitive data types. Such user defined
structures are called derived data types.
Primitive data types are contiguous. Derived data types allow you to
specify non-contiguous data in a convenient manner and to treat it as
though it was contiguous.
MPI provides several methods for constructing derived data types:
Similar to contiguous, but allows for regular gaps (stride) in the
displacements. MPI_Type_hvector is identical to
MPI_Type_vector except that stride is specified in bytes.
An array of displacements of the input data type is provided as the map
for the new data type. MPI_Type_hindexed is identical to
MPI_Type_indexed except that offsets are specified in bytes.
The new data type is formed according to completely defined map of the
component data types.
NOTE: This function is deprecated in MPI-2.0 and replaced by
MPI_Type_create_struct in MPI-3.0
Returns the size in bytes of the specified data type. Useful for
the MPI subroutines that require specification of offsets in bytes.
NOTE: This function is deprecated in MPI-2.0 and replaced by
MPI_Type_get_extent in MPI-3.0
Deallocates the specified datatype object. Use of this routine is
especially important to prevent memory exhaustion if many datatype
objects are created, as in a loop.
program struct
include 'mpif.h'
integer NELEM
parameter(NELEM=25)
integer numtasks, rank, source, dest, tag, i, ierr
integer stat(MPI_STATUS_SIZE)
type Particle
sequence
real*4 x, y, z, velocity
integer n, type
end type Particle
type (Particle) p(NELEM), particles(NELEM)
integer particletype, oldtypes(0:1) ! required variables
integer blockcounts(0:1), offsets(0:1), extent
tag = 1
call MPI_INIT(ierr)
call MPI_COMM_RANK(MPI_COMM_WORLD, rank, ierr)
call MPI_COMM_SIZE(MPI_COMM_WORLD, numtasks, ierr)
! setup description of the 4 MPI_REAL fields x, y, z, velocity
offsets(0) = 0
oldtypes(0) = MPI_REAL
blockcounts(0) = 4
! setup description of the 2 MPI_INTEGER fields n, type
! need to first figure offset by getting size of MPI_REAL
call MPI_TYPE_EXTENT(MPI_REAL, extent, ierr)
offsets(1) = 4 * extent
oldtypes(1) = MPI_INTEGER
blockcounts(1) = 2
! define structured type and commit it
call MPI_TYPE_STRUCT(2, blockcounts, offsets, oldtypes, &
particletype, ierr)
call MPI_TYPE_COMMIT(particletype, ierr)
! task 0 initializes the particle array and then sends it to each task
if (rank .eq. 0) then
do i=0, NELEM-1
particles(i) = Particle ( 1.0*i, -1.0*i, 1.0*i, 0.25, i, mod(i,2) )
end do
do i=0, numtasks-1
call MPI_SEND(particles, NELEM, particletype, i, tag, &
MPI_COMM_WORLD, ierr)
end do
endif
! all tasks receive particletype data
source = 0
call MPI_RECV(p, NELEM, particletype, source, tag, &
MPI_COMM_WORLD, stat, ierr)
print *, 'rank= ',rank,' p(3)= ',p(3)
! free datatype when done using it
call MPI_TYPE_FREE(particletype, ierr)
call MPI_FINALIZE(ierr)
end
A group is an ordered set of processes. Each process in a group is
associated with a unique integer rank. Rank values start at zero and go
to N-1, where N is the number of processes in the group.
In MPI, a group is represented within system memory as an object.
It is accessible to the programmer only by a "handle". A group is always
associated with a communicator object.
A communicator encompasses a group of processes that may communicate with
each other. All MPI messages must specify a communicator. In the
simplest sense, the communicator is an extra "tag" that must be included
with MPI calls.
Like groups, communicators are represented within system memory as
objects and are accessible to the programmer only by "handles".
For example, the handle for the communicator that comprises all tasks
is MPI_COMM_WORLD.
From the programmer's perspective, a group and a communicator are one.
The group routines are primarily used to specify which processes should
be used to construct a communicator.
Primary Purposes of Group and Communicator Objects:
Allow you to organize tasks, based upon function, into task groups.
Enable Collective Communications operations across a subset of
related tasks.
Provide basis for implementing user defined virtual topologies
Provide for safe communications
Programming Considerations and Restrictions:
Groups/communicators are dynamic - they can be created and destroyed
during program execution.
Processes may be in more than one group/communicator. They will have
a unique rank within each group/communicator.
MPI provides over 40 routines related to groups, communicators,
and virtual topologies.
Typical usage:
Extract handle of global group from MPI_COMM_WORLD using MPI_Comm_group
Form new group as a subset of global group using MPI_Group_incl
Create new communicator for new group using MPI_Comm_create
Determine new rank in new communicator using MPI_Comm_rank
Conduct communications using any MPI message passing routine
When finished, free up new communicator and group (optional) using
MPI_Comm_free and MPI_Group_free
Group and Communicator Management Routines
Create two different process groups for separate collective communications
exchange. Requires creating new communicators also.
Intentionally, the MPI-1 specification did not address several
"difficult" issues. For reasons of expediency, these issues
were deferred to a second specification, called MPI-2 in 1998.
MPI-2 was a major revision to MPI-1 adding new functionality and
corrections.
Key areas of new functionality in MPI-2:
Dynamic Processes - extensions that remove the static process model
of MPI. Provides routines to create new processes after job startup.
One-Sided Communications - provides routines for one directional
communications. Include shared memory operations (put/get) and
remote accumulate operations.
Extended Collective Operations - allows for the
application of collective operations to inter-communicators
External Interfaces - defines routines that allow developers to
layer on top of MPI, such as for debuggers and profilers.
Additional Language Bindings - describes C++ bindings and discusses
Fortran-90 issues.
Parallel I/O - describes MPI support for parallel I/O.
MPI-3:
The MPI-3 standard was adopted in 2012, and contains significant extensions to
MPI-1 and MPI-2 functionality including:
Nonblocking Collective Operations - permits tasks in a collective to
perform operations without blocking, possibly offering performance improvements.
New One-sided Communication Operations - to better handle different
memory models.
Neighborhood Collectives - extends the distributed graph and Cartesian
process topologies with additional communication power.
Fortran 2008 Bindings - expanded from Fortran90 bindings
MPIT Tool Interface - allows the MPI implementation to expose certain
internal variables, counters, and other states to the user (most likely
performance tools).
Matched Probe - fixes an old bug in MPI-2 where one could not probe for
messages in a multi-threaded environment.