|Privacy and Legal Notice|
Many of the paths are symbolic links. The actual paths sometimes change for minor bug fixes and other maintenance. For MPICH and IBM's MPI, there are scripts that link in the necessary libraries and include directories. When accessing the library files and include directories explicitly, care must be taken that the -I and -L paths are consistent with the MPI used. Users are encouraged to use the standard MPICH or IBM scripts as they will automatically provide all required macro definitions, environment settings, include and library paths, and platform-specific libraries.
Two variations of the IBM MPI library are available, a threaded library and a signal library. (Note: The signal MPI library does not work on Power4 or Power 5 systems.) The threaded library processes MPI calls in a separate, kernel-bound thread, while the signal library uses interrupts to ensure progress of MPI calls. The threaded library is thread-safe and is the default library used, whether or not the thread-safe compiler scripts (e.g., mpcc vs. mpcc_r) are used.
Note that the signal library yields slightly faster communication, but the compiled code is not thread-safe. Performance of the threaded library is comparable to that of the signal library if all MPI calls are performed in a single user thread and the environment variable MP_SINGLE_THREAD=yes.
In order to link with the signal library, you cannot use the thread-safe compiler scripts (e.g., mpcc_r) and you must set the environment variable LLNL_COMPILE_SINGLE_THREADED=TRUE.
Both libraries support two communication methods: User Space (US) and Internet Protocol (IP). US is an IBM OS bypass mechanism that provides user processes with fast access to communication hardware. Latencies with IP are about a factor of 5 higher than US. Current configurations limit the number of US processes per node to the number of cpus per node.
The IBM MPI libraries will use shared memory for on-node communication and the network interface for all off-node communication. The shared memory communication is enabled by setting the environment variable MP_SHARED_MEMORY=yes, which is the default setting in our current system configuration. Using the shared memory configuration provides slightly faster on-node communication at the cost of higher CPU overhead. Its impact on performance depends on the MPI calls that are used; generally, it will improve performance for blocking MPI calls, while codes that use nonblocking MPI calls can see performance degradation. Shared memory communication can be disabled by setting MP_SHARED_MEMORY=no.
The threaded MPI library can be used with C, C++, Fortran77, Fortran90, or Fortran95 codes.
Recall that in the LLNL default configuration, all the mp* compilation scripts are mapped to _r versions (e.g., mpcc and mpcc_r are equivalent).
The signal MPI library can be used with C, C++, Fortran77, Fortran90, or Fortran95 codes. Note: The signal MPI library does not work on Power4 or Power5 systems.
The following examples show that for both the compilation and load steps, the nonthreaded compile script must be used with the setting of the environment variable LLNL_COMPILE_SINGLE_THREADED=TRUE. In this case, the mp* scripts are not equivalent to the _r versions.
The resulting executable may be run with poe, using environment variables or command-line arguments to set job parameters.
There are many environment variables that affect the performance tuning of the IBM MPI. The default user environment sets MP_EUILIB=us, MP_SHARED_MEMORY=yes, and a few other environment variables most often needed for MPI or other parallel programs. You may find additional settings, or overriding of the defaults, are necessary for optimal performance for some codes.
To execute your code with n nodes and p processes in the indicated pool with US communications, use the environment settings as shown in the following example:
Alternatively, using poe command-line arguments:
Note: TotalView needs to know that the job is actually a poe job so
does not work.
We support three versions of MPICH: a default version, a latest installed version, and an oldest supported version. All versions may be used with C, C++, Fortran77, or Fortran90 through MPICH compilation and run scripts that are accessed through symbolic links in /usr/local/bin, as described in the examples below. All versions are installed in /usr/local because it is assumed that users will commonly have /usr/local/bin included in their PATH environment variable.
The most stable recent version of MPICH is the default. The default version of MPICH is installed as /usr/local/mpi and is accessed through links in /usr/local/bin to the standard MPICH compilation and run scripts. The compilation scripts are mpicc, mpiCC, mpif77, and mpif90, and the run script is mpirun, which is used to execute programs created using the compilation scripts.
The latest version of MPICH is installed as /usr/local/new_mpi, and there are links in /usr/local/bin to its corresponding scripts. The compilation scripts are new_mpicc, new_mpiCC, new_mpif77, and new_mpif90, and new_mpirun is used to execute programs built with those scripts. When a new MPICH release becomes available, the previous latest release will become the default, and the new latest release will be installed as new_mpi.
The oldest version of MPICH is installed as /usr/local/old_mpi, and there are links in /usr/local/bin to its corresponding scripts. The compilation scripts are old_mpicc, old_mpiCC, old_mpif77, and old_mpif90, and old_mpirun is used to execute programs built with those scripts. When a new version of MPICH becomes the default version, the previous default becomes old_mpi.
Users should be aware that the installed versions of MPICH can vary across platforms or machines. In general, we try to keep the versions consistent, but there can be a lag in migrating a new version to all systems because of programmatic requests. Other than very short lags to update links across the full set of machines, these versions will be consistent across machines of the same platform, and, usually, across platforms.
On the Intel Linux Cluster we only support a default version, so the remarks here about old_ and new_ versions do not apply there.
We install the best MPICH abstract device interface (ADI) available for each platform. On the IBM's, this is the MPL device, which is able to interface to poe and to make use of the SP switch in both US and IP mode.
Note: No thread-safe version of MPICH is available.
As stated above, /usr/local/bin contains soft links to the MPICH scripts for all the currently supported versions. The standard MPICH script names are linked to the default MPICH path. For example, /usr/local/bin/mpicc is a link to /usr/local/mpi/bin/mpicc. Other MPICH script names in /usr/local/bin are links to the additional MPICH versions that are supported, using the prefixes old_ and new_, so that these names are derived from the standard MPICH script names.
We use symbolic links so that different names can distinguish the different versions installed, because all MPICH versions provide the same script names, relative to their installation paths. For example, new_mpicc is a link to /usr/local/new_mpi/bin/mpicc, while mpicc is a link to /usr/local/mpi/bin/mpicc. Please note that the scripts for each version do differ, and cannot be used interchangeably; e.g., you cannot use mpirun to execute a program built with the old_mpicc script.
We have made site-specific modifications to the MPICH scripts in some cases. On the IBM SPs, the compilation scripts will automatically set the environment variable LLNL_COMPILE_SINGLE_THREADED=TRUE to prevent unintentional mixing of IBM's threaded MPL library with MPICH definitions.
The MPICH compilation scripts add configuration-specific macro definitions and automatically set the appropriate include directories and link in the appropriate libraries. Users are discouraged from accessing the MPI libraries and include files explicitly; they are subject to change with new versions of MPICH, and path names and MPI support libraries needed vary by platform. If explicit paths and libraries are required, consult the information in /usr/local/docs/MPI_Use_Summary on the platform you are using for more details on the paths and libraries needed.
Each MPICH compilation script is configured to use a specific C, C++, or Fortran compiler, typically the native compiler on the given platform. MPICH allows the user to change the compiler and linker/loader used by these scripts by defining appropriate environment variables, as described below. Note that you generally use the same command for both the compiler and the linker/loader, which requires setting a pair of MPICH environment variables (e.g., MPICH_CC=gcc and MPICH_CLINKER=gcc).
To determine what other definitions or paths are provided by the version of MPICH you are using, you may use the -compile_info or -link_info options to any of the MPICH compilation scripts such as mpicc to see the options used by these scripts. This can assist you in providing the MPICH-required options in your compile and link commands, if you are not using the MPICH scripts, to guarantee compatibility with MPICH.
Executables built with the oldest or newest versions of MPICH should be run using the corresponding old_mpirun or new_mpirun, respectively, as there could be subtle differences in the runtime environments created by each.
Mellanox/OSU MVAPICH MPI is based on MPICH 1.2.7. Currently, there is no support for MPI one-sided communications.
MPI compiler wrapper scripts are available in /usr/local/bin/, which is in the default $PATH. These scripts mimic the familiar MPICH scripts in their functionality, meaning they automatically include the appropriate MPI include files and link to the necessary MPI libraries and pass switches to the underlying compiler.
Type [scriptname] -help for a list of command-line options. Scripts available are:
Note: See the Environment Variables page for environment variable settings used with MPI runs.
On IBM SP platforms, MPICH uses IBM's proprietary Message Passing Library (MPL), which supports both US and IP communication (see the IBM MPI section). Because no thread-safe version of IBM's MPL exists, MPICH cannot use the _r compilers. MPICH users must therefore set LLNL_COMPILE_SINGLE_THREADED=TRUE on IBM machines. Failure to have this environment variable set can result in missing externals at load time or inappropriate mixing of MPICH and IBMs MPI definitions that can generate illegal/bad communicator errors at run time. This environment setting is automatically set in the MPICH scripts, but it must be added explicitly if you are not using the scripts.
Note that the MPL MPICH mpirun provides SMP support. By default it will use n = ceiling (p/4) nodes, placing up to 4 tasks on each node, where p is the number of processes requested. The -nodes n option will override this behavior, where the desired number of nodes to use is n. MPL MPICH will distribute the p tasks evenly across the n nodes, or complain if it cannot evenly distribute the tasks. Note that mpirun also understands several IBM environment variables, such as MP_NODES and MP_TASKS_PER_NODE to determine the number of nodes to use, but these must be consistent with the -np option used on the mpirun command. The default number of processes used if no -np option is specified is 1.
The following examples demonstrate how to use the MPICH scripts on the IBM SPs. Most of the examples use the default script names, but the oldest or newest versions of MPICH supported are also available with the old_ or new_-prefixed names as indicated.
Alternatively, to use the oldest or newest versions of MPICH:
Alternatively, to use KAI as the C++ compiler on the IBM machines:
The resulting executable is run with mpirun, old_mpirun, or new_mpirun, as appropriate. Although MPICH-compiled executables can generally be run as serial jobs on most platforms, it is strongly recommended that MPL MPICH jobs be run only through the mpirun scripts.
To run with p processes or tasks on n nodes:
to run with the default of up to 4 tasks per node.
Using the -tv option to mpirun will start your executable under TotalView.
Need $OMPI/bin in $PATH, $OMPI/lib in $LD_LIBRARY_PATH. The ompi_info command lists various configuration settings, compilers used, available components, and more.
Compiler wrappers are provided. Users should never need to explicitly link against any OMPI libraries.
Fortran 77 Example
By default, Open MPI will use the fastest networks available. On a Peloton system with InfiniBand, for example, shared memory will be used for communication between processes on the same node, and InfiniBand will be used for communication across nodes. A single-node MPI job will use shared memory by default. A multinode MPI job across nodes without InfiniBand will use TCP for communication. Networks can be explicitly selected via an MCA parameter, which will be discussed below.
Running Open MPI under SLURM
Within a batch script or interact allocation, Open MPI automatically detects how many nodes are available and how many cores each node has. For example, in a two-node allocation on Atlas (8 cores per node):
Note that adjacent ranks are grouped onto the same node. The number of processes may be explicitly specified with the -np parameter. Also, ranks may be assigned in a round-robin fashion across available nodes using the -bynode parameter:
Open MPI supports for runtime configuration of via MCA parameters. MCA parameters may be specified on the command line, in the shell environment, and/or in per-user and per-OMPI-installation configuration files. More information can be found in the OMPI FAQ:
As mentioned earlier, one way in which MCA parameters are useful is to select which network interconnects are used for communication. By default, the fastest network available is used; however, manually selecting e.g. TCP may be useful for debugging purposes.
Use shared memory and InfiniBand
Use only InfiniBand, no shared memory (even within one node)
Use only TCP
Several parameters are useful for running large-scale jobs. These include:
Instructs an mpirun to spawn a separate thread for accepting management connections from spawned processes. Slightly improves startup times.
Increases the InfiniBand transmit timeout, which significantly reduces the occurrence of code 12 errors from the InfiniBand network.
Establishes TCP management connections and MPI-level communication connections between all MPI processes during initialization. Generally not needed, though may help with some applications that communcate between every process in the MPI job.
An example mpirun command line:
Some useful debugging parameters include:
Whether MPI_FINALIZE should show all MPI handles that were not freed.
MPI_FINALIZE will show up to N instances of memory allocated by MPI_ALLOC_MEM that was not freed by MPI_FREE_MEM.
Enable to prevent OMPI from actually freeing MPI objects when their handles are freed.
Last modified October 21, 2009