Privacy and Legal Notice





Building Executables

Environment Variables

SMT and

Performance Results

Open Issues, Gotchas, and Recent Changes




Open Issues, Gotchas, and Recent Changes

MPI Task Busy-Waits | Large Pages | New Environment Variables



MPI task busy-waits on pending message affecting some multithreaded style applications.

Affected Machines

AIX machines (uP, Purple)

Affected Applications

MPI programs that oversubscribe the threads to cores ratio on a node.


To minimize message latencies, the default behavior of MPI on AIX platforms is for tasks to poll, or spin wait, when blocked for a message. This approach works well when there is just one task (or thread) per CPU, but if an application has been designed to make use of otherwise "idle time" during message waiting by running additional threads, spin waits will destroy performance.

IBM's MPI provides the environment variable MPI_WAIT_MODE to control what happens when a task is blocked waiting for a message. The default setting is poll, which means that the MPI task will spin wait for the message. To avoid polling, one can instead set this variable to nopoll. With this setting, the MPI task will poll for a short time (less than 1 millisecond) and then sleep until an interrupt indicates the message has arrived, or until a periodic timer goes off indicating it should check for the message. This is the setting to use when one is using additional threads to accomplish work while a task is blocked.

When using MPI_WAIT_MODE=nopoll, the setting MP_CSS_INTERRUPT=yes should also be used so that arriving messages will generate an interrupt to wake up the MPI task. Otherwise, the MPI task will only check for messages periodically, as defined by the value of MP_POLLING_INTERVAL, whose default is 400,000 microseconds.

When compiling in 64-bit mode (-q64, or OBJECT_MODE=64), however, there is a defect in the nopoll implementation in current versions of IBM's MPI that causes waits at barriers to be implemented with polling, even though nopoll has been set. While we await a design change to fix this defect in the next release of MPI, there are two workarounds noted below.

In addition to poll and nopoll, MP_WAIT_MODE can also be set to sleep or yield. These settings, however, just modify the polling behavior slightly and so are not alternate options when nopoll behavior is desired. If you need to avoid spin waits, nopoll is the setting to use.


(a) Compile in 32-bit mode instead of 64-bit mode. To compile in 32-bit mode instead of 64-bit mode, either set the environment variable OBJECT_MODE=32 or add the -q32 option to compile and link commands and -X32 to ar and ranlib commands.

(b) Replace MPI_Barrier calls with a call to MPI_Allreduce (the nopoll defect does not affect MPI_Allreduce).

Large Pages

To use large pages under AIX, a bit flag within the XCOFF and XCOFF64 executable file headers must be set to record the large page data attribute of a program. The ldedit command provides the ability to set and unset the large page flag of an executable file without the need for source code changes, recompiling, or relinking:

% /usr/ccs/bin/ldedit -blpdata a.out

Alternatively, the LDR_CNTRL environment variable can be used to override the bit flag to either enable or disable the use of large pages:


where Y means the application uses large pages if available (small pages will be used if there are insufficient large pages), N means do not use large pages, and M means it is mandatory to use large pages: a failure will occur when the application tries to allocate a large page and none are available.

Large pages are pinned (cannot be paged out or stolen) to memory the entire time an application executes. The large page memory pool is configured at boot time and is a limited system resource. A system administrator has the ability to control usage of the large page memory pool by user ID (such as with the chuser and mkuser commands). This prevents unprivileged users causing privileged large-page data applications to fail due to running out of large pages.

Large pages are a special purpose performance improvement feature. They are not recommended for general use. Large page usage provides performance value to a select set of applications. These are primarily long running memory access intensive applications that use large amounts of virtual memory. For example, uniprocessor performance for memory-bound kernels such as DAXPY will increase significantly when using large pages. This is primarily because of the increased efficiency of data prefetching long vectors in large pages.

Not all applications benefit by using large pages. Some applications can be severely degraded by the use of large pages. Applications that do a large number of fork()s (such as shell scripts) are especially prone to performance degradation when large pages are used. Tests have shown a 10-time increase in shell script execution time when the LDR_CNTRL environment specifies large page usage variable. Consider marking specific executables to use large pages rather than using the LDR_CNTRL environment variable. This limits large page usage to the specific applications that benefit from large page usage.

Consider the overall performance affect large pages may have on your system. While some specific applications may benefit from large page use, the overall performance of your system may be degraded by large page usage due to having reduced the amount of 4KB page storage available in the system. Consider using large pages when your system has sufficient physical memory such that reducing the number of 4 KB pages does not significantly impact overall system performance.

New Environment Variables

For 64 bit jobs use shared memory collective communication optimization.
Guarantee packet acks are sent back after this many packets.
Guarantee packet acks are sent back after this many packets. It takes the values yes and no. Yes means that the protocol will not terminate the job when it does not receive a packet acknowldegment within the timeout period.
Size of retransmit buffers. Must be greater than or equal to 0 and defaults to 16K.
Number of retransmit buffers. Must be greater than or equal to 0; default is 128. When a message is sent, it is completed from the send side when the last acknowlegment for the data packets arrives. However, at the expense of a copy, a send message may be completed immediately by copying the client send buffer into a temporary retransmit buffer. This temporary buffer is freed when the packet acknowlegments have all arrived.This helps performance when there is a large number of calls to MPI_Send for small messages. The two environmental variables above control the size and the number of these retransmit buffers. While no restriction is placed on the count and size of these buffers, the MPI initialization may fail with ENOMEM if too much memory is requested.
Message size to switch over to bulk (RDMA) transfer mode.

We appreciate any feedback, good and bad, you wish to provide. Please report any feedback to

High Performance Computing at LLNL    Lawrence Livermore National Laboratory

Last modified October 21, 2009