Privacy and Legal Notice

Home

Overview

News/Events

Tutorials/
Documentation

Libraries/
Building Executables

Environment Variables

SMT and
OpenMP

Performance Results

Open Issues, Gotchas, and Recent Changes

FAQs

Glossary

Contacts

Linux+Elan Comprehensive List of
Environment Variables

On the Quadrics systems, the MPI library (MPICH) is layered on top of the Elan communication library, known as libelan. Libelan provides optimized message-passing functions, one-sided communication, and collective operations that take advantage of the sophisticated hardware available in the QsNet network. Environment variables throughout this software stack can be set to tune or change the behavior of MPI applications. The most commonly used variables are described below, grouped into four classes by functionality. For more information about the Elan library, including additional environment variables not discussed here, please refer to the Quadrics Linux Cluster Elan Programming Manual.

  • Work-Arounds—Useful to try when an application hangs or crashes for unknown reasons.
  • Performance—Useful to tune communication performance of an application.
  • Memory—Useful for programs that require large amounts of message passing memory.
  • Debugging—Turns on debugging output in the MPI and Elan libraries.

Work-Arounds

Some applications (or particular problem inputs) may hang or crash when using the default optimization or protocol settings. The following environment variables can be used to enable/disable optimizations or tweak protocol selection criteria in the communication layers. Setting these variables may bypass problems preventing your application from running. In this case, there may be a problem in the system communication libraries, so please notify the LC Hotline. Also, these variables should typically be used only as temporary work-arounds because they can disable valuable optimizations.

Work-Arounds
Environment
Variable
Description Range Default Elan3 Default Elan4
MPI_USE_LIBELAN Flag indicating whether MPI should use Elan library optimizations in place of default MPICH functions, especially relevant to collectives. Note: This can also be used as a bitmask to select individual collectives and operations (e.g., if just all-to-all causes problems). Give --elan-help as an option to an MPI program for the bit definitions. TRUE | FALSE
| 1 | 0
1 1
LIBELAN_SHM_ENABLE Flag indicating whether the Elan library can use shared memory to exchange messages and perform collectives. TRUE | FALSE
| 1 | 0
1 1
LIBELAN_TPORT_SHM_ENABLE Flag indicating whether the Elan library can use shared memory to exchange messages. TRUE | FALSE
| 1 | 0
1 1
LIBELAN_GROUP_SHM_ENABLE Flag indicating whether the Elan library can use shared memory to perform collectives. TRUE | FALSE
| 1 | 0
1 1
LIBELAN_SHM_BIGMSG Message size below which data is eagerly buffered at the receiver when sending through shared memory, otherwise data is sent synchronously. [0, 4G] 64k 64k

My MPI program is hanging. What can I try?

MPI programs may hang when exchanging messages between MPI processes on the same node. These messages are exchanged through shared memory by default, which uses fixed-size FIFOs that can become blocked, leading to a deadlock. Many codes may be able to work around this issue by either disabling shared memory:

setenv LIBELAN_TPORT_SHM_ENABLE 0

or by increasing buffering at the receiver:

setenv LIBELAN_SHM_BIGMSG 2G

Performance

The behavior of the communication libraries, including protocol selection and method of waiting for a message, can be changed using several environment variables. Tuning these variables to values appropriate for your application may significantly improve your communication performance.

Performance
Environment
Variable
Description Range Default Elan3 Default Elan4
LIBELAN_TPORT_BIGMSG Message size below which data is eagerly buffered at the receiver when sending through the network; otherwise, data is sent synchronously. [0, 4G] 64k 32k
LIBELAN_TPORT_SMALLMSG Message size below which data is inlined with message envelope. Elan3
[0, 288]
Elan4
[0, 1920]
288 128
LIBELAN_SHM_BIGMSG Message size below which data is eagerly buffered at the receiver when sending through shared memory; otherwise, data is sent synchronously. [0, 4G] 64k 64k
LIBELAN_TOUCHBUF Flag indicating whether Elan libraries will touch buffers allocated with mmap to ensure MMU tables are set on Elan NIC. TRUE | FALSE
| 1 | 0
N/A TRUE
LIBELAN4_TOUCH_MALLOC Flag indicating whether Elan libraries should memset buffers allocated with malloc to ensure MMU tables are set on Elan NIC. TRUE | FALSE
| 1 | 0
N/A TRUE
LIBELAN4_MMAP_MALLOC Flag indicating whether mmap() based malloc() should be enabled. Enabling this is equivalent to:
MALLOC_MMAP_MAX = 0
MALLOC_TRIM_THRESHOLD = -1
TRUE | FALSE
| 1 | 0
N/A IA64/x86-64
FALSE
otherwise
TRUE
LIBELAN_WAITTYPE Determines how processes wait for messages: polling, blocking, or poll-blocking; a number indicates the number of microseconds to poll before blocking. POLL | WAIT
| [0, 32K]
POLL POLL

Quadrics has two basic data transfer protocols—asynchronous and synchronous. Asynchronous messages are eagerly sent to the receiver and are buffered in system memory when the receiver is not ready for them. Synchronous messages remain at the sender until the receiver posts a matching receive. Asynchronous transfer can reduce message latency by as much as 50%, but it becomes quite costly if large amounts of data must be buffered, as when transferring large messages or many small ones. In addition, for network transfers, Quadrics provides a third protocol that inlines very small messages inside the message envelope: a logically separate message that contains header information. Unless the application declares a message to be synchronous, one of these three protocols is chosen depending on the size of the message. Environment variables can be set to tune the byte boundaries used to select the protocol. In one case, an application cut communication costs in half by halving the default value of LIBELAN_TPORT_BIGMSG to force more messages to be sent synchronously.

The Quadrics network interface card has its own MMU for performing virtual-to-physical address translation to avoid consulting the kernel. The MMU tables must be established to read data from or copy data to application message buffers. Data copies are much more efficient when the mapping exists before the copy is attempted. If not, the MMU may thrash with faults while looking up translations on the fly. On Elan3 systems, this mapping will occur when the buffers are initialized, as with a memset() or an equivalent operation. On Elan4 systems, if LIBELAN_TOUCHBUF is enabled, the Elan libraries will intercept calls to mmap() in order to establish mapping tables at the point when memory is allocated. If mmap() is used infrequently, such pre-mapping is most efficient. However, if mmap() is called frequently, and especially if most of the memory is not used as message buffers, LIBELAN_TOUCHBUF should be disabled.

Processes may wait for message arrival in several ways: polling, blocking, or poll-blocking. Setting LIBELAN_WAITTYPE sets the method of waiting for all blocking MPI calls. In polling, a process spins on a CPU in a while loop checking for message arrival. In blocking, a process sleeps until the message arrives and the OS wakes it up. Poll-blocking is a hybrid in which the process polls for a limited time before it blocks. Polling provides better latency; however, it also consumes CPU resources. Even on nodes completely dedicated to the application, there are background system threads that must run occasionally. Conceding the CPU while waiting for a message will allow these system threads to do their work in the deadtime, freeing up cycles for the application in the future. Applications that exchange many large messages, where latency is not so important, will probably benefit from blocking. It is difficult to generalize which is best for applications that send many small messages.

How can I send all messages synchronously?

setenv LIBELAN_TPORT_BIGMSG 0
setenv LIBELAN_TPORT_SMALLMSG 0
setenv LIBELAN_SHM_BIGMSG 0

Memory

The Quadrics Elan network interface card has 64 MB of local SDRAM memory. MPI uses this Elan memory, along with main memory, to store its internal data structures and to buffer unexpected messages. To ensure efficient MMU address translations for these regions, the MPI library calls upon allocators implemented in the Elan library, which manage segments of Elan and main memory pre-allocated and translated during initialization. Additionally, there are global allocators for both Elan and main memory. These specialized allocators distribute segments of memory with the same virtual address on all processes, which is important for collective operations. Environment variables can be used to set the size and base address (in Elan virtual address space) of the heap assigned to each allocator.

Memory
Environment
Variable
Description Range Default
Elan3
Default
Elan4
LIBELAN_ALLOC_SIZE Defines the size of the main memory heap in bytes. [0, 4G] 200M 128M
LIBELAN_ALLOC_BASE Defines the Elan virtual address of the base of the main memory heap. [0, 4G] 0xe4100000 N/A
LIBELAN_ALLOCELAN_SIZE Defines the size of the Elan memory heap in bytes. [0, 4G] 64M 64M
LIBELAN_ALLOCELAN_BASE Defines the Elan virtual address of the base of the Elan memory heap. [0, 4G] 0xe0100000 N/A
LIBELAN_GALLOC_SIZE Defines the size of the global main memory heap in bytes. [0, 4G]* 64M 64M
LIBELAN_GALLOCELAN_SIZE Defines the size of the global Elan memory heap in bytes. [0, <64M]* 4M 4M
*  On Elan3 systems, the global allocator space is taken from the corresponding main and Elan allocators spaces. In other words LIBELAN_GALLOC_SIZE can be no larger than LIBELAN_ALLOC_SIZE, and LIBELAN_GALLOCELAN_SIZE can be no larger than LIBELAN_ALLOCELAN_SIZE. On Elan4 systems, the global allocators are independent from the main and Elan allocators.

I'm getting "Main memory exhausted" or "Exhausted Elan allocator space" errors. How do I increase memory?

Applications that require a lot of unexpected message buffering may need to increase the size of the Elan memory heaps. If this is the case, you may see one of the following error messages:

tportBuf: Main memory exhausted ...
elan3_alloc: Exhausted Elan allocator space

The following provides the maximum general setting for Elan3 systems:

setenv LIBELAN_ALLOC_BASE 0xc4000000
setenv LIBELAN_ALLOCELAN_BASE 0xc0000000
setenv LIBELAN_ALLOC_SIZE 959m

If these settings do not work, you may be able to increase things further, but this will depend on your application. In this case, you should contact the LC Hotline for support.

Debugging

The MPI and Elan libraries have internal debug logging, which is disabled by default. When debugging very difficult problems arising from the communication libraries, it may be useful to turn this on. This will be only be necessary in severe problem cases in which you may be asked to enable this logging by LC staff members working to assist you.

Debugging
Environment Variable Description Range Default Elan3 Default Elan4
LIBELAN_DEBUGFILE Elan3: Enables debug logging in Elan libraries. Stored in the form ${RANK_ID}.dbg.

Elan4: Enables debug logging in Elan libraries and also sets the directory and filename prefix to write per-process debug files. Stored in the form ${LIBELAN_DEBUGFILE}${RANK_ID}.dbg.
Elan3
TRUE | FALSE

Elan4
<String>
NULL NULL
LIBELAN_DEBUGSIG Determines signal that causes process to dump Quadrics debug info. [USR1, USR2]
NULL NULL
LIBELAN_DEBUGFLAGS Bitmask that determines which parts of the Elan libraries will include debug logging. Refer to /usr/include/elan/misc.h for the bit definitions. <Bitmask> 0 0

How do I enable debug logging?

Use the following settings to fully enable debug logging:

setenv LIBELAN_DEBUGFILE /tmp/${LOGNAME}.${SLURM_JOBID}.
setenv LIBELAN_DEBUGSIG USR1

Link against the debug version of the libraries:

setenv LD_LIBRARY_PATH /usr/lib/qsnet/default/lib/dbg

Each process in your application will log data to the file named in LIBELAN_DEBUGFILE. For large jobs, these files should be written to local disk in /tmp or to a parallel file system so as not to overwhelm NFS. The above settings also enable one to force a job to dump debug files at a given instant by sending it the signal named in LIBELAN_DEBUGSIG. In this case, the following command will flush debug files:

scancel --signal=SIGUSR1 <LCRM_JOBID>

If the application still fails, then attempt to further isolate the problem by selecting bitmask values for more output. Setting LIBELAN_DEBUGFLAGS = -1 outputs everything, but this can generate very large log files. Setting LIBELAN_DEBUGFLAGS = 0x10000000 enables CRC checksums for each TPORT message to verify data integrity within the libelan libraries, at the cost of reduced bandwidth.

© Quadrics Ltd. 1998-2003, Used with permission. Quadrics documentation available at http://www.quadrics.com/documentation


High Performance Computing at LLNL    Lawrence Livermore National Laboratory

Last modified September 8, 2006
UCRL-MI-126792