Privacy and Legal Notice





Building Executables

Environment Variables

SMT and

Performance Results

Open Issues, Gotchas, and Recent Changes




News and Events

Speaker: Adam Moody, Computation/ICCD
Date: Monday, May 23, 2005, 10:00–11:00 a.m.
Location: B451 R1025 (White Room)
Subject: Programming for Optimal MPI Performance on LC's Linux/Quadrics Clusters
Contact: Terry Jones (423-9834)

Techniques that applications may employ to achieve optimal MPI performance on LC's Linux clusters using the Quadrics interconnect will be discussed. An overview of the system network architecture and the Quadrics interconnect will be presented, leading to discussion of programming for the best performance in both point-to-point and collective communication. The effectiveness of these programming methods is illustrated through case examples of real applications. In several instances, communication operations completed faster by one or two orders of magnitude, leading to overall application speedups of two or three. Finally, some advanced techniques will be covered, which allow users to fine tune the MPI run-time environment for their application.

Beta Test of New Quadrics MPI Library for Thunder Users

The beta test begins April 20, 2005. The duration of the beta test will depend on the types of problems encountered, but it is expected to last until around mid-May.

Background and Expected Benefits

Quadrics has distributed beta versions of new communication libraries that are expected to provide significant benefits for MPI programs on Thunder. Many collectives have been optimized for small messages: Allreduce and Reduce (~3 or fewer elements), Broadcast (~1K or less), and particularly Alltoall, Allgather, and Gather (~1K or less).

The overall benefit to your application will depend upon how the percentage of time currently consumed by the improved collectives. For applications which utilize the improved collectives frequently for small messages, performance improvements of 25% or better are possible.

Who Can Participate

All Thunder Users with MPI applications may try out the Beta software. We're encouraging anyone interested to give these new libraries a try.

How to Participate

We ask that all participants follow two guidelines:

  1. Please notify us of your experience. We are interested in scenarios where the library helped as well as scenarios in which it did not. We would like to obtain data directly comparing these new libraries to the defaults currently installed on the machine. For an apples-to-apples comparison, it is best to collect timing data for identical problem configurations—same input and same number of tasks but using different communication libraries.
  2. Because these are beta libraries, testers should be on the lookout for any bugs leading to hangs, crashes, or incorrect answers. We want to know about all problems you encounter. Of special note are the new reduction implementations. One optimization allows for out-of-order processing of intermediate messages. This significantly improves latency but may lead to rounding differences. If your output differs significantly when using the new libraries as compared to the same problem with the default libraries, please let us know.

Follow the directions below to run against the new library. You may run against the new libraries without rebuilding or relinking your program binary. To access these new libraries, just include the following path in your LD_LIBRARY_PATH environment variable:


To be sure that your program will load the correct version, check that the proper path appears in the ldd output:

ldd <program_binary>

For example, in the tcsh shell, you might type:

setenv LD_LIBRARY_PATH /usr/local/tools/qsnet/beta
ldd myprogram

For sh:

export LD_LIBRARY_PATH=/usr/local/tools/qsnet/beta
ldd myprogram

If LD_LIBRARY_PATH includes other paths required for your program to run, you should prepend the beta path, e.g.,

setenv LD_LIBRARY_PATH /usr/local/tools/qsnet/beta:$LD_LIBRARY_PATH

Questions and Assistance

If you have any questions on this MPI Beta Test, please contact us (see Contacts list).

  • Questions on Quadrics/LLNL Beta Test Program: Terry Jones (MPI Lead), 423-9834.
  • Requests for MPI Assistance: Adam Moody (Linux MPI Support), 422-9006.
  • Requests for MPI Assistance (additional contact): Sheila Faulkner (Hotline, LC Trouble Reporting), 423-8471.

Speaker: D.K. Panda, Ohio State University
Date: Monday, December 6, 2004, 10:00-11:00 a.m.
Location: B451 R1025 (White Room)
Subject: Designing Next Generation High-Performance Clusters and Datacenters with InfiniBand
Contact: Terry Jones (423-9834)

The emerging InfiniBand Architecture (IBA) is generating a lot of excitement as an open interconnect standard for building next generation high-end systems in a radically different manner. This presentation will focus on research challenges and state-of-the-art solutions for designing HPC clusters and multi-tier datacenters with IBA.

For designing HPC clusters with IBA, the focus will be on issues related to designing scalable and high-performance implementation of the Message Passing Interface (MPI) standard (both MPI-1 and MPI-2). Issues, challenges, and solutions for designing efficient support for point-to-point communication, collective communication (broadcast, barrier, all-to-all, etc.), flow control, data types, and synchronization on clusters with different processors, PCI interfaces (PCI-X and PCI-Express), and networks (single-rail and multiple-rails) will be presented. The presentation will be based on experiences designing MVAPICH (MPI-1 over VAPI) and MVAPICH2 (MPI-2 over VAPI), which are being used in many IBA clusters to extract performance with IBA.

For datacenters, highlights will include issues, challenges, and solutions related to designing high-performance and scalable multi-tier datacenters with IBA. Performance benefits of the Sockets Direct Protocol (SDP) stack compared to the IP over IB (IPoIB) stack will be presented for various workloads. Impact of latency, throughput, and CPU utilization of the protocol stacks on the overall performance of the datacenter will be discussed. It will be shown how VAPI-level RDMA techniques can be used for providing strong coherency and reconfigurability with low overhead for designing next-generation datacenters with dynamic data.

Speaker: David Skinner, Lawrence Berkeley National Laboratory
Date: Tuesday, October 26, 2004, 10:00-11:00 a.m.
Location: B451 R1025 (White Room)
Subject: MPI Performance in a Production Environment: Toward Automatic Application Profiling
Contact: Terry Jones (423-9834)

This talk focuses on profiling and optimization of parallel scientific codes in a production environment. Most of the work presented is from data gathered on Seaborg, the IBM SP at NERSC, using tools designed to assist users and HPC managers in low overhead collection of performance information. Comparisons of different MPI libraries will be presented in addition to initial work done to characterize the diverse scientific workload currently running at NERSC.

There will be roundtable discussions with David on Tuesday afternoon. The discussions will provide an opportunity to cover any code-specific questions or obtain further information on the capabilities and usage of the profiling package. Let me know if you would like to sit in.

Speaker: Kathryn Mohror, CASC, ISCR, Portland State University
Date: Monday, August 23, 2004, 3:00 p.m.
Location: B451 R1025 (White Room)
Subject: Performance Tools for MPI-2 on Linux
Contact: John May (423-8102)

Programmers of message-passing codes for clusters of workstations face a daunting challenge in understanding the performance bottlenecks of their applications. This is largely due to the vast amount of performance data that is collected and the time and expertise necessary to use traditional parallel performance tools to analyze that data.

This talk reports on our recent efforts developing a performance tool for MPI applications on Linux clusters. Our target MPI implementations were LAM/MPI and MPICH2, both of which support portions of the MPI-2 Standard. We started with an existing performance tool and added support for non-shared file systems, MPI-2 one-sided communications, dynamic process creation, and MPI Object naming. We present results using the enhanced version of the tool to examine the performance of several applications. We describe a new performance tool benchmark suite we have developed, PPerfMark, and present results for the benchmark using the enhanced tool.

Speakers: David Klepacki and Charles Grassl, IBM
Date: Thursday, July 8-9, 2004
Location: B543 Auditorium
Subject: See below
Contact: Terry Jones (423-9834)

July 8 Session (Presenter) Presentation Materials
8:30–10:00 a.m. MPI Session 1 (David Klepacki) The IBM High Performance Toolkit Software and
IBM's MPI Implementation
[1.5 MB]
10:00–10:30 a.m. break
10:30–noon MPI Session 2 (David Klepacki)
noon–1:30 p.m. lunch  
1:30–3:00 p.m. App Optimization 1 (Charles Grassl) POWER5 and HPS Programming Strategies:
System Architecture
[1.8 MB]
3:00–3:30 p.m. break  
3:30–5:00 p.m. App Optimization 2 (Charles Grassl) POWER5 and HPS Programming Strategies:
Resources and Special Effects
[1.1 MB]
July 9 Session (Presenter) Presentation Materials
8:30–10:00 a.m. App Optimization 3 (Charles Grassl) POWER5 and HPS Programming Strategies:
pSeries Optimization
[930 KB]
10:00–10:30 a.m. break  
10:30–noon App Optimization 4 (Charles Grassl) POWER5 and HPS Programming Strategies:
[356 KB]

The IBM High Performance Computing Toolkit: David Klepacki
This talk will cover the performance analysis tools and libraries from the ACTC at Watson Research labs. In particular, it will cover the new profiling and tracing tools for MPI, as well as integration with its visualization tool, known as PeekPerf. The significance of these tools is the lower overheads combined with the ability to trace back the performance information to the exact source code statements in a user-friendly interactive manner. Also included will be the new OpenMP profiling tool, Pomprofiler, that is based on the newly proposed DPOMP interface for dynamic OpenMP instrumentation. This tool is also integrated into PeekPerf, and examples will be provided on its utilization.

IBM's MPI Implementation: David Klepacki
This talk provides a description of the IBM MPI Implementation, various protocols used, and key environment variables.

One-sided Communication Programming: David Klepacki
This talk is about exploring one-sided communication issues on IBM pSeries systems. First, it demystifies the use of one-sided messaging defined by the MPI-2 standard, complete with correct usage templates and performance characteristics on current Power4-based systems. Second, it introduces the use of SHMEM on IBM pSeries systems. There are many codes that employ the SHMEM programming paradigm from the Cray T3D/E days, which was found to have significant performance advantage on that architecture. With the advent of BlueGene/L, having a similar communication architecture as those Cray machines, there may be renewed interest in this programming paradigm. SHMEM can be used today on today's Power4-based systems, but with certain limitations when not used with a Federation switch. the talk will cover these limitations and provide comparison with the alterntive one-sided communication methods.

POWER5 and HPS Programming Strategies: Charles Grassl
This presentation will address strategies for programming and exploiting features of IBM Power4 and Power5 processor systems and the new High Performance Switch (HPS), also known as Federation.

Though the processor architecture and instruction sets for Power4 and Power5 processors have few changes from Power3 processors, the overall system design is much different from the 16-processor NightHawk-2 nodes in the ASC White system.

The Power4 processors are architecturally compatible with Power3 processors, but some features are different. The main features of concern are the depths of the functional units, the number of rename registers and the number of pre-fetch queues and the size of level 2 cache. These features are largely manipulated with the compilers, but it is useful for programmers to be aware of the compiler strategy and techniques.

The memory hierarchy of the Power4 and Power5 systems is changed: there is now a third level of cache and memory itself is not strictly "flat" or "uniform." Additionally, these systems have two sizes of memory pages available. We now have the concepts of "memory affinity" and of large versus small memory pages. These features have ramifications that affect performance programming.

The new HPS switch is nominally four times higher bandwidth than previous switch SP Switch 2, also known as "Colony," used in the ASC White system. The MPI message passing latency is also much improved relative to the SP Switch 2.

In this tutorial, we will describe the latest C and Fortran compilers and their specific features relevance to the Power4 and Power5 processors. We will also describe the performance optimization facilities available in the C and Fortran compilers and the most effective tactics for leveraging them.

We will follow this with an overview of optimization techniques that will exploit the features of the Power4 and Power5 processors. We will also discuss the use, exploitation, and ramifications of memory affinity and of large memory pages.

We will also discuss message passing using the new HPS switch. The new switch has slightly different tuning characteristics from previous pSeries switches, and this tuning involves several new environment variables.

David Klepacki directs and manages the Advanced Computing Technology Center (ACTC) and is also Associate Director of the Deep Computing Institute at the IBM T.J. Watson Research Center in Yorktown Heights, NY. David Klepacki obtained a Ph.D. degree in physics from Purdue University and is a senior staff scientist at IBM Research with more than 20 years of experience in highperformance computing. David has worked in many areas including high-performance processor design, numerically intensive computation, computational physics, parallel computing, application benchmarking, and cluster computing.
Charles Grassl is a High-Performance Computing Technology Specialist associated with the Deep Computing and the Advanced Computing Technology Center (ACTC) groups. He earned a Ph.D. in computational physics from the University of Wisconsin. His research involved developing computational models for disordered magnetic systems. This work involved programming early vector supercomputers for solving large eigenvalue problems. Beginning with the early use of vector supercomputers, through multiprocessor vector systems and microprocessor based MPP systems, to current Linux clusters, Charles Grassl has worked on high-performance computing applications for over 25 years.

Speaker: Rich Graham, Los Alamos National Laboratory
Date: Tuesday, June 29, 2004, 10:00–11:00 a.m.
Location: B113 R1104 (Von Neumann Room)
Subject: Open MPI
Contact: Terry Jones (423-9834)

A large number of MPI implementations are currently available, many of which emphasize different aspects of high-performance computing. Each implementation is typically targeted toward a specific environment or intended to solve a specific research problem. The result is myriad incompatible MPI implementations, all of which require separate installation, and the combination of which present significant logistical challenges for end users.

Building upon prior research, and influenced by experience gained from the code bases of the LAM/MPI, LA-MPI, FT-MPI, and the MVAPICH projects, Open MPI is an all-new, production-quality MPI-2 implementation that is centered around component concepts.

Open MPI provides a unique combination of novel features previously unavailable in an open-source, production-quality implementation of MPI. Its component architecture provides both a stable platform for third-party research as well as enabling the run-time composition of independent software add-ons.

This talk presents a high-level overview of the goals, design, and implementation, and performance data of Open MPI.

Speaker: Jesús Labarta, CEPBA Director, European Center for Parallelism of Barcelona, Technical University of Catalonia
Date: Tuesday, March 2, 2004, 10 a.m.
Location: B451 R1025 (White Room)
Subject: Analysis of AIX Traces with Paraver
Contact: Terry Jones (423-9834)

The AIX trace facility from IBM allows one to collect very low level information on OS scheduling, system calls, resources allocation, system daemons activity, etc.

Under a collaboration contract between LLNL and CEPBA-UPC, we have developed a translator from the AIXtrace format to Paraver. We are now able to use the flexibility and analysis power of Paraver to analyze the low level detail information captured by the AIX trace facility.

This talk will present the aix2prv translator, the kind of information and views that can be obtained, and some examples of studies such as analyzing the influence and impact of system interrupts in fine grain parallel applications or discovering some details about the MPI internals.

Speaker: L.V. Kale, Professor, Department of Computer Science, University of Illinois, Urbana-Champaign
Date: Thursday, February 25, 2004, 9:00 a.m.
Location: B451 R1025 (White Room)
Subject: Adaptive Resource Management via Processor Virtualization—Charm++ and AMPI
Contact: Andy Yoo (422-3721)

Speaker: Michael Campbell
Date: Thursday, January 8, 2004, 10:00 a.m.–11:00 a.m.
Location: B451 R1025 (White Room)
Contact: Terry Jones (423-9834)

On Thursday, January 8, 2004, Michael Campbell of the Center for Simulation of Advanced Rockets at the University of Illinois at Urbana-Champaign will be the MPI Topics Colloquium speaker. Michael will be at LLNL to discuss recent efforts to profile and analyze high-performance-computing applications. In addition to the seminar, we will have roundtable discussions, and there are a couple of timeslots left during Thursday to meet with Michael. Contact Terry Jones if you would like to join these discussions.

Performance Profiling and Analysis for High Performance Computing at CSAR

Performance profiling and analysis in high performance computing is a problem almost as complex and diverse as today's supercomputing environment. Application scientists are faced with many different platforms including large SMP machines, clusters of heterogeneous workstations, dedicated clustered workstations, and even clusters of clusters. The applications themselves are just as complex, with multiple, sometimes dynamic, parallelization schemes, multiple languages, and different design philosophies. Many quality profiling and analysis tools exist but are challenged to keep up with application and platform complexities. The CSAR rocket simulation code is one such complex application. CSAR code interfaces multiple, independently developed, multi-language physics and computer science modules, includes multiple parallelization schemes, and runs on a variety of platforms. CSAR has taken a relatively seamless, "embedded" approach to monitoring and evaluating our code's performance on the various platforms on which it runs. In this talk I will discuss CSAR's solution for the collection and analysis of performance data for our integrated simulation code.

The BlueGene/L Supercomputer: Delivering Large Scale Parallelism
José E. Moreira, IBM T.J. Watson Research Center
March 2003

High Performance Computing at LLNL    Lawrence Livermore National Laboratory

Last modified September 8, 2006