|Privacy and Legal Notice|
News and Events
Speaker: Adam Moody, Computation/ICCD
Techniques that applications may employ to achieve optimal MPI performance on LC's Linux clusters using the Quadrics interconnect will be discussed. An overview of the system network architecture and the Quadrics interconnect will be presented, leading to discussion of programming for the best performance in both point-to-point and collective communication. The effectiveness of these programming methods is illustrated through case examples of real applications. In several instances, communication operations completed faster by one or two orders of magnitude, leading to overall application speedups of two or three. Finally, some advanced techniques will be covered, which allow users to fine tune the MPI run-time environment for their application.
Beta Test of New Quadrics MPI Library for Thunder Users
The beta test begins April 20, 2005. The duration of the beta test will depend on the types of problems encountered, but it is expected to last until around mid-May.
Background and Expected Benefits
Quadrics has distributed beta versions of new communication libraries that are expected to provide significant benefits for MPI programs on Thunder. Many collectives have been optimized for small messages: Allreduce and Reduce (~3 or fewer elements), Broadcast (~1K or less), and particularly Alltoall, Allgather, and Gather (~1K or less).
The overall benefit to your application will depend upon how the percentage of time currently consumed by the improved collectives. For applications which utilize the improved collectives frequently for small messages, performance improvements of 25% or better are possible.
Who Can Participate
All Thunder Users with MPI applications may try out the Beta software. We're encouraging anyone interested to give these new libraries a try.
How to Participate
We ask that all participants follow two guidelines:
Follow the directions below to run against the new library. You may run against the new libraries without rebuilding or relinking your program binary. To access these new libraries, just include the following path in your LD_LIBRARY_PATH environment variable:
To be sure that your program will load the correct version, check that the proper path appears in the ldd output:
For example, in the tcsh shell, you might type:
setenv LD_LIBRARY_PATH /usr/local/tools/qsnet/beta
If LD_LIBRARY_PATH includes other paths required for your program to run, you should prepend the beta path, e.g.,
setenv LD_LIBRARY_PATH /usr/local/tools/qsnet/beta:$LD_LIBRARY_PATH
Questions and Assistance
If you have any questions on this MPI Beta Test, please contact us (see Contacts list).
Panda, Ohio State University
The emerging InfiniBand Architecture (IBA) is generating a lot of excitement as an open interconnect standard for building next generation high-end systems in a radically different manner. This presentation will focus on research challenges and state-of-the-art solutions for designing HPC clusters and multi-tier datacenters with IBA.
For designing HPC clusters with IBA, the focus will be on issues related to designing scalable and high-performance implementation of the Message Passing Interface (MPI) standard (both MPI-1 and MPI-2). Issues, challenges, and solutions for designing efficient support for point-to-point communication, collective communication (broadcast, barrier, all-to-all, etc.), flow control, data types, and synchronization on clusters with different processors, PCI interfaces (PCI-X and PCI-Express), and networks (single-rail and multiple-rails) will be presented. The presentation will be based on experiences designing MVAPICH (MPI-1 over VAPI) and MVAPICH2 (MPI-2 over VAPI), which are being used in many IBA clusters to extract performance with IBA.
For datacenters, highlights will include issues, challenges, and solutions related to designing high-performance and scalable multi-tier datacenters with IBA. Performance benefits of the Sockets Direct Protocol (SDP) stack compared to the IP over IB (IPoIB) stack will be presented for various workloads. Impact of latency, throughput, and CPU utilization of the protocol stacks on the overall performance of the datacenter will be discussed. It will be shown how VAPI-level RDMA techniques can be used for providing strong coherency and reconfigurability with low overhead for designing next-generation datacenters with dynamic data.
Speaker: David Skinner, Lawrence Berkeley National
This talk focuses on profiling and optimization of parallel scientific codes in a production environment. Most of the work presented is from data gathered on Seaborg, the IBM SP at NERSC, using tools designed to assist users and HPC managers in low overhead collection of performance information. Comparisons of different MPI libraries will be presented in addition to initial work done to characterize the diverse scientific workload currently running at NERSC.
There will be roundtable discussions with David on Tuesday afternoon. The discussions will provide an opportunity to cover any code-specific questions or obtain further information on the capabilities and usage of the profiling package. Let me know if you would like to sit in.
Speaker: Kathryn Mohror, CASC, ISCR, Portland
Programmers of message-passing codes for clusters of workstations face a daunting challenge in understanding the performance bottlenecks of their applications. This is largely due to the vast amount of performance data that is collected and the time and expertise necessary to use traditional parallel performance tools to analyze that data.
This talk reports on our recent efforts developing a performance tool for MPI applications on Linux clusters. Our target MPI implementations were LAM/MPI and MPICH2, both of which support portions of the MPI-2 Standard. We started with an existing performance tool and added support for non-shared file systems, MPI-2 one-sided communications, dynamic process creation, and MPI Object naming. We present results using the enhanced version of the tool to examine the performance of several applications. We describe a new performance tool benchmark suite we have developed, PPerfMark, and present results for the benchmark using the enhanced tool.
The IBM High Performance Computing Toolkit: David Klepacki
IBM's MPI Implementation: David Klepacki
One-sided Communication Programming: David Klepacki
POWER5 and HPS Programming Strategies: Charles Grassl
Though the processor architecture and instruction sets for Power4 and Power5 processors have few changes from Power3 processors, the overall system design is much different from the 16-processor NightHawk-2 nodes in the ASC White system.
The Power4 processors are architecturally compatible with Power3 processors, but some features are different. The main features of concern are the depths of the functional units, the number of rename registers and the number of pre-fetch queues and the size of level 2 cache. These features are largely manipulated with the compilers, but it is useful for programmers to be aware of the compiler strategy and techniques.
The memory hierarchy of the Power4 and Power5 systems is changed: there is now a third level of cache and memory itself is not strictly "flat" or "uniform." Additionally, these systems have two sizes of memory pages available. We now have the concepts of "memory affinity" and of large versus small memory pages. These features have ramifications that affect performance programming.
The new HPS switch is nominally four times higher bandwidth than previous switch SP Switch 2, also known as "Colony," used in the ASC White system. The MPI message passing latency is also much improved relative to the SP Switch 2.
In this tutorial, we will describe the latest C and Fortran compilers and their specific features relevance to the Power4 and Power5 processors. We will also describe the performance optimization facilities available in the C and Fortran compilers and the most effective tactics for leveraging them.
We will follow this with an overview of optimization techniques that will exploit the features of the Power4 and Power5 processors. We will also discuss the use, exploitation, and ramifications of memory affinity and of large memory pages.
We will also discuss message passing using the new HPS switch. The new switch has slightly different tuning characteristics from previous pSeries switches, and this tuning involves several new environment variables.
David Klepacki directs and manages the Advanced
Computing Technology Center (ACTC) and is also Associate Director of the Deep
Computing Institute at the IBM T.J. Watson Research Center in Yorktown Heights,
NY. David Klepacki obtained a Ph.D. degree in physics from Purdue University and
is a senior staff scientist at IBM Research with more than 20 years of experience
in highperformance computing. David has worked in many areas including high-performance
processor design, numerically intensive computation, computational physics, parallel
computing, application benchmarking, and cluster computing.
Speaker: Rich Graham, Los Alamos National Laboratory
A large number of MPI implementations are currently available, many of which emphasize different aspects of high-performance computing. Each implementation is typically targeted toward a specific environment or intended to solve a specific research problem. The result is myriad incompatible MPI implementations, all of which require separate installation, and the combination of which present significant logistical challenges for end users.
Building upon prior research, and influenced by experience gained from the code bases of the LAM/MPI, LA-MPI, FT-MPI, and the MVAPICH projects, Open MPI is an all-new, production-quality MPI-2 implementation that is centered around component concepts.
Open MPI provides a unique combination of novel features previously unavailable in an open-source, production-quality implementation of MPI. Its component architecture provides both a stable platform for third-party research as well as enabling the run-time composition of independent software add-ons.
This talk presents a high-level overview of the goals, design, and implementation, and performance data of Open MPI.
Speaker: Jesús Labarta, CEPBA Director,
European Center for Parallelism of Barcelona,
Technical University of Catalonia
The AIX trace facility from IBM allows one to collect very low level information on OS scheduling, system calls, resources allocation, system daemons activity, etc.
Under a collaboration contract between LLNL and CEPBA-UPC, we have developed a translator from the AIXtrace format to Paraver. We are now able to use the flexibility and analysis power of Paraver to analyze the low level detail information captured by the AIX trace facility.
This talk will present the aix2prv translator, the kind of information and views that can be obtained, and some examples of studies such as analyzing the influence and impact of system interrupts in fine grain parallel applications or discovering some details about the MPI internals.
Kale, Professor, Department of Computer
Science, University of Illinois, Urbana-Champaign
Speaker: Michael Campbell
On Thursday, January 8, 2004, Michael Campbell of the Center for Simulation of Advanced Rockets at the University of Illinois at Urbana-Champaign will be the MPI Topics Colloquium speaker. Michael will be at LLNL to discuss recent efforts to profile and analyze high-performance-computing applications. In addition to the seminar, we will have roundtable discussions, and there are a couple of timeslots left during Thursday to meet with Michael. Contact Terry Jones if you would like to join these discussions.
Performance Profiling and Analysis for High Performance Computing at CSAR
Performance profiling and analysis in high performance computing is a problem almost as complex and diverse as today's supercomputing environment. Application scientists are faced with many different platforms including large SMP machines, clusters of heterogeneous workstations, dedicated clustered workstations, and even clusters of clusters. The applications themselves are just as complex, with multiple, sometimes dynamic, parallelization schemes, multiple languages, and different design philosophies. Many quality profiling and analysis tools exist but are challenged to keep up with application and platform complexities. The CSAR rocket simulation code is one such complex application. CSAR code interfaces multiple, independently developed, multi-language physics and computer science modules, includes multiple parallelization schemes, and runs on a variety of platforms. CSAR has taken a relatively seamless, "embedded" approach to monitoring and evaluating our code's performance on the various platforms on which it runs. In this talk I will discuss CSAR's solution for the collection and analysis of performance data for our integrated simulation code.
The BlueGene/L Supercomputer:
Delivering Large Scale Parallelism
Last modified September 8, 2006