|Privacy and Legal Notice|
MPI Interactions with SMT and OpenMP
One of the new features available with POWER5-based systems (e.g., Purple and uP) is simultaneous multithreading (SMT). SMT is the ability of a single physical processor to concurrently dispatch instructions from more than one hardware thread. On ASC Purple, a node consisting of eight physical processors is configured as a logical 16-way by default. Two hardware threads can run on one physical processor at the same time.
Benchmarks have shown that SMT can provide performance boosts of up to 60%, although in some cases there is no benefit or even a slight degradation in performance. SMT works by overloading each physical CPU with the work from two threads. If one thread is stalled waiting for some resource, it may be possible to do useful work from the other thread, improving overall performance. Thus, SMT would benefit code that is memory latency bound, because work from one thread can be used to hide some of the latency from the other. But in situations where some resource, such as memory bandwidth or FPU cycles, is already maxed out, SMT will not provide a benefit.
For example, the memory bandwidth on a p575 node can be saturated with fewer than eight threads, so on an SMT system, running the STREAM benchmark, which is memory bandwidth bound, using 16 threads will not improve the performance over using eight threads. Similarly, blocked matrix multiplies use up all the FPU instruction bandwidth, so running more than one thread per physical CPU will be of no benefit to LINPACK. For more information on IBM's SMT capabilities, please consult Characterization of Simultaneous Multithreading (SMT) in POWER5 and General Programming Concepts: Writing and Debugging Programs—Simultaneous Multi-Threading. Finally, IBM's Fortran and C compilers support the OpenMP pragma directives.
Case Study 1: UMT2K
SMT benefited both UMT2K and sPPM. For UMT2K, there is about a 30% gain by making use of SMT.
These are one-node results. But although the overall percent of peak drops a little at scale, the relative benefits of SMT remains the same even on full machine runs. A similar benefit was found with sPPM.
Case Study 2: LINPACK
As mentioned above, there is no benefit in running 16 theads per node, rather than 8, for LINPACK. This brings up the question of whether having SMT enabled imposes any cost. The following results running LINPACK on 247 nodes show that the cost is very minor.
As a result, Purple is configured with SMT enabled so that codes like UMT2K and sPPM can enjoy the benefit, while other codes are not seriously impacted.
Case Study 3: HYDRA
HYDRA incorporates a hydrodynamics package, a laser package, and an implicit monte carlo package. It can be configured with varying numbers of MPI tasks per node, as well as a variable "masters to workers" ratio.
[adapted from Hybrid OpenMP and MPI Programming and Tuning, Yun (Helen) He and Chris Ding, Lawrence Berkeley National Laboratory, NUG2004 (June 24, 2004)]
Pure MPI Pro
Pure MPI Con
Pure OpenMP Pro
Pure OpenMP Con
Why Mixed OpenMP MPI is Sometimes Slower
Last modified October 12, 2007