Privacy and Legal Notice





Building Executables

Environment Variables

SMT and

Performance Results

Open Issues, Gotchas, and Recent Changes




MPI Interactions with SMT and OpenMP

Simultaneous Multithreading

One of the new features available with POWER5-based systems (e.g., Purple and uP) is simultaneous multithreading (SMT). SMT is the ability of a single physical processor to concurrently dispatch instructions from more than one hardware thread. On ASC Purple, a node consisting of eight physical processors is configured as a logical 16-way by default. Two hardware threads can run on one physical processor at the same time.

Benchmarks have shown that SMT can provide performance boosts of up to 60%, although in some cases there is no benefit or even a slight degradation in performance. SMT works by overloading each physical CPU with the work from two threads. If one thread is stalled waiting for some resource, it may be possible to do useful work from the other thread, improving overall performance. Thus, SMT would benefit code that is memory latency bound, because work from one thread can be used to hide some of the latency from the other. But in situations where some resource, such as memory bandwidth or FPU cycles, is already maxed out, SMT will not provide a benefit.

For example, the memory bandwidth on a p575 node can be saturated with fewer than eight threads, so on an SMT system, running the STREAM benchmark, which is memory bandwidth bound, using 16 threads will not improve the performance over using eight threads. Similarly, blocked matrix multiplies use up all the FPU instruction bandwidth, so running more than one thread per physical CPU will be of no benefit to LINPACK. For more information on IBM's SMT capabilities, please consult Characterization of Simultaneous Multithreading (SMT) in POWER5 and General Programming Concepts: Writing and Debugging Programs—Simultaneous Multi-Threading. Finally, IBM's Fortran and C compilers support the OpenMP pragma directives.

Case Study 1: UMT2K

SMT benefited both UMT2K and sPPM. For UMT2K, there is about a 30% gain by making use of SMT.

Mode Threads % of Peak
ST 8 19.5
SMT 16 26.7

These are one-node results. But although the overall percent of peak drops a little at scale, the relative benefits of SMT remains the same even on full machine runs. A similar benefit was found with sPPM.

Case Study 2: LINPACK

As mentioned above, there is no benefit in running 16 theads per node, rather than 8, for LINPACK. This brings up the question of whether having SMT enabled imposes any cost. The following results running LINPACK on 247 nodes show that the cost is very minor.

Mode Threads Time
TFlops % of Peak
ST 8 3341.65 12.31 82
SMT 8 3386.77 12.15 81

As a result, Purple is configured with SMT enabled so that codes like UMT2K and sPPM can enjoy the benefit, while other codes are not seriously impacted.

Case Study 3: HYDRA

HYDRA incorporates a hydrodynamics package, a laser package, and an implicit monte carlo package. It can be configured with varying numbers of MPI tasks per node, as well as a variable "masters to workers" ratio.

Nodes MPI Config
Masters:Workers Run Time
4 16 16:1 10.69 .588
8 8 16:1 6.75 .597
16 4 15:1 6.02 .648
64 1 15:1 5.10 .660


[adapted from Hybrid OpenMP and MPI Programming and Tuning, Yun (Helen) He and Chris Ding, Lawrence Berkeley National Laboratory, NUG2004 (June 24, 2004)]

Pure MPI Pro

  • Portable to distributed and shared memory machines
  • Scales beyond one node
  • No data placement problem

Pure MPI Con

  • Difficult to develop and debug
  • High latency, low bandwidth
  • Explicit communication
  • Large granularity
  • Difficult load balancing

Pure OpenMP Pro

  • Easy to implement parallelism
  • Low latency, high bandwidth
  • Implicit communication
  • Coarse and fine granularity
  • Dynamic load balancing

Pure OpenMP Con

  • Only on shared memory machines
  • Scale within one node
  • Possible data placement problem
  • No specific thread order

Why Hybrid?

  • Hybrid MPI/OpenMP paradigm is the software trend for clusters of SMP architectures.
  • Elegant in concept and architecture: using MPI across nodes and OpenMP within nodes.
  • Good usage of shared memory system resource (memory, latency, and bandwidth).
  • Avoids the extra communication overhead with MPI within node.
  • OpenMP adds fine granularity (larger message sizes) and allows increased and/or dynamic load balancing.
  • Some problems have two-level parallelism naturally.
  • Some problems could only use restricted number of MPI tasks.
  • Could have better scalability than both pure MPI and pure OpenMP.

Why Mixed OpenMP MPI is Sometimes Slower

  • OpenMP has less scalability due to implicit parallelism while MPI allows multidimensional blocking.
  • All threads are idle except one while MPI communication.
  • Need overlap comp and comm for better performance.
  • Critical section for shared variables.
  • Thread creation overhead.
  • Cache coherence, data placement.
  • Natural one level parallelism problems.
  • Pure OpenMP code performs worse than pure MPI within node.
  • Lack of optimized OpenMP compilers/libraries.

High Performance Computing at LLNL    Lawrence Livermore National Laboratory

Last modified October 12, 2007