As computers have evolved, the limiting factor for high performance computing (HPC) application performance has changed. The number of floating-point operations per second (flops) was once the primary performance bottleneck for many linear algebra-based codes. For the past decade or so, memory bandwidth, the rate at which data can be accessed or stored in memory by a processor, has been considered the key performance limiter.
But for codes that use other solution methods, LLNL researchers are learning that the performance bottlenecks can be different. Diagnosing these performance bottlenecks will aid researchers in accurately evaluating application performance and in developing next-generation HPC systems and codes.
A study by LLNL computer scientists Ian Karlin and Steve Langer of the high-energy-density physics code HYDRA found that none of its code packages were actually reaching the bandwidth limit on LLNL’s current roster of supercomputers. In fact, many packages were staying well below the obvious limits and yet were experiencing suboptimal performance on some machines. For instance, only considering bandwidth and flops, the HYDRA code would be expected to perform 1.5 to 3 times worse per node on the IBM BlueGene/Q Sequoia machine than on a Linux cluster, but it actually performed closer to 5 times worse.
Karlin and Langer have been investigating one potential performance-limiting factor—how different machines and codes organize work. In most supercomputers, the microprocessor will take the instructions provided by the code and decode them into work units for execution. Very broadly, these units fall into two categories: floating-point operations and integer operations. A floating-point operation is the work scientists want the machine to do—for instance, the calculation of mathematical equations. Integer operation includes tasks such as data movement or value testing (If X = Y, then Z). The balance of work can depend on the computer architecture.
For example, Sequoia is designed to perform an integer operation for every floating-point operation, while many of Livermore’s Linux clusters can perform 3 integer operations for every floating-point operation. Many codes have a ratio between 2:1 and 4:1. If the code and the supercomputer do not have similar ratios, the code may run less efficiently on that system.
The researchers suspect that many of these codes are also being hampered by memory latency—the time delay between when data is requested by the processor and when it begins to arrive. Latency is largely driven by the distance between the processors and memory. However, a physics code can be written or modified to minimize the number of times the same piece of data needs to be retrieved from memory, which could reduce the impact of a latency problem.
Further, codes experiencing performance limitations tended to have irregular data access patterns, they found. For instance, unstructured mesh codes and algorithms often require that a processor do more integer operations to find and retrieve data than for other types of codes. Indeed, comparisons showed that some of the code packages were reaching an integer operation limit well before they hit a bandwidth limit. Irregular access patterns also lead to latency issues, as the processor cannot easily guess or predict what data to retrieve and thereby speed up the process.
Some of the researchers’ findings are already informing next-generation HPC procurement decisions and broader co-design efforts.