As processors have become faster over the years, the cost of a prototypical “computing” operation, such as a floating point addition, has rapidly grown smaller. On the other hand, the cost of communicating data has become proportionately higher. For example, even on a high-end supercomputer, it takes less than a quarter nanosecond (amortized, in a pipelined unit) for a floating point addition, 50 ns to access DRAM memory, and thousands of nanoseconds to receive data from another node. If one considers energy, the comparison is also stark: currently, a floating point operation costs 30-45 picojoules (pJ), an off-chip 64-bit memory access costs 128 pJ, and remote data access over the network costs between 128 and 576 pJ. In order to optimize communication and overall application performance and reduce energy costs, it is imperative to maximize data locality and minimize data movement, both on-node and off-node.
Profiling Tools
Using profiling tools, we can characterize different classes of applications. With these profiles, we can determine whether the performance of an application is bound by memory, computation, or communication. After preliminary characterization, we can utilize specialized profilers to measure specific phases of the application in detail.
Mapping
In addition to profiling, we can predict the performance benefits of intelligently mapping applications. Creating these mappings is time consuming and difficult, because the general problem is NP-hard. In order to overcome this, we plan to create a holistic communication profile by combining a variety of network and system measurements, such as the communication graph, network counters, and average number of hops. Then, this information can be used alongside the minimum latency and peak bandwidth values to predict the performance of alternative mappings, either analytically using general models developed from real world tests, or through simulations of the network. Our long-term goal is to design, implement and evaluate algorithms for the near-optimal mapping of tasks in a parallel application to the underlying network topology in order to improve performance.