ExReDi | Computing

Because of the end of Dennard scaling, computing capability is increasing through more processing units, not faster clock speeds. Thus, future computer architectures will provide much more on-node parallelism, and applications must take advantage of this.

Unfortunately, standard low-order algorithms for problems in fluid and plasma dynamics already fail to make use of all of the floating-point operations on-node; algorithms for these models are typically memory bandwidth limited. In addition, with the massive increase in the number of components and with processors running at close to threshold voltages, resilience is anticipated to be a potential problem.

As part of the Resilient Extreme-scale Solvers initiative in DOE ASCR, the Extreme Resilient Discretization project (ExReDi) was established to address these challenges for algorithms common for fluid and plasma simulations. We began with the perspective that one must holistically consider the discretization as well as the solver. In particular, high-order discretizations possess higher arithmetic intensities (the ratio of work done to byte moved) and require fewer degrees of freedom to achieve a fixed level of accuracy.

As demonstrated in the roofline performance model shown in the figure, algorithms require a sufficient inherent arithmetic intensity to move out from under the bandwidth-limited region of the roofline.

The roofline model shows the on-node performance limits set by the memory bandwidth and the processor capability. Low-order algorithms are fundamentally bandwidth limited. High-order algorithms have the potential to reach peak performance.

In this project, we are considering high-order finite volume methods, primarily for hyperbolic problems; high-order particle-in-cell methods (PIC) for kinetic plasma and astrophysical simulations; high-order local discrete convolution methods for Poisson and Maxwell models; and low-memory, Full Approximation Storage (FAS) multigrid methods based on parallel segmental refinement. In addition, we are developing algorithmic-based fault tolerance (ABFT) techniques for these algorithms in the context of block-structured adaptive mesh refinement grids. To demonstrate the advantages of our methods, we have also developed tools to measure arithmetic intensity and to test resiliency.