Co-Design: Deploying Leading-Edge Computing Capabilities


Performance portability, fault tolerance, and accelerators - oh my!” | Simon McIntosh-Smith
May, 2016

Abstract: In this talk we’ll look at the University of Bristol’s work into developing performance portable applications in the context of a range of relevant codes, including transport, hydrodynamics and diffusion. Our group is one of the first to perform in-depth and independent comparisons of emerging and established parallel programming APIs, including Raja, Kokkos, OpenMP 4.x, OpenACC, OpenCL and CUDA. These studies have been performed at-scale on large accelerated machines, including Titan and Piz Daint. We’ll also cover some relatively recent work of ours in which we have developed application-based fault tolerance techniques for the sparse iterative CG solver in a diffusion code.

IBM OpenMP4 Hack-a-thon Outbrief and Lessons Learned | Erik W. Draeger, Ian Karlin, Tom Scogland, David Richards, Jim Glosli, Holger Jones, David Poliakoff, Adam Kunen
January, 2016

Abstract: This white paper summarizes the lessons learned by the LLNL co-design team who attended a “hack-a-thon” at IBM Research in Nov, 2015 to gain early experience with pre-release versions of the IBM CORAL compilers that support the OpenMP 4.5 standard. The paper discusses the advances that were made between OpenMP 4.0 and 4.5, as well as the challenges that remain for application programmers wishing to adopt this new standard for use on heterogeneous node architectures.

Use Cases for Large Memory Appliance/Burst Buffer | Rob Neely, Ian Karlin, Bert Still, Adam Bertsch | January, 2014

Abstract: This presentation outlines some possible uses for large memory presented to the application in the form of a separate memory “appliance” or burst buffer. These use cases were gathered from an applications perspective, and prioritized based on the likeliness that our application teams at LLNL would adopt them. Pros and cons to each use case are briefly discussed, and we hope that this opens a dialogue in the co-design community to best utilize this emerging technology.

Exploring Traditional and Emerging Parallel Programming Models Using a Proxy Application | Ian Karlin, Abhinav Bhatele, Jeff Keasler, Bradford L. Chamberlain, Jonathan Cohen, Zachary Devito, Riyaz Haque, Dan Laney, Edward Luke, Felix Wang, David Richards, Martin Schulz, Charles H. Still | May, 2013 | Presented at 2013 IEEE 27th International Symposium on Parallel and Distributed Processing

Abstract: Parallel machines are becoming more complex with increasing core counts and more heterogeneous architectures. However, the commonly used parallel programming models, C/C++ with MPI and/or OpenMP, make it difficult to write source code that is easily tuned for many targets. Newer language approaches attempt to ease this burden by providing optimization features such as automatic load balancing, overlap of computation and communication, message-driven execution, and implicit data layout optimizations. In this paper, we compare several implementations of LULESH, a proxy application for shock hydrodynamics, to determine strengths and weaknesses of different programming models for parallel computation. We focus on four traditional (OpenMP, MPI, MPI+OpenMP, CUDA) and four emerging (Chapel, Charm++, Liszt, Loci) programming models. In evaluating these models, we focus on programmer productivity, performance and ease of applying optimizations.

A Case for Improved C++ Compiler Support to Enable Performance Portability in Large Physics Simulation Codes | Rich Hornung, Jeff Keasler, LLNL | April, 2013

Abstract: Vendors of modern computer platforms typically include the capacity of SIMD vector units in advertised theoretical peak system performance. However, the extensive, non-portable source code modifications required for compilers to generate SIMD vector instructions are not manageable in large simulation codes. Thus, realized performance is substantially less than one quarter to one eighth of machine capability on current platforms. In this paper, we argue that to increase efficiency of machine usage, compilers must improve their support for SIMD vectorization and other optimizations. We show that good vectorization is possible with current compiler technology but that it is impractical to achieve in large codes, due to maintenance and portability concerns. We enumerate specific corrective actions that compiler vendors could pursue to resolve some key issues. These compiler improvements would also allow us to exploit other important forms of parallelism in a portable manner through high-level software constructs.

ASC Co-design Proxy App Strategy | Mike Heroux, Rob Neely, Sriram Swaminarayan | Jan, 2013

Abstract: This position paper outlines a high level strategy for the development of proxy applications. We start by proposing a common taxonomy (mini-apps, skeleton apps, and kernels - with proxy apps as an all-encompassing term). We also include a suggested template for documenting proxy applications so that it is easy for potential users to quickly determine the applicability of a given proxy application to the study being undertaken. We also include a proposed set of metrics that can be used to determine how representative a proxy application is compared to the full application or algorithm it is intended to represent. Finally, a basic set of minimal requirements for source code and data management is suggested.

National Security Applications Co-Design: A Framework for an ASC Tri-Lab Project |
April, 2012

Abstract: This white paper outlines a framework for the launch of an ASC Tri-Lab Co-design project, with a broad description of the issues that must be address, along with a proposed set of coordination activities.

Survey of Novel Programming Models for Parallelizing Applications at Exascale | Rich Cook, Evi Dube, Ian Lee, Lee Nau, Charles Shereda, Felix Wang | November, 2011

Abstract: This technical report provides a survey of existing and emerging programming models that may address issues related to programming advanced architectures. Chapel, X10, Fortress, Cilk Plus, Intel Threading Building Blocks, UPC, AMPI, Charm++, OpenCL, and CUDA are all considered, with an eye toward the potential applicability of these to our ASC mission base.

Scientific Grand Challenges for National Security: The Role of Computing at the Extreme Scale | October, 2009

Abstract: This is a report from one of the ASCR Scientific Grand Challenges Workshops detailing the specific needs for exascale computing in national security.