Supercomputer programs require coordinating potentially millions of tasks simultaneously, unlike typical computer programs, in which only a single stream of instructions executes at a time. The multiphysics simulations that run on LLNL’s high performance machines divide models of the physical world into smaller chunks and distribute them across millions of processors. Each processor calculates the behavior of one of these small pieces, frequently sharing its results with the other processors through an ultra-low latency network.
The particular piece of the model being simulated on a given processor may affect its performance, so adaptation is key to many of the numerical algorithms used on high performance computer (HPC) systems, without which many simulations would not be sufficiently accurate or feasible. If the processors do not work efficiently, time in the simulated world slows to a crawl.
LLNL scientists spend many painstaking hours ensuring that simulations can run as fast as possible, and the difficulty of this task is growing as supercomputers become more complex. Porting production codes to new platforms is already time consuming, and today’s turnaround of four to six months may become years on next-generation machines.
Funded through the Department of Energy’s Early Career Research program, computer scientist Todd Gamblin is leading an effort to mitigate this problem with a project designed to accelerate the adaptation of existing scientific simulation codes to perform on next-generation exascale systems. Gamblin’s project leverages machine learning to develop algorithms that can predict the performance of application codes even when they adapt based on their inputs. Rather than requiring scientists or engineers to tune their codes manually with explicit instructions, codes tuned using machine learning could “learn” from past executions.
“There is a desperate need for insightful modeling techniques that can simplify analyzing and tuning HPC applications,” says Gamblin.
Over the course of the five-year project, which began in 2014, Gamblin and his team will develop statistical models of HPC applications that can represent data-dependent and adaptive code behavior. Such models will help predict the performance of realistic applications and identify where performance problems originate in a system. In addition, the team will develop techniques to reduce the complexity of application models.
“We are treating performance analysis as a data science problem,” says Gamblin. “The ability to accurately measure, analyze, and visualize performance data will allow us to increase throughput of the entire computation center to do more complex science simulations that are important to the Laboratory’s missions.”
The statistical models under development will use both supervised and unsupervised machine-learning techniques to determine performance predictors and better characterize data at run times. Supervised learning methods, which learn from past examples of good performance, will be used to predict the performance of algorithms for single sets of inputs. Unsupervised models, on the other hand, look for structure in large data sets.
“We will use unsupervised learning techniques such as correlation analysis and clustering to identify features that predict data-dependent performance of applications,” says Gamblin. “Combined, these models will be able to predict the performance of full-system, multiscale, adaptive applications.” Additional techniques will be developed for comparing the performance of different codes to one another.
To ensure the team’s methodology—called Modeling for Exascale Applications and Data—will work for full production applications, it will be initially run using proxy apps (smaller, more portable applications that mimic larger codes of interest). As an example, the LULESH proxy app models a very small but computationally intensive part of the Laboratory’s sophisticated ALE3D multiphysics code. While ALE3D uses many physics models that interface with a unified mesh and physics data model, LULESH uses a much-simplified smaller mesh.
“Currently, there is no way to validate proxy apps to production codes and little or no work has been done on modeling the truly dynamic, data-dependent algorithms that dominate the workloads of integrated applications,” says Gamblin. “By using our models to validate proxy applications against codes, we will ensure that proxy apps can be trusted as representative of production code performance.”
Gamblin will receive up to $2.5 million in funding over 5 years for this project. He says, “The idea is for these models to provide simulation developers with insights that allow them to quickly optimize the performance of their code, ensuring that applications can take full advantage of the performance of future exascale machines.”
Work such as this is imperative for LLNL to continue performing the cutting-edge scientific simulations it needs to understand the complex scientific processes that underpin its mission objectives. Thanks to work like Gamblin’s, Livermore will be ready to take scientific exploration to the next step when exascale systems come online.