Road to El Capitan 9: Messaging and math

This article is part of a series about Livermore Computing’s efforts to stand up the NNSA’s first exascale supercomputer. El Capitan will come online in 2024 with the processing power of more than 2 exaflops, or 2 quintillion (10¹⁸) calculations per second. The system will be used for predictive modeling and simulation in support of the stockpile stewardship program.

Previous: Prepping for performance | Next: Collaboration is key

The road to El Capitan isn’t a single lane. It’s more like a superhighway, with many specialized working groups solving an array of complicated challenges. While some of these pursuits may appear to travel in different lanes, they’re all heading in the same direction. “Working groups allow LLNL to work shoulder to shoulder with vendors to influence the technology ending up in El Capitan. They wield a surprisingly large amount of influence,” says Terri Quinn, Livermore Computing program lead.

Matt Leininger leads the Messaging Working Group, which is responsible for integrating the Hewlett Packard Enterprise (HPE) Slingshot interconnection network into the El Capitan ecosystem. Slingshot will enable large-scale calculations to be performed across many nodes. He explains, “The network interconnect is a critical technology for any of our high performance computing [HPC] platforms. Its ability to efficiently move data between multiple processing nodes and storage permits a wide variety of complex simulations and scientific workloads—from traditional HPC to machine learning and data analytics.”

Collaborating with Oak Ridge, Sandia, and Los Alamos National Laboratories, the Messaging Working Group focuses on network performance optimization for these workloads, including the necessary nonrecurring engineering around messaging processes. Additionally, Leininger notes, “Our team includes vendor partners from HPE and AMD [Advanced Micro Devices Inc.]. HPE provides the Slingshot interconnect, and AMD provides the APUs [accelerated processing units], so both vendors are integral to solving issues with messaging.”

Central to this effort are two software libraries related to the message passing interface (MPI) standard: the open-source Libfabric software and HPE’s proprietary MPI solution. Message passing helps different programs and processes work together on parallel computing architectures. Leininger points out, “It’s always good to support more than one MPI on an HPC system. This allows us to compare and contrast the two implementations and have an alternative for when bugs show up in one implementation.”

To deploy the Slingshot, the group must manage the associated firmware, system software, automation processes, and more—all of which are first-generation technologies. El Capitan’s early access systems installed at the Livermore Computing Center have been extremely useful in this regard. “Most of our early access systems are being used to test on a single node or small number of nodes, so the next step will be testing the messaging on more and more nodes as El Capitan is deployed,” Leininger says. “What we’re doing will enable El Capitan to handle a class of workloads that was never possible before.”

Traveling beside the MPI communication lane on the El Capitan superhighway is a team dedicated to mathematical libraries. “These libraries are key to, and ubiquitous in, HPC modeling and simulation codes,” states computational scientist Ramesh Pankajakshan. “The arrival of GPUs and novel architectures like El Capitan’s have completely thrown a spanner in the works—in a good way. There is a lot more scope for making these libraries faster. It’s an open research topic.”

HPE and AMD have their own math libraries, and application teams may use other libraries such as hypre (linear solvers multigrid methods), LAPACK (linear algebra), SUNDIALS (nonlinear solvers), or PETSc (numerical solvers). According to Pankajakshan, who leads LLNL’s contributions to El Capitan’s Math Libraries and Machine Learning Working Group, fine-tuning math libraries for complex codes on GPU- and APU-based systems requires “a lot of back and forth.” He adds, “We’re unique among the vendors’ customers because we have, say, applications with millions of lines of code written over 40 years, and we’re using cutting-edge architectures.”

El Capitan’s memory and storage setup affects how math libraries execute calculations. Pankajakshan notes, “Previously, moving data back and forth was computationally expensive, so you had to pick an execution space and stay there. With El Capitan in particular, and APUs in general, there are lots of opportunities. It’s always fun to work on the cutting edge.”

Previous: Prepping for performance | Next: Collaboration is key

—Holly Auten & Meg Epperly