This article is part of a series about Livermore Computing’s efforts to stand up the NNSA’s first exascale supercomputer. El Capitan will come online in 2024 with the processing power of more than 2 exaflops, or 2 quintillion (1018) calculations per second. The system will be used for predictive modeling and simulation in support of the stockpile stewardship program.
Previous: It takes a village | Next: The right operating system
A supercomputer does not simply arrive in one piece ready to be plugged into a wall. The number and scale of components is staggering, to say nothing of the expertise required to assemble and connect them. How do the different types of processors work together? What type of floor will hold all that weight? How will the system draw electricity without crashing the power grid?
Then, even when the lights are blinking and the cooling system is humming, a researcher cannot simply press a button to begin running their scientific code. How will users log in? How will the code know which nodes to use, and when? Where will the massive amounts of simulation data go? How will administrators and users know if the system is working properly? What happens if another researcher has a completely different code and use case?
Questions like these are never far from the mind of Adam Bertsch, the Integration Project lead for El Capitan. He states, “My role is to bring all the various pieces together, both within Livermore Computing [LC] and with our vendor partners, to make sure that the system shows up, works, and gets accepted. My job is to make sure the facilities, hardware, and software teams have all the information about each other that they need—that they know what’s expected of them and that they’re ready.”
Essential Infrastructure
Bertsch served as the integration lead for LLNL’s 125-petaflop Sierra supercomputer, which came online in 2018, but every HPC system is different. For instance, LC teams must learn the ins and outs of vendors’ products: Sierra has IBM CPUs and NVIDIA GPUs, and a Mellanox EDR InfiniBand interconnect, whereas El Capitan will have AMD APUs (accelerated processing units) with HPE/Cray’s advanced Slingshot interconnection network. Differences continue in the number of cores and nodes, operating system versions, and power requirements.
Furthermore, Bertsch points out, “During Sierra’s preparation and deployment, we didn’t have a really large infrastructure project going on for months.” LLNL’s massive Exascale Computing Facility Modernization (ECFM) project was a prerequisite for siting El Capitan. Bertsch continues, “ECFM was very successful. [Project manager] Anna Maria Bailey and her team did a really great job bringing new power and cooling capabilities to the Lab as well as to the building that will house El Capitan. But we’re not done. Now we’re installing and testing the system-specific infrastructure, including how it works with the early access systems [EAS].”
Aligning the hardware delivery schedule with facilities and infrastructure upgrades is just one facet of integration dependency. “On the software side, we’re responsible for a large chunk of the system software configuration and implementation,” Bertsch explains. “So, we need to make sure we have onsite all of the hardware that’s required to bring up the software. That way, the software will be ready when the final hardware is hooked up. We aim to deploy El Capitan as efficiently as possible.”
The EAS machines play an important role in this readiness strategy. As smaller versions of El Capitan, the five systems—nicknamed RZNevada, RZVernal, Hetchy, Tenaya, and Tioga—have been in various stages of installation and testing since 2021. “We’ve had these systems onsite for a while,” Bertsch notes. “We use them to confirm the hardware setup, software stack, and application portability.”
Early Success and Future Benefits
One integration success has been getting HPE’s programming environment running on the Tri-Lab Operating System Stack (TOSS). Another is the extensive portability effort that enables scientific applications currently running on Sierra to run on the very different El Capitan EAS architecture. In addition, LLNL’s next-generation resource management and job scheduling software Flux is already running on Tioga. Bertsch says, “Application teams were able to start running on these new platforms much more quickly and effectively than they were with Sierra. And when their code is combined with Flux, scientists will be able to get the most out of El Capitan.”
As complicated as this exascale integration is, Bertsch is optimistic. He says, “The functionality that El Capitan is going to unlock for our users and for the programs is the most exciting aspect. The way people use computers to do their work will change.” Additionally, he sees benefits for the LC team who keeps the supercomputers operational 24/7/365. “Because we’re using TOSS, SysAdmins will be able to manage this new system the same way they manage our other systems. So, the transfer learning and onboarding will be much easier than it was with Sierra. When someone gets called in the middle of the night, they’ll be able to respond with what they already know,” Bertsch states. As for users who are accustomed to the Lab’s current HPC systems, he adds, “We’ll roll out Flux across the computing center, so researchers will be using the same tools they’re used to.”
Previous: It takes a village | Next: The right operating system
—Holly Auten & Meg Epperly