Road to El Capitan 1: It takes a village

This article is part of a series about Livermore Computing’s efforts to stand up the NNSA’s first exascale supercomputer. El Capitan will come online in 2024 with the processing power of more than 2 exaflops, or 2 quintillion (10¹⁸) calculations per second. The system will be used for predictive modeling and simulation in support of the stockpile stewardship program.

Next: All the moving parts

The road to the NNSA’s first exascale-class supercomputer, El Capitan, began several years ago. This enormous undertaking of meticulous planning and strategic collaborations is already bearing fruit in a number of ways. For instance, close coordination with the DOE Exascale Computing Project, Office of Science Advanced Scientific Computing Research program, and the NNSA Advanced Simulation and Computing (ASC) program has paved the way for secure integration of national security applications with a robust, exascale-ready software stack. At LLNL, completion of the enormous Exascale Computing Facility Modernization (ECFM) project—which finished on time in 2022 and under its $100 million budget—marked a major milestone in these preparations. (Watch a 4:48 video about ECFM.)

Partway through 2023, the road is becoming steeper both in activity and focus for the Livermore Computing (LC) Program led by Terri Quinn, the LC Division led by Becky Springmeyer, and the Weapon Simulation and Computing (WSC) community led by Rob Neely. LC Chief Technology Officer Bronis de Supinski notes, “This is an exciting—and challenging—period with hardware deliveries, environment testing, and facilities completion. We’ve been connecting the 85-megawatt ECFM infrastructure to early access systems [EAS] since 2021, as well as configuring applications and system software to run on machines of this scale.”

El Capitan will be inherently different from Frontier, its sister system at Oak Ridge National Laboratory. The design combines Hewlett Packard Enterprise (HPE)/Cray's advanced Slingshot interconnection network alongside processors from Advanced Micro Devices Inc. (AMD). LLNL is also the first supercomputing center to use HPE’s near-node local storage solution called Rabbits. de Supinski adds, “We’ll use the ASC Tri-Lab Operating System Stack and the Livermore-developed Flux resource management software—and we have to make these work with HPE’s software from day one.”

According to de Supinski, significant efficiency gains in El Capitan’s architecture will come from AMD’s accelerated processing units, or APUs. He says, “All CPUs and GPUs will have equal access to memory in a truly unified memory space, so the challenges of moving data back and forth will be much lower. This memory architecture will increase our performance.”

These preparatory activities are under the purview of the El Capitan Center of Excellence, a collaboration of developers and experts at the NNSA labs, HPE, and AMD. de Supinski explains, “We like to deploy early hardware because we can fix any issues ahead of the larger system deployment. Through both collaborative and ad hoc meetings with the vendors and other partners, and even a ‘Rabbit hackathon,’ we’ve made a lot of progress.”

Within LLNL, siting El Capitan involves all of LC and a close partnership with WSC. de Supinski is quick to credit the work of others, stating, “People across the Lab are pulling in the same direction for what will be one of the best computing systems in the world. Managing a new partnership with vendors is challenging, but Terri Quinn’s experience combined with [WSC business manager] Kim Bosque’s finance background really make the process so much easier. I’m proud of LC’s fabulous hotline and tier-two customer support, collaboration with code teams, and respect for and from our users—not to mention our excellent facilities staff, system engineers, operators, and software and hardware experts. Working with such a great set of individuals makes doing anything worthwhile.”

Next: All the moving parts

—Holly Auten & Meg Epperly