Road to El Capitan 11: Industry investment

This article is part of a series about Livermore Computing’s efforts to stand up the NNSA’s first exascale supercomputer. El Capitan will come online in 2024 with the processing power of more than 2 exaflops, or 2 quintillion (10¹⁸) calculations per second. The system will be used for predictive modeling and simulation in support of the stockpile stewardship program.

Previous: Collaboration is key

Perhaps no other square mile of California hums with as much activity and purpose as Lawrence Livermore’s main campus. Amid the world-class experimental labs, specialized equipment, and manufacturing facilities, a workforce of more than 9,000 serves a national security mission through innovative R&D. One of the most exciting places these days is the Livermore Computing Center as the exascale El Capitan supercomputer and its unclassified siblings are brought online—a process that’s considerably more complicated than unboxing a laptop.

The Lab’s high performance computing (HPC) success depends in part on strategic relationships with U.S. commercial companies. From the early UNIVAC computers to today’s massively parallel systems, computer scientists and software engineers have worked closely with industry experts to design and field each generation of machines.

Livermore Computing Chief Technology Officer Bronis de Supinski says, “Success at huge tasks cannot be achieved alone, whether you are an individual or a big organization. It takes partnership. This maxim is true for building supercomputers, and Livermore Computing’s long history of strong, deep industry collaboration is born from our desire to succeed at that huge task.”

Hewlett Packard Enterprise (HPE) and Advanced Micro Devices Inc. (AMD) joined Livermore and the NNSA in the years-long El Capitan procurement and deployment. “The collaboration has worked extraordinarily well,” states HPE system architect Chris Brady. “This is the best customer relationship I’ve been a part of. It’s both pleasant and productive.”

El Capitan’s computational speed comes from AMD’s unique MI300A accelerated processing units (APUs), which combine CPUs and GPUs for higher efficiency, higher resolution simulations. Debuting in El Capitan, the APUs are bolstered by HPE’s Slingshot interconnect network, accelerator blades, and near-node storage components nicknamed “the Rabbits” (see the article Storage in the exascale era in this series). HPE also provides the liquid-cooled cabinets housing all of this hardware in LLNL’s primary machine room.

Terri Quinn, associate program director for Livermore Computing Systems and Environments, points out, “Our objective was to deliver to the NNSA the most capable computer possible within the given budget. To do this, we asked the vendors to build us a system that required them to push their technical capabilities hard. HPE and AMD have done this with El Capitan. They had the expertise and the courage to take risks.”

Aiming for the exascale threshold was a challenge for everyone involved. “The scale is huge. Everything is much more difficult, whether we’re upgrading firmware or figuring out how to run all the cables. For example, with smaller systems, five or six people can arrange and install the cables in a few days. This time, cabling took months,” Brady notes.

“We had 13 project teams working on different aspects of delivery and 144 milestones along the way. There was incredible complexity in all the tandem workstreams as well as the integration points between programming environments and system software,” explains Gina Norling, former HPE engineering program manager and now AMD’s senior engineering manager for the Center of Excellence HPC/AI.

Brick Stephenson, HPE director of program management for strategic customer engagements, adds, “The only way I can keep my head around the whole endeavor is to realize that I’m not the expert on any one thing. I don’t have to know all the answers; I just need to know who to loop in. Every day there’s a challenge you didn’t see coming. Transparency plus time equals trust.”

According to Keith Shields, HPE’s senior director of strategic program management, the key to a large-scale deployment is managing the small-scale aspects. He advises, “Break it into physical and time-bound components. Where do the power drops go? Where does the plumbing go? Then it’s not a scramble when the cabinets arrive.”

In addition, HPE staff drew on lessons learned with AMD when building the exascale Frontier supercomputer at Oak Ridge National Laboratory, says hardware project manager Randy Law. “Aspects of El Capitan were familiar because of Frontier, and what we discovered during that project benefitted this one,” he states.

As prominent contributors to the global HPC community, these industry partners share LLNL and NNSA’s commitment to El Capitan’s success. Stephenson, who brought five decades of computing experience to the project, states, “I look forward to hearing about the new scientific discoveries this system will enable.” A relative HPC newcomer, field services engineer Jonathan Gutierrez notes, “This is a big job that takes lots of people. And assuming El Capitan debuts high on the Top500 list, I’ll be able to say, ‘I helped with that.’ We hope our excitement carries over to the users.”

The El Capitan project—which included siting three early access systems as well as the unclassified Tuolumne and RZAdams machines—highlights enduring connections between industry and national labs. Shields, who began his HPC career nearly 30 years ago as a software developer, routinely visited LLNL throughout the process. He notes, “I was born in Livermore and my dad was a Lab employee, so I’ve come full circle.”

Stephenson, too, has Livermore ties. He was a field engineer at the Lab in the 1970s before moving to Cray, which was later acquired by HPE. Gutierrez worked at both Sandia National Laboratories campuses before pivoting to an HPC career just a few years ago. He will continue to work onsite at the Livermore Computing Center, helping to run the day-to-day operations. “Learning from experienced folks has been wonderful,” Gutierrez says.

For many of LLNL’s vendor colleagues, bringing El Capitan online has been a professional highpoint. Brady even put his retirement plans on hold to see the project through. “It’s a really wonderful cap to my career,” he states. “And frankly, after El Capitan, what else could be as exciting?” Norling draws motivation from the project’s scale and difficulty, adding, “I have to pinch myself that I’m involved in this program. El Capitan has so many amazing technical aspects, but it comes down to the people. At the end of the day, it’s those relationships and connecting folks to make a solution that’s better than we could have ever produced on our own.”

Previous: Collaboration is key

—Holly Auten & Meg Epperly