This article is part of a series about Livermore Computing’s efforts to stand up the NNSA’s first exascale supercomputer. El Capitan will come online in 2024 with the processing power of more than 2 exaflops, or 2 quintillion (1018) calculations per second. The system will be used for predictive modeling and simulation in support of the stockpile stewardship program.
Supercomputers commissioned for the DOE’s national laboratories do not exist in a vacuum. The entire process—procurement to design to build to installation to testing to acceptance to operation—unfolds with input from, and close coordination of, a number of stakeholders. Teams representing all of the interested agencies ensure each step is completed properly and transferred smoothly to the next step. Each decision is made with collaborative considerations: How will this system serve the DOE’s mission? How will it enable multi-institutional teams to advance science? How will it push the leading edge of HPC technologies?
One such decision is about El Capitan’s operating system (OS), and the requirements are steep. A suitable OS for the NNSA’s first exascale machine needs to integrate fully with vendor software, function efficiently for users from multiple institutions, and streamline system administration. The chosen solution is the Advanced Simulation and Computing (ASC) program’s Tri-Lab Operating System Stack (TOSS).
The “Tri-Labs” are the NNSA’s national laboratories: Livermore, Los Alamos, and Sandia. Livermore Computing’s (LC) Jim Foraker, Systems Software and Security Development Group leader, explains, “TOSS meets everyone’s needs. It’s a Red Hat Linux distribution created here at Livermore to provide a common base environment for the commodity clusters at all three labs. And other organizations around the DOE complex, as well as NASA, are starting to use TOSS.”
‘In Charge of Our Own Destiny’
According to Trent D’Hooge, who serves as the system software lead for El Capitan, the right OS was undoubtedly going to be a custom solution. But what was the best way to fulfill the requirements? He says, “We made the choice to be in charge of our own destiny. We have full lifecycle support for TOSS, including release management, software packaging, quality assurance, configuration management, and much more. Choosing TOSS means we can run El Capitan for at least ten years.”
Foraker adds, “TOSS will simplify the user’s experience and the administration side of El Capitan. The benefits are huge.” The OS’s track record and robust functionality gives LC’s system software experts confidence in this decision.
A sound decision still comes with challenges. TOSS needs to not only integrate with the AMD APUs, Hewlett Packard Enterprise (HPE)/Cray Slingshot interconnect, and Rabbit hardware, but also support system management and monitoring at El Capitan’s scale. Accordingly, El Capitan’s early access systems (EAS)—three of which rank in the top 200 HPC systems in the world—play a crucial role in software readiness. D’Hooge states, “The EAS machines are where we make sure all our system software supports the hardware that HPE provides. We’ve had success with Slingshot functionality already.”
Each of the five EAS is already running TOSS ahead of El Capitan. Foraker notes that scaling up is difficult and complicated, but the smaller systems are an important proving ground. “We can do this,” he says. “I think we’re in a place now where it’s becoming more and more obvious to everyone that we really can do this.”
—Holly Auten & Meg Epperly