TOSS | Computing

Capacity computing—the use of smaller and less expensive commercial-grade systems to run parallel problems with more modest computational requirements—allows the National Nuclear Security Administration’s (NNSA’s) more powerful supercomputers, or “capability” systems, to be dedicated to the larger, more complex calculations critical to stockpile stewardship. Reducing the total cost of ownership for robust and scalable HPC clusters is a significant challenge that impacts many programs at LLNL.

In 2007, LLNL successfully led a first-of-a-kind effort to build a common capacity hardware environment, called the Tri-Lab Linux Capacity Clusters (TLCC1), at the three NNSA laboratories—Lawrence Livermore, Los Alamos, and Sandia. The TLCC1 experience proved that deploying a common hardware environment at all three sites greatly reduced the time and cost to deploy each HPC cluster.

TOSS: The right operating system to run capacity computing hardware

The Tri-Lab Operating System Stack (TOSS) was created to run on the TLCC systems from their inception. The goal of the TOSS project has been to increase efficiencies in the ASC Tri-Lab community with respect to both the utility and the cost of a common computing environment. The project delivers a fully functional cluster operating system, based on Red Hat Linux, capable of running MPI jobs at scale on hardware across the Tri-Lab complex.

TOSS provides a complete product with full lifecycle support. Well-defined processes for release management, packaging, quality assurance testing, configuration management, and bug tracking are used to ensure a production-quality software environment can be deployed across the Tri-Labs in a consistent and manageable fashion.

Building on TLCC1’s success, the second generation of Tri-Lab clusters (TLCC2) was deployed in 2011 and 2012. For TLCC2, LLNL and its laboratory and industry partners—Los Alamos, Sandia, Appro, Intel, QLogic, and Red Hat—sited 12 small- to large-scale commodity clusters for NNSA’s Advanced Simulation and Computing (ASC) and Stockpile Stewardship Program.

After their deployment, the TLCC2 clusters proved to be some of the most scalable, reliable, and cost-effective clusters that LLNL has ever brought into service. Users can run larger simulations with a higher job throughput than was previously possible on commodity systems. And with a consistent user environment, seamless software environment, and common operating system, the TLCC2 capacity computers made it easy for users to collaborate and resolve problems at any site.

Delivery of the third generation of Tri-Lab clusters, referred to as Commodity Technology Systems (CTS-1) , began in April 2016. In support of this and subsequent large commodity system deployments, computer scientists at the three laboratories continued their partnership with Red Hat to support TOSS on systems of up to 10,000 nodes.

CTS-2, the fourth generation of Tri-Lab Linux clusters, began arriving in 2022 and continues to the present. These systems far surpass their predecessors in terms of compute power. Dane, the largest system of this generation, has peak performance of 10.6 petaflops.

Center-wide TOSS deployment improves user and staff experience

With the procurement of the El Capitan systems, Livermore Computing made the unprecedented decision to use TOSS as the operating system for its largest and most performant machines, including the 2.79-exaflops El Capitan. Now, TOSS is installed across the entire computer center, which is more consistent for users and easier to maintain for system administrators.

To reference TOSS, please cite the following paper: Edgar A. León, Trent D’Hooge, Nathan Hanford, Ian Karlin, Ramesh Pankajakshan, Jim Foraker, Chris Chambreau, and Matthew L. Leininger. TOSS-2020: A Commodity Software Stack for HPC. In International Conference for High Performance Computing, Networking, Storage and Analysis, SC’20. IEEE Computer Society, November 2020.