Capacity computing—the use of smaller and less expensive commercial-grade systems to run parallel problems with more modest computational requirements—allows the National Nuclear Security Administration’s (NNSA’s) more powerful supercomputers, or “capability” systems, to be dedicated to the larger, more complex calculations critical to stockpile stewardship. Reducing the total cost of ownership for robust and scalable HPC clusters is a significant challenge that impacts many programs at LLNL.
In 2007, LLNL successfully led a first-of-a-kind effort to build a common capacity hardware environment, called the Tri-Lab Linux Capacity Clusters (TLCC1), at the three NNSA laboratories—Lawrence Livermore, Los Alamos, and Sandia. The TLCC1 experience proved that deploying a common hardware environment at all three sites greatly reduced the time and cost to deploy each HPC cluster.
The Tri-Lab Operating System Stack (TOSS) was created to run on the TLCC systems from their inception. The goal of the TOSS project has been to increase efficiencies in the ASC tri-lab community with respect to both the utility and the cost of a common computing environment. The project delivers a fully functional cluster operating system, based on Red Hat Linux, capable of running MPI jobs at scale on hardware across the tri-lab complex.
TOSS provides a complete product with full lifecycle support. Well-defined processes for release management, packaging, quality assurance testing, configuration management, and bug tracking are used to ensure a production-quality software environment can be deployed across the tri-lab in a consistent and manageable fashion.
Building on TLCC1’s success, the second generation of tri-lab clusters (TLCC2) was deployed in 2011 and 2012. For TLCC2, LLNL and its laboratory and industry partners—Los Alamos, Sandia, Appro, Intel, QLogic, and Red Hat—sited 12 small- to large-scale commodity clusters for NNSA’s Advanced Simulation and Computing (ASC) and Stockpile Stewardship Program.
Since their deployment, the TLCC2 clusters have proven to be some of the most scalable, reliable, and cost-effective clusters that LLNL has ever brought into service. Users can run larger simulations with a higher job throughput than was previously possible on commodity systems. And with a consistent user environment, seamless software environment, and common operating system, the TLCC2 capacity computers make it easy for users to collaborate and resolve problems at any site.
Delivery of the third generation of tri-lab clusters, referred to as Commodity Technology Systems, began in April 2016. In support of this and subsequent large commodity system deployments, computer scientists at the three laboratories are continuing their partnership with Red Hat to support TOSS on systems of up to 10,000 nodes.
By leveraging hardware and software designed for mass consumer markets, commodity systems can be constructed at a fraction of the cost of systems designed specifically for the much smaller HPC market. However, since HPC imposes different system requirements than the mass market, the challenge is in obtaining efficient performance from commodity systems, especially at large scale.
One challenge tri-lab computer scientists have been tackling is degraded application performance due to resource contention with system management software. Commodity system software consumes only a small percentage of time and resources on each node, but at large scale, it has dramatic effects on HPC applications, delaying synchronization across nodes.
The objective of current efforts is to identify and reduce the interference (or noise) introduced by the system software on commodity clusters. By characterizing the major sources of noise and applying novel techniques to eliminate them or reduce their effect, these researchers aim to improve application performance and scalability. Enhancements will be integrated into TOSS and future commodity systems.
To reference TOSS, please cite the following paper: Edgar A. León, Trent D’Hooge, Nathan Hanford, Ian Karlin, Ramesh Pankajakshan, Jim Foraker, Chris Chambreau, and Matthew L. Leininger. TOSS-2020: A Commodity Software Stack for HPC. In International Conference for High Performance Computing, Networking, Storage and Analysis, SC’20. IEEE Computer Society, November 2020.