CTS-1 has delivered a two- to four-fold improvement in different performance areas over our past commodity systems.
Matt Leininger, Livermore Computing
Jade system

Commodity Clusters

High performance computing (HPC) is a cornerstone of the National Nuclear Security Administration (NNSA) Stockpile Stewardship program. Bolstering computing at NNSA’s three national security labs is essential to ensure the safety, security, and reliability of the nation’s aging nuclear deterrent without testing. NNSA’s new capacity computing systems, called the Commodity Technology Systems-1 (CTS-1), are its third joint procurement under the Advanced Simulation and Computing (ASC) program. These computing clusters provide the needed computing capacity for NNSA’s day-to-day scientific work at the three labs managing the nation’s nuclear deterrent.

Under the CTS-1 contract, Penguin Computing—a Silicon Valley–based developer of high-performance Linux cluster computing systems—has furnished the labs with multiple systems, ranging in size from a few hundred to several thousand nodes. It also includes options to purchase additional systems specialized for HPC simulation, visualization, machine learning, or data sciences.

Penguin Computing has delivered over 25 petaflops of “capacity” computing capability to Los Alamos, Sandia, and Lawrence Livermore national labs. These commodity technology systems are designed to run a large number of jobs simultaneously on a single system. Advances in computational technology, enabled in part by NNSA’s ASC program, have brought down the cost of HPC systems from approximately $100 million per teraflop (trillion floating point operations per second) in 1995 to less than $5,000 per teraflop today, a factor of 20,000. Each successive generation of computing system has provided greater computing power and energy efficiency.

This tri-lab procurement model reduces costs through economies of scale based on standardized hardware and software environments at the three labs. The effort helps increase operational efficiencies and facilitates collaborations that benefit our nation’s security, support academia, and advance the technology that promotes American economic competitiveness. This strategy also allows NNSA’s more powerful advanced technology system supercomputers to be dedicated to the largest and most complex calculations critical to stockpile stewardship.

Delivery of the systems procurement began in April 2016 and will continue through at least 2019. Each system is built of scalable units (SUs), and each SU represents approximately 230 teraflops of computing power. SUs are designed to be connected, much like Legos, to create more powerful systems.

Matt Leininger, Livermore Computing’s deputy for advanced technology projects, explains, “The SU concept was developed at Livermore and has been used in procurements for over a decade. This approach provides a flexible arrangement such that a vendor can seamlessly deliver both large and small machines.” Penguin has delivered 1-SU, 2-SU, 4-SU, 6-SU, and up to 14-SU systems to Livermore and several medium-sized machines of 2–8 SUs to Sandia and Los Alamos.

The new computing clusters run multiple jobs faster than previous commodity systems, in line with industry trends toward better throughput rather than faster per-core processing unit performance. However, notes Leininger, “We tend to see a two- to four-fold improvement in different performance areas over the older TLCC2 systems.” For instance, CTS-1’s 100-gigabyte per second Intel Omni-Path high-speed network performs about 2 times better than the TLCC2s’ network.

Other features of the new machines are virtually invisible to users. They can be cooled with either water or air and have an option to use higher voltage power for greater power efficiency than past clusters. Rather than having one power supply for each node as in many past clusters, the Penguin systems supply a whole cluster rack with one large power shelf. This change is intended to increase reliability and make power supply maintenance easier.

“Power and cooling really matter when it comes to total cost of ownership,” says Leininger. “They may not seem exciting, but these are very practical issues that help the new machines run better.”

Livermore’s CTS-1 clusters have been deployed in support of institutional computing and NNSA’s Life Extension Program. Such systems enable investigations into technical issues related to aging weapons systems. These are critical to ensuring the safety, security, and reliability of the nuclear weapons in the stockpile as they age well beyond their intended deployment life.