Integrating the Sequoia Supercomputer
Ensuring the safety and efficacy of the nation’s nuclear stockpile in the absence of testing requires sophisticated hardware and software. Demands on computer systems used for weapons assessments and certification continue to grow as the nuclear stockpile moves further from the test base against which simulations are calibrated. Simulation capabilities are also strained because weapons behavior spans such a range of timescales, from detonation, which happens on the micro- to millisecond scale, to nuclear fusion, which lasts mere femtoseconds. Sequoia, an IBM BlueGene/Q supercomputer with 1.6 million cores and a peak performance of 20 quadrillion floating-point operations per second (petaFLOP/s), aims to supply the next level of performance necessary for stockpile stewardship simulations and calculations. It is also an ambitious leap forward in computing at LLNL.
While software experts were beginning code optimization and development work for Sequoia, engineering and facilities teams were preparing to install the new system in the two-level space previously occupied by ASC Purple at Livermore’s Terascale Simulation Facility. The teams first created a three-dimensional model of the computer room to determine how to best use the available space, with consideration for sustainability and energy efficiency. They then made changes to the facility on a coordinated schedule to avoid interrupting existing operations. Changes included gradually increasing the air and chilled-water temperatures to meet Sequoia’s requirements and to save energy.
Sequoia is equipped with energy-efficient features such as a novel 480-volt electrical distribution system for reducing energy losses and a water-cooling system that is more than twice as energy efficient as standard air cooling. Incorporating such systems into the space necessitated two significant modifications to IBM’s facility specifications. First, the facilities team designed innovative in-floor power distribution units to minimize congestion and reduce, by a factor of four, the conduit distribution equipment bridging the utilities and rack levels. Second, although IBM specified stainless-steel pipe for the cooling infrastructure, the team selected more economical polypropylene piping that met durability requirements and relaxed water-treatment demands. The polypropylene piping also contributed to the building’s already impressive “green” credentials as a Leadership in Energy and Environmental Design gold-rated facility. Facilities manager Anna Maria Bailey notes, “Sequoia’s electrical, mechanical, and structural requirements are demanding. Preparing for Sequoia really pushed us to think creatively to make things work.”
Integrating any first-of-its-kind computer system is challenging, but Sequoia was Livermore’s most grueling in recent history because of the machine’s size and complexity. System testing and stabilization spanned a full 14 months, but the integration schedule itself was unusually tight, leaving a mere 5 weeks between delivery of the last racks in April 2012 and the deadline for completing Linpack testing—a performance benchmark used to rank the world’s fastest computers. Issues ranged from straightforward inconveniences, such as paint chips in the water system during pump commissioning and faulty adhesives on a bulk power module gasket, to more puzzling errors, such as intermittent electrical problems caused by connections bent during rack installation. Sequoia’s cooling infrastructure also presented some initial operational challenges, including uneven node-card cooling and false tripping of a leak-detection monitor.
A more serious manufacturing defect was encountered during the final integration phase. During intentionally aggressive thermal cycling as part of the Linpack testing, the team experienced a high volume of uncorrectable and seemingly random memory errors. In effect, compute cards were failing at alarming rates. The integration team began removing cards while IBM performed random dye-injection leak tests. The tests revealed that the solder attaching chips to their compute cards was, in some instances, exhibiting microscopic cracks. Investigation revealed that unevenly applied force during manufacturing tests had damaged the solder on a portion of Sequoia’s cards. These cracks had widened during thermal cycling, overheating the memory controllers beneath.
The Livermore team overcame this and other integration hurdles with assistance from IBM computer scientists and BlueGene/Q’s unusually sophisticated error-detection control system software. Within 40 days of detecting the memory errors, IBM and Livermore troubleshooters had pinpointed the cause and replaced all 25,000 faulty cards. Although the system was accepted in December 2012, IBM continued to work with Livermore to fine-tune Sequoia hardware and software until the machine converted to classified operations in April 2013, demonstrating a notable level of dedication and partnership in machine deployment.
None of the integration challenges prevented Sequoia from completing, with only hours to spare, a 23-hour Linpack benchmark run at 16.324 petaflops and assuming the lead position on the Top500 Supercomputer Sites list for a 6-month period in 2012. Kim Cupps, a division leader in Livermore Computing, observes, “We’re proud that Sequoia was named the world’s fastest computer, but really, what’s important about the machine is what we can do with it. When we hear people talk about the work they’re doing on the machine and what they’ve accomplished, that’s what makes all the work worthwhile.”
The speed, scaling, and versatility that Sequoia has demonstrated to date is impressive indeed. For a few months prior to the transition to classified work and the access limitation that entails, university and national laboratory researchers conducted unclassified basic-science research on Sequoia. Over the course of the science runs, Sequoia repeatedly set new world records for core usage and speed.