Changing the Magic Behind the Supercomputing Curtain

Thursday, October 15, 2015

LLNL’s world-class supercomputers support the Lab’s national security mission by running large, dynamic simulations. LLNL’s largest current machine, the Sequoia Blue Gene/Q system, consists of an impressive 1.6 million processor cores. In recent years, uncertainty quantification (UQ) studies have become increasingly critical for NNSA’s stockpile stewardship program. UQ studies evaluate the impact of input parameters (or lack thereof) on the outcomes of an ensemble of simulations. Sequoia was originally designed to support UQ jobs, and it was intended to support hundreds of large, concurrent simulations, each using thousands of processors. However, the needs of the UQ program have evolved to require thousands or millions of smaller simulation runs. This stresses the Sequoia system in fundamentally different ways.

Sequoia was designed to run no more than 192 concurrent jobs. When users initially tried to run UQ ensembles, they quickly hit system limits at many levels, causing their ensembles to fail. Tracking so many small jobs required more processes and more memory than was available on Sequoia’s front-end nodes, and the resource manager did not support distributing this workload to other nodes. The system was effectively limited by the capacity of the front-end nodes, not by the 1.6 million compute cores hungry for work behind them. Researchers were able to complete their ensembles by submitting many smaller batches of jobs over the course of months, but this increased the turnaround time for UQ studies and did not leverage the full processing power of Sequoia.

Livermore Computing worked extensively with UQ teams to address the system limitations that prevented more than 192 simultaneous runs on Sequoia. After an involved system debugging effort, including several dedicated system tests and adding 224 GB more memory to each job scheduling node, Sequoia was able to handle 10,000 UQ jobs at once, but users wanted more. One ambitious user expressed the desire to run one million jobs at once.

To satisfy this demand, Livermore computer scientists Todd Gamblin and John Gyllenhaal, along with computational scientists David Domyancic, Rao Nimmakayala, Richard Klein, Tammy Dahlgren, and Scott Brandon, took on the challenge of streamlining UQ calculations on Sequoia.

Rather than redesigning Sequoia’s entire resource management system, which could have taken years, Todd and John co-developed a tool called CRAM, which packs many small application instances into a single run. It accomplishes this by virtualizing the Message Passing Interface (MPI) used by LLNL’s HPC applications. CRAM changes the magic behind the supercomputer curtain by making many small MPI-based jobs appear to the system as one large job. This is comparable to the efficiency of delivering a single busload of 30 passengers to a location rather than 30 single-passenger cars. CRAM works seamlessly and requires no modifications to existing applications; developers need only relink their MPI code with the CRAM library.

To bring CRAM into production use, the team was allocated LLNL Technology Base (TechBase) funds. This allowed them to integrate CRAM into LLNL’s UQ Pipeline (UQP), a workflow system used to orchestrate large UQ studies. The UQP team, in close collaboration with Todd and John, re-engineered the way the pipeline dispatches batches of jobs. With these changes, the pipeline would launch jobs using CRAM’s interface, and CRAM could be used with any of the codes already in the Pipeline. UQ applications can now harness the full power of Sequoia.

The pairing of CRAM and the UQP is seamless—researchers run their applications unmodified in the pipeline, the Sequoia system hardware and software see these jobs as the single, large runs they were designed for, and the UQP transparently runs much larger UQ studies for its users. For many users, the new capability has arrived just in time.

“I’ve seen the issues that UQ supercomputer users have been running into,” said John. “There are many customers who want this functionality. What drives us is that they want it now.”

This summer, the group ran a scaling test of the CRAM-enabled UQP using a small test application, launching 1.6 million runs on Sequoia in 40 minutes. Previously, submitting this many concurrent jobs would have overwhelmed Sequoia or any other large supercomputer. Even if the resource manager and front-end nodes could handle the load, the job launch overhead would have caused this set of runs to take four and a half days to complete. Now that such ensembles can be run in less than an hour, clients are knocking at the group’s door to leverage this capability for their work.

“A tremendous amount of effort was devoted to optimizing the supercomputer for large jobs,” says Tammy Dahlgren, a computational scientist who worked to integrate CRAM into the UQP. “We are molding the new use case into the way the supercomputer was originally designed so a suite of many small jobs will run seamlessly.”

Current UQP efforts include working out the kinks identified from the test run. The team is refining CRAM’s integration based on their experiences from the initial run and by using the tools with a large, multiphysics application. More work in the UQP will be needed to automate the assessment, error recovery, and rescheduling of individual simulation runs using the CRAM enhancements.

The CRAM integration work and initial test run results will be presented during a poster session at the Supercomputing 2015 conference (SC15). CRAM has been released as open source software so that other sites can leverage this capability. It is available from GitHub.