Running extremely large ensembles of small jobs for uncertainty quantification
Sequoia supercomputer with Sequoia logo

Cramming Sequoia Full of Jobs for Uncertainty Quantification

Monday, June 2, 2014

Just a few years ago, scaling simulation codes to run on thousands of cores was the pinnacle of supercomputing prowess. With the arrival of Sequoia, a 1.5 million core BlueGene/Q (BG/Q) system, using a million cores is now commonplace. In April 2014, over 50 application runs used the entire machine: 98,304 16-core nodes all crunching numbers in unison. Each parallel application run on Sequoia is called a job, and Sequoia was designed to run a small number of very large jobs. Over 900 jobs that month alone used 64,000 cores or more: more than the entire core count of LLNL’s next largest supercomputer, Zin. Recently, Sequoia users have discovered another very interesting use for all those cores: running extremely large ensembles of small jobs as part of statistical studies in the area of uncertainty quantification (UQ).

Over the past decade, UQ has played an increasingly central role in NNSA’s stockpile stewardship program and in other national security missions. UQ techniques are used to measure the sensitivity of simulations to input variability, thus helping scientists understand the likelihood of certain outcomes when some inputs are not exactly known.

Sequoia has more than enough compute power for UQ jobs, but large ensembles have stressed the machine in unanticipated ways. Specifically, Sequoia’s job management system was not designed to run more than 192 simultaneous jobs. The first attempts to run large UQ ensembles failed because the system resource manager could not scale to the job counts needed for UQ, but the BG/Q team at Livermore Computing (LC)—LLNL’s Dave Fox, Adam Bertsch, Py Watson, Lance Weems, Teresa Kamakea, and John Gyllenhaal and IBM/LLNL’s Roy Musselman and Paul Szepietowski—rose to meet the challenge.

Sequoia UQ users indicated that they wanted to run 10,000 jobs at the same time (while also confiding that a colleague was advocating running one million jobs). LC agreed to squash scaling bugs to push the limits as far as possible, and some of the early barriers were easily overcome.

Linux, the operating system on Sequoia’s front-end nodes, has safeguards to prevent users from overwhelming the system by launching a large number of processes (colloquially called a “fork bomb”). Removing these safeguards allowed more than 192 jobs to be launched. At just over 512 jobs, the control system became so busy and unresponsive that another set of safeguards determined that the control system had hung and killed it before it could finish launching all of the jobs.

IBM quickly reproduced and fixed this scalability issue, allowing runs up to 3,000 simultaneous jobs before panicking the system. UQ users were then asked to try 1,500 simultaneous jobs—well below the tested limit. After two days of successful runs, all 1,500 jobs were assigned to a single batch scheduling node, exhausting available memory. After the first batch scheduling node failed, the batch system helpfully moved all of the jobs to the next working batch scheduling node, causing a cascade of crashes. Only after purchasing additional memory and installing 256 GB of memory in each batch scheduling node were the UQ users able to push their jobs through. With the extra memory, the system would only panic after 3,000 jobs, and UQ users were able to push through more than 40,000 16-node jobs in April with this additional headroom.

One of the challenges of operating one of the world’s largest supercomputers is that the vendor cannot test the full-scale system in advance. Frequently, subtle bugs only manifest at scale, and LC staff must be capable of handling issues the vendor did not foresee. In this case, IBM could only test its system with up to 2,048 jobs. Working with IBM, LC staff determined the cause of their system panic. With this fix, users can now run over 20,000 simultaneous jobs on Sequoia.

And what about that user who wanted to run one million jobs? Although the 10,000 job goal was surpassed, the system resource manager would need to be rewritten to scale further. Following a BG/Q status update, LLNL researcher Todd Gamblin mentioned he might have a solution. Simulations running on Sequoia use the Message Passing Interface (MPI) to send messages on the machine’s network. Using tools developed as part of his research on MPI performance measurement, Todd developed Cram, a tool that allows many small application instances to run inside of a single large job. Each instance runs a separate sub-problem with its own inputs and working directory. LLNL’s Ares code has already implemented a similar capability at the application level, but Cram virtualizes the MPI interface and can be applied transparently to any application, without even modifying the source code.

Because Cram runs within a single MPI job, Cram can run one million jobs while only burdening the resource manager with the bookkeeping overhead of one. In May, LC was able to run 1.5 million simultaneous jobs on Sequoia using the tool—one job per core on the machine. Surprisingly, the main bottleneck was the creation of the file that describes to Cram the jobs that it should run. If done naively, creating the file requires 1.5 million command invocations on the front-end nodes, and this can take up to 54 hours. By adding a Python interface to Cram, the file could be created with a single Python script, and file creation time for a million-job run dropped to less than 4 minutes. Once that is done, it takes only 2 minutes to launch 1.5 million jobs from a Cram file on Sequoia. Users will now be able to use all the cores on Sequoia, no matter how many UQ jobs they need.

The team’s results were presented at SciComP 2014 Meeting.