Specialized Workloads

The Laboratory’s multifaceted pandemic response required urgent yet forward-thinking solutions. With a designated portion of CARES Act funding, the Laboratory recognized the need to support several projects simultaneously. This mandate meant a combination of repurposing, upgrading, and purchasing HPC resources. “For expediency, we began by leveraging what we already had,” states senior principal HPC strategist Matt Leininger.

As 2020 unfolded, LC brought online a range of hardware solutions to enhance scalability and collaboration, such as increasing the number of central processing units (CPUs) and GPUs available for complex computational workflows. In addition to processing power, storage and memory capacity were increased in anticipation of large datasets and cross-machine operations. [See Essential Investments (on the main COVID-19 R&D page) and Research Meets DevOps (on the Open to the Research Community page).]

“These investments address varying needs of the Lab’s scientific workloads. The upgrades are a productivity win and prepare us for an even more rapid response next time,” states former principal HPC strategist Ian Karlin.

collage of three supercomputers — The Mammoth (left), Ruby (top right), and Corona (bottom right) supercomputing clusters play key roles in Livermore’s COVID-19 R&D projects and are specialized for different scientific workloads. (Click to enlarge.)

Although Sierra is Livermore’s best-known supercomputer, other systems in LLNL’s HPC facility are specialized for this type of R&D work. Corona, a CPU–GPU hybrid cluster named for a solar eclipse, now has hundreds more GPUs than when it came online in 2019. With more than 1,600 total GPUs, Karlin notes, “Corona has expanded researchers’ ability to run molecular dynamics simulations with greater throughput.”

The VAST file system was upgraded to increase its capacity (now 5.2 petabytes) and bandwidth to provide storage and high input/output performance to Corona and other unclassified machines. Trent D’Hooge, LC’s deputy division leader for operations, explains, “Certain types of workloads fare better on VAST. It writes to small files very quickly, whereas our Lustre file system works better with larger files.”

Furthermore, all machines on LC’s open network—which enables collaborations with external researchers—are connected to VAST. D’Hooge adds, “Upgrading the file system benefits all of the COVID-19 work, which involves writing and converting many files and often spans multiple machines.”

Besides enhancements to existing systems, two computing clusters were acquired in 2020. The Mammoth system was an entirely new purchase intended for genetic sequence analytics and other bioinformatics projects. According to Leininger, genomic analysis had been running on other machines but not as efficiently as researchers wanted. “This new CPU-based system was tailored to these workloads,” he states. “Mammoth maximizes memory capacity and storage per node, which makes it well-suited for database searches and tasks like comparing genomes.” The cluster also supports graph analytics and ML tasks.

Another 2020 purchase was the 6.1-petaflop Ruby cluster, which has more than 1,500 nodes on a CPU-only architecture and is used for basic science discovery projects beyond COVID-19. “Parts of multiple COVID-19 research workflows are most efficient on CPUs,” says Karlin.

Altogether, and bolstered by VAST’s increased file system capacity, the upgraded HPC systems provide complementary capabilities to the antiviral drug design pipeline—from genetic sequencing on Mammoth to docking and scoring tasks on Ruby to molecular dynamics calculations on Corona and other GPU-based machines.

While much of Computing’s COVID-19 R&D has been performed remotely, hardware delivery and installation require physical labor and on-location oversight—all under strict safety protocols. “We had to figure out which resources the projects needed, then acquire the machines and execute during a pandemic,” says Karlin, who calls the effort herculean.