Road to El Capitan 4: Storage in the exascale era

This article is part of a series about Livermore Computing’s efforts to stand up the NNSA’s first exascale supercomputer. El Capitan will come online in 2024 with the processing power of more than 2 exaflops, or 2 quintillion (10¹⁸) calculations per second. The system will be used for predictive modeling and simulation in support of the stockpile stewardship program.

Previous: The right operating system | Next: A framework for complex workflows

Imagine the not-too-distant future. You’re ready to run your 3D multiphysics application on the NNSA’s first exascale supercomputer. El Capitan revs into action, processing more than a quintillion calculations per second thanks to the incredibly complex coordination between its sophisticated hardware and system software. Your code hits a snag and has to restart, but fortunately there’s a checkpoint in place and it continues on almost seamlessly. Meanwhile, your application and its checkpoints are generating data, and as Livermore Computing (LC) computer scientist Brian Behlendorf observes, “That tremendous amount of data has to go somewhere.”

“Somewhere” means a file system and the crucial I/O (input/output) process of storing and retrieving application data. Behlendorf, who serves as the I/O lead for El Capitan, explains that the machine is designed with two tiers of storage: a large-capacity tier alongside a near-node solution called a Rabbit. “Workloads are changing and stressing traditional file systems. The Lab’s applications vary widely and require different types of storage for different computing paradigms,” he notes. “El Capitan will have the global Lustre file system that’s shared between all the nodes as well as HPE’s [Hewlett Packard Enterprise] dynamically configurable Rabbits.”

Introducing the Rabbit Node

HPE’s Rabbits are storage hardware installed directly into a supercomputer. By being physically located close to compute nodes, they can move data very quickly, which makes them ideal for an application’s immediate storage needs. A Rabbit can be configured into its own Lustre file system that performs better than the comparatively slower global file system, or as local block storage so the application can run as efficiently as possible. Furthermore, Rabbits are equipped with a processor and provide the capability to execute I/O and data analysis tools in containers on the Rabbits. Behlendorf states, “This Rabbit hardware is new, innovative, and flexible. I’m excited to build it from the ground up with LC and HPE.”

When you run your application on El Capitan, a tightly choreographed I/O dance is happening. The application submits the job to the Flux resource manager, specifying the amount of storage needed. Flux determines which compute and storage resources to use, and sends that request to the HPE software governing the Rabbit. The HPE software, which runs in a Kubernetes container environment, configures the Rabbit file system for your application. When the application completes, the HPE software migrates your data from the Rabbit to the large-capacity tier, freeing up the Rabbit for the next job. If you need to re-run your application, the data can move back to the Rabbit. More complex workflows—such as those with machine learning algorithms or in situ data analysis—can take advantage of this hierarchical strategy of containers and local shared storage.

Storing More than Application Data

In addition to application teams, LC is working with others to adapt user-level analysis tools for Rabbits. One pilot effort involves running the SCR (Scalable Checkpoint/Restart) software directly on a Rabbit node instead of on a compute node. For SCR project lead Kathryn Mohror, the Rabbits’ potential for running these types of programs is just as exciting as the Rabbits’ ability to handle application data. She explains, “Rabbits offer a flexible, temporary area for file sharing that’s faster than the larger file system and offer the ability to run our tools on the Rabbit itself. What if we leveraged those capabilities for tools like SCR for checkpointing or Ascent for in situ data visualization?”

When SCR captures checkpointing data, it decides whether to reduce, move, or delete that data as the application continues. Currently, this process happens on the compute node, which can lead to interference because, as Mohror points out, “Everything SCR does competes for resources with the application running on the same node.” If the SCR activities are moved to the Rabbit, the application could run faster without interference from SCR—or any similarly positioned software—getting in the way.

The strategy promises to be a win for the compute center, too. “With SCR or other data analysis tools running on a separate processor from the main application, they will have more opportunities to reduce the data. For example, we’ll have the luxury of running a compression algorithm that might take a little longer. We can be smarter about how much data ends up on the large-capacity file system, which benefits everyone in the center,” says Mohror.

Production Readiness

LC installed the first Rabbit prototype into the Hetchy early access system (EAS) last year, and the I/O team has been kicking the metaphorical tires ever since. Three more EAS machines are in the process of receiving Rabbits for additional testing of the associated firmware, hardware, and software.

“Having Hetchy available for this work has been tremendously valuable, and now we’re able to exercise the basic I/O functionality on it as well as begin prototyping the SCR integration,” Behlendorf explains. “Long term, Hetchy will be used to stage updates for El Capitan, before new software gets to the big system. In the meantime, it’s our primary development environment for testing integration and production-ready efforts.”

Behlendorf praises the close coordination between HPE and LC. He says, “Integrating Rabbits into the EAS, and ultimately into El Capitan, is a huge co-design effort. Together we’re making sure the software stack runs correctly in our environment and on HPE’s hardware.” As just two examples of this teamwork, James Corbett from LC’s Workflow Enablement group spearheaded the building of new capabilities into Flux to accommodate Rabbits (see the next article in this series), while Chris Morrone from LC’s Systems Software and Security group led the effort to stand up the Kubernetes environment.

Behlendorf is confident this I/O strategy and Rabbit technology will be ready when El Capitan comes online next year, stating, “The core functionality is there. We’ll scale it up.”

Previous: The right operating system | Next: A framework for complex workflows

—Holly Auten & Meg Epperly