This article is part of a series about Livermore Computing’s efforts to stand up the NNSA’s first exascale supercomputer. El Capitan will come online in 2024 with the processing power of more than 2 exaflops, or 2 quintillion (1018) calculations per second. The system will be used for predictive modeling and simulation in support of the stockpile stewardship program.
Previous: Storage in the exascale era
Livermore Computing (LC) has a long history of developing state-of-the-art resource scheduling software that allows users to run and manage their simulation codes on HPC clusters. However, Slurm and similar traditional technologies have not been able to keep up with the challenges of increasing system scales and interplays, such as those that occur between compute clusters and file systems.
Over a decade ago, LLNL led a collaborative effort to develop Slurm. “Perfectly reasonable choices made at the beginning of the Slurm project started to conflict with the way HPC technologies and use cases were evolving. Slurm’s scheduler was designed and optimized for one node corresponding to one core, and it has a strictly two-level hierarchy for jobs and job steps. Additionally, Slurm limits how many jobs can run on a given system,” explains computer scientist Tom Scogland. “Over time, we could envision, and had already begun developing, other technologies with more expressive resources.”
The LLNL-developed Flux resource and job management framework aims to fill these gaps. Flux expands the scheduler’s view beyond the single dimension of “nodes” and enables new resource types, schedulers, and services that can be deployed as data centers continue to evolve. Flux is easy to run, allows users to compose complex workflows, and supports very large numbers of concurrent jobs.
Flux has been a user-level tool for years, and now it’s an effective cluster-level tool. “We are providing more capable resource management through hierarchical, multi-level management and scheduling schemes,” says Becky Springmeyer, LC Division leader. “Users benefit from schedulers that have deeper knowledge of network, I/O, and power interconnections, along with the ability to dynamically shape running work. A major challenge in designing Flux was making sure its framework is general and extensible enough to support resources and use cases that are only now emerging in research.”
One of these emerging use cases is El Capitan’s novel node-level storage (see the previous article in this series). James Corbett of LC’s Workflow Enablement Group is integrating Flux with Hewlett Packard Enterprise (HPE) hardware and software, including testing and writing the plugins and code that optimize Flux for the early access systems (EAS). Flux is the primary resource manager on four of El Capitan’s EAS machines as well as LLNL’s Corona supercomputer.
Rabbits and Flux
Compared to traditional resource management software, Flux makes smarter placement decisions and offers greater flexibility and more opportunity for adaptation. These solutions help scientific researchers more effectively harness a supercomputer’s power. For example, by considering a holistic view of a data center’s input/output (I/O) bandwidth capability and utilization, Flux avoids the “perfect storm” of I/O operations that can occur when a naïve scheduler places I/O-intensive work without regard to I/O availability. This is especially important when using near-node storage like HPE’s Rabbits.
Flux handles each batch job as a complete instance of the framework with all the functionality of the system-level resource manager. Because each job can act as a full Flux instance running hierarchically inside the primary instance, users can customize Flux at different nested levels within their jobs. For example, a user desiring to launch many small high-throughput jobs could submit a large, long-running parent job, then load a specialized scheduler—one that is streamlined for high throughput—inside it. Panning outward in scale, “parent” schedulers operating at a larger granularity can move resources between “child” jobs as bottlenecks occur and employ pluggable schedulers for resource types like Rabbits and other novel architectures.
Testing of Flux’s integration with the Rabbits is under way on two EAS machines, and the sum is proving greater than the parts. “El Capitan users will be able to leverage the Rabbits not just as a network burst buffer or even a node burst buffer, but also as per-job configurable storage resources. You’ll have flexibility—whether you want node-like storage, your own personal mini-Lustre file system, or some combination of the two,” says Scogland, Flux’s scheduling and research lead.
Feedback from developers and users has significantly contributed to Flux’s current features and its development roadmap. To ease the learning curve of moving to this new technology, LC’s Ryan Day created wrapper functions that transform Slurm commands for Flux. Additionally, Corbett notes, “Besides being able to manage the Rabbits closely, users will like the Python libraries Flux offers for manipulating jobs. This feature is already popular among users who write workflows.” Scogland adds, “The license on the Flux APIs (LGPL3) is attractive to other developers, which means that if you dynamically link to Flux, you don’t have to use the same license. Even closed or proprietary software can link to Flux, and our active and growing API ecosystem can accommodate a range of use cases.”
Winner of a 2021 R&D 100 Award and part of the DOE Exascale Computing Project's Software Technology portfolio, Flux is open source and available to HPC centers around the world via GitHub. After Flux’s capabilities are fully demonstrated on El Capitan, LC and the development team will be primed for collaboration with the HPC and artificial intelligence communities, expanding Flux’s scope to reach even more complex workflows and specialized hardware technologies.
Previous: Storage in the exascale era
—Holly Auten & Meg Epperly