Computing centers that house dozens of large-scale systems with unique capabilities must have a method for efficiently scheduling their resources for use. In the case of Livermore Computing (LC), those resources include extremely large systems, such as the exascale El Capitan, as well as myriad smaller support systems for generating, visualizing, analyzing, and storing data that are critical to fulfilling LLNL’s national security missions.
LC developers have a long history of developing state-of-the-art software—including Slurm and its predecessors—that allows users to run and manage simulation codes across multiple clusters. However, traditional resource and job management approaches cannot keep up with the challenges of increasing system scales and interplays, such as those that occur between compute clusters and file systems.
Flux is a next-generation resource management framework that expands the scheduler’s view beyond the single dimension of “nodes.” Instead of simply developing a replacement for older solutions, Flux offers a framework that enables new resource types, schedulers, and workflow services to be deployed as HPC centers continue to evolve.
This comprehensive, modular software suite provides hierarchical resource management and graph-based scheduling features that improve the performance, portability, flexibility, and manageability of both traditional and complex scientific workflows on many types of computing systems—in the cloud, at remote locations, on a laptop, or on next-generation architectures like El Capitan. Responding to the rise of hybrid computing configurations, Flux also supports container-based workflows, in which an application code and its dependencies are packaged for different environments.
Smarter Workloads
A resource manager tracks and monitors the hardware deployed in the HPC center, then arbitrates access as customers submit work they would like to run. The job-scheduling algorithms must not only determine when and where resources that meet the user-specified requirements will be available, but it must also implement an allocation policy. Job placement in both space and time is critical to achieving efficient execution and getting the most work done for the time, power, and money spent.
Flux addresses this issue by making smarter placement decisions and by offering greater flexibility and more opportunity for adaptation than current resource management software. These solutions help scientific researchers and computing users more effectively harness the power of LC capabilities. For example, with a holistic view of the center’s input/output (I/O) bandwidth capability and utilization, Flux avoids the “perfect storm” of I/O operations that can occur when a naïve scheduler organizes I/O-intensive work without regard to I/O availability.
Flux also supports hierarchical workflows via nested Flux instances, in which each level acts as its own resource manager. With recursive launch, users can leverage these instances to run smaller jobs for more efficient scheduling of a complex workflow. These capabilities simplify workflow management by reducing scheduling overhead, ultimately improving performance at large scales.
Users can customize their scheduling policies and seamlessly use desired programming tools with this framework. Kicking off a Flux-managed workload and monitoring its progress are possible with just a few commands.
Building a Community
Flux is open source and available to HPC centers around the world. The team welcomes research collaborations and regularly leads hands-on tutorials as well as workshops at premier HPC conferences. In 2021 Flux won an R&D 100 Award, and in 2026, the project joined the High Performance Software Foundation (HPSF). Through the foundation, Department of Energy labs can work with academia and industry to advance a common, affordable software stack for HPC.
