ICC Home Privacy and Legal Notice LC User Documents Banner

UCRL-WEB-201386

SLURM Reference Manual


SLURM Goals

SLURM was developed specifically to meet locally important criteria for a helpful, efficient way to manage compute resources on large (Linux/CHAOS) clusters. The primary threefold purpose of a cluster resource manager (such as LoadLeveler on LC's IBM ASC machines or the Resource Management System (RMS) from Quadrics) is to:

  • Allocate nodes--
    give users access (perhaps even exclusive access) to compute nodes for some specified time range so their job(s) can run.
  • Control job execution--
    provide the underlying mechanisms to start, run, cancel, and monitor the state of parallel (or serial) jobs on the nodes allocated.
  • Manage contention--
    reconcile competing requests for limited resources, usually by managing a queue of pending jobs.

At LC, an adequate cluster resource manager needs to meet two general requirements:

  • Scalable--
    It must operate well on clusters with as many as several thousand nodes, including cases where the nodes are heterogeneous (with different hardware or configuration features).
  • Portable--
    It must ultimately support jobs on clusters that have different operating systems or versions, different architectures, different vendors, and different interconnect networks. Linux/CHAOS is, of course, the intended first home for this software, however.

Any LC resource manager must also meet two additional, locally important, requirements:

  • Compatible with LCRM (DPCS)--
    Since a resource manager is not a complex scheduler nor a complete batch system with across-cluster accounting and reporting features, it must support and work well within such a larger, more comprehensive job-control framework. At LC, the Livermore Computing Resource Management system (formerly called DPCS) provides that framework (see also the next section).
  • Compatible with QsNet--
    Since LC's Linux Project has already refined QsNet as its preferred high-speed interconnect for Linux/CHAOS clusters, an adequate resource manager must also allocate Quadrics QsNet resources along with compute nodes. But conversely, interconnect independence and the ability to easily support other brands of interconnect (such as Myrinet) is important too. Such independence allows great flexibility in pursuing new hardware configurations in future clusters.

Finally, to fit well into LC's emerging CHAOS environment, a resource manager should ideally have these three very beneficial extra properties as well:

  • Fault Tolerant--
    Innovative scientific computing systems are often much less stable than routine business clusters, so a good local resource manager should recover well from many kinds of system failures (without terminating its workload), including failure of the node where its own control functions execute.
  • Open Source--
    The software (source code) should be freely sharable under the GNU General Public License, as with other nonproprietary CHAOS components.
  • Modular--
    An approach that clearly separates high-level job-scheduling functions from low-level cluster-administration functions allows for easier changes in scheduling policy without having to sacrifice working, familiar cluster-resource tools or features.

No commercial (or existing open source) resource manager meets all of these needs. So since 2001 Livermore Computing, in collaboration with Linux NetworX and Brigham Young University, has developed and refined the "Simple Linux Utility for Resource Management" (SLURM).


Navigation Links: [ Document List ] [ HPC Home ] [ Previous ] [ Next ]