SLURM Reference Manual
SLURM was developed specifically to meet locally important criteria
for a helpful, efficient way to manage compute resources on large
The primary threefold
purpose of a cluster resource manager
(such as LoadLeveler on LC's IBM ASC machines or the
Resource Management System (RMS) from Quadrics) is to:
- Allocate nodes--
give users access (perhaps even exclusive access) to compute nodes
for some specified time range so their job(s) can run.
- Control job execution--
provide the underlying mechanisms to start, run, cancel, and monitor
the state of parallel (or serial) jobs on the nodes allocated.
- Manage contention--
reconcile competing requests for limited resources, usually by managing
a queue of pending jobs.
At LC, an adequate cluster resource manager needs to meet two
It must operate well on clusters with as many as several thousand nodes,
including cases where the nodes are heterogeneous (with different
hardware or configuration features).
It must ultimately support jobs on clusters that have different
operating systems or versions, different architectures, different vendors,
and different interconnect networks.
Linux/CHAOS is, of course, the intended first home for this software,
Any LC resource manager must also meet two
additional, locally important, requirements:
- Compatible with LCRM (DPCS)--
Since a resource manager is not a complex scheduler nor a complete batch
system with across-cluster
accounting and reporting features, it must support and
work well within such a larger, more comprehensive job-control
At LC, the Livermore Computing Resource Management system (formerly
provides that framework
(see also the next section).
- Compatible with QsNet--
Since LC's Linux Project has already refined
QsNet as its preferred
high-speed interconnect for Linux/CHAOS clusters, an adequate resource
manager must also allocate Quadrics QsNet resources along with compute
But conversely, interconnect independence
and the ability to easily support other brands of interconnect
(such as Myrinet) is important too. Such independence allows great
flexibility in pursuing new hardware configurations in future clusters.
Finally, to fit well into LC's emerging CHAOS environment, a resource
manager should ideally have these three very beneficial
extra properties as well:
- Fault Tolerant--
Innovative scientific computing systems are often much less stable than
routine business clusters, so a good local resource manager should
recover well from many kinds of system failures (without terminating
its workload), including failure of the node where its own control
- Open Source--
The software (source code) should be freely sharable under the GNU
General Public License, as with other nonproprietary CHAOS components.
An approach that clearly separates high-level job-scheduling
functions from low-level cluster-administration functions allows
for easier changes in scheduling policy without having to sacrifice
working, familiar cluster-resource tools or features.
No commercial (or existing open source) resource manager meets all
of these needs.
So since 2001 Livermore Computing, in collaboration with Linux NetworX
and Brigham Young University, has developed and refined the
"Simple Linux Utility for Resource Management" (SLURM).
Navigation Links: [
Document List ] [
HPC Home ] [