SLURM Reference Manual
SLURM is LC's locally developed C-language Simple Linux Utility for
Resource Management. SLURM is a job- and compute-resource manager
that can run reliably and efficiently on Linux (CHAOS) clusters as
large as several thousand nodes. Its features suit it to large-scale,
high-performance computing environments, and its design avoids known
weaknesses (such as inflexibility or fault intolerance) in available
commercial resource management products for supercomputers.
This manual summarizes the specific service goals that SLURM
was developed to meet, and explains the roles that it plays (relative
to the Livermore Computing Resource Management (LCRM/DPCS) system,
for example) on LC
Key to SLURM's operation are two software daemons: one (SLURMCTLD)
controls the job queue and resource allocations, while the other
(SLURMD) shepherds executing jobs on each compute node.
Sections below explain the features and subsystems of each SLURM daemon.
Additional sections tell how use of "plugin modules" make SLURM
easily adaptable to many hardware situations, and introduce the five
utility programs that give SLURM its direct user interface.
SRUN is the SLURM utility central to launching, assigning resources to,
and guiding the execution of parallel jobs managed by SLURM, both
interactively and through batch queues. Hence, the five ways to use
SRUN (its "modes"),
SRUN's complex I/O redirection support,
and the often-elaborate interaction among the many
SRUN options receive careful attention in several subsections devoted
to that tool. SRUN also interacts with a set of special SLURM
environment variables (like those used for job management by IBM's POE),
explained in another subsection.
Detailed and customizable monitoring of SRUN-submitted
provided by SQUEUE, whose options we also compare and illustrate with
annotated output cases.
Likewise, to plan SRUN use you can monitor SLURM-managed
nodes by executing or customizing a
separate SLURM tool called SINFO, with its own section below.
Checkpoint support using SCONTROL is introduced as well,
along with the ability to signal or kill jobs or their steps
by using SCANCEL.
Other Relevant Documentation.
SLURM development is part of LC's larger CHAOS open-source operating
system project, as explained in the separate
CHAOS Reference Manual.
For a summary of known, significant differences between LC's Linux
machines and those running AIX or Tru64 UNIX, see the
Linux Differences guide.
And for general advice on managing (batch) jobs on LC production
machines, consult the examples and comparisons in the basic
On LC machines where Moab has replaced LCRM as the "workload
manager," Moab manages batch jobs among
clusters while SLURM manages compute resources
LC-local information on how Moab and SLURM interact appears in the
Moab at LC guide,
of which also links to special Moab-SLURM configuration manuals
for system administrators published by both LLNL and Cluster
Navigation Links: [
Document List ] [
HPC Home ] [