ICC Home Privacy and Legal Notice LC User Documents Banner

UCRL-WEB-201386

SLURM Reference Manual


Introduction

SLURM is LC's locally developed C-language Simple Linux Utility for Resource Management. SLURM is a job- and compute-resource manager that can run reliably and efficiently on Linux (CHAOS) clusters as large as several thousand nodes. Its features suit it to large-scale, high-performance computing environments, and its design avoids known weaknesses (such as inflexibility or fault intolerance) in available commercial resource management products for supercomputers.

Overview.
This manual summarizes the specific service goals that SLURM was developed to meet, and explains the roles that it plays (relative to the Livermore Computing Resource Management (LCRM/DPCS) system, for example) on LC production machines. Key to SLURM's operation are two software daemons: one (SLURMCTLD) controls the job queue and resource allocations, while the other (SLURMD) shepherds executing jobs on each compute node. Sections below explain the features and subsystems of each SLURM daemon. Additional sections tell how use of "plugin modules" make SLURM easily adaptable to many hardware situations, and introduce the five utility programs that give SLURM its direct user interface.

SLURM Tools.
SRUN is the SLURM utility central to launching, assigning resources to, and guiding the execution of parallel jobs managed by SLURM, both interactively and through batch queues. Hence, the five ways to use SRUN (its "modes"), SRUN's complex I/O redirection support, and the often-elaborate interaction among the many SRUN options receive careful attention in several subsections devoted to that tool. SRUN also interacts with a set of special SLURM environment variables (like those used for job management by IBM's POE), explained in another subsection. Detailed and customizable monitoring of SRUN-submitted jobs is provided by SQUEUE, whose options we also compare and illustrate with annotated output cases. Likewise, to plan SRUN use you can monitor SLURM-managed nodes by executing or customizing a separate SLURM tool called SINFO, with its own section below. Checkpoint support using SCONTROL is introduced as well, along with the ability to signal or kill jobs or their steps by using SCANCEL.

Other Relevant Documentation.
SLURM development is part of LC's larger CHAOS open-source operating system project, as explained in the separate CHAOS Reference Manual. For a summary of known, significant differences between LC's Linux machines and those running AIX or Tru64 UNIX, see the Linux Differences guide. And for general advice on managing (batch) jobs on LC production machines, consult the examples and comparisons in the basic EZJOBCONTROL guide.

On LC machines where Moab has replaced LCRM as the "workload manager," Moab manages batch jobs among clusters while SLURM manages compute resources within clusters. LC-local information on how Moab and SLURM interact appears in the Moab at LC guide, one section of which also links to special Moab-SLURM configuration manuals for system administrators published by both LLNL and Cluster Resources, Inc.


Navigation Links: [ Document List ] [ HPC Home ] [ Previous ] [ Next ]