ICC Home Privacy and Legal Notice LC User Documents Banner


SLURM Reference Manual


SLURM fills a crucial but mostly hidden role in running large parallel programs on large clusters.

Most users who run batch jobs at LC use job-control utilities (such as PSUB or PALTER) that talk to the Livermore Computing Resource Management system (LCRM, formerly called DPCS), LC's locally designed metabatch system. LCRM:

  • Provides a common user interface for batch-job submittal across all LC machines and clusters.
  • Monitors resource use across machines and clusters.
  • Implements bank-based fair-share scheduling policy, again, across all LC production machines.

To carry out its scheduling decisions, LCRM relies on the native resource manager on each machine or cluster where it assigns batch jobs to run. The basic duties of such a native resource manager are to:

  • Get and share information on resource (chiefly node) availability.
  • Allocate compute resources (chiefly, nodes or processors).
  • Shepard jobs as their tasks execute.

On IBM AIX machines, LoadLeveler traditionally served as the native resource manager. On LC's nonAIX machines, LCRM has relied on one of three other native resource managers to provide low-level job control:

  • RMS (Resource Management System), used on "capability" clusters (devoted to one or two users at a time).
  • TBS (Trivial Batch System, an LC-developed replacement for the formerly widespread Network Queueing System or NQS).
  • SLURM (introduced here for managing Linux clusters and still evolving to meet specific LC needs).

The key differences among these alternatives appear in this table:


Proprietary? Yes No, open source No, open source
Used on: Machines with QsNet interconnect Interconnect independent Interconnect independent
Suited for: Capability clusters Capacity clusters Either with CHAOS
Node allocation: Whole nodes allocated
to jobs
Multiple jobs per node Either possible

Navigation Links: [ Document List ] [ HPC Home ] [ Previous ] [ Next ]