ICC Home Privacy and Legal Notice LC User Documents Banner

UCRL-WEB-200040

CHAOS: Linux from Livermore


Parallel Resource Manager (SLURM)

The primary threefold purpose of a cluster resource manager (such as LoadLeveler on LC's IBM ASC machines or the Resource Management System (RMS) from Quadrics) is to:

  • Allocate nodes--
    give users access (perhaps even exclusive access) to compute nodes for some specified time range so their job(s) can run.
  • Control job execution--
    provide the underlying mechanisms to start, run, cancel, and monitor the state of parallel (or serial) jobs on the nodes allocated.
  • Manage contention--
    reconcile competing requests for limited resources, usually by managing a queue of pending jobs.

At LC, an adequate cluster resource manager needs to meet two general requirements:

  • Scalable--
    It must operate well on clusters with as many as several thousand nodes, including cases where the nodes are heterogeneous (with different hardware or configuration features).
  • Portable--
    It must ultimately support jobs on clusters that have different operating systems or versions, different architectures, different vendors, and different interconnect networks. Linux/CHAOS is, of course, the intended first home for this software, but in summer 2006 LC began installing the same (vendor independent) resource manager on both Linux/CHAOS and AIX clusters.

Any LC resource manager must also meet two additional, locally important, requirements:

  • Compatible with LCRM--
    Since a resource manager is not a complex scheduler nor a complete batch system with across-cluster accounting and reporting features, it must support and work well within such a larger, more comprehensive job-control framework. At LC, the Livermore Computing Resource Management (LCRM) system, formerly called the Distributed Production Control System (DPCS), provides that metabatch framework.
  • Compatible with QsNet--
    Since LC's Linux Project has already refined QsNet as its preferred high-speed interconnect for Linux/CHAOS clusters, an adequate resource manager must also allocate Quadrics QsNet resources along with compute nodes. (A flexible resource manager will be interconnect independent.)

Finally, to fit well into the emerging CHAOS environment, a resource manager at LC should ideally have these two very beneficial extra properties as well:

  • Fault Tolerant--
    Innovative scientific computing systems are often much less stable than routine business clusters, so a good local resource manager should recover well from many kinds of system failures (without terminating its workload), including failure of the node where its own control functions execute.
  • Open Source--
    The software (source code) should be freely sharable under the GNU General Public License, as with other nonproprietary CHAOS components.

No commercial (or existing open source) resource manager meets all nine of these needs. So since 2001 Livermore Computing, in collaboration with Linux NetworX and Brigham Young University, has developed and refined the "Simple Linux Utility for Resource Management" (SLURM). The summary of its requirements above gives a good profile of SLURM's role and design strategy. But it says little about how SLURM actually works.

This diagram shows SLURM's architecture (from the system point of view):

      SRUN    -|        -------------
               |        |           |
      SCANCEL -|--------| SLURMCTLD |--------| SCONTROL
               |        |           |
      SQUEUE  -|        -------------
               |              |
      SINFO   -|    ---------------------
                    |         |         |
               SLURMD      SLURMD       SLURMD
                    (...compute nodes...)

At the center is SLURM's centralized work manager (SLURMCTLD) or control daemon (with a duplicate backup for reliability, not shown). Along the bottom are the SLURMD daemons residing on every compute node, each of which runs jobs locally as a remote shell. (On BlueGene/L, compute nodes can execute only a single process so the SLURMD daemon runs instead on one of the BlueGene/L "front end nodes," but it fills the same role.) User tools (left side) allocate resources and start jobs (SRUN) on SLURM-managed nodes, terminate them (SCANCEL), report job status (SQUEUE), and separately report current node and partition status (SINFO). The administrative tool SCONTROL (right side) monitors and modifies configurations and job states. These SLURM parts were tested on an LC Linux system during 2002, then deployed for public use with the release of CHAOS1.2 across all LC Linux clusters (that had a suitable switch) in the fall of 2003.

From the user point of view, SRUN is the central SLURM tool. SRUN offers over 65 command-line options that you can combine to provide:

  • five ways to execute your parallel jobs. These include an interactive way much like using POE on IBM/AIX machines as well as "local batch" (without LCRM) and "global batch" (with LCRM) alternatives.
  • elaborate control over resource allocation. You can specify not only the total number of nodes for your job but also the CPUs/process (multithreading), processes/CPU (overcommitment), and even specific node ranges or hostnames to use or avoid, and you can bind tasks to CPUs or memory.
  • fine-grained I/O management. You can separately redirect your job's input, output, and standard error to (or from) specified files on a per-job, per-step, per-node, or per-task basis.
  • detailed influence on your job's working features. You can change your job's reported name, default path, debug level, imposed "constraints" (much like PSUB), the verbosity of SRUN messages about it, and the type of MPI invoked.

On CHAOS machines, jobs submitted to SLURM using SRUN (either as a stand-alone utility or executed within an LCRM script) can be monitored for progress and resource use with the SQUEUE reporting tool. SQUEUE thus fills the role for CHAOS and SLRUM that SPJSTAT fills for AIX and LoadLeveler on IBM machines. And like SPJSTAT, SQUEUE reports jobs by means of their SLURM-assigned "local" job ID rather than their LCRM JID (even if they have one). SQUEUE also lets users request customized job-status reports, in which they can specify both the job features reported (from a list of 24) and the order in which reported jobs are sorted.

Likewise, on CHAOS machines, compute resources managed by SLURM can be monitored for features or availability with the SINFO reporting tool. SINFO thus fills the role for CHAOS that LLSTATUS fills for AIX on IBM machines. Like LLSTATUS, by default SINFO reports broadly on all node partitions, but you can focus on specific nodes or node sets if you wish. And like SQUEUE, SINFO offers customization options to change not only the node properties reported but also the order or format of columns shown in SINFO output.

On BlueGene/L only, an additional SLURM tool called SMAP shows the topological distribution of jobs among nodes (because job geometry is important on that machine's unusual architecture).

More details on SLURM, including how its subsystems interact with each other, how users interact with SLURM, the many specialized job-control features offered by the SRUN tool, and the customization possibilities for SQUEUE, SINFO, and SMAP output, appear in the SLURM Reference Manual. In 2006, LC began replacing LoadLeveler with SLURM for resource management even on its AIX machines. For an AIX/CHAOS and LoadLeveler/SLURM cross-comparison matrix, see the "SLURM and Operating Systems" section of the SLURM Reference Manual.



Navigation Links: [ Document List ] [ HPC Home ] [ Next ]