SLURM Reference Manual
SLURM's central control daemon is called SLURMCTLD.
Unlike the Portable Batch System daemon, SLURMCTLD is
multi-threaded, so some threads can handle problems without
delaying service to continuing normal jobs that also need attention.
SLURMCTLD runs on a single management node (with a fail-over spare
copy elsewhere for safety), reads the SLURM configuration file,
and maintains state information on:
- nodes (the basic compute resource),
- partitions (logically disjoint sets of nodes),
- jobs (or resource allocations to run jobs for a
time period), and
- job steps (parallel tasks within a job).
Job steps are not supported on
The SLURMCTLD daemon in turn consists of three software subsystems,
each with a specific role:
- Node Manager
- monitors the state and configuration of each node in
the cluster. It receives state-change messages from each compute
node's SLURMD daemon asynschonously, and it also actively polls those
daemons periodically for status reports.
- Partition Manager
- groups nodes into disjoint sets (partitions)
and assigns job limits and access controls to each partition.
The partition manager also allocates nodes to jobs (at the request
of the Job Manager, below) based on job and partition properties.
SCONTROL is the (privileged) user utility that can alter partition
- Job Manager
- accepts job requests (from
SRUN or a metabatch
system like LCRM), places them in a priority-ordered queue,
and reviews that queue periodically or when any state change might
allow a new job to start. Qualifying jobs are allocated resources and
that information transfers to (SLURMD on) the relevant nodes so the
job can execute. When all nodes assigned to a job report that their
work is done, the Job Manager revises its records and reviews the
pending-job queue again.
Navigation Links: [
Document List ] [
HPC Home ]