ICC Home Privacy and Legal Notice LC User Documents Banner


CHAOS: Linux from Livermore

Diskless Node Support

Starting with CHAOS 3.2 (spring, 2007), this operating system supports compute clusters (such as Atlas, Zeus, Rhea, Hopi, and Yana) whose nodes have no local hard disks. The file systems that normally reside on each local disk are served instead from NFS-mounted remote disks.

Enabling diskless clusters has both advantages and disadvantages for users.

The prime user benefit of diskless clusters is reliability. With no local disks to fail on thousands of separate nodes, the cluster is up more often and repair costs are reduced.

(1) Diskless nodes have no swap space. An application that runs out of memory (usually 16 Gbyte/node) cannot swap to disk. Normally the CHAOS Out Of Memory (OOM) killer terminates such applications (unless memory overcommit is disabled). SLURM jobs ended by the OOM killer receive a characteristic message in STDERR, with the format

     slurmd[host]: taskn: [name] terminated by OOOM killer
     slurmd[host]: task0: VmSize: xxxM RSS: yyyM

is the node where the killed task was running,
is the task ID of the terminated task,
specifies the process that was killed,
is the virtual memory size of the killed process, and
is the resident size of that process when it was killed.

(2) Temporary file systems use RAM, not disk. This means that any files in /tmp or /var/tmp use real memory on the node, rather than disk space. If you delete files from these temporary file systems, CHAOS reclaims the memory used. Also, CHAOS purges these file systems completely between jobs on diskless nodes. To preserve files between jobs, therefore, you must use HPSS archival storage, the Lustre parallel file system, or one of the /nfs/tmpn file systems.

Navigation Links: [ Document List ] [ HPC Home ] [ Next ]