ICC Home Privacy and Legal Notice LC User Documents Banner

UCRL-WEB-200040

CHAOS: Linux from Livermore


For Users of NUMA Nodes

Many of the Linux/CHAOS clusters that LC bought starting in late 2006 (such as Atlas and Zeus) have hardware with "nonuniform memory access" (sometimes called nonuniform memory architecture), always abbreviated NUMA. Starting with version 3.2, released in 2007, CHAOS includes extra features to enable users to more effectively work with NUMA hardware. This section explains basic NUMA concepts and CHAOS policy regarding NUMA resources. It then introduces (in separate subsections) three user tools to help manage those resources.

NUMA Hardware:
Linux/CHAOS clusters with NUMA hardware have addressable nodes (such as atlas36 or atlas1151) each comprised of four "dual-core sockets" (components wtih two CPUs each), as shown here:

                  -----------------------------
     addressable  | cpu0 | cpu0 | cpu0 | cpu0 |
     node         | cpu1 | cpu1 | cpu1 | cpu1 |
     (atlas1151)  -----------------------------
                   node0 |node1 |node2 |node3  ---dual-core sockets
                                                  = NUMA nodes

Each socket functions as a kind of subnode (here 0 through 3) with its own CPUs and local memory. But because the (two) CPUs share their memory with different access rates (that is, nonuniformly), their socket is a "NUMA node." Each addressable node on LC clusters with NUMA hardware contains four such NUMA nodes.

This diagram shows in more detail the nonuniformity of memory access within each NUMA node:


     |-----one socket, one NUMA node-------|
     -----------------     -----------------
     |     CPU0      |     |     CPU1      |
     |               |     |               |
     -----------------     -----------------
     |   I   |   D   |     |   I   |   D   |   L1 (fast)
     | cache | cache |     | cache | cache |
     |       |       |     |       |       |
     ------------------- -------------------
     |    L2 cache   | | | |    L2 cache   |   L2 (slower)
     |               | | | |               |
     ----------------  | | -----------------
                       | |
               -------------------
               |  crossbar/bus   |             Remote memory,
               -------------------             local to other CPU
                                               (slowest)

For each CPU, access to its instruction (I) and data (D) L1 local memory is faster than access to its local L2 cache. And for each CPU, the other CPU's memory is also accessible (via a bus) but is "remote" or "foreign." On machines (such as LC's Thunder) where the remote-memory access rate equals the local-memory access rate, the hardware is called SMP (symmetric multiprocessor). On machines (such as Atlas or Zeus) where the remote-memory access rate is slower than the local-memory access rate, the hardware is called NUMA.

Memory Policy:
The default memory policy under CHAOS 3.2 or later on LC machines with NUMA hardware is to allocate memory to a process from its local NUMA node. This makes it important that processes not be rescheduled to different CPUs (and certainly not to different NUMA nodes) during long-running jobs. So CHAOS now automatically enables "CPU affinity" for SRUN-launched jobs on LC machines with NUMA hardware. CPU affinity keeps a process from moving between CPUs, thus avoiding the performance overhead of transferring the process's working set between different CPU caches. The three tools described in the subsections below are available on NUMA-hardware machines to query and adjust this CHAOS memory policy and the CPU-affinity default setting.



Navigation Links: [ Document List ] [ HPC Home ] [ Next ]