On May 23, LLNL computer scientist Maya Gokhale and colleagues Bo (Ivy) Peng and Jacob Wahlgren from KTH Royal Institute of Technology, in Sweden, received a second prize Research Poster Award at the International Supercomputing Conference High Performance Conference (ISC23) in Hamburg, Germany. Their research studies the effects of disaggregating memory systems—in other words, splitting the accessible memory into a local component and a remote component—in high performance computing (HPC).
“This work was motivated by our measurements within Livermore Computing on present-day HPC clusters. In the job data surveyed, we found that one or two nodes might use most of their available memory, leaving most of the memory in the remaining nodes unused,” says Gokhale, who is part of the Parallel Systems Group in LLNL’s Center for Applied Scientific Computing.
Memory capacity in HPC systems has increased over the past decade. In homogeneous clusters, all nodes have the same amount of memory resources split evenly among them. However, when running scientific code, many jobs don’t require all the available memory in most of their nodes. As a result, these resources go unutilized, representing a potential source of savings in both money and energy through disaggregated memory systems.
To investigate the potential for this architecture, Gokhale, Peng, and Wahlgren examined how sharing these resources using a disaggregated memory system impacts computational applications.
A Pool of Memory
“Suppose we provision each node with a much smaller amount of memory, and at times when they need more, they can use the memory of a memory server,” explains Gokhale. “Such a server, accessed through the high performance Compute Express Link (CXL) remote memory interface, can house many terabytes of memory, and other nodes can attach to it and use it when they need more capacity than they already have.”
CXL is an open standard for interconnecting computing resources, like memory. In addition to being a more efficient use of resources, a CXL-attached memory pool has a second benefit: allowing multi-tiering, or the use of different types of memory, by separating local resources from the remote, pooled resources.
Some computational applications have more sensitivity to memory pooling than others. In the absence of having access to such a distributed memory server, the researchers developed an emulation platform to explore which types of HPC computational jobs could benefit from using a remote memory pool.
They found this type of disaggregation increases cost effectiveness without performance degradation in three out of seven applications when up to 75% of the memory used came from the pooled resources. However, the reduced bandwidth and increased latency affected some applications, and interference from other, unrelated applications affected performance.
The team plans to dedicate future work to investigating how to mitigate this performance degradation—such as by implementing scheduling policies in which an application determines its own allocation needs—and how to best use multi-tiering.
—Anashe Bandari