Abstract |
With the constant evolution of high performance applications, their memory requirement
is rapidly increasing. As a result, the demand for more memory on computer nodes of
large clusters running those applications, continuously rises. However, an individual
computer node has limits in terms of memory capacity. Typically, by running several
processes of different computational and memory requirements on a cluster, creates
fluctuating workloads among the computer nodes. Hence, several nodes use most of their
memory while others are left with unused memory which could potentially be exploited
by nodes with a heavy memory workload.
Consequently, the concept of remote memory management has become the subject for
research by many organizations, which have implemented varying techniques for reading
and writing data on remote memory. Although using remote memory practically
increases the total available memory of a computer node, accessing data remotely, can
critically minimize performance due to the data travelling through the network
interconnection of the cluster. Furthermore, software APIs that are implemented to give
processes access to remote memory, primarily can be complex, and secondly the
responsibility for remote memory allocation and fair remote memory sharing among
processes, is assigned to processes themselves, which can be quite complicated,
especially when many processes are running simultaneously on the same computer node.
In this MSc thesis, we present the Page Migration System(PMS), which monitors main
memory usage of the computer nodes on a cluster, and moves infrequently accessed data
of a process from the memory of a computer node with heavy memory workload, to the
unused memory of a remote computer node of the same cluster node with a lighter
memory workload. The key features of the PMS is that it transparently moves LRU pages
of processes to remote memory while using a fairness algorithm when choosing memory
pages among many processes running on the same computer node. What's more, remote
memory is mapped on the local node, allowing the OS to cache remote data. To be
precise, a read and/or write on remote memory happens when we get a cache miss.
Cacheability offers better performance when there are less misses, by reducing network
transfers. Finally the system is able to return memory pages locally if the overall node
memory usage drops, or if the access frequency of those memory pages increases.
We evaluate the PMS using several benchmarks that stress the CPU in terms of memory
access. We use benchmarks that perform raw serial access on arrays of around a Gigabyte
in size and thus cause cache eviction frequently, essentially moving more data through
the network. That way we can measure the performance drop of a process due to
memory access in the worst case scenario. We also run cache blocking benchmarks that
exploit temporal locality, and we show that we get a better performance that way by
reducing operations on remote memory. Finally we observe the behaviour and
performance on real HPC applications using the PMS.
|