Abstract |
HPC systems keep growing in size to meet the ever-increasing demand for
performance and computational resources. Additionally, by utilizing diverse
resources such as accelerators to further improve their computational power
the HPC systems become more heterogeneous. The resources of such systems are
shared among many users, whose number can reach up to a few thousands,
and execute a broad spectrum of applications from all scientific fields.
This brings up two issues, the first is the diversity of the software stack
required by the various applications and the complete platform isolation
among the users. Another issue is the managing of the many resources and the
various jobs issued by the users.
Additionally, applications that seek to lower their completion time rely on advanced
parallelism for exploiting the system's resources, which leads to increased pressure on
system interconnects. The problem of managing the resources of a complex and sizeable
HPC system is tackled by the use of a special middleware often called Resource and Job
Managing System.
In order to deal with the issue of platform isolation and provide a more flexible
software stack the Virtual Machine or container technology is often employed.
Furthermore, by increasing communication locality of the applications their
communication overhead is reduced resulting in lower completion times which can
also lower the pressure on system interconnects. Apart from the communication
cost, the completion time of an application can be further improved by reducing
the overhead of job abortions due to node failures.
In this thesis we present three extensions to Slurm which is a Resource and Job
Managing System that is widely adopted in HPC systems that use the above methods
to tackle the aforementioned issues. The first extension provides Slurm with support for
FPGA-based accelerators that allows the users to select nodes with specific FPGA-based
accelerators making Slurm better suited for heterogeneous environments. The second
extension enables Slurm to run workloads in Virtual Machines. Compared to other similar
approaches this extension to Slurm maintains a simple user interface and also integrates
the VM management into Slurm.
The final extension implements a topology and fault aware process placement
approach that reduces the communication cost while also taking into account node
failures. The proposed topology and fault aware process placement approach follows
a common approach which, according to the bibliography, models the process placement
as graph mapping or graph embedding problem. Both the system's topology and the
application's communication pattern can be expressed as two separate graphs and the
mapping is derived by solving the corresponding graph mapping problem.
Additionally, unlike the common approach the proposed approach also takes into
account node failures and attempts to avoid paths that include faulty nodes.
Finally, we evaluate the proposed topology and fault aware approach
in two different environments, with and without the presence of node failures.
The results of our evaluation show a notable decrease in overall completion time of MPI
jobs in both environments. Compared to the default process placement of Slurm, the
proposed approach reduces significantly both the MPI job abortions and the overall
completion time from 10% up to 31% for different MPI applications.
|