E-Locus - Institutional Repository of the University of Crete - Process placement optimizations and heterogeneity extensions to the Slurm resource manager

Home Process placement optimizations and heterogeneity extensions to the Slurm resource manager

Results - Details

[Add to Basket]

Identifier

000426193

Title

Process placement optimizations and heterogeneity extensions to the Slurm resource manager

Alternative Title

Βελτιστοποίηση τοποθέτησης διεργασιών και επεκτάσεις για ετερογενή συστήματα στο λογισμικό διαχειρισμού πόρων Slurm

Author

Βάρδας, Ιωάννης Γ.

Thesis advisor

Κατεβαίνης, Μανόλης

Reviewer

Μπίλας, Άγγελος
Πρατικάκης, Πολύβιος

Abstract

HPC systems keep growing in size to meet the ever-increasing demand for performance and computational resources. Additionally, by utilizing diverse resources such as accelerators to further improve their computational power the HPC systems become more heterogeneous. The resources of such systems are shared among many users, whose number can reach up to a few thousands, and execute a broad spectrum of applications from all scientific fields. This brings up two issues, the first is the diversity of the software stack required by the various applications and the complete platform isolation among the users. Another issue is the managing of the many resources and the various jobs issued by the users. Additionally, applications that seek to lower their completion time rely on advanced parallelism for exploiting the system's resources, which leads to increased pressure on system interconnects. The problem of managing the resources of a complex and sizeable HPC system is tackled by the use of a special middleware often called Resource and Job Managing System. In order to deal with the issue of platform isolation and provide a more flexible software stack the Virtual Machine or container technology is often employed. Furthermore, by increasing communication locality of the applications their communication overhead is reduced resulting in lower completion times which can also lower the pressure on system interconnects. Apart from the communication cost, the completion time of an application can be further improved by reducing the overhead of job abortions due to node failures. In this thesis we present three extensions to Slurm which is a Resource and Job Managing System that is widely adopted in HPC systems that use the above methods to tackle the aforementioned issues. The first extension provides Slurm with support for FPGA-based accelerators that allows the users to select nodes with specific FPGA-based accelerators making Slurm better suited for heterogeneous environments. The second extension enables Slurm to run workloads in Virtual Machines. Compared to other similar approaches this extension to Slurm maintains a simple user interface and also integrates the VM management into Slurm. The final extension implements a topology and fault aware process placement approach that reduces the communication cost while also taking into account node failures. The proposed topology and fault aware process placement approach follows a common approach which, according to the bibliography, models the process placement as graph mapping or graph embedding problem. Both the system's topology and the application's communication pattern can be expressed as two separate graphs and the mapping is derived by solving the corresponding graph mapping problem. Additionally, unlike the common approach the proposed approach also takes into account node failures and attempts to avoid paths that include faulty nodes. Finally, we evaluate the proposed topology and fault aware approach in two different environments, with and without the presence of node failures. The results of our evaluation show a notable decrease in overall completion time of MPI jobs in both environments. Compared to the default process placement of Slurm, the proposed approach reduces significantly both the MPI job abortions and the overall completion time from 10% up to 31% for different MPI applications.

Language

English

Subject

Fault aware

HPC

Topology mapping

Διαχείρηση Πόρων

Τοποθέτηση διεργασιών

Issue date

2019-11-22

Collection

School/Department--School of Sciences and Engineering--Department of Computer Science--Post-graduate theses