Your browser does not support JavaScript!

Post-graduate theses

Search command : Author="Στεφανίδης"  And Author="Κωνσταντίνος"

Current Record: 10 of 690

Back to Results Previous page
Next page
Add to Basket
[Add to Basket]
Identifier 000426193
Title Process placement optimizations and heterogeneity extensions to the Slurm resource manager
Alternative Title Βελτιστοποίηση τοποθέτησης διεργασιών και επεκτάσεις για ετερογενή συστήματα στο λογισμικό διαχειρισμού πόρων Slurm
Author Βάρδας, Ιωάννης Γ.
Thesis advisor Κατεβαίνης, Μανόλης
Reviewer Μπίλας, Άγγελος
Πρατικάκης, Πολύβιος
Abstract HPC systems keep growing in size to meet the ever-increasing demand for performance and computational resources. Additionally, by utilizing diverse resources such as accelerators to further improve their computational power the HPC systems become more heterogeneous. The resources of such systems are shared among many users, whose number can reach up to a few thousands, and execute a broad spectrum of applications from all scientific fields. This brings up two issues, the first is the diversity of the software stack required by the various applications and the complete platform isolation among the users. Another issue is the managing of the many resources and the various jobs issued by the users. Additionally, applications that seek to lower their completion time rely on advanced parallelism for exploiting the system's resources, which leads to increased pressure on system interconnects. The problem of managing the resources of a complex and sizeable HPC system is tackled by the use of a special middleware often called Resource and Job Managing System. In order to deal with the issue of platform isolation and provide a more flexible software stack the Virtual Machine or container technology is often employed. Furthermore, by increasing communication locality of the applications their communication overhead is reduced resulting in lower completion times which can also lower the pressure on system interconnects. Apart from the communication cost, the completion time of an application can be further improved by reducing the overhead of job abortions due to node failures. In this thesis we present three extensions to Slurm which is a Resource and Job Managing System that is widely adopted in HPC systems that use the above methods to tackle the aforementioned issues. The first extension provides Slurm with support for FPGA-based accelerators that allows the users to select nodes with specific FPGA-based accelerators making Slurm better suited for heterogeneous environments. The second extension enables Slurm to run workloads in Virtual Machines. Compared to other similar approaches this extension to Slurm maintains a simple user interface and also integrates the VM management into Slurm. The final extension implements a topology and fault aware process placement approach that reduces the communication cost while also taking into account node failures. The proposed topology and fault aware process placement approach follows a common approach which, according to the bibliography, models the process placement as graph mapping or graph embedding problem. Both the system's topology and the application's communication pattern can be expressed as two separate graphs and the mapping is derived by solving the corresponding graph mapping problem. Additionally, unlike the common approach the proposed approach also takes into account node failures and attempts to avoid paths that include faulty nodes. Finally, we evaluate the proposed topology and fault aware approach in two different environments, with and without the presence of node failures. The results of our evaluation show a notable decrease in overall completion time of MPI jobs in both environments. Compared to the default process placement of Slurm, the proposed approach reduces significantly both the MPI job abortions and the overall completion time from 10% up to 31% for different MPI applications.
Language English
Subject Fault aware
HPC
Topology mapping
Διαχείρηση Πόρων
Τοποθέτηση διεργασιών
Issue date 2019-11-22
Collection   Faculty/Department--Faculty of Sciences and Engineering--Department of Computer Science--Post-graduate theses
  Type of Work--Post-graduate theses
Permanent Link https://elocus.lib.uoc.gr//dlib/b/4/1/metadata-dlib-1574421863-485182-13570.tkl Bookmark and Share
Views 1

Digital Documents
No preview available

No permission to view document.
It won't be available until: 2020-05-22