Your browser does not support JavaScript!

Post-graduate theses

Current Record: 30 of 785

Back to Results Previous page
Next page
Add to Basket
[Add to Basket]
Identifier 000443510
Title Hierarchical shared address space MPI Collectives, for multi/many-core processors
Alternative Title Ιεραρχικά και κοινού εύρους διευθύνσεων MPI Collectives, για multi/many-core επεξεργαστές
Author Κατεβαίνης - Μπίτζος, Γεώργιος Ε.
Thesis advisor Μπίλας, Άγγελος
Μαραζάκης, Μανώλης
Reviewer Πρατικάκης, Πολύβιος
Μαρκάτος, Ευάγγελος
Abstract Performance portability is at the forefront of MPI's feature-set. With the ever-growing adoption of multi/many-core CPUs, obstacles arise that threaten to throttle performance and restrict scaling, jeopardizing this objective. In order to extend the expectations of high performance on these platforms, we implement in Open MPI XPMEM-based Hierarchical Collectives (XHC), a component for intra-node collectives that addresses the challenges inherent to modern processors. XHC constructs a multi-level hierarchy that conforms to the processor's topological features and dictates the algorithms' data movement patterns. Data is exchanged over memory, without any redundant copies, via XPMEM-created user-level shared address space mappings. Synchronization is also realized through memory, with utilization of highly efficient lock-free techniques. We provide implementations for the Broadcast, Barrier, and Allreduce collective operations, and evaluate them on three different machines with high core counts and varying topologies, including multiple CPU sockets and/or NUMA nodes. OSU's MPI microbenchmarks, as well as real-world HPC applications are used to produce comparisons with Open MPI's built-in components. We offer notable insight into the workings of microbenchmarks and core factors that affect their results. Finally, we present statistics on the usage of collective primitives in our HPC applications of choice. Our proposed collectives achieve in microbenchmarks, over the next best option, speedups of up to 3x, 2x, and 8x for the Broadcast, Barrier, and Allreduce operations respectively. In miniAMR we attain up to 4.8x speedup, and in CNTK we reduce the training time of the AlexNet deep neural network by 8%. The results of our experimental evaluation clearly highlight the necessity for certain techniques, like topology awareness and single-copy data transfers, when seeking maximum efficiency from MPI collectives on multi/many-core processors.
Language English
Subject Topology
Issue date 2021-11-26
Collection   School/Department--School of Sciences and Engineering--Department of Computer Science--Post-graduate theses
  Type of Work--Post-graduate theses
Permanent Link Bookmark and Share
Views 137

Digital Documents
No preview available

Download document
View document
Views : 5