E-Locus - Institutional Repository of the University of Crete

Home Collections School/Department School of Sciences and Engineering Department of Computer Science Post-graduate theses

Post-graduate theses

Current Record: 69 of 824

[Add to Basket]

Identifier

000443510

Title

Hierarchical shared address space MPI Collectives, for multi/many-core processors

Alternative Title

Ιεραρχικά και κοινού εύρους διευθύνσεων MPI Collectives, για multi/many-core επεξεργαστές

Author

Κατεβαίνης - Μπίτζος, Γεώργιος Ε.

Thesis advisor

Μπίλας, Άγγελος
Μαραζάκης, Μανώλης

Reviewer

Πρατικάκης, Πολύβιος
Μαρκάτος, Ευάγγελος

Abstract

Performance portability is at the forefront of MPI's feature-set. With the ever-growing adoption of multi/many-core CPUs, obstacles arise that threaten to throttle performance and restrict scaling, jeopardizing this objective. In order to extend the expectations of high performance on these platforms, we implement in Open MPI XPMEM-based Hierarchical Collectives (XHC), a component for intra-node collectives that addresses the challenges inherent to modern processors. XHC constructs a multi-level hierarchy that conforms to the processor's topological features and dictates the algorithms' data movement patterns. Data is exchanged over memory, without any redundant copies, via XPMEM-created user-level shared address space mappings. Synchronization is also realized through memory, with utilization of highly efficient lock-free techniques. We provide implementations for the Broadcast, Barrier, and Allreduce collective operations, and evaluate them on three different machines with high core counts and varying topologies, including multiple CPU sockets and/or NUMA nodes. OSU's MPI microbenchmarks, as well as real-world HPC applications are used to produce comparisons with Open MPI's built-in components. We offer notable insight into the workings of microbenchmarks and core factors that affect their results. Finally, we present statistics on the usage of collective primitives in our HPC applications of choice. Our proposed collectives achieve in microbenchmarks, over the next best option, speedups of up to 3x, 2x, and 8x for the Broadcast, Barrier, and Allreduce operations respectively. In miniAMR we attain up to 4.8x speedup, and in CNTK we reduce the training time of the AlexNet deep neural network by 8%. The results of our experimental evaluation clearly highlight the necessity for certain techniques, like topology awareness and single-copy data transfers, when seeking maximum efficiency from MPI collectives on multi/many-core processors.

Language

English

Subject

Topology

Issue date

2021-11-26

Collection

School/Department--School of Sciences and Engineering--Department of Computer Science--Post-graduate theses

Type of Work--Post-graduate theses

Permanent Link

https://elocus.lib.uoc.gr//dlib/0/f/7/metadata-dlib-1636543304-321411-13080.tkl