E-Locus - Institutional Repository of the University of Crete

Home Collections School/Department School of Sciences and Engineering Department of Computer Science Doctoral theses

Doctoral theses

Current Record: 11 of 125

[Add to Basket]

Identifier

000462946

Title

Transparent spatial sharing of multiple and heterogeneous accelerators

Alternative Title

Διαφανής χωρικός διαμοιρασμός πολλαπλών και ετερογενών επιταχυντών

Author

Παυλιδάκης, Εμμανουήλ Ι.

Thesis advisor

Μπίλας, Άγγελος

Reviewer

Κατεβαίνης, Μανώλης
Πρατικάκης, Πολύβιος
Παπαευσταθίου, Βασίλειος
Βασιλειάδης, Γεώργιος
Κοσμίδης, Λεωνίδας
Καρακώστας, Βασίλειος

Abstract

Today, effectively utilizing multiple heterogeneous accelerators within applications and high-level Machine Learning (ML) frameworks like TensorFlow, PyTorch, and Caffe presents notable challenges across four key aspects: (a) sharing heterogeneous accelerators, (b) al locating available resources elastically during application execution, (c) providing the required performance for latency critical tasks, and (d) protecting application’s data under spatial sharing. In this dissertation, we introduce a novel runtime system designed to decouple applications from the intricacies of heterogeneous accelerators within a single server. Our approach entails a client-side API that allows applications to be written once without considering any low-level details, such as the number or type of accelerators. By leveraging our system, applications are liberated from the burdens of accelerator selection, memory allocations, and memory management operations. A backend service seamlessly manages these intricate tasks—referred to as the server—which is shared among all applications and boasts four primary features. First, the server defers the assignment of a task to an accelerator until the latest feasible moment, setting it apart from current methods that allocate an application to an accelerator during its initialization phase. Subsequent to the task assignment decision but just prior to task execution, the server promptly transfers the necessary data to the designated accelerator. This dynamic task assignment and the lazy data placement enable adaptation to application load changes. Second, to ensure that latency-critical GPU applications will have the desired performance under time-sharing, the server revokes the execution of long-running kernels. Our revocation mechanism stops a task by prematurely terminating the ongoing GPU kernel without preserving any state and replays it later. The server uses a runtime scheduler that prioritizes latency-critical tasks over batch and instructs the revocation mechanism when to kill a running kernel. Third, to facilitate spatial accelerator sharing across applications, the server establishes multiple streams for GPUs and command queues for FPGAs. Regarding FPGAs, the server loads multi-kernel bitstreams and can (re)program the FPGA with the appropriate bitstream required from each application task. While spatial accelerator sharing enhances accelerator utilization and application response time compared to time-sharing, it does come at the expense of data isolation. Finally, GPU spatial sharing lacks protection due to the single accelerator address space, leaving application data susceptible to exposure to other applications. Consequently, the feasibility of sharing in broad multi-user settings becomes compromised. To resolve this issue, we design and implement a software-based sandboxing approach that applies bitwise instructions in the virtual assembly code of kernels. Our approach does not require extra or specific hardware units and supports ML frameworks that use closed-source domain-specific libraries. To minimize the porting effort of existing CUDA applications, we examine the interception of CUDA API calls at various levels, i.e., driver, runtime, and high-level library functions. We show that intercepting only the CUDA runtime and driver library is adequate to run complex ML frameworks, such as Caffe and PyTorch. Additionally, this level of interception is more robust than the ones used from previous approaches because it requires handling fewer and much simpler functions. We use Caffe, TensorFlow, PyTorch, and Rodinia to demonstrate and evaluate the proposed runtime system in an accelerator-rich server environment using GPUs, FPGAs, and CPUs. Our results show that applications that use our system can safely share accelerators without any modifications at low overhead and with latency guarantees.

Language

English

Subject

GPU Memory Protection

Heterogeneity

Preemption

Runtime System

Scheduling

Ανάκληση εργασιών

Επιταχυντές

Ετερογένεια

Προστασία μνήμης επιταχυντών