Your browser does not support JavaScript!

Home    Collections    Type of Work    Doctoral theses  

Doctoral theses

Search command : Author="Στεφανίδης"  And Author="Κωνσταντίνος"

Current Record: 10 of 2427

Back to Results Previous page
Next page
Add to Basket
[Add to Basket]
Identifier 000462946
Title Transparent spatial sharing of multiple and heterogeneous accelerators
Alternative Title Διαφανής χωρικός διαμοιρασμός πολλαπλών και ετερογενών επιταχυντών
Author Παυλιδάκης, Εμμανουήλ Ι.
Thesis advisor Μπίλας, Άγγελος
Reviewer Κατεβαίνης, Μανώλης
Πρατικάκης, Πολύβιος
Παπαευσταθίου, Βασίλειος
Βασιλειάδης, Γεώργιος
Κοσμίδης, Λεωνίδας
Καρακώστας, Βασίλειος
Abstract Today, effectively utilizing multiple heterogeneous accelerators within applications and high-level Machine Learning (ML) frameworks like TensorFlow, PyTorch, and Caffe presents notable challenges across four key aspects: (a) sharing heterogeneous accelerators, (b) al locating available resources elastically during application execution, (c) providing the required performance for latency critical tasks, and (d) protecting application’s data under spatial sharing. In this dissertation, we introduce a novel runtime system designed to decouple applications from the intricacies of heterogeneous accelerators within a single server. Our approach entails a client-side API that allows applications to be written once without considering any low-level details, such as the number or type of accelerators. By leveraging our system, applications are liberated from the burdens of accelerator selection, memory allocations, and memory management operations. A backend service seamlessly manages these intricate tasks—referred to as the server—which is shared among all applications and boasts four primary features. First, the server defers the assignment of a task to an accelerator until the latest feasible moment, setting it apart from current methods that allocate an application to an accelerator during its initialization phase. Subsequent to the task assignment decision but just prior to task execution, the server promptly transfers the necessary data to the designated accelerator. This dynamic task assignment and the lazy data placement enable adaptation to application load changes. Second, to ensure that latency-critical GPU applications will have the desired performance under time-sharing, the server revokes the execution of long-running kernels. Our revocation mechanism stops a task by prematurely terminating the ongoing GPU kernel without preserving any state and replays it later. The server uses a runtime scheduler that prioritizes latency-critical tasks over batch and instructs the revocation mechanism when to kill a running kernel. Third, to facilitate spatial accelerator sharing across applications, the server establishes multiple streams for GPUs and command queues for FPGAs. Regarding FPGAs, the server loads multi-kernel bitstreams and can (re)program the FPGA with the appropriate bitstream required from each application task. While spatial accelerator sharing enhances accelerator utilization and application response time compared to time-sharing, it does come at the expense of data isolation. Finally, GPU spatial sharing lacks protection due to the single accelerator address space, leaving application data susceptible to exposure to other applications. Consequently, the feasibility of sharing in broad multi-user settings becomes compromised. To resolve this issue, we design and implement a software-based sandboxing approach that applies bitwise instructions in the virtual assembly code of kernels. Our approach does not require extra or specific hardware units and supports ML frameworks that use closed-source domain-specific libraries. To minimize the porting effort of existing CUDA applications, we examine the interception of CUDA API calls at various levels, i.e., driver, runtime, and high-level library functions. We show that intercepting only the CUDA runtime and driver library is adequate to run complex ML frameworks, such as Caffe and PyTorch. Additionally, this level of interception is more robust than the ones used from previous approaches because it requires handling fewer and much simpler functions. We use Caffe, TensorFlow, PyTorch, and Rodinia to demonstrate and evaluate the proposed runtime system in an accelerator-rich server environment using GPUs, FPGAs, and CPUs. Our results show that applications that use our system can safely share accelerators without any modifications at low overhead and with latency guarantees.
Language English
Subject GPU Memory Protection
Heterogeneity
Preemption
Runtime System
Scheduling
Ανάκληση εργασιών
Επιταχυντές
Ετερογένεια
Προστασία μνήμης επιταχυντών
Σύστημα χρόνου εκτέλεσης
Χρόνο-προγραμματισμός διεργασιών
Issue date 2024-03-22
Collection   School/Department--School of Sciences and Engineering--Department of Computer Science--Doctoral theses
  Type of Work--Doctoral theses
Permanent Link https://elocus.lib.uoc.gr//dlib/6/c/5/metadata-dlib-1709224678-490548-32067.tkl Bookmark and Share
Views 6

Digital Documents
No preview available

Download document
View document
Views : 1