Abstract |
Today, effectively utilizing multiple heterogeneous accelerators within applications and
high-level Machine Learning (ML) frameworks like TensorFlow, PyTorch, and Caffe presents
notable challenges across four key aspects: (a) sharing heterogeneous accelerators, (b) al
locating available resources elastically during application execution, (c) providing the required performance for latency critical tasks, and (d) protecting application’s data under
spatial sharing.
In this dissertation, we introduce a novel runtime system designed to decouple applications from the intricacies of heterogeneous accelerators within a single server. Our
approach entails a client-side API that allows applications to be written once without considering any low-level details, such as the number or type of accelerators. By leveraging
our system, applications are liberated from the burdens of accelerator selection, memory
allocations, and memory management operations. A backend service seamlessly manages
these intricate tasks—referred to as the server—which is shared among all applications
and boasts four primary features.
First, the server defers the assignment of a task to an accelerator until the latest feasible
moment, setting it apart from current methods that allocate an application to an accelerator during its initialization phase. Subsequent to the task assignment decision but just
prior to task execution, the server promptly transfers the necessary data to the designated
accelerator. This dynamic task assignment and the lazy data placement enable adaptation
to application load changes.
Second, to ensure that latency-critical GPU applications will have the desired performance under time-sharing, the server revokes the execution of long-running kernels. Our
revocation mechanism stops a task by prematurely terminating the ongoing GPU kernel
without preserving any state and replays it later. The server uses a runtime scheduler that
prioritizes latency-critical tasks over batch and instructs the revocation mechanism when
to kill a running kernel.
Third, to facilitate spatial accelerator sharing across applications, the server establishes multiple streams for GPUs and command queues for FPGAs. Regarding FPGAs, the
server loads multi-kernel bitstreams and can (re)program the FPGA with the appropriate
bitstream required from each application task. While spatial accelerator sharing enhances
accelerator utilization and application response time compared to time-sharing, it does
come at the expense of data isolation.
Finally, GPU spatial sharing lacks protection due to the single accelerator address space,
leaving application data susceptible to exposure to other applications. Consequently, the
feasibility of sharing in broad multi-user settings becomes compromised. To resolve this
issue, we design and implement a software-based sandboxing approach that applies bitwise instructions in the virtual assembly code of kernels. Our approach does not require
extra or specific hardware units and supports ML frameworks that use closed-source domain-specific libraries.
To minimize the porting effort of existing CUDA applications, we examine the interception of CUDA API calls at various levels, i.e., driver, runtime, and high-level library
functions. We show that intercepting only the CUDA runtime and driver library is adequate to run complex ML frameworks, such as Caffe and PyTorch. Additionally, this level
of interception is more robust than the ones used from previous approaches because it
requires handling fewer and much simpler functions.
We use Caffe, TensorFlow, PyTorch, and Rodinia to demonstrate and evaluate the proposed runtime system in an accelerator-rich server environment using GPUs, FPGAs, and
CPUs. Our results show that applications that use our system can safely share accelerators
without any modifications at low overhead and with latency guarantees.
|