Abstract |
The end of the processor performance race in the start of the current century signaled
the beginning of the multicore era. To harness the benefits of multiple CPU cores for a single
application, programmers must now use parallel programming models. Semiconductor
trends hint that processors within the next decade will manage to integrate hundreds of cores
on a single chip; the architecture will be heterogeneous, with few strong (and power-hungry)
cores and many weak (and power-efficient) ones; caches will be less or not at all coherent.
As the new manycore era approaches, finding a productive and efficient programming model
able to scale on such architectures is a major challenge.
Dependency-aware, task-based programming models have gained a significant following.
The programmer provides a serial program, split into small functions (tasks) that run
to completion, along with information about the data on which the tasks will operate. A
runtime system analyzes the dependencies and schedules and executes the tasks in parallel.
Despite their increasing popularity, these programming models are not ready to scale
to emerging manycore architectures, as they primarily target today’s homogeneous, cachecoherent
multicores. Their runtime implementations appear to be neither scalable, nor suitable
for heterogeneous, less coherent architectures.
Our thesis delves into the parallel programming challenges that lie ahead in the coming
decade. We consider two major problems. First, how should a parallel runtime system be
designed, in order to be able to scale well on a manycore processor ten years from now?
And second, how can we implement and evaluate such runtime system designs, since such
manycore processors are not currently available?
Towards the first problem, we enhance an existing task-based model to support nested
parallelism and pointer-based, irregular data structures. We then design and implement Myrmics,
a runtime system that implements this programming model. Myrmics is specifically
designed to run on future, heterogeneous, non-coherent processors and to scale well using
novel, distributed algorithms and policies for hierarchical memory management, dependency
analysis and task scheduling. Our experimental results reveal that Myrmics scales
well compared to reference, hand-tuned MPI baselines, while automatic parallelization overheads
remain modestly low (10–30%). We verify that many of our proposed algorithms and
policies are promising.
Towards the second problem, we create a heterogeneous 520-core FPGA prototype modeled
faithfully after current predictions for manycore processors. We use it to evaluate the
Myrmics runtime system. The FPGA prototype is based on Formic, a new printed circuit
board that we design specifically for scalable systems. We estimate that our prototype runs
code 50,000 faster than software simulators for similar core counts.
|