Abstract |
As the core count of modern architectures increases, so does the
probability that an
individual processor of the system manifests some
kind of error. Moreover as core
count
increases novel architectures that lack
traditional cache
-
coherence mechanisms emerge
to mitigate synchronization overheads.
Finally high core count dictate the use of parallel programming in order to take
advantage
of all the resources of the sy
stem. To facilitate parallel programming
several task
-
based
programming models have appeared. The main advantage of task
-
based programming
models it that they allow the programmer to split his program into tasks and define a
dataflow of operations. The My
rmics runtime system takes this approach a step further
and by defining task memory footprints and using dependency analysis is able to
automatically parallelize the computation automatically.
In this thesis, we present FT
-
myrmics an extension of the Myrm
ics runtime system for
automatic fault tolerance. In FT
-
myrmics, we provide transparent fault tolerance
against
transient errors. As a baseline for performance, we implemented a full checkpointing
solution on top of myrmics.
Apart from using full checkpoi
nting, we take advantage of the well
-
defined task
boundaries provided by the Myrmics programming model to avoid unnecessary
checkpointing. Since we know the exact memory footprint and the type of each
argument, we can avoid
checkpointing read
-
only argumen
ts by enforcing that they
cannot be overwritten,
using hardware assistance.
We evaluate the fault
-
tolerant FT
-
Myrmics system on a representative set
of benchmarks,
using the Formic prototype 512
-
core processor emulator.
We find the performance
overhead of d
istributed task checkpointing to
cause slowdowns between 1.1x and 5x,
depending on the size of the
checkpoint
.
|