Abstract |
At the beginning of the 21st century, the processor industry made a fundamental
shift towards multicore architectures, in order to address the diminishing
returns in single-thread performance with increasing transistor counts, and in order
to overcome the severe power problems of clock frequency scaling. Semiconductor
technology trends indicate that now the era of power- and energy-constrained
manycore architectures has come. Technology projections show that the energy
consumed by data movement and communication will dominate the corresponding
budget of future computing systems; thus, unnecessary data movements will
subtract significant energy margin from computations.
The most popular communication model for multi-core and many-core architectures
is shared-memory. Threads or processes that run concurrently on different
cores communicate and exchange data by accessing the same global memory locations.
However, accesses to off-chip memory are slow and, thus, processor designs
utilize a hierarchy of faster on-chip memories to improve the speed of memory operations.
Memory hierarchies today are based on two dominant schemes: (i) multilevel
coherent caches, and (ii) software-managed local memories (scratchpads).
Caches manage the memory hierarchy transparently, using hardware replacement
policies, and communication happens implicitly, with cache-coherence protocols
that provoke data transfers between caches. Scratchpad memories are controlled
by the programmer or the runtime software, and communication happens explicitly,
through programmable DMA engines that perform the data transfers.
This thesis proposes architectural support in the memory hierarchy to enable
the software to control data locality; we design programmable hardware primitives
that allow runtime software to orchestrate communication and reduce the associated
energy consumption.
We demonstrate a hybrid cache/scratchpad memory hierarchy that provides
unified hardware support for both implicit communication, via cache-coherence,
and explicit communication, via fast virtualized inter-processor communication
hardware primitives. We also introduce the Epoch-based Cache Management
(ECM), which allows software to assign priorities to cache-lines, in order to guide
the cache replacement policy, and, in effect, to manage locality. Moreover, we
design the Explicit Bulk Prefetcher (EBP), a programmable prefetch engine that
allows software to accurately prefetch data ahead of time, in order to hide memory latency and improve cache locality. Furthermore, we propose a set of hardware
primitives for Software Guided Coherence (SGC) in non-cache-coherent systems,
in order to allow runtime software to orchestrate the fetching of the most up-todate
version of data from the appropriate cache(s) and maintain coherence at the
software object granularity.
We evaluate our proposed hardware primitives by comparing them against
directory-based cache-coherence with hardware prefetching. Our experimental results
for explicit communication show that we can improve performance by 10% to
40%, and at the same time reduce the energy consumption of on-chip communication
by 35% to 70% owing to significant reduction in on-chip traffic, by factors of
2 to 4. Moreover, we exploit a task-based programming system to guide hardware,
and show that our proposed hardware primitives in cache-coherent systems (ECM,
EBP) improve performance by an average of 20%, inject 25% less on-chip traffic
on average, and reduce the energy consumption in the components of the memory
hierarchy by an average of 28%. Our hardware support for non-cache-coherent systems
(ECM, SGC) improves performance by an average of 14%, injects 41% less
on-chip traffic on average, and reduces the energy consumption in the components
of the memory hierarchy by an average of 44%.
|