Abstract |
As the number of processors per chip increases, so does the need for efficient and high-speed communication support. This is necessary so that applications can exploit the numerous cores available in today chip multiprocessors. Although explicit communication mechanisms such as RDMA can be used, implicit migration of data among the cores significantly simplifies the programming effort in large scale systems, by providing a simple and intuitive programming model. This approach, however, introduces a problem known as cache coherence, where multiple copies of the data need to be kept
consistent. An orthogonal solution is to use directory based coherence protocols, which offer increased scalability by reducing the volume of messages exchanged as opposed to broadcast protocols.In this thesis a directory based cache coherence protocol is implemented in a four-core FPGA based prototype that was developed at the CARV (Computer Architecture and VLSI Systems) laboratory of FORTH (Foundation of Research and Technology). The protocol that was implemented can support up to 16 processors and it is integrated with the existing system which also provide RDMA and special hardware support for synchronization and explicit management of cache memories. Finally, our main finding is that the area overhead of the coherent system as opposed to a non-coherent is only 4% in terms of logic. We evaluate our protocol using custom software micro-benchmarks emulating common operations found in parallel applications such as locks and barriers. Also a matrix multiplication algorithm and a producer-consumer benchmark was developed for evaluating the protocol. Our results show that our design scales for the matrix multiplication algorithm, achieving a speedup that ranges between 3.74 to 1.96.
|