Abstract |
In the last decades, technology has reached a point of slow scaling, mainly due to limitations
caused by the increasing amounts of power consumption. To gain performance speedup,
hardware architects have turned to energy efficient processors, including some that are based
on open-source RISC-V Instruction Set Architecture (ISA), which promise energy efficiency and
high performance on multi-core chips.
This thesis contributes the design and implementation of a new approach for
interprocessor stream communication across RISC-V Coherence Islands. Traditionally, the
coherence islands use memory-to-memory communication over TCP/IP or Remote Direct
Memory Access (RDMA) interconnections. Writing and reading data to and from memory at
the endpoints heightens latency and depletes processor cycles. Instead, in our work, the
communication confines itself between a core and another (remote) node, which can either
be a core or a memory. In particular, we propose a new Streaming Cache that resides next to
Level 1 Cache (L1 Cache) and uses the same fast interface for communication with the core.
We split the Streaming Cache into two logical parts: a) the producer, an outgoing streaming
cache that handles streaming data departing from the node; b) the consumer, an incoming
streaming cache that handles streaming data arriving to the node. Effectively, in the proposed
streaming framework, instead of moving data across the main memory of the end-points, data
of both the producer and the consumer can be accessed with same latency as the L1 Cache.
To improve performance, we use the read-once/store-once cache policies in the Streaming
Cache, which immediately recycle the space of already accessed streaming data. Furthermore,
a Prefetcher fetches data from the (remote) node before they are needed, thus reducing the cost of read accesses, while the write accesses take advantage of a Write-Combiner, which
combines neighboring data and sends them to the (remote) node. In our work, accesses to
streaming data are recognized using virtual addresses without the need of extending ISA.
We implemented the proposed system in SystemVerilog, as an extension of the CVA6
(former ARIANE) single-core RISC-V CPU. We built the Incoming and Outgoing schemes of
Streaming Cache, each with four (4) contexts (hardware streams) to support virtualization,
and we tightly-coupled them with the Load/Store Unit (LSU) of the ARIANE. We also built a
communication logic at the edges that sends/receives data over an AXI-4 interconnect.
We synthesized our design for Xilinx Zynq UltraScale+ MPSoC Field Programmable Gate
Array (FPGA). The Incoming logic of our design utilizes 16839 Look-Up Tables (LUTs), 7506
Registers and 8 Block Random Access Memories (BRAMs), and operates at 275 MHz, while the
Outgoing logic utilizes 23606 LUTs, 8615 Registers and 8 BRAMs, and operates at 210 MHz.
We performed behavioral simulations to our RTL design in order 1) to verify the streaming
functionality when coupled with the RISC-V cores and 2) to evaluate its performance. In our
preliminary evaluations, we stream data from/to main memory of the ARIANE core, first using
the traditional memory hierarchy and second using our optimized streaming cache. The
promising results underline the performance gains due to the stream-optimized cache policies
of our design, by managing to almost completely eliminate the latency of network's
interconnection in our indicative hand-made bare metal benchmarking programs.
|