Abstract |
In recent decades, both research and industry have turned to High Performance
Computing (HPC) for their ever-increasing computational needs. In an attempt to
provide a high-performance communication framework for European
supercomputers, under the EU-funded ExaNeSt and RED-SEA projects, we design a
novel Remote Direct Memory Access (RDMA) engine, capable of low latency (less than
0.5 μs) and high throughput communication (100 Gb/s).
In this thesis, we design the Quality of Service (QoS) hardware of our RDMA
engine. Transfers are segmented into blocks, so as to enable selective retransmissions,
multi-path routing and to avoid per packet acknowledgment overheads. Small-sized
transfers can bypass the RDMA-DRAM path, to further minimize latency. We schedule
transfers at block level, based on a user-defined priority, we support end-to-end flow
control and we enable network multi-pathing and congestion management options.
We also implement a completion notification engine in hardware. We expose 2048
virtual channels to users supporting multiple outstanding data transfer requests.
Finally, we introduce a novel way of collectively polling the status of multiple channels.
Our register-transfer-level (RTL) hardware implementation is pipelined in order
to achieve higher clock and message rates (1 operation/clock cycle, or 150 MOP/s in
our FPGA implementation), while maintaining a low latency of 4 clock cycles for single
block transfers. To further reduce latency, we implement multiple (32) scheduling
queues in shared space, that support one (1) enqueue and one (1) dequeue operation
per clock cycle, as well as back-to-back dequeue operations.
We synthesized our design for the Zynq Ultrascale+ MPSoC. The RDMA's QoS
engine leverages 13.3K Look-Up Tables (LUTs), 5.1K register and 23 BRAM blocks (848
kbits). The maximum frequency achieved in this FPGA was 150 MHz, but this can be
further improved, especially in a VLSI implementation.
Extensive functional verification tests were performed using the Vivado Design
Suite. The QoS engine developed in this thesis completed in simulation 100K
outstanding transfers of varying size, up to 1 MB. Additionally, we integrated our QoS
implementation with the RDMA send unit in another simulated test-bench, issuing 5K
transfers of maximum 256 KB (256 packets), which the design also completed
successfully. In these tests, we examined every possible transfer type, including
congestion managed and fast-path flows, as well as completion notifications.
The design was implemented on the Zynq's FPGA and performance
measurements were taken from user-level programs on the Zynq's A53 ARM core.
Completion time for small transfers of up to 512 Bytes was measured at 360 ns, when
transferring intra-node, BRAM to BRAM (excluding network and DRAM latencies), ten
times lower than the latency of the ExaNeSt RDMA, a previous implementation on the
same MPSoC, using the ARM Cortex-R5 co-processor for QoS support. Moreover, we
significantly improved the transfer rate that can be achieved, reaching the theoretical
maximum (line) throughput as early as with 16KB transfers, whereas using the
previous implementation the corresponding transfer size was 4MB. Finally, although
the RDMA engine is optimized for and tested using AXI processor interconnects, it can
also be connected to PCI or CHI host-processor interconnects.
|