E-Locus - Institutional Repository of the University of Crete

Home Collections School/Department School of Sciences and Engineering Department of Computer Science Post-graduate theses

Post-graduate theses

Current Record: 54 of 824

[Add to Basket]

Identifier

000449573

Title

Hardware support for quality of service in an RDMA engine

Alternative Title

Υποστήριξη μέσω υλικού της ποιότητας υπηρεσίας μιας μηχανής για απομακρυσμένες άμεσες προσπελάσεις μνήμης

Author

Μπάρτζης, Σωκράτης Δ.

Thesis advisor

Κατεβαίνης, Μανόλης
Χρυσός, Νικόλαος

Reviewer

Πρατικάκης, Πολύβιος
Παπαευσταθίου, Βασίλειος

Abstract

In recent decades, both research and industry have turned to High Performance Computing (HPC) for their ever-increasing computational needs. In an attempt to provide a high-performance communication framework for European supercomputers, under the EU-funded ExaNeSt and RED-SEA projects, we design a novel Remote Direct Memory Access (RDMA) engine, capable of low latency (less than 0.5 μs) and high throughput communication (100 Gb/s). In this thesis, we design the Quality of Service (QoS) hardware of our RDMA engine. Transfers are segmented into blocks, so as to enable selective retransmissions, multi-path routing and to avoid per packet acknowledgment overheads. Small-sized transfers can bypass the RDMA-DRAM path, to further minimize latency. We schedule transfers at block level, based on a user-defined priority, we support end-to-end flow control and we enable network multi-pathing and congestion management options. We also implement a completion notification engine in hardware. We expose 2048 virtual channels to users supporting multiple outstanding data transfer requests. Finally, we introduce a novel way of collectively polling the status of multiple channels. Our register-transfer-level (RTL) hardware implementation is pipelined in order to achieve higher clock and message rates (1 operation/clock cycle, or 150 MOP/s in our FPGA implementation), while maintaining a low latency of 4 clock cycles for single block transfers. To further reduce latency, we implement multiple (32) scheduling queues in shared space, that support one (1) enqueue and one (1) dequeue operation per clock cycle, as well as back-to-back dequeue operations. We synthesized our design for the Zynq Ultrascale+ MPSoC. The RDMA's QoS engine leverages 13.3K Look-Up Tables (LUTs), 5.1K register and 23 BRAM blocks (848 kbits). The maximum frequency achieved in this FPGA was 150 MHz, but this can be further improved, especially in a VLSI implementation. Extensive functional verification tests were performed using the Vivado Design Suite. The QoS engine developed in this thesis completed in simulation 100K outstanding transfers of varying size, up to 1 MB. Additionally, we integrated our QoS implementation with the RDMA send unit in another simulated test-bench, issuing 5K transfers of maximum 256 KB (256 packets), which the design also completed successfully. In these tests, we examined every possible transfer type, including congestion managed and fast-path flows, as well as completion notifications. The design was implemented on the Zynq's FPGA and performance measurements were taken from user-level programs on the Zynq's A53 ARM core. Completion time for small transfers of up to 512 Bytes was measured at 360 ns, when transferring intra-node, BRAM to BRAM (excluding network and DRAM latencies), ten times lower than the latency of the ExaNeSt RDMA, a previous implementation on the same MPSoC, using the ARM Cortex-R5 co-processor for QoS support. Moreover, we significantly improved the transfer rate that can be achieved, reaching the theoretical maximum (line) throughput as early as with 16KB transfers, whereas using the previous implementation the corresponding transfer size was 4MB. Finally, although the RDMA engine is optimized for and tested using AXI processor interconnects, it can also be connected to PCI or CHI host-processor interconnects.

Language

English

Subject

Data transmission

FPGA

HPC communication

Networks

Quality of service

Απομακρυσμένη άμεση προσπέλαση μνήμης

Διεπαφές δικτύου

Ποιότητα υπηρεσίας

Συστοιχία επιτόπια προγραμματιζόμενων πυλών