E-Locus - Institutional Repository of the University of Crete

Home Collections School/Department School of Sciences and Engineering Department of Computer Science Doctoral theses

Doctoral theses

Current Record: 74 of 121

[Add to Basket]

Identifier

000375276

Title

VLSI Micro-Architectures for High-Radix Crossbars

Alternative Title

Αρχιτεκτονικές VLSI για σταυραγωγούς υψηλού σθένους.

Author

Πασσάς, Γεώργιος Αύγουστος

Thesis advisor

Κατεβαίνης Μανόλης
Πνευματικάτος Δ.

Abstract

The crossbar is the most popular switch for digital systems such as Internet routers, clusters, and multiprocessors (on-chip, as well asmultichip). However, because the cost of the crossbar growswith the square of the radix thereof, and because of past implementations in various technologies, it is widely believed that the crossbar is not scalable to radices beyond 32 or 64, and that for higher radices more complicated networks are needed, where the crossbar is the basic building block. In this thesis, we scale the crossbar to radices well beyond 100 by crafting novel VLSI micro-architectures and their detailed CMOS layouts. As a case study, we laid out a 128×128×24Gb/s crossbar, interconnecting 128 1mm2 “user tiles” in a single hop, using just 16mm2 of silicon in 90nm CMOS. The crossbar is 32bi t s wide, runs at 750MHz, and consumes 7Wat t s. In router systems, the user tiles will containmemory implementing combined queueing at the inputs and outputs of the crossbar, plus a small part of logic for port control. We show that this architecture is the best among a range of known router memory architectures (e.g. totally shared memory, solely input queueing, or crosspoint queueing), for two reasons: (i) It gives top performance using only a modest speedup on either the crossbar or the memories, independent of radix; and (ii) it partitions the memory space only linearly with the radix, thus yielding: (a) High SRAM density by using few, large, and area efficient blocks; and (b) highmemory space utilization through flexible sharing among flows. In chip multiprocessors, the user tiles will contain cache or local memory, plus a small part of logic for the processor. When traffic is global and heavy, such a system is competitive to the popular mesh-centric systems, owing to the simplified routing and load balancing of the crossbar. We made high radix crossbars feasible by developing novel VLSI micro-architectures for both their datapath and their control path. We implement the datapath using trees of multiplexor gates, as tristate buses are slowed down by intrinsically large parasitic capacitances, and we show that highly concentrated trees are more area efficient by further reducing the parasitic capacitance of their internal wires. Moreover, we contribute an experimental analysis showing that: (i) The area of the crossbar is gate limited for all practical values of its radix N and its width W, thus growing as O(N2W), not as O(N2W2), which would have been the case had area been wire limited, as is commonly believed in the literature; and (ii) the delay of the crossbar is dominated by the parasitics of wires, and because wire length growswith the perimeter of the crossbar, delay grows as O(NpW), not asO(logN), which would have been the case had delay been gate limited, as is commonly believed in the literature. Next, we propose novel pipelines to cope with the delay of the interconnect. Finally, we demonstrate that modern EDA tools can be guided to exploit the abundance of wiring resources through custom, but algorithmic placement of gates. For the control path, we study the architecture of iSLIP, which is the most popular parallel matching crossbar scheduler. In particular, we study a traditional iSLIP architecture that implements the matching decision of each input and each output of the crossbar in a separate arbiter block, and communicates the matching decisions between the input and the output arbiters through global arbiterto- arbiter links. First, we show that this architecture is expensive because the arbiter-to-arbiter links take up O(N4) area. Thus, a r adi x-128 iSLIP scheduler occupies 14mm2, where the arbiter-to-arbiter links account for more than 50%. Next, by observing that the wiring of an arbiter fits in O(NlogN) area, we propose a novel architecture that inverts the locality of wires by orthogonally interleaving the input with the output arbiters, thus lowering the wiring area of the scheduler down to O(N2log 2N). Using this architecture, the r adi x-128 iSLIP scheduler becomes gate limited, fitting in 7mm2, which is a 50% reduction compared to the traditional. For a higher radix of 256, area is reduced by almost an order of magnitude. Finally, the running time of the proposed scheduler is less than 10ns, thus allowing operation with aminimum packet as small as 30By tes at a 24Gb/s line rate.

Language

English

Subject

Crossbar

Scheduler

VLSI

Σταυραγωγός

Χρονοπρογραμματιστής

Issue date

2012-06-14

Collection

School/Department--School of Sciences and Engineering--Department of Computer Science--Doctoral theses