Abstract |
The crossbar is the most popular switch for digital systems such as Internet routers, clusters, and
multiprocessors (on-chip, as well asmultichip). However, because the cost of the crossbar growswith
the square of the radix thereof, and because of past implementations in various technologies, it is
widely believed that the crossbar is not scalable to radices beyond 32 or 64, and that for higher radices
more complicated networks are needed, where the crossbar is the basic building block. In this thesis,
we scale the crossbar to radices well beyond 100 by crafting novel VLSI micro-architectures and their
detailed CMOS layouts.
As a case study, we laid out a 128×128×24Gb/s crossbar, interconnecting 128 1mm2 “user tiles” in a
single hop, using just 16mm2 of silicon in 90nm CMOS. The crossbar is 32bi t s wide, runs at 750MHz,
and consumes 7Wat t s.
In router systems, the user tiles will containmemory implementing combined queueing at the inputs
and outputs of the crossbar, plus a small part of logic for port control. We show that this architecture
is the best among a range of known router memory architectures (e.g. totally shared memory, solely
input queueing, or crosspoint queueing), for two reasons: (i) It gives top performance using only a
modest speedup on either the crossbar or the memories, independent of radix; and (ii) it partitions
the memory space only linearly with the radix, thus yielding: (a) High SRAM density by using few,
large, and area efficient blocks; and (b) highmemory space utilization through flexible sharing among
flows. In chip multiprocessors, the user tiles will contain cache or local memory, plus a small part of
logic for the processor. When traffic is global and heavy, such a system is competitive to the popular
mesh-centric systems, owing to the simplified routing and load balancing of the crossbar.
We made high radix crossbars feasible by developing novel VLSI micro-architectures for both their
datapath and their control path. We implement the datapath using trees of multiplexor gates, as tristate
buses are slowed down by intrinsically large parasitic capacitances, and we show that highly concentrated
trees are more area efficient by further reducing the parasitic capacitance of their internal
wires. Moreover, we contribute an experimental analysis showing that: (i) The area of the crossbar is
gate limited for all practical values of its radix N and its width W, thus growing as O(N2W), not as
O(N2W2), which would have been the case had area been wire limited, as is commonly believed in
the literature; and (ii) the delay of the crossbar is dominated by the parasitics of wires, and because
wire length growswith the perimeter of the crossbar, delay grows as O(NpW), not asO(logN), which
would have been the case had delay been gate limited, as is commonly believed in the literature. Next,
we propose novel pipelines to cope with the delay of the interconnect. Finally, we demonstrate that
modern EDA tools can be guided to exploit the abundance of wiring resources through custom, but
algorithmic placement of gates.
For the control path, we study the architecture of iSLIP, which is the most popular parallel matching
crossbar scheduler. In particular, we study a traditional iSLIP architecture that implements the
matching decision of each input and each output of the crossbar in a separate arbiter block, and communicates
the matching decisions between the input and the output arbiters through global arbiterto-
arbiter links. First, we show that this architecture is expensive because the arbiter-to-arbiter links
take up O(N4) area. Thus, a r adi x-128 iSLIP scheduler occupies 14mm2, where the arbiter-to-arbiter
links account for more than 50%. Next, by observing that the wiring of an arbiter fits in O(NlogN)
area, we propose a novel architecture that inverts the locality of wires by orthogonally interleaving the
input with the output arbiters, thus lowering the wiring area of the scheduler down to O(N2log 2N).
Using this architecture, the r adi x-128 iSLIP scheduler becomes gate limited, fitting in 7mm2, which
is a 50% reduction compared to the traditional. For a higher radix of 256, area is reduced by almost
an order of magnitude. Finally, the running time of the proposed scheduler is less than 10ns, thus
allowing operation with aminimum packet as small as 30By tes at a 24Gb/s line rate.
|