460 likes | 672 Views
Critical ALU Path Optimization and Implementation in a BiCMOS Process for Gigahertz Range Processors. Matthew W. Ernest Electrical, Computer and Systems Engineering Dept. Rensselaer Polytechnic Institute. Overview. Motivation Parallel Prefixes and Carry Types HBT Digital Circuits
E N D
Critical ALU Path Optimization and Implementation in a BiCMOS Process for Gigahertz Range Processors Matthew W. Ernest Electrical, Computer and Systems Engineering Dept. Rensselaer Polytechnic Institute
Overview • Motivation • Parallel Prefixes and Carry Types • HBT Digital Circuits • Pseudo-carry Adder • Future Directions
Motivation “Speed has always been important otherwise one wouldn't need the computer.” -Seymour Cray • Ubiquity • Simplicity • Complexity
Parallel Prefixes Given: x0 x1 x2 ...xk Find: x0 x0 Ä x1 x0 Äx1 Ä x2 ... x0 Ä x1 Ä x2... Ä xk • The set of problems covering sequences of operations where terms are added in order to the result of the previous operation • Carry computation is an application of parallel prefix theory
1 1 0 0 Carry types: Carry Select • Compute possible results in parallel • Select when actual carry-in available • Requires internal carry for blocks, e.g. ripple • Delay: O(f(n/b) +b), min. O(n1/2) • Area: O(f(n/b)·b+b), approx. 2n • Affected by block sizing
Carry-out can be “generated” at current position or carry-in “propagated” Delay: O(1) Area: O(n2) High fan-in/fan-out Carry Types: Carry look-ahead
Carry Types: Block carry look-ahead • A block propagates a carry if all bits in the block propagate a carry • A block generates a carry if a bit generates a carry and all succeeding bits propagate • Delay: O(log n) • Area: O(n log n)
Carry vs. Pseudo-carry Cout=Gn+ Pn• Gn-1 +…+Pn• Pn-1• ... P0• Cin If G=A•B and P=A+B then G=G•P Cout= Pn•Gn+ Pn• Gn-1 +…+Pn• Pn-1• ... P0• Cin Cout= Pn(Gn+ Gn-1 +…+Pn-1• ... P0• Cin) Cout= Pn•Hn Hn =Gn+ Gn-1 +…+Pn-1• ... P0• Cin
Carry vs. Pseudo-carry • Redundant terms create factorization opportunities • Factorization moves terms from critical paths to non-critical paths • Multiple paths can be parallelized • Products with fewer terms lead to implementations with smaller, faster gates
Pseudo-carries can be generated in blocks like carries Deriving Block Pseudo-carry from Block Carry Look-ahead Terms Block Generate: Gi•j0= Gij + PijGij-1i + … + PijPij-1iPij-2i•••Gi0 If G=A•B and P=A+B then G=G•P Gi•j0= PijGij + PijGij-1i + … + PijPij-1iPij-2i•••Gi0 Gi•j0= Pij(Gij + Gij-1i + … + Pij-1iPij-2i•••Gi0) Hi•j0= Gij + Gij-1i + … + Pij-1iPij-2i•••Gi0
Generalized Pseudocarry Equations H2s= G1s+1 + G1s Hi+js= Hjs+i + Ijs+i-1•His Hi+j+ks= Hks+I+j + Iks+I+j-1•Hjs+i + Iks+I+j-1• Ijs+i-1•His Ip+qt= Iqt+p•Ipt Ip+q+rt= Irt+q+p•Iqt+p•Ipt
Sum with pseudo-carry no more complex than sum with carry Other look-ahead features still apply, e.g. Han-Carlson “every other carry” Generating Sums Using Pseudocarry Sn=AnÅBnÅCn-1 If Tn=AnÅBn Cm= Pm•Hm then Sn=TnÅPn-1Hn-1
Adder comparision CSel PCLA Ripple CLA Bits C B A 32 32 12 12 9 6 5 64 64 20 16 12 7 6
HBT Digital Circuits • Exponential I/V relationship leads to high gain and fast switching • Vertical arrangement allows critical dimensions to be smaller with tighter tolerances • Traditionally high DC power consumption: compare increasing leakage and switching currents for FETs
Constant current source equals combined emitter currents Ratio of current through each transistor is exp. function of base voltage Difference in currents at collector converted to difference in voltage on pull-up resistors. Current Steering Logic
Limited to simple functions Large fan-in Any function of inputs Fan-in limited by supply voltage Single-ended vs. Double-ended
Look-ahead gate w/ fully differential logic Hn-2 Hn-2 Hn-1 Hn-1 In-1 In-1 In In Hn-1 Hn-1 Hn Hn In In Hn Hn
Hn Hn-1 Vr Hn Vr In In Mixed input look-ahead gates • In(Hn+ Hn-1) + In•Hn • Hn+ In•Hn-1 • Two series-gated levels for three inputs
Hn Hn-1 Hn Hn-2 Hn-1 Hn In-1 In-1 In In Mixed input look-ahead gates • In In-1(Hn+ Hn-1 + Hn-2) + In In-1(Hn+ Hn-1) + In• In-1• Hn • Hn+ In•Hn-1 + In• In-1• Hn-2 • Three series-gated levels for five inputs
Pseudocarry Blocks H2s H2s H2s H2s H2s H2s H2s H2s H2s H2s H2s H2s H2s H2s H2s H2s H6s H6s H6s H6s H6s H18s H14s H32s
Pseudocarry Tree Oscillator Select 0 1 31 32 1 B A Cin Cout
2 x 165 ps Carry Tree High-speed Output
Breakdown of measured delay Resistor model Total measured delay = 165 ps Temperature 11% 6% Wire C 12% 71% Devices
At design time, fT peak at 1.2mA/um2 but limit at 2mA/um2 For some devices, max. frequency when driving load can occur above fT peak current Models supported this, no reason at time to not believe them However, models are never qualified above fT peak current! Loaded vs. unloaded toggling
Cadence internal parasitic methods • Approximates all capacitance as polynomial function of distance between conductors • Cannot extract RC and capacitance between conductors at the same time: killer for differential wiring! • Convenient, but window of usability small and shrinking
QuickCap capacitance extraction • Field solving with floating random walk method • Accuracy almost wholly a function of run time: 4x run time give ½ error • Random walks independent, near perfect parallelization
Extract physical data from layout Compute RC with QuickCap Extract netlist from schematic Combine to simulate with Spectre Cadence/QuickCap Design Flow
Partial manual extraction with QuickCap • Identify main wires of oscillation paths: approx. dozen pairs • QuickCap extraction for each wire-ground cap. and cap. between pair • Add RC-ladder for each pair by hand to schematic and simulate
Feedback path w/o parasitics (ps) QuickCap parasitic cap. (ps) COEFGEN parasitic cap. (ps) Raphael parasitic cap. (ps) QuickCap parasitic RC (ps) Cin 100 121 128 131 135 A1 103 123 130 129 137 A31 108 127 129 132 141 Simulation with Parasitic Extraction
Pseudo-carry Tree configured as Ring Oscillator 00...00 11...11 Sel 0 Sel 1 30 32 1 1 B A 1 C in C out
Carry Tree High-speed Outputs 16 x 146 ps
Reference Type Size Gate Del. Time ZIMM96 Carry 32 5 - STEL96 Adder 64(32) 12.5(12?) - WANG97 Adder 32 3 2.7ns CHAN98 Adder 64(32) 27(19.5) - SILB98 Fixed 64 - 550 ps AIPP99 Adder 64 - 660 ps SAGE01 Adder 32[16x2] - <500ps MATH01 Adder 64 - 482 ps STAS01 Adder 64 - 440 ps LEE02 Adder 64 900 ps VANA02 ALU 32 8 <200 ps Comparisons of published adders
Eliminates Miller capacitance between input and output Reduces Cjc and Cjs on outputs Shortens rise time, but increases delay Cascode Output Stage
“Wide/Short” gate with dotted emitter/collector • Shorter trees lead to lower supply voltages • Wider trees reduce ratio of emitter-followers to terms computed, lowering total current • More inputs per look-ahead gate means fewer look-ahead levels • Elimination of single-ended inputs on critical H signals allow faster switching with reduced swing
Even wider look-ahead gate Width limited by • Accumulated Cjc and Cjs of dotted-and node • Saturation vs. breakdown • Fan-out loading from inputs and interconnect
Conclusions • 32-bit addition depth reduced to 5 gates fabricated. 4 and 3 gate depth circuits designed. • Gate to compute 3-way look-ahead fabricated. Up to 8-way look-ahead designed. • Carry delay for 32-bit addition measured at 146ps. • QuickCap technology file for 5HP brings simulated results within 11% of measured.