A 64b Adder Using Self-Calibrating Differential Output Prediction Logic

Advanced VLSI FALL 2006 CLASS PRESENTATION A 64b Adder Using Self-Calibrating Differential Output Prediction Logic ISSCC 2006 K. H. Chong and Larry McMurchie Dept. of Electrical Engineering University of Washington Carl Sechen Electrical Engineering Dept. University of Texas at Dallas BY: A.Jahanshahi Dept. of Electrical And Computer Engineering University of Tehran Supervisor: M.Fakhraei

Outline • History of Output Prediction Logic • Introduction to Output Prediction Logic (OPL) • Fastest digital logic technique • Self-calibrating Differential OPL (DOPL) • Twice as fast and uses half the energy compared to domino logic • High-speed, power-efficient 64b adder architecture • Valency-3 Kogge-Stone sparse tree • DOPL-specific 3b carry-select units • Only 5 logic levels • Measurement results

History • First introduced in 2000[1] - speed up of 2X to 3X over optimized static CMOS and to 5X when applied to wide input NOR Gates.[1] • Differential OPL(DOPL) is 5 times faster than optimized static CMOS and nearly 2 times faster than OPL and domino [2]. • Until Now several successful Chips have been reported.[3-7]

1 1 Gate 1 Gate 2 Gate 3 Gate 4 0 0 1 1 1 1 Gate 1 Gate 2 Gate 3 Gate 4 clk1 clk3 clk4 clk2 Output Prediction Logic • Both static and OPL gates are inherently inverting • In the worst-case for static CMOS, the output of every gate in a critical path must fully transition from 1 to 0, or 0 to 1 • OPL reduces the worst case by predicting that all outputs are 1 • On any critical path, only every other gate will have to transition (pull down) • If consecutive gates pull down, then this is not a critical path since the first pull-down event does not cause the second • Critical path delay will be reduced by at least 50% • Speedup > 2X by skewing the gates for pd transitions • How to achieve high inputs and high outputs on inverting gates?

OPL-static OPL-pseudo OPL-dynamic a clk clk b c out out out a b c a b c a b c clk clk clk Three Types of Single-Rail OPL Gates • In all 3 cases, when gates are not enabled (clk = 0), output will be high even if inputs are high • Enable the gates (clk = 1) when inputs have arrived

1 1 1 1 Gate 1 Gate 2 Gate 3 Gate 4 clk1 clk3 clk2 clk4 clock separation ti clki-1 clki clki+1 OPL Clock Separations • Predicted 1’s are maintained by delaying the clocks • One fast pull-down event every TWO adjacent clock separations! • for ANY critical path • Separations are small, less than an inverter delay which can be produced for example by Reduced swing logic[3]

Delay (ns) IN OUT Clock separation (ns) OUT IN IN VDD VDD VDD OUT CLK CLK CLK GND GND GND • Clock too early • Delay optimal • Clock blocking Delay vs. Clock Separation • Red: OPL-static NOR3 chain • Blue point is optimally sized static CMOS NOR3 chain • Robust: +/- 30% over nominal sep. of .14 gives > 2X speedup [2]

OPL-Differential Gates • Drawback of true differential gates is that one side or the other will have a tall stack of devices • In differential domino, in the worst case, every stack on a particular signal path will have to discharge • In OPL-differential, at most every other stack on a critical signal path will have to discharge: 2X speedup

Diff. Domino vs. OPL-Diff. • Delays (ns) for chains of 10 gates (FO of 4) in 0.18um TSMC • static CMOS chains are optimally sized • domino and OPL-differential use same size transistors

clk P3 P4 P2 P1 OUT OUTB clk clk P3 P4 P2 P1 OUT OUTB clk Improved OPL-Differential • S. Kio, L. McMurchie, and C. Sechen, “Application of Output Prediction Logic to Differential CMOS,” Proc. of IEEE Computer Society Annual Workshop on VLSI, Orlando, FL, April 19-20, 2001 • tfall(OUT)  tfall(OUTB) needed for contention free evaluation

1st level DOPL 2nd level DOPL 3rd level DOPL 4th level DOPL 5th level DOPL 4th 2nd clk 3rd 1st clk clk clk 5th clk Buffer tree clk_ref T-gate Self-Calibrating Differential OPL • Dual output gates: Use a completion detector to produce a downstream clock • Ideally should feed to the next level • But, DOPL gates are too fast! • If a DOPL gate evaluates slower (faster) than expected, downstream clock will be delayed (sped-up) to compensate

2nd 3rd 4th 1st 5th DOPL DOPL DOPL DOPL DOPL 2nd 3rd 4th 1st 5th DOPL DOPL DOPL DOPL DOPL Buffer tree clk_ref T-gate Clock Skew Reduction • DOPL circuits are levelized • Completion detector outputs for each level are tied together • Cannot use static CMOS NAND2’s due to contention

pMOS Dynamic NAND2 Completion Detector • Minimizes crowbar current • Fast, monotonic rising clock edge • Power consumption is comparable to that of an inverter DOPL1 clk out1 out1 out2 out2 Evaluate devices Reset clk DOPL2 out3 out4 out3 out4 Evaluate devices Reset

4th 2nd 3rd 5th 1st DOPL DOPL DOPL DOPL DOPL Reset(2) Reset(1) Reset(3) in1 Buffer tree in2 clk_ref Reset(n) T-gate clk in3 in4 Reset(n) Low Skew Inverter Generates Reset Signals • Crowbar current is minimized: • Reset goes low slightly before DOPL evaluates • Reset goes high slightly after DOPL pre-charges

clk_reference 2nd Level DOPL 3th Level DOPL 4th Level DOPL 5th Level DOPL 1st Level DOPL Mesh Buffer Tree Mesh Buffer Tree Completion detector A level may span multiple rows Low Skew Inverter Self-Calibrating DOPL Floorplan

c13 c10 c16 c4 c7 c1 64b Adder Architecture • Valency-3 Kogge-Stone sparse carry tree • log3N levels for every 3rd carry, but the challenge is in efficiently producing the “missing” pairs of carries

S4-S2 S1-S0 S7-S5 MUX MUX MUX C5 C2 Cin S71-S51 S41-S21 S11-S01 S70-S50 S40-S20 S10-S00 C5 C8 C2 Cin 3b Carry-Select Units • Valency-3 Kogge-Stone sparse tree quickly generates every 3rd carry • log3N levels • Use carry-select CLA Adder to output sums when this “quick” carry arrives

64b Adder Layout and Photomicrograph • IBM 130nm 1.2V process (8RF): Area = 264um X 180um • Auto placed and routed

Energy Consumption • Energy per operation: 29.5 pJ for the IBM 130nm 1.2V process • Since energy is proportional to CV2, we can conservatively estimate the energy consumption for a 90nm 1.1V process:

Measured Results • R. Zlatanovici and B. Nikolic, “Power-performance optimal 64-bit carry-lookahead adders” Proc. ESSCIRC, Sep 2003, pp. 321 – 324. • S. Sun, Y. Han, X. Guo, K.H. Chong, L. McMurchie, and C. Sechen, “409ps 4.7 FO4 64b Adder Based on Output Prediction Logic in 0.18um CMOS,” Proc. IEEE Comp. Soc. Annual Symp. on VLSI (ISVLSI),11-12 May 2005, Pages: 52 – 58. • S. Perri, P. Corsonello, and G. Staino, “A Low Power Sub-Nanosecond Standard-Cells Based Adder,” Proc. IEEE ICECS 2003, pp. 296 – 299.

VDD vs. Delay Curve for 64b DOPL Adder

Summary • Developed self-calibrating differential output prediction logic (DOPL) • Twice as fast as domino logic, and half the energy • Developed hybrid 64b adder architecture, consisting of a valency-3 Kogge-Stone sparse tree and DOPL-specific 3b carry-select units • 64b adder implemented using 130nm 1.2V IBM process (8RF) • Nominal measured delay 238ps (3.9 FO4) • Best measured delay 215ps (3.5 FO4) • Fastest 64b adder reported by nearly 2X • DOPL is a great candidate for scaling. • Energy: 29.5 pJ (conservatively scales to 17.2 pJ for a 90nm process) • Competitive with fast static CMOS adders

References • [1] L. McMurchie, S. Kio, G. Yee, T. Thorp and C. Sechen, “Output Prediction Logic: A High Performance CMOS Design Technique”, Proc. Int. Conf. On Computer Design (ICCD), September 17-20,2000, Austin, TX. • [2] Kio Su, et al., “Application of Output Prediction Logic to Differential CMOS,” Proc. IEEE Workshop on VLSI, pp. 57- 65, April 2001. • [3] S. Sun, Y. Han, X. Guo, K.H. Chong, L. McMurchie, and C. Sechen, “409ps 4.7 FO4 64b Adder Based on Output Prediction Logic in 0.18um CMOS,” Proc. IEEE Comp. Soc. Annual Symp. on VLSI (ISVLSI), 11-12 May 2005, Pages: 52 – 58. • [4] X. Guo and C. Sechen, “A High Throughput Divider Implementation,”Proc. IEEE CICC, Paper 15.2, Sept., 2005. • [5] R. Zlatanovici and B. Nikolic, “Power-Performance Optimal 64-bit Carry-Lookahead Adders,” Proc. ESSCIRC, pp. 321 - 324, Sept., 2003. • [6] Sheng Sun; McMurchie, L.; A High-Performance 64-bit Adder Implemented in Output Prediction Logic, Sechen, C.; Advanced Research in VLSI, 2001. ARVLSI 2001. Proceedings. 2001 Conference on 14-16 March 2001 Page(s):213 - 222 • [7] High Speed Redundant Adder and Divider in Output Prediction Logic, Proceedings of the IEEE Computer Society Annual Symposium on VLSI New Frontiers in VLSI Design,2005

A 64b Adder Using Self-Calibrating Differential Output Prediction Logic