Yan Lin and Lei He EE Department, UCLA Partially supported by NSF.

Leakage Efficient Chip-Level Dual-Vdd Assignment with Time Slack Allocation for FPGA Power Reduction Yan Lin and Lei He EE Department, UCLA Partially supported by NSF. Address comments to lhe@ee.ucla.edu

Outline • Review and Motivation • Chip-level Vdd-level Assignment Algorithms • Experimental Results • Conclusions

FPGA Power Reduction • Existing FPGAs are power inefficient compared to ASICs [kussy, ISLPED’98] • Power aware FPGA CAD algorithms for existingFPGA architectures • CAD algorithms to minimize power-delay product[Lamoureux et al, ICCAD’03] • Configuration inversion for leakage reduction[Anderson et al, FPGA’04] • Power efficient FPGA circuits and architectures • Dual-Vdd and Vdd-programmable FPGA logic blocks[Li et al, FPGA’04][Li et al, DAC’04] • Vdd-programmable FPGA interconnects • [Li et al, ICCAD’04] • [Gayasen et al, FPL’04] [Anderson et al, ICCAD’04]

Vdd-programmable Interconnects [Li et al, ICCAD’04] Power transistor • Conventional routing switch • Vdd-programmable switch • Vdd selection for used switch • Power-gating unused switch • Reduce leakage by 300X • Configurable Vdd-level conversion • Avoid excessive leakage when low-Vdd switch drives high-Vdd switches • Segment based Vdd-level converter insertion (SLC) • Area overhead • 35% area overhead for MCNC benchmark circuits • Leakage overhead • 29% leakage overhead for MCNC benchmark circuits

Previous Approaches w/o LCs • [Gayasen et al, FPL’04] • Level converters inserted at CLB inputs (outputs) • All the routing trees driven by (driving) the source (sink)CLB have the same Vdd-level as the source (sink) CLB • Lacking in flexibility • A path-based Vdd-level assignment is performed for CLBsand interconnects • [Anderson et al, ICCAD’04] • VT drop of NMOS is used to generate low-Vdd • Positive feedback PMOS is used to tolerate low-Vdd switch driving high-Vdd switches • Alternative design of level converter • Still has delay and power penalty

Our Major Contributions • Proposed two ways to avoid using level converters in interconnects • Tree based level converter insertion (TLC) • All the switches in one routing tree have same Vdd-level • Dual-Vdd tree based level converter insertion (dTLC) • Only high-Vdd switch drives low-Vdd switches in one tree • Proposed a few Vdd-level assignment algorithms • Sensitivity based algorithms • TLC-S and dTLC-S for TLC and dTLC, respectively • Linear programming (LP) based algorithm • dTLC-LP for dTLC

Tree based LC insertion (TLC) • allows one type of Vdd-level within one routing tree • Dual-Vdd tree based LC insertion (dTLC) • allows high-Vdd switch drives low-Vdd switches, but not vice versa Problem Formulations • Assign Vdd-level to each interconnect switch to minimize interconnect power • Meet the delay target Tspec • Vdd-level converters • are removed within interconnects • are inserted at CLB inputs/outputs and can be used when needed

Outline • Review and Motivation • Chip-level Vdd-level Assignment Algorithms • Experimental Results • Conclusions

Interconnect power • Dynamic power • Leakage power is pre-characterized using SPICE Delay & Power Model with Dual-Vdd • To incorporate dual-Vdd into timing analysis • Pre-characterize the intrinsic delay and effective driving resistance of switch using SPICE • Calculate routing delay using Elmore delay model

Chip-level Assignment Algorithms • Tree based level converter insertion (TLC) • Sensitivity based algorithm TLC-S • Dual-Vdd tree based level converter insertion (dTLC) • Sensitivity based algorithm dTLC-S • Linear programming (LP) based algorithm dTLC-LP

Sensitivity Based Algorithm TLC-S • Iterative assignment • Assign low-Vdd to the ‘untried’ tree with maximum power sensitivity in each iteration • Reject the assignment if critical path increases • Iteration terminates after all trees are ‘tried’ • Power sensitivity • The power reduction by changing Vdd from high-Vdd to low-Vdd • Power includes both dynamic and leakage power

Sensitivity Based Algorithm dTLC-S • A “candidate switch” is defined as • A switch does not drive any switch • Low-Vdd has been assigned to all of its fanout switches • Iterative assignment • Assign low-Vdd to a candidate switch with maximum power sensitivity in each iteration • Reject assignment if critical path increases • Iteration terminates when there is no candidate switch

LP Based Algorithm dTLC-LP: Overview Single-Vdd placed and routed netlist Chip-level Time Slack Allocation Net-level Bottom-up Assignment Refinement Dual-Vdd netlist

b4 b4 b4 b4 b3 b3 b1 b3 b1 b3 b1 b1 b2 b2 b2 sink1 b2 s1=2 s1=2 s1=1 sink2 s2=1 s1 s2=3 s2=1 s2 dTLC-LP: Single-Net Estimation • Slack is represented in multiples of • is delay increase of an interconnect segment by changing Vdd from high-Vdd to low-Vdd • An example

dTLC-LP: Single-Net Estimation (Cont.) • Given the allocated slacks, estimate number of low-Vdd switches • sik: Slack for kth sink in ithrouting tree • lik: Number of switches in the path from source to kth sink in ithtree • SLij: Set of sinks in the fanout cone of jth switch in ithtree • An example Source

s1/l1 s1 dTLC-LP: Single-Net Estimation (Cont.) • Given the allocated slacks, estimate number of low-Vdd switches • sik: Slack for kth sink in ithrouting tree • lik: Number of switches in the path from source to kth sink in ithtree • SLij: Set of sinks in the fanout cone of jth switch in ithtree • An example Source

dTLC-LP: Single-Net Estimation (Cont.) • Given the allocated slacks, estimate number of low-Vdd switches • sik: Slack for kth sink in ithrouting tree • lik: Number of switches in the path from source to kth sink in ithtree • SLij: Set of sinks in the fanout cone of jth switch in ithtree • An example Source s2/l2 s2

dTLC-LP: Single-Net Estimation (Cont.) • Given the allocated slacks, estimate number of low-Vdd switches • sik: Slack for kth sink in ithrouting tree • lik: Number of switches in the path from source to kth sink in ithtree • SLij: Set of sinks in the fanout cone of jth switch in ithtree • An example Source s3/l3 s3

dTLC-LP: Single-Net Estimation (Cont.) • Given the allocated slacks, estimate number of low-Vdd switches • sik: Slack for kth sink in ithrouting tree • lik: Number of switches in the path from source to kth sink in ithtree • SLij: Set of sinks in the fanout cone of jth switch in ithtree • An example Source Min(sk/lk) • Theorem: The estimation gives a lower bound of number of low-Vdd switches that can be achieved

dTLC-LP : Full-chip Time Slack Allocation • Objective function • fs(i): transition density of ithtree • Fn(i): estimated number of low-Vdd switches in ith tree • Directly minimize dynamic power • May help minimizing leakage power that exponentially depends on Vdd-level • Constraints • Net-based timing constraints • For PIs and POs • For edges corresponding to routing • For edges other than routing

Constraints due to transforming min function to linear function dTLC-LP : Full-chip Time Slack Allocation • Objective function • fs(i): transition density of ithtree • Fn(i): estimated number of low-Vdd switches in ith tree • Directly minimize dynamic power • May help minimizing leakage power that exponentially depends on Vdd-level • Constraints • Upper bound for useful slack • Theorem: The time slack allocation problem is an LP problem

dTLC-LP : Overview Single-Vdd placed and routed netlist Chip-level Time Slack Allocation Net-level Bottom-up Assignment Refinement Dual-Vdd netlist

dTLC-LP : Net-level Bottom-up Assignment • Theorem: the bottom-up assignment is optimal • Perform bottom-up assignment within each tree to leverage the allocated slacks • Bottom-up assignment • Assign low-Vdd to switches in the routing tree in a bottom-up fashion • Slack is reduced by in each step • Stop the process until no slack left

dTLC-LP : Overview Single-Vdd placed and routed netlist Chip-level Time Slack Allocation Net-level Bottom-up Assignment Refinement Dual-Vdd netlist

Outline • Review and Motivation • Modeling and Problem Formulations • Chip-level Vdd-level Assignment Algorithms • Experimental Results • Conclusions

Experimental Setting • Cluster-based Island Style FPGA Structure • 100% buffered interconnects, subset switch block • Uniform length 4 for all wire segments • ITRS 100nm technology • Use VPR [Betz-Rose-Marquardt] for placement and routing • Use fpgaEva-LP2 [Lin et al, FPGA’05] for power calculation • Considering short-circuit power, glitch power and input vector • 8% average error compared to SPICE simulation

0.05 Leakage power Dynamic power 0.045 0.04 0.035 0.03 Interconnect Power (watt) 0.025 0.02 0.015 0.01 0.005 0 dTLC-LP TLC-S dTLC-S Interconnect Power Comparison between TLC-S, dTLC-S and dTLC-LP • dTLC-S and dTLC-LP achieve 6.7% and 6.9% less interconnect power compared to TLC-S, respectively • Interconnect power breakdown • TLC-S, dTLC-S and dTLC-LP have almost the same leakage • dTLC-S and dTLC-LP achieve 13.8% and 15.8% less interconnect dynamic power compared to TLC-S, respectively

h2lLCi SLC dTLC-LP 25% 20% 0% 5% 15% 10% 15% 20% 25% 64% 19% 10% 5% dTLC-LP h2lLCi SLC 0% dTLC-LP compared to SLC and h2lLCi 100% 0.14 90% 0.12 80% 0.1 70% Interconnect Power (watt) % of VddL Switches 0.08 60% 0.06 50% 0.04 40% 30% 0.02 12.00 12.50 13.00 13.50 14.00 14.50 15.00 15.50 12.00 12.50 13.00 13.50 14.00 14.50 15.00 15.50 Critical Path Delay (ns) Critical Path Delay (ns) • SLC [Li et al, ICCAD ’04] • Segment based level converter inserted in interconnects • Sensitivity based assignment algorithm • h2lLCi [Gayasen et al, FPL’04] • All the routing tree driven by source CLB have the same Vdd-level as the source CLB • Path based assignment algorithm • dTLC-LP, SLC and h2lLCi achieve 77.54%, 74.70% and 41.80% low-Vdd switches w/o relaxing Tspec • At different delays,dTLC-LP achieves • The highest number of low-Vdd switches • The lowest power consumption

1.E+04 TLC-S 9.E+03 dTLC-S 8.E+03 dTLC-LP 7.E+03 6.E+03 Runtime (s) 5.E+03 4.E+03 3.E+03 2.E+03 1.E+03 0.E+00 alu4 apex2 apex4 elliptic ex1010 frisc pdc s38417 s38584 MCNC Benchmarks Runtime Comparison between TLC-S, dTLC-S and dTLC-LP • TLC-S runs the fastest • dTLC-S versus dTLC-LP • Runs 3X faster than dTLC-LP • But achieves similar power consumption

Conclusions and Future Work • Proposed two ways to avoid using level converters in Vdd-programmable interconnects • Tree based level converter insertion (TLC) • Dual-Vdd tree based level converter insertion (dTLC) • Developed chip-level dual-Vdd assignment algorithms w/o level converters • Sensitivity based algorithms TLC-S and dTLC-S • LP based algorithm dTLC-LP • Developed dTLC-LP that reduces interconnect power by 64% • Developed dTLC-S that obtains slightly smaller power reduction with 3X speedup compared to dTLC-LP • Extend chip-level Vdd-level assignment to interconnects using wire segments of different lengths • Allocate time slack to logic blocks and interconnects in a uniform fashion

Thank you!

Yan Lin and Lei He EE Department, UCLA Partially supported by NSF.

Yan Lin and Lei He EE Department, UCLA Partially supported by NSF.

Presentation Transcript

*Supported by the NSF Plant Genome Research and REU Programs

Partially supported by: DARPA/Rome Laboratory, NSF, Intel, Microsoft

NSF Supported Afterschool Programs

Roy Lee Advisor: Lei He royjylee@ucla eda.ee.ucla October 26, 2011

Acknowledgments Koutis is partially supported by NSF CCF-1018463. and UPRRP seed funds.

Supported by NSF PREM Grant # DMR- 0934206

Work partially supported by under Contract 32/2011

Supported in part by NSF

supported by NSF grant PHY-0354979

Yan Lin and Lei He EE Department, UCLA eda.ee.ucla

*partially supported by ST Microelectronics

NSF-supported NCNR spectrometers

Research Supported by: NSF, ONR, NOPP

Presented by Lei Lin and Aron Airall

NSF-supported NCNR spectrometers

Supported by NSF Grant DMR 0425880

Supported by DOE CCPP, ARM and NSF

Acknowledgments Partially supported by the NSF Engineering Research Centers Program under

Partially supported by: DARPA/Rome Laboratory, NSF, Intel, Microsoft

Work supported by SBU, NSF , NIH, ONR , NSLS

1 Lerong Cheng, 1 Yan Lin, 1 Lei He, and 2 Yu Cao 1 EE Department, UCLA 2 EE Department, ASU

Supported in part by NSF