Explaining The Gap Between ASIC and Custom Power: A Custom Perspective

# Explaining The Gap Between ASIC and Custom Power: A Custom Perspective

## Explaining The Gap Between ASIC and Custom Power: A Custom Perspective

- - - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - - -
##### Presentation Transcript

1. Explaining The Gap Between ASIC and Custom Power: A Custom Perspective Andrew Chang Cadence Design Systems* William J. Dally Computer Systems Laboratory Stanford University * Work done while Author was at Stanford

2. Design Tradeoffs: Power vs. Performance 1. Move to More Energy Efficient Operating Point More Energy Efficient w/ Custom Power 3 1 2 Performance

3. Design Tradeoffs: Power vs. Performance 1. Move to More Energy Efficient Operating Point More Energy Efficient w/ Custom 2. Trade Performance for Power Larger Range w/ Custom Power 3 1 2 Performance

4. Design Tradeoffs: Power vs. Performance 1. Move to More Energy Efficient Operating Point More Energy Efficient w/ Custom 2. Trade Performance for Power Larger Range w/ Custom 3. Move to Different Power vs. Performance Curve More Architectural Choice with Custom Power 3 1 2 Performance

5. Dynamic Power Dissipation Pdyn = a CVdd2f = a Ecircuitf • Reduce Vdd • Static, dynamic, voltage islands, power gating • Reduce a and/or f • Clock gating, block enables, bus encoding, glitch identification and elimination • Reduce Ecircuit • Engineer interconnects, increase circuit efficiency, subthreshold circuit techniques

6. Static Power Dissipation Pstatic = Vdd (Isub + Iox ) Isub = K1 W e -Vt/ nVq (1- e –Vgs/Vq) Iox = K2 W (Vgs/tox)2 e –atox/ Vgs With K1, K2, n, and a experimentally determined • Reduce Vdd • Static, dynamic, voltage islands, power gating • Increase effective Vt • Substituting high-threshold devices, transistor stacking, static and active body bias • Reduce effective W • Reduce number and size of devices in design

7. Which Design Is More Efficient? • 0.7um CMOS 173MHz chip w/ 460K T’s • 0.18um CMOS 10kHz chip w/ 640K T’s

8. Which Design Is More Efficient? • 0.7um CMOS 173MHz chip w/ 460K T’s • Vdd (typ) = 3.3V, Vdd (min) = 1.1V • 0.18um CMOS 10kHz chip w/ 640K T’s • Vdd (max) = 1.8V, Vdd (min) = 0.18V

9. Which Design Is More Efficient? • 0.7um CMOS 173MHz chip w/ 460K T’s • Vdd (typ) = 3.3V, Vdd (min) = 1.1V • Power = 845mW • 0.18um CMOS 10kHz chip w/ 640K T’s • Vdd (max) = 1.8V, Vdd (min) = 0.18V • Power = 1.6mW

10. Talk Outline • Normalized Metric: Ebit • Effect of Architecture • ASIC vs. Custom • Building Blocks • Achievable Energy Efficiency • 16b 1024 FFT Example • Answer to “Which Design is More Efficient”

11. Talk Outline • Normalized Metric: Ebit • Effect of Architecture • ASIC vs. Custom • Building Blocks • Achievable Energy Efficiency • 16b 1024 FFT Example • Answer to “Which Design is More Efficient”

12. Defining Ebit Ebit = Cbit * Vdd2 Cbit = 4 * 2 fF/um * Wmin • Energy needed to write a 1-bit SRAM cell • Approximates minimum useful capacitance • The ratio of Ebit to the energy for a range of circuits remains largely constant with technology scaling

13. c2 Technology mm2 0.5mm 58 18 5.7 0.18mm 18 Technology Scaling for Ebit • c is a normalized unit of distance equal to the M1 pitch

14. Technology Scaling for Nand2 NAND2 • c is a normalized unit of distance equal to the M1 pitch A A YN B B YN 4c = 2.24mm 8c = 4.48mm

15. Applying Ebit

16. Talk Outline • Normalized Metric: Ebit • Effect of Architecture • ASIC vs. Custom • Building Blocks • Achievable Energy Efficiency • 16b 1024 FFT Example • Answer to “Which Design is More Efficient”

17. Talk Outline • Normalized Metric: Ebit • Effect of Architecture • ASIC vs. Custom • Building Blocks • Achievable Energy Efficiency • 16b 1024 FFT Example • Answer to “Which Design is More Efficient”

18. Effect of Architecture NVIDIA GeForceFX Intel Pentium-4 Design Style: Custom Design Style: ASIC 2600MHz – 55M Transistors 400MHz – 125M Transistors

19. Effect of Architecture NVIDIA GeForceFX Intel Pentium-4 Design Style: Custom Design Style: ASIC 2600MHz – 55M Transistors ~60 Watts 400MHz – 125M Transistors ~20 Watts

20. Effect of Architecture ASIC Architecture: 6x Efficiency NVIDIA GeForceFX Intel Pentium-4 Design Style: Custom Design Style: ASIC 2600MHz – 55M Transistors ~60 Watts: 5GFlops & 5 Gbs 400MHz – 125M Transistors ~20 Watts: 10GFlops & 13 GBs

21. Custom Circuits: 9x (7x) Efficiency NVIDIA GeForceFX Intel Pentium-4 Design Style: Custom Design Style: Custom 2600MHz – 55M Transistors ~60 Watts: 5GFlops & 5 Gbs Vdd = 1.3V 400MHz – 125M Transistors ~3 Watts: 10GFlops & 13 GBs Vdd = 0.65V

22. Combined Architecture and Circuits40x+ Improvement but 1.5 Years vs. 3+ Years NVIDIA GeForceFX Intel Pentium-4 Design Style: Custom Design Style: Custom 2600MHz – 55M Transistors ~60 Watts: 5GFlops & 5 Gbs Vdd = 1.3V 400MHz – 125M Transistors ~3 Watts: 10GFlops & 13 GBs Vdd = 0.65V

23. Talk Outline • Normalized Metric: Ebit • Effect of Architecture • ASIC vs. Custom • Building Blocks • Achievable Energy Efficiency • 16b 1024 FFT Example • Answer to “Which Design is More Efficient”

24. Talk Outline • Normalized Metric: Ebit • Effect of Architecture • ASIC vs. Custom • Building Blocks • Achievable Energy Efficiency • 16b 1024 FFT Example • Answer to “Which Design is More Efficient”

25. ASIC vs. Custom • ASIC Methods • Provide only coarse-grain control 100K+ gates, but requiremuch less effortand historically scale with complexity • Custom Methods • Offer fine-grain control individual transistors & gates, but requirelarge effort andscale poorly with complexity • Exploits Design Structure • Exploits Circuit Techniques

26. Custom Methods EmphasizeFine-Grain Manual Control + Custom Library Operation and Performance Characterized for the Specific Case

27. ASIC Methods SubstituteCoarse-GrainControlAutomation + Generic Library

28. ASIC Methods SubstituteCoarse-GrainControlAutomation + Generic Library Operation and Performance Characterized for the Typical/Generic Case

29. ASICFocus on 100K+ GatesLost Opportunities to Exploit Structure • Designs reuse similar basic building blocks • Building blocks: 1-10K-gates not 100K+ gate • 64-bit adder 1K-gates • 64x64 rf 2K-gates • 64x64 multiplier 20K-gates • Opportunities to exploit these structures lost when design is viewed in large chunks

30. Bank 1 Bank 0 LTLB EMI MEMORY SWITCH NIF/ROUTER CLST 2 CLST 1 CLST 0 CLUSTER SWITCH CLST 2 CLST 1 CLST 0 C C C C C C C C C C C C C L L L L L L L L L L L L L C L C L Different Architectures Similar Building Blocks 1998 “MAP” 64b Microprocessor - 5M T’s (MIT/Stanford) XCVRS Bus EX RF SRAM 2002 “Imagine” 32b Stream Processor - 22M T’s (Stanford) XCVRS Bus EX RF SRAM

31. Bank 1 Bank 0 LTLB EMI MEMORY SWITCH NIF/ROUTER CLST 2 CLST 1 CLST 0 CLUSTER SWITCH CLST 2 CLST 1 CLST 0 C C C C C C C C C C C C C L L L L L L L L L L L L L C L Significant Structure ExistsWithin100K-gates 1998 “MAP” 64b Microprocessor - 5M T’s (MIT/Stanford) XCVRS Bus EX RF SRAM C L 2002 “Imagine” 32b Stream Processor - 22M T’s (Stanford) XCVRS Bus EX RF SRAM

32. Energy of 100K-gate Equivalent • ASIC (N2) = 1400K Ebits (typ) • Custom Logic = 424K Ebits* • SRAM (small) = 1085K Ebits • SRAM (med) = 155K Ebits • SRAM (large) = 50K Ebits *Based on data extracted from Intel McKinley

33. Exploiting Circuit Techniques • Custom circuits more efficient • Reduced parasitics • 1.7x circuit techniques and flops • 1.4x libraries • 1.4x due to engineering interconnects • Subthreshold Circuits • Low Performance but ultra-low power • Requires Architecture, Gates, Memories, CAD Tools

34. Relating Power to PerformanceCV/I, Idsat, tFO4 Idsat = K3 Leff -0.5 tox-0.8 (Vgs - Vt)1.25 tFO4 = K4 [Ceff Vdd /Idsat] (K4 ~ 13.5)

35. Relating Power to Performance Relating Vdd and Vt to tFO4 Idsat = K3 Leff -0.5 tox-0.8(Vgs - Vt)1.25 tFO4 = K4[Ceff Vdd /Idsat](K4 ~ 13.5)

36. Relating Power to PerformanceCorrelation to Reported Foundry Data Idsat = K3 Leff -0.5 tox-0.8 (Vgs - Vt)1.25 tFO4 = K4 [Ceff Vdd /Idsat] (K4 ~ 13.5)

37. Achievable Power ImprovementAssuming 50/50 Split of Logic and Memory • 130nm uP assumes 80% Dynamic and 20% Static • 90nm uP assumes 50% Dynamic and 50% Static

38. Talk Outline • Normalized Metric: Ebit • Effect of Architecture • ASIC vs. Custom • Building Blocks • Achievable Energy Efficiency • 16b 1024 FFT Example • Answer to “Which Design is More Efficient”

39. Talk Outline • Normalized Metric: Ebit • Effect of Architecture • ASIC vs. Custom • Building Blocks • Achievable Energy Efficiency • 16b 1024 FFT Example • Answer to “Which Design is More Efficient”

40. 16b 1024 point FFT • Generally, k N log Noperations (complex multiplies) with pre-computation • Radix-2, Radix-4 etc… implementations • Decimation in time and/or decimation in Frequency

41. Range of Implementations • MIT FFT (2005) • 0.18um CMOS, 628K T’s, 10KHz: Architecture and subtheshold circuits, 180mV operation • Spiffee (1999) • 0.7um CMOS, 460K T’s, 173MHz: Cached FFT Architecture and algorithm, 1.1V operation • SA-1100 (1999) • 0.35um CMOS, 2.6M T’s, 74MHz: Commercial embedded processor, Custom Circuits, 1.5V operation • Imagine (2003) • 0.15um CMOS, 22M T’s , 232MHz: Streaming Media Processor, tiled standard cells, 1.2V operation • Stratix IS25F627C8 (2005) • 0.13um CMOS, 3.9K logic elements, 123K memory bits, 24 DSP blocks, 272MHz: Commercial FPGA Co-processor, • Intel P4 (2003) • 0.13um CMOS, 3GHz, SSE: Commerical General Purpose Processor, Custom Circuits, 1.5V operation • TI ‘C6416 (2003) • 0.13um CMOS, 720MHz: Commercial Digital Signal Processor

42. Ebit Energy 16b 1024 point FFT

43. Ebit Energy 16b 1024 point FFT

44. Which Design Is More Efficient? • 0.7um CMOS 173MHz chip w/ 460K T’s • Vdd (typ) = 3.3V, Vdd (min) = 1.1V • Power = 845mW • 0.18um CMOS 10kHz chip w/ 640K T’s • Vdd (max) = 1.8V, Vdd (min) = 0.18V • Power = 1.6mW

45. Which Design Is More Efficient?Depends on the Metric! • 0.7um CMOS 173MHz chip w/ 460K T’s • Vdd (typ) = 3.3V, Vdd (min) = 1.1V • Power = 845mW • EDP 143x better • 0.18um CMOS 10kHz chip w/ 640K T’s • Vdd (max) = 1.8V, Vdd (min) = 0.18V • Power = 1.6mW • Absolute energy 6x better