Explaining The Gap Between ASIC and Custom Power: A Custom Perspective - PowerPoint PPT Presentation

slide1 l.
Download
Skip this Video
Loading SlideShow in 5 Seconds..
Explaining The Gap Between ASIC and Custom Power: A Custom Perspective PowerPoint Presentation
Download Presentation
Explaining The Gap Between ASIC and Custom Power: A Custom Perspective

play fullscreen
1 / 54
Explaining The Gap Between ASIC and Custom Power: A Custom Perspective
236 Views
Download Presentation
coralie
Download Presentation

Explaining The Gap Between ASIC and Custom Power: A Custom Perspective

- - - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript

  1. Explaining The Gap Between ASIC and Custom Power: A Custom Perspective Andrew Chang Cadence Design Systems* William J. Dally Computer Systems Laboratory Stanford University * Work done while Author was at Stanford

  2. Design Tradeoffs: Power vs. Performance 1. Move to More Energy Efficient Operating Point More Energy Efficient w/ Custom Power 3 1 2 Performance

  3. Design Tradeoffs: Power vs. Performance 1. Move to More Energy Efficient Operating Point More Energy Efficient w/ Custom 2. Trade Performance for Power Larger Range w/ Custom Power 3 1 2 Performance

  4. Design Tradeoffs: Power vs. Performance 1. Move to More Energy Efficient Operating Point More Energy Efficient w/ Custom 2. Trade Performance for Power Larger Range w/ Custom 3. Move to Different Power vs. Performance Curve More Architectural Choice with Custom Power 3 1 2 Performance

  5. Dynamic Power Dissipation Pdyn = a CVdd2f = a Ecircuitf • Reduce Vdd • Static, dynamic, voltage islands, power gating • Reduce a and/or f • Clock gating, block enables, bus encoding, glitch identification and elimination • Reduce Ecircuit • Engineer interconnects, increase circuit efficiency, subthreshold circuit techniques

  6. Static Power Dissipation Pstatic = Vdd (Isub + Iox ) Isub = K1 W e -Vt/ nVq (1- e –Vgs/Vq) Iox = K2 W (Vgs/tox)2 e –atox/ Vgs With K1, K2, n, and a experimentally determined • Reduce Vdd • Static, dynamic, voltage islands, power gating • Increase effective Vt • Substituting high-threshold devices, transistor stacking, static and active body bias • Reduce effective W • Reduce number and size of devices in design

  7. Which Design Is More Efficient? • 0.7um CMOS 173MHz chip w/ 460K T’s • 0.18um CMOS 10kHz chip w/ 640K T’s

  8. Which Design Is More Efficient? • 0.7um CMOS 173MHz chip w/ 460K T’s • Vdd (typ) = 3.3V, Vdd (min) = 1.1V • 0.18um CMOS 10kHz chip w/ 640K T’s • Vdd (max) = 1.8V, Vdd (min) = 0.18V

  9. Which Design Is More Efficient? • 0.7um CMOS 173MHz chip w/ 460K T’s • Vdd (typ) = 3.3V, Vdd (min) = 1.1V • Power = 845mW • 0.18um CMOS 10kHz chip w/ 640K T’s • Vdd (max) = 1.8V, Vdd (min) = 0.18V • Power = 1.6mW

  10. Talk Outline • Normalized Metric: Ebit • Effect of Architecture • ASIC vs. Custom • Building Blocks • Achievable Energy Efficiency • 16b 1024 FFT Example • Answer to “Which Design is More Efficient”

  11. Talk Outline • Normalized Metric: Ebit • Effect of Architecture • ASIC vs. Custom • Building Blocks • Achievable Energy Efficiency • 16b 1024 FFT Example • Answer to “Which Design is More Efficient”

  12. Defining Ebit Ebit = Cbit * Vdd2 Cbit = 4 * 2 fF/um * Wmin • Energy needed to write a 1-bit SRAM cell • Approximates minimum useful capacitance • The ratio of Ebit to the energy for a range of circuits remains largely constant with technology scaling

  13. c2 Technology mm2 0.5mm 58 18 5.7 0.18mm 18 Technology Scaling for Ebit • c is a normalized unit of distance equal to the M1 pitch

  14. Technology Scaling for Nand2 NAND2 • c is a normalized unit of distance equal to the M1 pitch A A YN B B YN 4c = 2.24mm 8c = 4.48mm

  15. Applying Ebit

  16. Talk Outline • Normalized Metric: Ebit • Effect of Architecture • ASIC vs. Custom • Building Blocks • Achievable Energy Efficiency • 16b 1024 FFT Example • Answer to “Which Design is More Efficient”

  17. Talk Outline • Normalized Metric: Ebit • Effect of Architecture • ASIC vs. Custom • Building Blocks • Achievable Energy Efficiency • 16b 1024 FFT Example • Answer to “Which Design is More Efficient”

  18. Effect of Architecture NVIDIA GeForceFX Intel Pentium-4 Design Style: Custom Design Style: ASIC 2600MHz – 55M Transistors 400MHz – 125M Transistors

  19. Effect of Architecture NVIDIA GeForceFX Intel Pentium-4 Design Style: Custom Design Style: ASIC 2600MHz – 55M Transistors ~60 Watts 400MHz – 125M Transistors ~20 Watts

  20. Effect of Architecture ASIC Architecture: 6x Efficiency NVIDIA GeForceFX Intel Pentium-4 Design Style: Custom Design Style: ASIC 2600MHz – 55M Transistors ~60 Watts: 5GFlops & 5 Gbs 400MHz – 125M Transistors ~20 Watts: 10GFlops & 13 GBs

  21. Custom Circuits: 9x (7x) Efficiency NVIDIA GeForceFX Intel Pentium-4 Design Style: Custom Design Style: Custom 2600MHz – 55M Transistors ~60 Watts: 5GFlops & 5 Gbs Vdd = 1.3V 400MHz – 125M Transistors ~3 Watts: 10GFlops & 13 GBs Vdd = 0.65V

  22. Combined Architecture and Circuits40x+ Improvement but 1.5 Years vs. 3+ Years NVIDIA GeForceFX Intel Pentium-4 Design Style: Custom Design Style: Custom 2600MHz – 55M Transistors ~60 Watts: 5GFlops & 5 Gbs Vdd = 1.3V 400MHz – 125M Transistors ~3 Watts: 10GFlops & 13 GBs Vdd = 0.65V

  23. Talk Outline • Normalized Metric: Ebit • Effect of Architecture • ASIC vs. Custom • Building Blocks • Achievable Energy Efficiency • 16b 1024 FFT Example • Answer to “Which Design is More Efficient”

  24. Talk Outline • Normalized Metric: Ebit • Effect of Architecture • ASIC vs. Custom • Building Blocks • Achievable Energy Efficiency • 16b 1024 FFT Example • Answer to “Which Design is More Efficient”

  25. ASIC vs. Custom • ASIC Methods • Provide only coarse-grain control 100K+ gates, but requiremuch less effortand historically scale with complexity • Custom Methods • Offer fine-grain control individual transistors & gates, but requirelarge effort andscale poorly with complexity • Exploits Design Structure • Exploits Circuit Techniques

  26. Custom Methods EmphasizeFine-Grain Manual Control + Custom Library

  27. Custom Methods EmphasizeFine-Grain Manual Control + Custom Library Operation and Performance Characterized for the Specific Case

  28. ASIC Methods SubstituteCoarse-GrainControlAutomation + Generic Library

  29. ASIC Methods SubstituteCoarse-GrainControlAutomation + Generic Library Operation and Performance Characterized for the Typical/Generic Case

  30. ASICFocus on 100K+ GatesLost Opportunities to Exploit Structure • Designs reuse similar basic building blocks • Building blocks: 1-10K-gates not 100K+ gate • 64-bit adder 1K-gates • 64x64 rf 2K-gates • 64x64 multiplier 20K-gates • Opportunities to exploit these structures lost when design is viewed in large chunks

  31. Bank 1 Bank 0 LTLB EMI MEMORY SWITCH NIF/ROUTER CLST 2 CLST 1 CLST 0 CLUSTER SWITCH CLST 2 CLST 1 CLST 0 C C C C C C C C C C C C C L L L L L L L L L L L L L C L C L Different Architectures Similar Building Blocks 1998 “MAP” 64b Microprocessor - 5M T’s (MIT/Stanford) XCVRS Bus EX RF SRAM 2002 “Imagine” 32b Stream Processor - 22M T’s (Stanford) XCVRS Bus EX RF SRAM

  32. Bank 1 Bank 0 LTLB EMI MEMORY SWITCH NIF/ROUTER CLST 2 CLST 1 CLST 0 CLUSTER SWITCH CLST 2 CLST 1 CLST 0 C C C C C C C C C C C C C L L L L L L L L L L L L L C L Significant Structure ExistsWithin100K-gates 1998 “MAP” 64b Microprocessor - 5M T’s (MIT/Stanford) XCVRS Bus EX RF SRAM C L 2002 “Imagine” 32b Stream Processor - 22M T’s (Stanford) XCVRS Bus EX RF SRAM

  33. Energy of 100K-gate Equivalent • ASIC (N2) = 1400K Ebits (typ) • Custom Logic = 424K Ebits* • SRAM (small) = 1085K Ebits • SRAM (med) = 155K Ebits • SRAM (large) = 50K Ebits *Based on data extracted from Intel McKinley

  34. Exploiting Circuit Techniques • Custom circuits more efficient • Reduced parasitics • 1.7x circuit techniques and flops • 1.4x libraries • 1.4x due to engineering interconnects • Subthreshold Circuits • Low Performance but ultra-low power • Requires Architecture, Gates, Memories, CAD Tools

  35. Relating Power to PerformanceCV/I, Idsat, tFO4 Idsat = K3 Leff -0.5 tox-0.8 (Vgs - Vt)1.25 tFO4 = K4 [Ceff Vdd /Idsat] (K4 ~ 13.5)

  36. Relating Power to Performance Relating Vdd and Vt to tFO4 Idsat = K3 Leff -0.5 tox-0.8(Vgs - Vt)1.25 tFO4 = K4[Ceff Vdd /Idsat](K4 ~ 13.5)

  37. Relating Power to PerformanceCorrelation to Reported Foundry Data Idsat = K3 Leff -0.5 tox-0.8 (Vgs - Vt)1.25 tFO4 = K4 [Ceff Vdd /Idsat] (K4 ~ 13.5)

  38. Achievable Power Improvement (Assuming 50/50 split of Logic and Memory)

  39. Achievable Power Improvement(Assuming 50/50 Split of Logic and Memory)

  40. Achievable Power Improvement(Assuming 50/50 Split of Logic and Memory)

  41. Achievable Power ImprovementAssuming 50/50 Split of Logic and Memory • 130nm uP assumes 80% Dynamic and 20% Static • 90nm uP assumes 50% Dynamic and 50% Static

  42. Talk Outline • Normalized Metric: Ebit • Effect of Architecture • ASIC vs. Custom • Building Blocks • Achievable Energy Efficiency • 16b 1024 FFT Example • Answer to “Which Design is More Efficient”

  43. Talk Outline • Normalized Metric: Ebit • Effect of Architecture • ASIC vs. Custom • Building Blocks • Achievable Energy Efficiency • 16b 1024 FFT Example • Answer to “Which Design is More Efficient”

  44. 16b 1024 point FFT • Generally, k N log Noperations (complex multiplies) with pre-computation • Radix-2, Radix-4 etc… implementations • Decimation in time and/or decimation in Frequency

  45. Range of Implementations • MIT FFT (2005) • 0.18um CMOS, 628K T’s, 10KHz: Architecture and subtheshold circuits, 180mV operation • Spiffee (1999) • 0.7um CMOS, 460K T’s, 173MHz: Cached FFT Architecture and algorithm, 1.1V operation • SA-1100 (1999) • 0.35um CMOS, 2.6M T’s, 74MHz: Commercial embedded processor, Custom Circuits, 1.5V operation • Imagine (2003) • 0.15um CMOS, 22M T’s , 232MHz: Streaming Media Processor, tiled standard cells, 1.2V operation • Stratix IS25F627C8 (2005) • 0.13um CMOS, 3.9K logic elements, 123K memory bits, 24 DSP blocks, 272MHz: Commercial FPGA Co-processor, • Intel P4 (2003) • 0.13um CMOS, 3GHz, SSE: Commerical General Purpose Processor, Custom Circuits, 1.5V operation • TI ‘C6416 (2003) • 0.13um CMOS, 720MHz: Commercial Digital Signal Processor

  46. Ebit Energy 16b 1024 point FFT

  47. Ebit Energy 16b 1024 point FFT

  48. Which Design Is More Efficient? • 0.7um CMOS 173MHz chip w/ 460K T’s • Vdd (typ) = 3.3V, Vdd (min) = 1.1V • Power = 845mW • 0.18um CMOS 10kHz chip w/ 640K T’s • Vdd (max) = 1.8V, Vdd (min) = 0.18V • Power = 1.6mW

  49. Which Design Is More Efficient?Depends on the Metric! • 0.7um CMOS 173MHz chip w/ 460K T’s • Vdd (typ) = 3.3V, Vdd (min) = 1.1V • Power = 845mW • EDP 143x better • 0.18um CMOS 10kHz chip w/ 640K T’s • Vdd (max) = 1.8V, Vdd (min) = 0.18V • Power = 1.6mW • Absolute energy 6x better