1 / 21

Literature Review

Literature Review. Measuring the Gap Between FPGAs and ASICs Ian Kuon, Jonathan Rose University of Toronto IEEE TCAD/ICAS Feburary 2007. Henry Chen February 26, 2010. Introduction. Trade-offs between FPGAs and standard-cell ASICs Decreased NRE, design time

dwight
Download Presentation

Literature Review

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Literature Review Measuring the GapBetween FPGAs and ASICs Ian Kuon, Jonathan Rose University of Toronto IEEE TCAD/ICAS Feburary 2007 Henry Chen February 26, 2010

  2. Introduction • Trade-offs between FPGAs and standard-cell ASICs • Decreased NRE, design time • Increased silicon area, power; decreased performance • FPGA inefficiencies known and accepted,but largely un-quantified

  3. Previous Comparisons • Jones et al. (1986): MPGAs to standard cells • 1.52.6x area, ~1.1x delay • Estimates based on only 5 circuits • Brown et al. (1992): FPGAs to MPGAs • 812x area, ~3x delay • Optimistic FPGA gate counting? • Anecdotal evidence • Doesn’t consider “hard” macros (multipliers, memories) • Combine for FPGAs to standard cells • 1238x area, ~3.4x delay • Dated; based on (questionable?) extractions

  4. Previous Comparisons (2000’s) • Zuchowski et al. (2002): LUT to ASIC gate (0.25μm90nm) • ~1/45 gate density, 1214x delay, ~500x dynamic power • Unexplained process-dependent density/power variation • Dependent on gates implemented per LUT • Wilton et al. (2005): Partial programmable replacement • 88x area, 2x delay • Single logic module • Compton & Hauck (2007): FPGA apps. to standard-cell • Avg 7.2x area • Scaled FPGA 0.15μm to 0.18μm standard-cell

  5. Methodology • Implement in both FPGA and standard-cell • Altera Stratix II FPGA: TSMC 90nm multi-Vt, 1.2V • Standard-cell: ST CMOS090 90nm, dual-Vt, 1.2V • Empirical results from 23 benchmarks • Rejected if different synthesis tools resulted in>5% register count deviation • Mix of logic, memory, DSP • Analyze gains from FPGA’s DSP and memory blocks • Exclude I/Os • Have device data from Altera

  6. Implementations • FPGA • Altera-provided CAD flow • Speed/area balanced optimization; optimize critical paths performance, otherwise optimize area • Automatic DSP, memory block inference • Set to mimic effects of high resource utilization • ASIC • Synopsys/Cadence synthesis/PAR flow • Free to choose from high/standard-Vt cells • Timing-driven placement; target 7585% utilization • Emphasized performance in compiled memories

  7. Area Comparison • ASIC • Post PAR’d core area • Include memory macros • FPGA • Count only silicon area for used resources • Include surrounding routing resources • Count full block area even if only partially used • Area data from Altera

  8. Area Comparison Results • Logic only:35x avg (17‒54x) • Logic + DSP:25x avg (12‒58x) • Logic + Memory:33x avg (19‒70x) • Logic + Memory + DSP:18x avg (9.5‒26x)

  9. Impact of Hard Macros on Area • Smaller area penalty for designs using hard macros • Hard macro close to ASIC implementation(plus programmable interface & routing)

  10. Area Comparison Caveats • Pessimistic FPGA area estimation; count full resource area even if only partially used (~5‒10% reduction) • ASIC density may decrease for larger designs, while FPGAs are designed to handle large designs

  11. Delay Comparison • Altera Quartus II / Synopsys PrimeTime SI • Static timing analysis to extract max. clock frequency • Compare for different FPGA speed grades • FPGAs are binned for performance • ASICs tend to be designed for worst-case

  12. Delay Comparison Results(Fastest Speed Grade) • Logic only:3.4x avg (1.9‒5.0x) • Logic + DSP:3.5x avg (2.4‒4.7x) • Logic + Memory:3.5x avg (2.8‒4.3x) • Logic + Memory + DSP:3.0x avg (2.6‒3.5x)

  13. Delay Comparison Results(Slowest Speed Grade) • Logic only:4.6x avg (2.5‒6.7x) • Logic + DSP:4.6x avg (3.0‒6.3x) • Logic + Memory:4.8x avg (3.8‒5.7x) • Logic + Memory + DSP:4.1x avg (3.8‒4.7x)

  14. Impact of Hard Macros on Delay • Almost no benefit—sometimes penalty! • Fixed positions in FPGA; extra routing to use • Fixed architecture; some apps. may not use efficiently

  15. Power Comparison • Altera Quartus II Power Analyzer / Synopsys PrimePower • Compare power, not energy consumption • FPGAs slower; need more time or parallelism • Implement for highest speed possible • Simulate at same operating frequency, voltage • Measure only core power • Assume constant toggle rates for all nets in design • Meaningful test vectors not available for all designs • FPGA static power consumption scaled by used fraction

  16. Power Comparison Results • Logic only:14x avg (5.7‒52x) • Logic + DSP:12x avg (7.5‒16x) • Logic + Memory:14x avg (12‒16x) • Logic + Memory + DSP:7.1x avg (5.3‒8.3x)

  17. Impact of Hard Macros on Power • Slight benefit—primarily from area savings? • Less area and interconnect

  18. Power Consumption Caveats • May be disproportionate power in FPGA clock network • “Overdesigned” for tested circuits • Could have small incremental power increase • ASIC clock network would have to grow with designs

  19. Static Power Comparison • Unable to draw useful conclusions about static power • 87x for typical silicon, typical temp. (25°C) • 5.4x for worst-case silicon, worst-case temp. (85°C) • Had to scale worst-case silicon temp. characterization • Subthreshold leakage is process-dependent • Little information on leakage estimate factors • Different processes from different foundries • Some correlation between static power and area gap(correlation coefficient ~0.8) • Hard macros likely reduced static power penalty

  20. Conclusions • Disparity hard to quantify—very application dependent • Avg. gap gap 3x; gap gap range 1.3‒9.1x • All-LUT designs avg. 35x area, 3.4‒4.6x delay, 14x power • 119x area, 47.6x power gap for equal performance(assuming ideal parallelization) • Hard macros reduce area and power, but have little performance benefit • Avg. 18x area, 3‒4.1x delay, 7.1x power • 54x area, 21.3x power for equal performance

  21. References • Jones, Jr., H. S., Nagle, P. R., Nguyen, H. T., “A Comparison of Standard Cell and Gate Array Implementations in a Common CAD System”, Proc. IEEE CICC, 1986, pp. 228232 • Brown, S. D., Francis, R., Rose, J., Vranesic, Z., Field-Programmable Gate Arrays, Norwell, MA: Kluwer, 1992 • Zuchowski, P. S., Reynolds, C. B., Grupp, R. J., Davis, S. G., Cremen, B., Troxel, B., “A Hybrid ASIC and FPGA Architecture,” Proc. ICCAD, Nov. 2002, pp. 187194 • Wilton, S. J., Kafafi, N., Wu, J. C. H., Bozman, K. A., Aken’Ova, V., Saleh, R., “Design Considerations for Soft Embedded Programmable Logic Cores”, IEEE JSSC, vol 40, no. 2, pp. 485497, Feb. 2005 • Compton, K., Hauck, S., “Automatic Design of Area-Efficient Configurable ASIC Cores,” IEEE Trans. Comp., vol 56, no. 5, pp. 662672, May 2007

More Related