1 / 19

A Flexible DSP Block to Enhance FGPA Arithmetic Performance

A Flexible DSP Block to Enhance FGPA Arithmetic Performance. Hadi Parandeh-Afshar Alessandro Cevrero Panagiotis Athanasopoulous Philip Brisk Yusuf Leblebici Paolo Ienne. LAP EPFL LSM, LAP EPFL LSM, LAP EPFL UCR LSM EPFL LAP EPFL.

rich
Download Presentation

A Flexible DSP Block to Enhance FGPA Arithmetic Performance

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. A Flexible DSP Block to Enhance FGPA Arithmetic Performance HadiParandeh-Afshar Alessandro Cevrero PanagiotisAthanasopoulous Philip Brisk Yusuf Leblebici Paolo Ienne LAP EPFL LSM, LAP EPFL LSM, LAP EPFL UCR LSM EPFL LAP EPFL EcolePolitechiqueFederale De lausanne (EPFL) University of California Riverside (UCR) {first_name.last_name@epfl.ch} first_name@cs.ucr.edu

  2. Motivation and contribution • New DSP block for high performance FPGAs • Increased flexibility PPG Bypassable PPG Programmable Compressor Tree • Enchance FPGA arithmetic performance

  3. E1 E2 E1 E2 DSP blocks cannot accelerate multi-operand addition M1 M2 E1 E2 19 19 48 19 19 M1 M2 4 48 1 sign Fused multiply-addition operations cannot use current DSP blocks in a single-cycle and sign neg S1 S2 S1 S2 not xor xor 4 out (a) (b) Arithmetic transformations out Motivation and contribution • Data flow transformation automatically expose compressor tree [Verma et al , TCAD 08]

  4. Outline • Related work • Limitations • DSP Block Architecture • Experimental methodology • Results • Conclusions

  5. 9 9 9 9 9 9 9 9     Σ FPGA commentary • Logic cells with dedicated addition circuitry and fast carry chains • Compressor tree synthesis on 6-LUT FPGAs • [Parandeh-Afshar et. al, ASPDAC 08, DATE 08, FPL 09] • IP cores [Xilinx, Altera] • FP cores [Beauchamp et al., TVLSI 08] • DSP Blocks [Altera Stratix III-IV]

  6. 9 9 9 9 9 9 9 9     Σ FPGA commentary • Logic cells with dedicated addition circuitry and fast carry chains • Compressor tree synthesis on 6 LUTs FPGAs • [Parandeh-Afshar et al, DATE 08, ASPDAC 08, FPL 09] • IP cores [Xilinx, Altera] • FP cores [Beauchamp et al., TVLSI 08] • DSP Blocks [Altera Stratix III-IV]

  7. 16 128 = 816 input bits 15 15 15 15 CSlice Carry-out Carry-in 6 48 = 86 output bits Field Programmable Compressor Tree (FPCT) • User-configurable multi operand adder • Compressor tree + bypassable CPA [Cevrero et al, FPGA 08, TRETS 09]

  8. FPCT limitations • PPG soft logic 9x9-bit signed multiplier [Baugh Wooley] Soft-Logic 9x9-bit PPG (81 LUTs) 1 82 wires  FPCT 18 bit output

  9. FPCT limitations • PPG soft logic • Low input utilization for multipliers 9x9-bit signed multiplier [Baugh Wooley] 64% input utilization  C3 C2 Soft-Logic 9x9-bit PPG (81 LUTs) C6 C4 C5 C1 C0 1 82 wires  FPCT 3 3 2 2 2 2 3 18 bit output

  10. 11 DSP block architecture 128 FPCT (8 CSlices) 48

  11. 11 128 90 18 PPG* PPG 61 0 0 3 3 21 15 A A ½-FPCT (4 CSlices) B B 5 5 DSP block architecture 61 6 ½-FPCT (4 CSlices) • Two 9x9 signed PPGs • One modified to support larger multiplier • Hard compression circuits ‘A’ and ‘B’ • Efficient Synthesis of large multipliers

  12. 11 Fixed Logic (A) Fixed Logic (B) 3 3 2 2 2 5 128 90 18 PPG* PPG 61 0 0 3 3 21 15 A A ½-FPCT (4 CSlices) B B 5 5 DSP block architecture C3 C2 C4 C1 61 6 ½-FPCT (4 CSlices) • Two 9x9 signed PPGs • One modified to support larger multiplier • Hard compression circuits ‘A’ and ‘B’ • Efficient Synthesis of large multipliers

  13. 11 128 90 18 PPG* PPG 61 0 0 3 3 21 15 A A ½-FPCT (4 CSlices) B B 5 5 DSP block architecture Only 8% larger that traditional FPCT in 90nm CMOS (ARTISAN cell library with TSMC process)  61 6 ½-FPCT (4 CSlices) • Two 9x9 signed PPGs • One modified to support larger multiplier • Hard compression circuits ‘A’ and ‘B’ • Efficient Synthesis of large multipliers

  14. IP IP IP Experimental methodology Input Pins • Virtual Embedded blocks (VEB) [Ho et al, FCCM 06] • Define a preplaced soft IP core: F* • Same area and I/0 as our DSP Output Pins

  15. F* F* F* Experimental methodology Input Pins • Virtual Embedded blocks (VEB) [Ho et al, FCCM 06] • Define a preplaced soft IP core: F* • Same area and I/0 as our DSP • Replace our DSP block with F* • Map benchmark on Stratix II • Extract F* delay • Estimated proposed DSP block delay • ASIC design flow (90nm CMOS) Output Pins

  16. New-DPS New-DPS New-DPS Experimental methodology Input Pins • Virtual Embedded blocks (VEB) [Ho et al, FCCM 06] • Define a preplaced soft IP core: F* • Same area and I/0 as our DSP • Replace our DSP block with F* • Map benchmark on Stratix II • Extract F* delay • Estimated proposed DSP block delay • ASIC design flow (90nm CMOS) • For each proposed DSP block in the circuit • Subtract delay of F* • Add proposed DSP block delay Output Pins

  17. Ternary Stratix II DSP Block Proposed DSP Block GPC [Parandeh-Afshar et al, ASPDAC 08] FPCT w/ Soft PPG Results Critical Path Delay ns

  18. Stratix II DSP Block Proposed DSP Block FPCT w/ Soft PPG Results Normalized Area (to Stratix II DSP block area)

  19. Conclusion • New DSP block proposed • Accelerate multiplication and multi-operand addition • More flexibility • Competitive with Stratix II DSP block • Intends to replace compressor tree in existing DSP block • Only 8% area overhead respect to original FPCT

More Related