A Flexible DSP Block to Enhance FGPA Arithmetic Performance

A Flexible DSP Block to Enhance FGPA Arithmetic Performance HadiParandeh-Afshar Alessandro Cevrero PanagiotisAthanasopoulous Philip Brisk Yusuf Leblebici Paolo Ienne LAP EPFL LSM, LAP EPFL LSM, LAP EPFL UCR LSM EPFL LAP EPFL EcolePolitechiqueFederale De lausanne (EPFL) University of California Riverside (UCR) {first_name.last_name@epfl.ch} first_name@cs.ucr.edu

Motivation and contribution • New DSP block for high performance FPGAs • Increased flexibility PPG Bypassable PPG Programmable Compressor Tree • Enchance FPGA arithmetic performance

E1 E2 E1 E2 DSP blocks cannot accelerate multi-operand addition M1 M2 E1 E2 19 19 48 19 19 M1 M2 4 48 1 sign Fused multiply-addition operations cannot use current DSP blocks in a single-cycle and sign neg S1 S2 S1 S2 not xor xor 4 out (a) (b) Arithmetic transformations out Motivation and contribution • Data flow transformation automatically expose compressor tree [Verma et al , TCAD 08]

Outline • Related work • Limitations • DSP Block Architecture • Experimental methodology • Results • Conclusions

9 9 9 9 9 9 9 9     Σ FPGA commentary • Logic cells with dedicated addition circuitry and fast carry chains • Compressor tree synthesis on 6-LUT FPGAs • [Parandeh-Afshar et. al, ASPDAC 08, DATE 08, FPL 09] • IP cores [Xilinx, Altera] • FP cores [Beauchamp et al., TVLSI 08] • DSP Blocks [Altera Stratix III-IV]

9 9 9 9 9 9 9 9     Σ FPGA commentary • Logic cells with dedicated addition circuitry and fast carry chains • Compressor tree synthesis on 6 LUTs FPGAs • [Parandeh-Afshar et al, DATE 08, ASPDAC 08, FPL 09] • IP cores [Xilinx, Altera] • FP cores [Beauchamp et al., TVLSI 08] • DSP Blocks [Altera Stratix III-IV]

16 128 = 816 input bits 15 15 15 15 CSlice Carry-out Carry-in 6 48 = 86 output bits Field Programmable Compressor Tree (FPCT) • User-configurable multi operand adder • Compressor tree + bypassable CPA [Cevrero et al, FPGA 08, TRETS 09]

FPCT limitations • PPG soft logic 9x9-bit signed multiplier [Baugh Wooley] Soft-Logic 9x9-bit PPG (81 LUTs) 1 82 wires  FPCT 18 bit output

FPCT limitations • PPG soft logic • Low input utilization for multipliers 9x9-bit signed multiplier [Baugh Wooley] 64% input utilization  C3 C2 Soft-Logic 9x9-bit PPG (81 LUTs) C6 C4 C5 C1 C0 1 82 wires  FPCT 3 3 2 2 2 2 3 18 bit output

11 DSP block architecture 128 FPCT (8 CSlices) 48

11 128 90 18 PPG* PPG 61 0 0 3 3 21 15 A A ½-FPCT (4 CSlices) B B 5 5 DSP block architecture 61 6 ½-FPCT (4 CSlices) • Two 9x9 signed PPGs • One modified to support larger multiplier • Hard compression circuits ‘A’ and ‘B’ • Efficient Synthesis of large multipliers

11 Fixed Logic (A) Fixed Logic (B) 3 3 2 2 2 5 128 90 18 PPG* PPG 61 0 0 3 3 21 15 A A ½-FPCT (4 CSlices) B B 5 5 DSP block architecture C3 C2 C4 C1 61 6 ½-FPCT (4 CSlices) • Two 9x9 signed PPGs • One modified to support larger multiplier • Hard compression circuits ‘A’ and ‘B’ • Efficient Synthesis of large multipliers

11 128 90 18 PPG* PPG 61 0 0 3 3 21 15 A A ½-FPCT (4 CSlices) B B 5 5 DSP block architecture Only 8% larger that traditional FPCT in 90nm CMOS (ARTISAN cell library with TSMC process)  61 6 ½-FPCT (4 CSlices) • Two 9x9 signed PPGs • One modified to support larger multiplier • Hard compression circuits ‘A’ and ‘B’ • Efficient Synthesis of large multipliers

IP IP IP Experimental methodology Input Pins • Virtual Embedded blocks (VEB) [Ho et al, FCCM 06] • Define a preplaced soft IP core: F* • Same area and I/0 as our DSP Output Pins

F* F* F* Experimental methodology Input Pins • Virtual Embedded blocks (VEB) [Ho et al, FCCM 06] • Define a preplaced soft IP core: F* • Same area and I/0 as our DSP • Replace our DSP block with F* • Map benchmark on Stratix II • Extract F* delay • Estimated proposed DSP block delay • ASIC design flow (90nm CMOS) Output Pins

New-DPS New-DPS New-DPS Experimental methodology Input Pins • Virtual Embedded blocks (VEB) [Ho et al, FCCM 06] • Define a preplaced soft IP core: F* • Same area and I/0 as our DSP • Replace our DSP block with F* • Map benchmark on Stratix II • Extract F* delay • Estimated proposed DSP block delay • ASIC design flow (90nm CMOS) • For each proposed DSP block in the circuit • Subtract delay of F* • Add proposed DSP block delay Output Pins

Ternary Stratix II DSP Block Proposed DSP Block GPC [Parandeh-Afshar et al, ASPDAC 08] FPCT w/ Soft PPG Results Critical Path Delay ns

Stratix II DSP Block Proposed DSP Block FPCT w/ Soft PPG Results Normalized Area (to Stratix II DSP block area)

Conclusion • New DSP block proposed • Accelerate multiplication and multi-operand addition • More flexibility • Competitive with Stratix II DSP block • Intends to replace compressor tree in existing DSP block • Only 8% area overhead respect to original FPCT

A Flexible DSP Block to Enhance FGPA Arithmetic Performance

A Flexible DSP Block to Enhance FGPA Arithmetic Performance

Presentation Transcript

Using Software to Enhance Performance

Flexible Budgets and Performance Analysis

A Novel FPGA Logic Block for Improved Arithmetic Performance

High Performance Arithmetic

Flexible Budgets and Performance Analysis

How to realize high-performance compute with Multicore DSP

Verifying Performance of a HDL design block

A flexible FGPA based Data Acquisition Module for a High Resolution PET Camera

Custom Reduction of Arithmetic in Linear DSP Transforms

Flexible Budgets and Performance Analysis

Using DSP to Improve the Performance of a Doherty Amplifier

Enhance Motor Performance”

Flexible Budgets and Performance Analysis

Flexible Arithmetic Components for Area-Efficient Fault Tolerance

Adaptive Mapping of Linear DSP Algorithms to Fixed-Point Arithmetic

Using meta-reflection to enhance performance

To DSP or Not to DSP?

Useful tips to enhance sexual performance

Adaptive Mapping of Linear DSP Algorithms to Fixed-Point Arithmetic

Custom Reduction of Arithmetic in Linear DSP Transforms

High-Performance Arithmetic Challenges: From Architectures to Circuits

Flexible Budgets and Performance Analysis