1 / 13

Alessandro Cevrero 1,2

Architectural Improvement for Field Programmable Counter Array: Enabling Efficient Synthesis of Fast Compressor Trees on FPGA. Alessandro Cevrero 1,2. Panagiotis Athanasopoulos 1,2. Hadi Parandeh-Afshar 2. Philip Brisk 2. Frank K. Gurkaynak 1. Ajay K. Verma 2. Yusuf Leblebici 1.

emeliah
Download Presentation

Alessandro Cevrero 1,2

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Architectural Improvement for Field Programmable Counter Array: Enabling Efficient Synthesis of Fast Compressor Trees on FPGA Alessandro Cevrero1,2 Panagiotis Athanasopoulos1,2 Hadi Parandeh-Afshar2 Philip Brisk2 Frank K. Gurkaynak1 Ajay K. Verma2 Yusuf Leblebici1 Paolo Ienne2 1 2 16th ACM/SIDA International Symposium on FPGAs Monterey, California, USA, February 26, 2008

  2. Motivation and Contribution [Brisk et al., DAC 2007] Programmable IP core to accelerate compressor trees Hybrid FPGA/FPCA device Contributions: Field Programmable Counter Array (FPCA): Goal: Improve FPGA performance for arithmetic circuits. • Completely new FPCA architecture • Reduced routing delay • More flexibility and better mapping • Simplified integration process 1/11

  3. FPGA Commentary Logic cells with dedicated addition circuitry and fast carry chains Support for ternary addition [Altera Stratix II/III, Xilinx Virtex-5] Parallel accumulation uses adder trees ASIC designers use compressor trees! Compressor tree synthesis on FPGAs via GPC mapping [Parandeh-Afshar et al., ASPDAC 2008, DATE 2008] Faster than ternary adder trees IP Cores DSP48, BlockRAM, etc. [Xilinx, Altera] FP cores [Beauchamp et al., TVLSI 2008] Mismatches in bitwidth limit gains [Kuon and Rose, FPGA 2006, TCAD 2007] 2/11

  4. + Methodology and Solution • Transform circuit to merge disparate addition and multiplication operations to expose compressor trees • [Verma and Ienne, ICCAD 2004] • Synthesize compressor tree onto FPCA • [Brisk et al., DAC 2007] • Map everything else onto traditional FPGA • Standard approach • Integrate FPGA+FPCA onto same die • Ongoing research at EPFL FPCA : programmable compressor tree 3/11

  5. Previous Work Initial FPCA architecture [Brisk et al., DAC 2007] • Routing network delay • Performance bottleneck • Poor area utilization • Many resources unused • Large counters implement the functionality of smaller counters • “Pitch matching” problem • FPCA routing channels must align with FPGA routing channels • Leads to unnecessarily large counters 4/11

  6. 15 15:4 4 4:3 3 3:2 2 CPA Recurring Patterns in Compressor Tree Synthesis New FPCA architecture: • Counter Slice (CSlice) • Compress one column at a time • Propagate carry bits to neighboring CSlices • Eliminates FPGA-style routing network • No routing delay between counters • Pitch matching problem disappears 5/11

  7. FPCA v2.0 Area Utilization CSlice CSlice CSlice CSlice 15:4 15:4 15:4 15:4 4:3 4:3 4:3 4:3 3:2 3:2 3:2 3:2 CPA CPA CPA CPA Si+3 Si+2 Si+1 Si CSlice Architecture Configurable GPC 6/11

  8. FPCA V2.0 Mapping Heuristic FPCA FPCA FPCA FPCA FPCA … • FPCA synthesis heuristic: • Map columns of input bits onto FPCA • Minimize the height of the compressor tree • Avoid vertical configurations, when possible Multi-FPCA Configurations Routing Delay Vertical Horizontal 7/11

  9. CSlice Synthesis FPCA Synthesis: 90nm Artisan standard cell library • Rank-3 CSlices used in experiments • 8 CSlices per FPCA • Similar to dimensions of a DSP block in current FPGAs • Simplifies integration process • DFFs store configuration bitstream • Semi-custom design • Standard cells are predominant CSlice V2.0 rank-3 with 16 input bits per CSlice 8/11

  10. FPCA Delay Extraction SUM F* FPCA FPCA FPCA SUM F* SUM F* Input Pins Methodology: • Methodology: • Each FPCA instance is replaced with F* instance (same I/0) • Extract Delay Between F* instances • Combined these Delay with Combinational Delay extracted for the FPCA • Define a pre-placed soft IP core : F* • Same dimensions and I/O as FPCA • Map onto Stratix II FPGA • Extract critical path delay • Replace all sum operations with F* • Map compressor tree onto FPCA • Configuration DFF values set to constant values ; not optimized • Measure critical path delay • For each compressor tree in the circuit • Subtract delay of F* • Add FPCA delay Output Pins 9/11

  11. Experimental Results Comparison • GPC Mapping [Parandeh-Afshar et al., ASP-DAC 2008] • FPCA mapping (6 FPCAs per device) 2.40x 1.60x 10/11

  12. Conclusion Conclusion • New FPCA architecture • Hardwired connections between counters • Counters of multiple sizes organized into CSlices • Carry chains between CSlices • Avg./Max. speedups of 1.60x/2.40x compared to GPC mapping Future Work • Add pipeline registers to FPCA • Increase latency, increase clock frequency, throughput • Demonstrator chip taped out in October 2007 • Returned from the foundry in January 2008; PCBs ready next week • Measure power consumption, clock frequency, I/O interface, etc. 11/11

  13. Demonstrator Chip

More Related