1 / 46

Coarse and Fine Grain Programmable Overlay Architectures for FPGAs

Coarse and Fine Grain Programmable Overlay Architectures for FPGAs. Alex Brant Advisor: Guy Lemieux University of British Columbia. Outline. Motivation Contributions Prior Work ZUMA FPGA Overlay CARBON-Razor Overlay Summary. Motivation - 1. FPGA Overlays

maitland
Download Presentation

Coarse and Fine Grain Programmable Overlay Architectures for FPGAs

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Coarse and Fine Grain Programmable Overlay Architectures for FPGAs Alex Brant Advisor: Guy Lemieux University of British Columbia

  2. Outline • Motivation • Contributions • Prior Work • ZUMA FPGA Overlay • CARBON-Razor Overlay • Summary

  3. Motivation - 1 • FPGA Overlays • FPGA designs that can be further programmed by the user • What are the benefits? • Ease of use (simpler languages, tools, etc.) • Optimized for particular problem domains • Open access to architecture & CAD • User-configured logic added to fixed FPGA bitstream • Dynamic reconfiguration on any device • Portability between vendors and devices

  4. Motivation - 2 Fine Grain Overlay – ZUMA • FPGA-like architecture • Compatible with VTR CAD tools • “Virtual” FPGA for portability of designs • Open source for research and applications • Implements fine grain part of MALIBU architecture • Generic implementation has high area overhead • Overcome by utilizing low level FPGA resources, implementing more efficient structures

  5. Motivation - 3 Coarse Grain Overlay – CARBON • Array of time-multiplexed ALUs • Fast compile • High density • Efficient mapping of word oriented circuits • Implements coarse grain part of MALIBU • Time-multiplexing limits overall performance • Performance gained using overclocking with error tolerance (CARBON-Razor)

  6. Contributions • Area efficient implementation of fine grain routing and logic with LUTRAMs • Area efficient 2-stage local routing network and configuration controller • Extension of Razor error tolerance from pipelined processors to 2D processing arrays • Design of an overclockable coarse grain FPGA overlay with in-circuit error correction

  7. Publications • ZUMA: An Open FPGA Overlay Architecture, Alexander Brant and Guy G.F. Lemieux (FCCM 2012) • Pipeline Frequency Boosting: Hiding Dual-Ported Block RAM Latency using Intentional Clock Skew, Alexander Brant, Ameer Abdelhadi, Aaron Severance, Guy G.F. Lemieux (FPT 2012) • CARBON-Razor: An Error-Tolerant Coarse Grain FPGA (in preparation)

  8. Outline • Motivation • Contributions • Prior Work • ZUMA FPGA Overlay • CARBON-Razor Overlay • Summary

  9. FPGA Architecture • Implements any logic function

  10. MALIBU Architecture • Hybrid coarse/fine grain FPGA • Time-multiplexed ALU (CG) combined with FPGA cluster • CG passes data to neighbors through memories

  11. MALIBU Hybrid FPGA • CGs are run on fast system clock (e.g. > 1GHz) • System clock / Schedule length = User clock rate • Advantages: • Greater density from time-multiplexing • Ability to trade-off between area and speed • Compiles up to 300x faster than normal FPGA • Better performance for word-oriented circuits

  12. Razor Timing Error Tolerance • Works with feed-forward pipeline circuits • Detects timing errors by capturing data a second time with a delayed clock • Tolerates errors by stalling pipeline one cycle

  13. Razor Timing Error Example • Data captured in main FF

  14. Razor Timing Error Example • Data captured in main FF • Fraction of cycle later, data captured by shadow latch

  15. Razor Timing Error Example • Data captured in main FF • Fraction of cycle later, data captured by shadow latch • Main FF and Shadow latch are compared

  16. Razor Timing Error Example • Data captured in main FF • Fraction of cycle later, data captured by shadow latch • Main FF and Shadow latch are compared • If different, shadow data loaded to main FF, pipeline is stalled

  17. Razor Timing Error Example • Data captured in main FF • Fraction of cycle later, data captured by shadow latch • Main FF and Shadow latch are compared • If different, shadow data loaded to main FF, pipeline is stalled • If not, pipelining proceeds normally

  18. Outline • Motivation • Contributions • Prior Work • ZUMA FPGA Overlay • CARBON-Razor Overlay • Summary

  19. ZUMA Overlay • Island style FPGA architecture, implemented on an FPGA • Initially implemented in generic Verilog • High area overhead, 125+ host LUTs for each ZUMA LUT (eLUT) • Area efficiency improvements: • Implementation of routing and logic with FPGA LUTRAMs • Design of efficient 2-stage local interconnect

  20. ZUMA Layout One tile of ZUMA Architecture

  21. Details - LUTRAM Reprogrammable LUTRAM in Xilinx and Altera Devices

  22. Details – LUTRAM Multiplexer LUTRAM can implement larger MUXs than a normal LUT, need no extra configuration memory 6-LUT, configured as a 6-to-1 MUX in RAM mode 6-LUT, configured as a 4-to-1 MUX 6-LUT

  23. Details – Local Routing Crossbar Two-Stage (I+N) x (k*N) crossbar used in ZUMA Logic Cluster

  24. Results • Both Xilinx and Altera versions implemented • Our generic version is 125-150 LUTs per eLUT • Area overhead as low as 40 Host LUTs per eLUT with improvements • Compared to previous work (vFPGA) on 4-LUT host, overhead reduced 3x with same parameters

  25. Outline • Motivation • Contributions • Prior Work • ZUMA FPGA Overlay • CARBON-Razor Overlay • Summary

  26. CARBON Overlay • FPGA implementation of MALIBU CG • Modifications to support FPGA block RAMs • Critical Path is Memory to ALU to Memory

  27. CARBON-Razor • Razor is applied to the CARBON overlay • Error tolerance on memory to memory critical path • How to do it: • Shadow registers  apply to CARBON memories • CARBON schedule  1-3 extra timeslots for error recovery • Stall propagation  extend from 1D pipeline (Razor) to 2D array (CARBON)

  28. CARBON-Razor Memory • Shadow register paired with RAM • Stratix memory mode allows read-back of previously written data

  29. 2D Error Propagation • Can’t propagate errors to entire chip fast enough • We can propagate it one tile per cycle • Error propagation logic can then combine multiple errors into one stall region

  30. 2D Error Propagation Example • Error at tile at cycle 0 • Each cycle, stall propagates to nearest neighbors 0

  31. 2D Error Propagation Example • Error at tile at cycle 0 • Each cycle, stall propagates to nearest neighbors 1 1 0 1 1

  32. 2D Error Propagation Example • Error at tile at cycle 0 • Each cycle, stall propagates to nearest neighbors 2 2 1 2 2 1 0 1 2 1 2

  33. 2D Error Propagation Example • Error at tile at cycle 0 • Each cycle, stall propagates to nearest neighbors 3 2 3 3 2 1 2 2 1 0 1 3 2 1 2

  34. 2D Error Propagation Example • Error at tile at cycle 0 • Each cycle, stall propagates to nearest neighbors 4 3 2 3 3 2 1 2 2 1 0 1 3 2 1 2

  35. 2D Error Propagation Example • Error at tile at cycle 0 • Each cycle, stall propagates to nearest neighbors 4 3 2 3 3 2 1 2 2 1 0 1 3 2 1 2

  36. Stall Propagation Logic • When an error is detected at a CG: • Instruction schedule stalls • Memories in CG load from shadow register • Any writes from neighbor captured in shadow register • Next cycle: • Schedule resumes • Neighbor’s write performed from shadow register • 4 neighbors stall, unless they stalled last cycle • Stall region continues in expanding diamond shaped wave

  37. Carbon Schedule Extension • We add 1-3 cycles of slack to schedule • Allows margin of safety • Speedup determined by difference in FMAX and schedule length • If no hard deadline is needed (eg. when used as compute accelerator), average extension of schedule can be used to find speedup FMAX-Razor * SLBase FMAX-Base * SLRazor Speedup =

  38. Results • Performance compared between CARBON and CARBON-Razor for 4 benchmarks • Maximum performance found by pushing clock speed and shadow register delay • Average increases to 14% with no hard deadline

  39. Contributions • Area efficient implementation of FPGA routing and logic with LUTRAMs • Area efficient 2-stage local routing network and configuration controller • Extension of Razor error tolerance from pipelined processors to 2D processing arrays • Design of an overclockable coarse grain FPGA overlay with in-circuit error correction

  40. Summary • Fine Grain Overlay – ZUMA • FPGA-like architecture, compatible with VTR CAD tools • High area overhead implementing fine grain structures • Overcome by utilizing FPGA resources, implementing alternate structures • Area reduced to 40 host LUTs per eLUT, 3x improvement • Coarse Grain Overlay – CARBON • Fast compile, efficient mapping of word oriented circuits • Time-multiplexing decreases overall performance • Performance gained using overclocking with error tolerance • Speedup of 13% on average compared to baseline design

  41. Thank you

  42. ZUMA Config Controller

  43. LUTRAM Crossbar

  44. CARBON Razor Timing • Shadow register latches correct data if delay is sufficient

  45. CARBON-Razor Stall Logic

  46. CARBON-Razor Test

More Related