1 / 23

Warp Processors

Warp Processors. Roman Lysecky , Greg Stitt , Frank Vahid, Warp Processors, ACM Transactions on Design Automation of Electronic Systems (TODAES), v.11 n.3, p.659-681, July 2006. Motivation. Wish to overcome barriers for FPGA acceleration: Integrating tools to SW flows

hansel
Download Presentation

Warp Processors

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Warp Processors Roman Lysecky , Greg Stitt , Frank Vahid, Warp Processors, ACM Transactions on Design Automation of Electronic Systems (TODAES), v.11 n.3, p.659-681, July 2006

  2. Motivation • Wish to overcome barriers for FPGA acceleration: • Integrating tools to SW flows • Non-conformance to standard binary concept • Aim to make FPGAs invisible to SW developer • Dynamically determine critical regions • Re-implement as custom HW • Communicate between HW/SW

  3. System Overview 2 Profile application to determine critical regions Initially execute application in SW only 1 Partition critical regions to HW 3 Partitioned application’s speed “warps” as accelerator takes over critical region 5 Program configurable logic and update SW binary 4

  4. Profiling • Typical profilers instrument code • Change behaviour • Require extra tools • Warp profiler monitors instruction addresses seen on instruction memory bus • Maintains cache of 16 8-bit entries to store backward branch frequencies • Maintains relative frequencies • Accurately selects kernels within 10 saturations

  5. On-Chip CAD • On-chip CAD module implemented on separate ARM7 processor • In multi-processor environments, only one CAD module is necessary • Stages: • Decompilation • Partitioning • Behavioral & RT Synthesis • JIT FPGA Compilation

  6. Decompilation • Used to recover high-level constructs • i.e. loops, if statements, arrays • Decompiles critical region into CDFG • Intermediate code creation • High-level construct recovery • Map into statements/expressions • Use techniques to undo compiler optimizations • Loop re-rolling • Strength promotion • Compare-with-zero optimization

  7. Partitioning • Determines which software kernels are most suitable for implementation in HW • Uses heuristic [assumed the 0-1 knapsack heuristic] to choose kernels to maximize speedup while reducing energy

  8. Behavioral & RT Synthesis • Converts CDFG for each critical kernel to HW circuit description • Then converts into netlist format

  9. JIT CompilationLogic Synthesis • Optimizes hardware circuit • Creates acyclic graph of Boolean logic network • Nodes correspond to any simple 2-input logic gate • Uses Riverside on-chip minimizer (ROCM), a simple two-level logic minimizer • Traverse in breadth-first manner, apply logic minimization at each node • 15x faster & 3x less memory than Espresso-II • 2% increase in circuit size

  10. JIT CompilationTechnology Mapping • Maps hardware onto CLBs and LUTs of RCLF • 3-phase greedy hierarchical graph-clustering algorithm • Breadth-first traversal of input acyclic graph creates 3-input 1-output LUT nodes • Breadth-first traversal combines nodes where possible to form final 3-input 2-output LUTs • Traverses graph final time, packs LUTs into CLBs • 25X faster than commercial algorithms • Only minimally impacts circuit delay

  11. JIT CompilationPlacement • Places network of CLBs onto configurable logic • Greedy dependency-based positional algorithm • Places critical path nodes on single horizontal row of RCLF • Analyzes dependencies between placed/unplaced nodes • Based on dependencies, place above (input to placed node) or below (uses output from placed node) • Attempts to utilize routing resources between adjacent CLBs • Superimposes and aligns relative placement onto RCLF

  12. JIT CompilationRouting • Rips up illegal routes, adjusts routing costs of entire routing resource graph • Uses general approach of VPR’s routability-driven router • Allows both overuse of routing resources and illegal routes • Constructs routing conflict graph • Two routes conflict when both pass through a switch matrix and assigning them the same channel would result in illegal routing • Uses vertex coloring algorithm to assign routing channels • If any routes cannot be assigned legal channel, rips up, re-adjusts, and re-reroutes

  13. JIT CompilationBinary Updater • Used to allow SW to communicate with accelerated HW kernel • Replaces original SW instructions for loop with a jump to HW init. code • Enables HW with memory-mapped register • Shuts down microprocessor to power-down sleep mode • HW asserts completion signal to cause SW interrupt to wake up microprocessor • Jumps back to end of SW loop

  14. DADG & LCH Reg0 Reg1 Reg2 32-bit MAC Routing-Oriented Configurable Logic Fabric W-FPGAs • Data Address Generator (DADG) • Loop Control Hardware (LCH) • Multiplier-Accumulator (MAC) • All memory accesses handled through DADG • LCH for zero loop overhead

  15. W-FPGAsRouting-oriented Configurable Logic Fabric • RCLF consists of array of CLBs surrounded by switch matrices for routing between CLBs • Handle routing between CLBs using switch matrices • SMs can route signals to one of 4 neighbour SMs or two SM two rows/cols apart SM SM SM SM SM SM DADG LCH CLB CLB CLB CLB 32-bit MAC Configurable Logic Fabric SM SM SM SM SM SM

  16. W-FPGAsConfigurable Logic Blocks • Incorporates two 3-input 2-output LUTs • Equivalent to four 3-input 1-output LUTs with fixed internal routing • Reduces mapping complexity to increase speed e a b c d f LUT LUT Adj. CLB Adj. CLB SM SM SM SM SM SM CLB CLB CLB CLB o1 o2 o3 o4 SM SM SM SM SM SM

  17. W-FPGAsSwitch Matrices 0 1 2 3 0L 1L 2L 3L • All nets are routed using only a single pair of channels throughout the CLF • Each short channel is associated with single long channel • Designed for fast, lean JIT FPGA routing 3L 3L 2L 2L 1L 1L 0L 0L 3 3 2 2 1 1 0 0 0 1 2 3 3L 0L 1L 2L SM SM SM SM SM SM CLB CLB CLB CLB SM SM SM SM SM SM

  18. W-FPGAs • Lean place & route tools on RCLF can execute 10X faster using 18X less memory than existing tools • Results in lower clock frequencies for large circuits • Inclusion of DADG and MAC helps offset low freq.

  19. ResultsBenchmarks

  20. ResultsSingle Critical Region

  21. ResultsOverall Speedup (max 4 critical regions)

  22. Results

  23. Implementation with MicroBlaze Base MicroBlaze system i_lmb d_lmb d_lmb i_lmb W-FPGA W-FPGA Interface Data (BRAM) MicroBlaze lmb_cntrl Instr. (BRAM) lmb_cntrl profiler opb prof_intf lmb_cntrl opb_ddr uartlite MicroBlaze (ROCCAD) Instr/ Data (BRAM) lmb_cntrl lmb_cntrl Dynamic Partitioning

More Related