Warp Processors

Warp Processors Roman Lysecky , Greg Stitt , Frank Vahid, Warp Processors, ACM Transactions on Design Automation of Electronic Systems (TODAES), v.11 n.3, p.659-681, July 2006

Motivation • Wish to overcome barriers for FPGA acceleration: • Integrating tools to SW flows • Non-conformance to standard binary concept • Aim to make FPGAs invisible to SW developer • Dynamically determine critical regions • Re-implement as custom HW • Communicate between HW/SW

System Overview 2 Profile application to determine critical regions Initially execute application in SW only 1 Partition critical regions to HW 3 Partitioned application’s speed “warps” as accelerator takes over critical region 5 Program configurable logic and update SW binary 4

Profiling • Typical profilers instrument code • Change behaviour • Require extra tools • Warp profiler monitors instruction addresses seen on instruction memory bus • Maintains cache of 16 8-bit entries to store backward branch frequencies • Maintains relative frequencies • Accurately selects kernels within 10 saturations

On-Chip CAD • On-chip CAD module implemented on separate ARM7 processor • In multi-processor environments, only one CAD module is necessary • Stages: • Decompilation • Partitioning • Behavioral & RT Synthesis • JIT FPGA Compilation

Decompilation • Used to recover high-level constructs • i.e. loops, if statements, arrays • Decompiles critical region into CDFG • Intermediate code creation • High-level construct recovery • Map into statements/expressions • Use techniques to undo compiler optimizations • Loop re-rolling • Strength promotion • Compare-with-zero optimization

Partitioning • Determines which software kernels are most suitable for implementation in HW • Uses heuristic [assumed the 0-1 knapsack heuristic] to choose kernels to maximize speedup while reducing energy

Behavioral & RT Synthesis • Converts CDFG for each critical kernel to HW circuit description • Then converts into netlist format

JIT CompilationLogic Synthesis • Optimizes hardware circuit • Creates acyclic graph of Boolean logic network • Nodes correspond to any simple 2-input logic gate • Uses Riverside on-chip minimizer (ROCM), a simple two-level logic minimizer • Traverse in breadth-first manner, apply logic minimization at each node • 15x faster & 3x less memory than Espresso-II • 2% increase in circuit size

JIT CompilationTechnology Mapping • Maps hardware onto CLBs and LUTs of RCLF • 3-phase greedy hierarchical graph-clustering algorithm • Breadth-first traversal of input acyclic graph creates 3-input 1-output LUT nodes • Breadth-first traversal combines nodes where possible to form final 3-input 2-output LUTs • Traverses graph final time, packs LUTs into CLBs • 25X faster than commercial algorithms • Only minimally impacts circuit delay

JIT CompilationPlacement • Places network of CLBs onto configurable logic • Greedy dependency-based positional algorithm • Places critical path nodes on single horizontal row of RCLF • Analyzes dependencies between placed/unplaced nodes • Based on dependencies, place above (input to placed node) or below (uses output from placed node) • Attempts to utilize routing resources between adjacent CLBs • Superimposes and aligns relative placement onto RCLF

JIT CompilationRouting • Rips up illegal routes, adjusts routing costs of entire routing resource graph • Uses general approach of VPR’s routability-driven router • Allows both overuse of routing resources and illegal routes • Constructs routing conflict graph • Two routes conflict when both pass through a switch matrix and assigning them the same channel would result in illegal routing • Uses vertex coloring algorithm to assign routing channels • If any routes cannot be assigned legal channel, rips up, re-adjusts, and re-reroutes

JIT CompilationBinary Updater • Used to allow SW to communicate with accelerated HW kernel • Replaces original SW instructions for loop with a jump to HW init. code • Enables HW with memory-mapped register • Shuts down microprocessor to power-down sleep mode • HW asserts completion signal to cause SW interrupt to wake up microprocessor • Jumps back to end of SW loop

DADG & LCH Reg0 Reg1 Reg2 32-bit MAC Routing-Oriented Configurable Logic Fabric W-FPGAs • Data Address Generator (DADG) • Loop Control Hardware (LCH) • Multiplier-Accumulator (MAC) • All memory accesses handled through DADG • LCH for zero loop overhead

W-FPGAsRouting-oriented Configurable Logic Fabric • RCLF consists of array of CLBs surrounded by switch matrices for routing between CLBs • Handle routing between CLBs using switch matrices • SMs can route signals to one of 4 neighbour SMs or two SM two rows/cols apart SM SM SM SM SM SM DADG LCH CLB CLB CLB CLB 32-bit MAC Configurable Logic Fabric SM SM SM SM SM SM

W-FPGAsConfigurable Logic Blocks • Incorporates two 3-input 2-output LUTs • Equivalent to four 3-input 1-output LUTs with fixed internal routing • Reduces mapping complexity to increase speed e a b c d f LUT LUT Adj. CLB Adj. CLB SM SM SM SM SM SM CLB CLB CLB CLB o1 o2 o3 o4 SM SM SM SM SM SM

W-FPGAsSwitch Matrices 0 1 2 3 0L 1L 2L 3L • All nets are routed using only a single pair of channels throughout the CLF • Each short channel is associated with single long channel • Designed for fast, lean JIT FPGA routing 3L 3L 2L 2L 1L 1L 0L 0L 3 3 2 2 1 1 0 0 0 1 2 3 3L 0L 1L 2L SM SM SM SM SM SM CLB CLB CLB CLB SM SM SM SM SM SM

W-FPGAs • Lean place & route tools on RCLF can execute 10X faster using 18X less memory than existing tools • Results in lower clock frequencies for large circuits • Inclusion of DADG and MAC helps offset low freq.

ResultsBenchmarks

ResultsSingle Critical Region

ResultsOverall Speedup (max 4 critical regions)

Results

Implementation with MicroBlaze Base MicroBlaze system i_lmb d_lmb d_lmb i_lmb W-FPGA W-FPGA Interface Data (BRAM) MicroBlaze lmb_cntrl Instr. (BRAM) lmb_cntrl profiler opb prof_intf lmb_cntrl opb_ddr uartlite MicroBlaze (ROCCAD) Instr/ Data (BRAM) lmb_cntrl lmb_cntrl Dynamic Partitioning

Warp Processors

Warp Processors

Presentation Transcript

Warp Knitting Basics

Warp Speed: Executing Time Warp on 1,966,080 Cores

East Midlands WARP

WARP

SUPPLEMENTARY WARP/WEFT

Warp films

Processors

Warp Processors

Warp Processors (a.k.a. Self-Improving Configurable IC Platforms)

Warp Processors (a.k.a. Self-Improving Configurable IC Platforms)

OS/2 Warp

ARP/wARP developments

Processors

WARP Managed Service Platform (WARP-MSP)

PROCESSORS

The Warp Processor

Time Warp

Warp Speed

Warp Processors

MPI/WARP

Warp Processors Towards Separating Function and Architecture

Warp Processors Towards Separating Function and Architecture