Warp Processor: A Dynamically Reconfigurable Coprocessor

Warp Processor: A Dynamically Reconfigurable Coprocessor Frank Vahid Professor Department of Computer Science and Engineering University of California, Riverside Associate Director, Center for Embedded Computer Systems, UC Irvine Work supported by the National Science Foundation, the Semiconductor Research Corporation, Xilinx, Intel, Motorola/Freescale Contributing Ph.D. Students: Roman Lysecky (2005, now asst. prof. at U. Arizona), Greg Stitt (Ph.D. 2006), Kris Miller (MS 2007), David Sheldon (3rd yr PhD), Scott Sirowy (1st yr PhD)

Outline • Intro and Background: Warp Processors • Work in progress under SRC 3-yr grant • Parallelized-computation memory access • Deriving high-level constructs from binaries • Case studies • Using commercial FPGA fabrics • Application-specific FPGA • Other ongoing related work • Configurable cache tuning Frank Vahid, UC Riverside

Intro: Partitioning to FPGA • Custom ASIC coprocessor known to speedup sw kernels • Energy advantages too (e.g., Henkel’98, Rabaey’98, Stitt/Vahid’04) • Power savings even on FPGA (Stitt/Vahid IEEE D&T’02, IEEE TECS’04) • Con: more silicon (~10x), less power savings • Pro: platform fully programmable • Mass-produced Application Proc. ASIC Application Proc. FPGA Frank Vahid, UC Riverside

Intro: FPGA vs. ASIC Coprocessor – FPGA Surprisingly Competitive • FPGA 34% savings versus ASIC’s 48% (Stitt/Vahid IEEE D&T’02, IEEE TECS’04) • 70% energy savings & 5.4 speedup vs. 200 MHz MIPS (Stitt/Vahid DATE’05) Frank Vahid, UC Riverside

Hardware for Bit Reversal Original X Value Bit Reversed X Value . . . . . . . . . . . Compilation Binary . . . . . . . . . . . sll $v1[3],$v0[2],0x10 srl $v0[2],$v0[2],0x10 or $v0[2],$v1[3],$v0[2] srl $v1[3],$v0[2],0x8 and $v1[3],$v1[3],$t5[13] sll $v0[2],$v0[2],0x8 and $v0[2],$v0[2],$t4[12] or $v0[2],$v1[3],$v0[2] srl $v1[3],$v0[2],0x4 and $v1[3],$v1[3],$t3[11] sll $v0[2],$v0[2],0x4 and $v0[2],$v0[2],$t2[10] ... Bit Reversed X Value Bit Reversed X Value Processor FPGA • Requires between 32 and 128 cycles • Requires only 1 cycle (speedup of 32x to 128x) Processor Processor FPGA – Why (Sometimes) Better than Software C Code for Bit Reversal x = (x >>16) | (x <<16); x = ((x >> 8) & 0x00ff00ff) | ((x << 8) & 0xff00ff00); x = ((x >> 4) & 0x0f0f0f0f) | ((x << 4) & 0xf0f0f0f0); x = ((x >> 2) & 0x33333333) | ((x << 2) & 0xcccccccc); x = ((x >> 1) & 0x55555555) | ((x << 1) & 0xaaaaaaaa); Other big reason: Concurrency Frank Vahid, UC Riverside

2 Profile application to determine critical regions 1 Initially execute application in software only 3 Profiler Partition critical regions to hardware µP I$ 5 D$ Partitioned application executes faster with lower energy consumption FPGA Dynamic Part. Module (DPM) 4 Program configurable logic & update software binary Warp Processing – Dynamic Partitioning of Sw Kernels to FPGA Profiler µP I$ D$ FPGA Dynamic Part. Module (DPM) Frank Vahid, UC Riverside

SW Binary Binary Binary Standard Compiler Profiling Proc. FPGA CAD Tools CAD Tools CAD Tools Profiling Profiling Profiling Proc. FPGA Proc. FPGA Proc. FPGA Warp Processors – Dynamic Partitioning • Advantages vs. compiler-time partitioning • No special compilers • Completely transparent • Separates function and architecture for architectures having FPGAs • Avoid complexities of supporting different FPGAs • Potentially brings FPGA advantages to ALL software Traditional partitioning done here Frank Vahid, UC Riverside

HW Bitstream Std. HW Binary Updated Binary Binary Binary Binary Binary Binary Logic Synthesis RT Synthesis Routing Tech. Mapping/Packing Placement JIT FPGA Compilation Binary Updater Decompilation Partitioning Profiler µP I$ D$ WCLA (FPGA) DPM (CAD) JIT FPGA Compilation Warp Processing Steps (On-Chip CAD) Frank Vahid, UC Riverside

Profiler µP I$ D$ WCLA (FPGA) DPM (CAD) Warp Processing – Partitioning • Applications spend much time in small amount of code • 90-10 rule • Observed 75-4 rule for MediaBench, NetBench • Potentially large perfomance/ energy benefits from implementing critical regions in hardware • Use profiling results to identify critical regions Frank Vahid, UC Riverside

Control Structure Recovery Function Recovery Array Recovery Data Flow Analysis Control/Data Flow Graph Creation long f( long reg2 ) { int reg3 = 0; int reg4 = 0; loop: reg4 = reg4 + mem[reg2 + reg3 << 1)]; reg3 = reg3 + 1; if (reg3 < 10) goto loop; return reg4; } reg3 := 0 reg4 := 0 reg3 := 0 reg4 := 0 long f( long reg2 ) { long reg4 = 0; for (long reg3 = 0; reg3 < 10; reg3++) { reg4 += mem[reg2 + (reg3 << 1)]; } return reg4; } long f( short array[10] ) { long reg4 = 0; for (long reg3 = 0; reg3 < 10; reg3++) { reg4 += array[reg3]; } return reg4; } loop: reg4 := reg4 + mem[ reg2 + (reg3 << 1)] reg3 := reg3 + 1 if (reg3 < 10) goto loop loop: reg1 := reg3 << 1 reg5 := reg2 + reg1 reg6 := mem[reg5 + 0] reg4 := reg4 + reg6 reg3 := reg3 + 1 if (reg3 < 10) goto loop ret reg4 ret reg4 Almost Identical Representations Warp Processing – Decompilation • Synthesis from binary has a challenge • High-level information (e.g., loops, arrays) lost during compilation • Solution –Recover high-level information: decompilation Corresponding Assembly Original C Code Mov reg3, 0 Mov reg4, 0 loop: Shl reg1, reg3, 1 Add reg5, reg2, reg1 Ld reg6, 0(reg5) Add reg4, reg4, reg6 Add reg3, reg3, 1 Beq reg3, 10, -5 Ret reg4 long f( short a[10] ) { long accum; for (int i=0; i < 10; i++) { accum += a[i]; } return accum; } Frank Vahid, UC Riverside

Warp Processing – Decompilation • Earlier study • Synthesis after decompilation often quite similar • Almost identical performance, small area overhead FPGA 2005 Frank Vahid, UC Riverside

r1 r3 8 r2 + < 32-bit adder 32-bit comparator r4 r5 r4[0]=r1[0] xor r2[0], carry[0]=r1[0] and r2[0] r4[1]=(r1[1] xor r2[1]) xor carry[0], carry[1]= ……. ……. Warp Processing – RT Synthesis • Maps decompiled DFG operations to hw library components • Adders, Comparators, Multiplexors, Shifters • Creates Boolean expression for each output bit in dataflow graph Frank Vahid, UC Riverside

1s <1s <1s 10s Tech. Map Log. Syn. Route Place 1 MB 1 MB 3.6 MB 1 min 1-2 mins 1 min 2-30 mins .5 MB 10 MB 10 MB 50 MB 60 MB Warp Processing – JIT FPGA Compilation • Existing FPGAs require complex CAD tools • FPGAs designed to handle large arbitrary circuits, ASIC prototyping, etc. • Require long execution times and large memory usage • Not suitable for dynamic on-chip execution • Solution: Develop a custom CAD-oriented FPGA (WCLA – Warp Configurable Logic Architecture) • Careful simultaneous design of FPGA and CAD • FPGA features evaluated for impact on CAD • Add architecture features for SW kernels • Enables development of fast, lean JIT FPGA compilation tools Frank Vahid, UC Riverside

Profiler ARM I$ DADG & LCH D$ Reg0 Reg1 Reg2 32-bit MAC WCLA DPM Configurable Logic Fabric Warp Configurable Logic Architecture (WCLA) • Data address generators (DADG) and loop control hardware(LCH) • Provide fast loop execution • Supports memory accesses with regular access pattern • Integrated 32-bit multiplier-accumulator (MAC) • Frequently found within critical SW kernels DATE’04 Frank Vahid, UC Riverside

0 1 2 3 0L 1L 2L 3L e a b c d f 3L 3L 2L 2L 1L 1L LUT LUT 0L 0L Adj. CLB Adj. CLB 3 3 2 2 1 1 0 0 o1 o2 o3 o4 0 1 2 3 3L 0L 1L 2L Warp Configurable Logic Architecture (WCLA) • CAD-specialized configurable logic fabric • Simplified switch matrices • Directly connected to adjacent CLB • All nets are routed using only a single pair of channels • Allows for efficient routing • Simplified CLBs • Two 3 input, 2 output LUTs • Each CLB connected to adjacent CLB to simplify routing of carry chains • Currently being prototyped by Intel (scheduled for 2006 Q3 shuttle) Frank Vahid, UC Riverside DATE’04

Expand Reduce Irredundant Logic Synthesis Tech. Mapping/Packing Placement JIT FPGA Compilation Routing on-set dc-set off-set Warp Processing – Logic Synthesis • ROCM - Riverside On-Chip Minimizer • Two-level minimization tool • Combination of approaches from Espresso-II [Brayton, et al., 1984][Hassoun & Sasoa, 2002] and Presto [Svoboda & White, 1979] • Single expand phase instead of multiple iterations • Eliminate need to compute off-set – reduces memory usage • On average only 2% larger than optimal solution On-Chip Logic Minimization, DAC’03 A Codesigned On-Chip Logic Minimizer, CODES+ISSS’03 Frank Vahid, UC Riverside

JIT FPGA Compilation Logic Synthesis Tech. Mapping/Packing Placement Routing Warp Processing – Technology Mapping • ROCTM - Technology Mapping/Packing • Decompose hardware circuit into DAG • Nodes correspond to basic 2-input logic gates (AND, OR, XOR, etc.) • Hierarchical bottom-up graph clustering algorithm • Breadth-first traversal combining nodes to form single-output LUTs • Combine LUTs with common inputs to form final 2-output LUTs • Pack LUTs in which output from one LUT is input to second LUT Dynamic Hardware/Software Partitioning: A First Approach, DAC’03 A Configurable Logic Fabric for Dynamic Hardware/Software Partitioning, DATE’04 Frank Vahid, UC Riverside

JIT FPGA Compilation CLB CLB CLB CLB CLB CLB CLB CLB CLB CLB CLB CLB CLB CLB CLB CLB CLB CLB Logic Synthesis CLB Tech. Mapping/Packing Placement Routing Warp Processing – Placement • ROCPLACE - Placement • Dependency-based positional placement algorithm • Identify critical path, placing critical nodes in center of CLF • Use dependencies between remaining CLBs to determine placement • Attempt to use adjacent CLB routing whenever possible Dynamic Hardware/Software Partitioning: A First Approach, DAC’03 A Configurable Logic Fabric for Dynamic Hardware/Software Partitioning, DATE’04 Frank Vahid, UC Riverside

JIT FPGA Compilation Logic Synthesis Tech. Mapping/Packing Placement Routing Warp Processing – Routing • ROCR - Riverside On-Chip Router • Requires much less memory than VPR as resource graph is smaller • 10x faster execution time than VPR (Timing driven) • Produces circuits with critical path 10% shorter than VPR (Routablilty driven) Frank Vahid, UC Riverside Dynamic FPGA Routing for Just-in-Time FPGA Compilation, DAC’04

Profiler ARM I$ D$ WCLA DPM ARM I$ D$ Xilinx Virtex-E FPGA Experiments with Warp Processing • Warp Processor • ARM/MIPS plus our fabric • Riverside on-chip CAD tools to map critical region to configurable fabric • Requires less than 2 seconds on lean embedded processor to perform synthesis and JIT FPGA compilation • Traditional HW/SW Partitioning • ARM/MIPS plus Xilinx Virtex-E FPGA • Manually partitioned software using VHDL • VHDL synthesized using Xilinx ISE 4.1 Frank Vahid, UC Riverside

Average kernel speedup of 41, vs. 21 for Virtex-E WCLA simplicity results in faster HW circuits Warp ProcessorsPerformance Speedup (Most Frequent Kernel Only) SW Only Execution Frank Vahid, UC Riverside

Average speedup of 7.4 Warp ProcessorsPerformance Speedup (Overall, Multiple Kernels) • Energy reduction of 38% - 94% Assuming 100 MHz ARM, and fabric clocked at rate determined by synthesis SW Only Execution Frank Vahid, UC Riverside

Xilinx ISE 9.1 s 60 MB DPM (CAD) 3.6MB 0.2 s 3.6MB DPM (CAD) (75MHz ARM7) 1.4s Warp Processors - ResultsExecution Time and Memory Requirements Frank Vahid, UC Riverside

Parallelism can’t be exploited if data isn’t available A[i] c[i] A[i+1] c[i+1] A[i+1] c[i+1] + + + B[i+2] B[i] B[i+1] 1. Parallelized-Computation Memory Access • Problem • Can parallelize kernel computation, but may hit memory bottleneck • Solution • Use more advanced memory/compilation methods Frank Vahid, UC Riverside

A[i] A[i] c[i] c[i] A[i+1] A[i+1] c[i+1] c[i+1] + + + + B[i] B[i] B[i+1] B[i+1] 1. Parallelized-Computation Memory Access • Method 1: Distribute data among FPGA block RAMS • Concurrently accessible Memory accesses are parallelized Main Memory blockRAM blockRAM Frank Vahid, UC Riverside

1st iteration window Empty Empty Empty Empty Empty Empty A[0] A[1] A[2] A[3] Controller Smart Buffer 2nd iteration window Killed A[4] Datapath A[6] Killed Killed A[5] Smart Buffer 3rd iteration window 1. Parallelized-Computation Memory Access • Method 2: Smart Buffers (Najjar’2004) • Memory structure optimized for application’s access patterns • Takes advantage of data reuse • Speedups of 2x to 10x compared to hw without smart buffers RAM A[0] A[1] A[2] A[3] A[4] A[5] A[6] A[7] A[8] …. Smart Buffer RAM Frank Vahid, UC Riverside

2. Deriving high-level constructs from binaries • Problem • Some binary features unsuitable for synthesis • Loops unrolled by optimizing compiler, or pointers • Previous decompilation techniques didn’t consider • Features OK for sw-to-sw translation, not for synthesis • Solution – New decompilation techniques • Convert pointers to arrays • Reroll loops • Others Loop Unrolling Loop Rerolling Ld reg2, 100(0) Add reg1, reg1, reg2 Ld reg2, 100(1) Add reg1, reg1, reg2 Ld reg2, 100(2) Add reg1, reg1, reg2 for (int i=0; i < 3; i++) accum += a[i]; for (int i=0; i<3;i++) reg1 += array[i]; Frank Vahid, UC Riverside

2. Deriving high-level constructs from binaries • Recent study of decompilation robustness • In presence of compiler optimizations, and instruction sets • Energy savings of 77%/76%/87% for MIPS/ARM/Microblaze ICCAD’05 DATE’04 Frank Vahid, UC Riverside

3. Case Studies • Compare warp processing (binary level) versus compiler-based (C level) partitioning for real examples • H.264 study (w/ Freescale) • Highly-optimized proprietary C code • Results of 2 month study • Competitive • Also learned that simple C-coding guidelines improve synthesis • Whether done from binary or source; presently developing guidelines • More examples: IBM (server), others... Frank Vahid, UC Riverside

uP 4. Using Commercial FPGA Fabrics • Can warp processing utilize commercial FPGAs? • Approach 1: “Virtual FPGA” – Map our fabric to FPGA • Collaboration with Xilinx • Initial results: 6x performance overhead, 100x area overhead • Main problem is routing Map fabric onto a commercial fabric Warp fabric uP uP Commercial FPGA Warp fabric Investigating better methods (one-to-one mapping) “Virtual FPGA” Frank Vahid, UC Riverside

CLB CLB CLB CLB CLB CLB CLB CLB CLB CLB CLB CLB 5. Application-Specific FPGA • Commercial FPGAs intended for ASIC prototyping • Huge range of possible designs • Generality causes loss of efficiency • Propose to investigate app-spec FPGAs • Put on ASIC next to custom circuits and microprocessor • FPGA tuned to particular circuit, but still general and reprogrammable • Supports late changes, modifications to standards, etc. • Customize CLB size, # of inputs, routing resources • Coarse-grained components – multiply-accumulate, RAM, etc. • Use retargetable CAD tool • Expected results: smaller, faster FPGAs General Fabric DSP-tuned Fabric MAC MAC RAM CLB CLB CLB Frank Vahid, UC Riverside

5. Application-Specific FPGA • Initial results: Performance improvements up to 300%, area reductions up to 90% Frank Vahid, UC Riverside

Configurable Cache Tuning • Developed • Runtime configurable cache (ISCA 2003) • Configuration heuristics (DATE 2004, ISLPED 2005) – 60% memory-access energy savings • Present focus: Dynamic tuning Way 1 Way1 Way 4 Way2 Way 1 Way 3 Way 2 Ways, line size, and total size are configurable ISLPED 2005 Frank Vahid, UC Riverside

Summary • Basic warp technology • Developed 2002-2005 (NSF, and 1-year CSR grants from SRC) • Uses binary synthesis and FPGAs • Conclusion: Feasible technology, much potential • Ongoing work (SRC) • Improve and validate effectiveness of binary synthesis • Examine FPGA implementation issues • Extensive future work to develop robust warp technology Frank Vahid, UC Riverside

Warp Processor: A Dynamically Reconfigurable Coprocessor