Portability for FPGA Applications—Warp Processing and SystemC Bytecode

Portability for FPGA Applications—Warp Processing and SystemC Bytecode Frank Vahid Dept. of CS&E University of California, Riverside Associate Director, Center for Embedded Computer Systems, UC Irvine Contributing Ph.D. Students Roman Lysecky (Ph.D. 2005, now Asst. Prof. at Univ. of Arizona Greg Stitt (Ph.D. 2007, now Asst. Prof. at Univ. of Florida, Gainesville Scotty Sirowy (current) David Sheldon (current) Chen Huang (current) This research was supported in part by the National Science Foundation, the Semiconductor Research Corporation, Intel, Freescale, IBM, and Xilinx

Portable Applications on PCs One binary x86 binary How? Why? Pentium Opteron Atom Dual Core Multiple platforms

Applications Tools Architectures “Ecosystem” Portable Applications on PCs • Standard software binary • Dynamic software binary translation x86 Binary x86 µP VLIW VLIW Binary SW binary translation

Meanwhile, Circuits on FPGAs Show Large Speedups • Int. Symp. on FPGAs, FCCM, FPL, CODES/ISSS, ICS, MICRO, CASES, DAC, DATE, ICCAD, RAW, …

Xilinx Virtex II Pro. Source: Xilinx FPGAs Entering Computing Mainstream • AMD Opteron • Intel QuickAssist • Cray, SGI • Mitrionics • IBM Cell (research) • Xilinx, Altera SGI Altix supercomputer (UCR: 64 Itaniums plus 2 FPGA RASCs)

01110100... 001010010 … … 001010010 … … 001010010 … … "Software" "Hardware" Sep 2007 IEEE Computer Processor Processor Processor Circuits on FPGAs are Software Binaries FPGA “Binaries” (Circuits) Microprocessor Binaries (Instructions) nothardware aka "bitstream" Bits loaded into LUTs and SMs Bits loaded into program memory FPGA 0111 … 0010 …

Applications x86 µP FPGA FPGA binary SW binary translation Tools Architectures “Warp Processing” “Ecosystem” “Portable Applications” + “FPGAs” • Standard software binary • Dynamic translation x86 Binary x86 µP VLIW VLIW Binary SW binary translation

Software Binary Mov reg3, 0 Mov reg4, 0 loop: Shl reg1, reg3, 1 Add reg5, reg2, reg1 Ld reg6, 0(reg5) Add reg4, reg4, reg6 Add reg3, reg3, 1 Beq reg3, 10, -5 Ret reg4 Warp Processing 1 Initially, software binary loaded into instruction memory Profiler I Mem µP D$ FPGA On-chip CAD

Software Binary Mov reg3, 0 Mov reg4, 0 loop: Shl reg1, reg3, 1 Add reg5, reg2, reg1 Ld reg6, 0(reg5) Add reg4, reg4, reg6 Add reg3, reg3, 1 Beq reg3, 10, -5 Ret reg4 µP Warp Processing 2 Microprocessor executes instructions in software binary Profiler I Mem µP D$ FPGA On-chip CAD

Software Binary Mov reg3, 0 Mov reg4, 0 loop: Shl reg1, reg3, 1 Add reg5, reg2, reg1 Ld reg6, 0(reg5) Add reg4, reg4, reg6 Add reg3, reg3, 1 Beq reg3, 10, -5 Ret reg4 Critical Loop Detected Warp Processing 3 Profiler monitors instructions and detects critical regions in binary Profiler Profiler I Mem µP µP beq beq beq beq beq beq beq beq beq beq add add add add add add add add add add D$ FPGA On-chip CAD

Software Binary Mov reg3, 0 Mov reg4, 0 loop: Shl reg1, reg3, 1 Add reg5, reg2, reg1 Ld reg6, 0(reg5) Add reg4, reg4, reg6 Add reg3, reg3, 1 Beq reg3, 10, -5 Ret reg4 Warp Processing 4 On-chip CAD reads in critical region Profiler Profiler I Mem µP µP D$ FPGA On-chip CAD On-chip CAD

Software Binary Mov reg3, 0 Mov reg4, 0 loop: Shl reg1, reg3, 1 Add reg5, reg2, reg1 Ld reg6, 0(reg5) Add reg4, reg4, reg6 Add reg3, reg3, 1 Beq reg3, 10, -5 Ret reg4 reg3 := 0 reg4 := 0 loop: reg4 := reg4 + mem[ reg2 + (reg3 << 1)] reg3 := reg3 + 1 if (reg3 < 10) goto loop ret reg4 Warp Processing 5 On-chip CAD decompiles critical region into control data flow graph (CDFG) Profiler Profiler I Mem µP µP D$ Recover loops, arrays, subroutines, etc. – needed to synthesize good circuits FPGA Dynamic Part. Module (DPM) On-chip CAD

Software Binary Mov reg3, 0 Mov reg4, 0 loop: Shl reg1, reg3, 1 Add reg5, reg2, reg1 Ld reg6, 0(reg5) Add reg4, reg4, reg6 Add reg3, reg3, 1 Beq reg3, 10, -5 Ret reg4 + + + + + + . . . + + + . . . + + reg3 := 0 reg4 := 0 . . . loop: reg4 := reg4 + mem[ reg2 + (reg3 << 1)] reg3 := reg3 + 1 if (reg3 < 10) goto loop + ret reg4 Warp Processing 6 On-chip CAD synthesizes decompiled CDFG to a custom (parallel) circuit Profiler Profiler I Mem µP µP D$ FPGA Dynamic Part. Module (DPM) On-chip CAD

Software Binary Mov reg3, 0 Mov reg4, 0 loop: Shl reg1, reg3, 1 Add reg5, reg2, reg1 Ld reg6, 0(reg5) Add reg4, reg4, reg6 Add reg3, reg3, 1 Beq reg3, 10, -5 Ret reg4 SM SM SM SM SM SM CLB CLB + + + + + + . . . SM SM SM SM SM SM + + + . . . + + reg3 := 0 reg4 := 0 . . . loop: reg4 := reg4 + mem[ reg2 + (reg3 << 1)] reg3 := reg3 + 1 if (reg3 < 10) goto loop + ret reg4 Warp Processing 7 On-chip CAD maps circuit onto FPGA Profiler Profiler I Mem µP µP D$ FPGA FPGA Dynamic Part. Module (DPM) On-chip CAD + +

Software Binary Mov reg3, 0 Mov reg4, 0 loop: Shl reg1, reg3, 1 Add reg5, reg2, reg1 Ld reg6, 0(reg5) Add reg4, reg4, reg6 Add reg3, reg3, 1 Beq reg3, 10, -5 Ret reg4 SM SM SM SM SM SM CLB CLB + + + + + + . . . SM SM SM SM SM SM + + + Software-only “Warped” FPGA . . . + + reg3 := 0 reg4 := 0 . . . loop: reg4 := reg4 + mem[ reg2 + (reg3 << 1)] reg3 := reg3 + 1 if (reg3 < 10) goto loop + ret reg4 Warp speed, Scotty Warp Processing >10x speedups for some apps On-chip CAD replaces instructions in binary to use hardware, causing performance and energy to “warp” by an order of magnitude or more 8 Mov reg3, 0 Mov reg4, 0 loop: // instructions that interact with FPGA Ret reg4 Profiler Profiler I Mem µP µP D$ FPGA FPGA Dynamic Part. Module (DPM) On-chip CAD + +

Binary Binary Warp Processing Challenges • Can we decompile binaries sufficiently for synthesis? • Can we just-in-time (JIT) compile to FPGAs? Profiling & partitioning Decompilation Profiler µP I$ D$ CDFG Binary Updater FPGA On-chip CAD JIT FPGA compilation FPGA binary Binary Microp Binary Binary

Control Structure Recovery Function Recovery Array Recovery Control/Data Flow Graph Creation Data Flow Analysis long f( long reg2 ) { int reg3 = 0; int reg4 = 0; loop: reg4 = reg4 + mem[reg2 + reg3 << 1)]; reg3 = reg3 + 1; if (reg3 < 10) goto loop; return reg4; } reg3 := 0 reg4 := 0 reg3 := 0 reg4 := 0 long f( short array[10] ) { long reg4 = 0; for (long reg3 = 0; reg3 < 10; reg3++) { reg4 += array[reg3]; } return reg4; } long f( long reg2 ) { long reg4 = 0; for (long reg3 = 0; reg3 < 10; reg3++) { reg4 += mem[reg2 + (reg3 << 1)]; } return reg4; } loop: reg1 := reg3 << 1 reg5 := reg2 + reg1 reg6 := mem[reg5 + 0] reg4 := reg4 + reg6 reg3 := reg3 + 1 if (reg3 < 10) goto loop loop: reg4 := reg4 + mem[ reg2 + (reg3 << 1)] reg3 := reg3 + 1 if (reg3 < 10) goto loop ret reg4 ret reg4 Almost Identical Representations Decompilation • Recover high-level information from binary: branches, loops, arrays, subroutines, … • Adapted previous methods for processor-processor translation (UQBT) • Developed new synthesis-oriented methods (e.g., “reroll” loops, strength “promotion”) Corresponding Assembly Original C Code Mov reg3, 0 Mov reg4, 0 loop: Shl reg1, reg3, 1 Add reg5, reg2, reg1 Ld reg6, 0(reg5) Add reg4, reg4, reg6 Add reg3, reg3, 1 Beq reg3, 10, -5 Ret reg4 long f( short a[10] ) { long accum; for (int i=0; i < 10; i++) { accum += a[i]; } return accum; }

Decompilation Results vs. C • Synthesis from decompiled binary is competitive with synthesis from C

Decompilation Results on Optimized H.264In-depth Study with Freescale • Again, competitive with synthesis from C

Decompilation Effective Even with Compiler Optimizations • Do compiler optimizations hurt decompilation? • (Surprisingly) found optimized code synthesizes to even better circuits Speedup when decompiled binary is partitioned and synthesized to FPGA Average Speedup of 10 Examples

Decompilation Summary: Decompilation is surprisingly effective at recovering high-level program structures for synthesis Stitt et al ICCAD’02, DAC’03, CODES/ISSS’05, ICCAD’05, FPGA’05, TODAES’06, TODAES’07 Ph.D. work of Greg Stitt (Ph.D. UCR 2007, now Asst. Prof. at UF Gainesville)

Binary Binary Warp Processing Challenges • Can we decompile binaries sufficiently for synthesis? • Can we just-in-time(JIT) compile to FPGAs? Profiling & partitioning Decompilation Profiler µP I$ D$ CDFG Binary Updater FPGA On-chip CAD JIT FPGA compilation FPGA binary Binary Microp Binary Binary

Expand Reduce Irredundant 3.6MB 3.6MB on-set dc-set off-set Ultra-lean Riverside JIT FPGA tools on a 75MHz ARM7 1.4s Challenge: JIT Compile to FPGA 60 MB Commercial tool Logic synthesis Tech. map. Placement Routing 9.1 s • Developed ultra-lean CAD heuristics for synthesis, placement, routing, and technology mapping, e.g., • Logic synthesis: run single expand phase • Technology mapping: bottom-up graph clustering heuristic • Placement: place critical path first, then adjacent items • Routing: use resource graph that matches switch matrix / channel structure Ultra-lean Riverside JIT FPGA tools (drawn to scale) Penalty: 1.3-2x in performance & size (even more might be acceptable) 0.2 s

JIT Compile to FPGA Summary: Ultra-lean JIT FPGA compiler 40x speedup, 20x less memory, 1.3x-2x circuit penalty Lysecky et al, DAC’03, ISSS/CODES’03, DATE’04, DAC’04, DATE’05, FCCM’05, TODAES’06 Ph.D. work of Roman Lysecky (Ph.D. UCR 2005, now Asst. Prof. at Univ. of Arizona)

Average kernel speedup of 41 Profiler µP I$ D$ FPGA On-chip CAD Warp Processing ResultsPerformance Speedup (Most Frequent Kernel Only) vs. 200 MHz ARM 1 = ARM-only execution Overall application speedup average is 7.4

f() f() µP On-chip CAD f() f() Acc. Lib Warping Thread-Based Applications for (i = 0; i < 10; i++) { thread_create( f, i ); } Multi-core platforms  multi-threaded apps Performance OS schedules threads onto accelerators (possibly dozens), in addition to µPs Compiler Very large speedups possible – parallelism at bit, arithmetic, and now thread level too µP µP FPGA Binary f() OS schedules threads onto available µPs µP µP µP f() OS OS invokes on-chip CAD tools to create accelerators for f() Thread warping: use one core to create accelerator for waiting threads Remaining threads added to queue

b() a() Memory Access Synchronization (MAS) • Must deal with widely known memory bottleneck problem • FPGAs great, but often can’t get data to them fast enough for (i = 0; i < 10; i++) { thread_create( thread_function, a, i ); } RAM DMA Data for dozens of threads can create bottleneck void f( int a[], int val ) { int result; for (i = 0; i < 10; i++) { result += a[i] * val; } . . . . } FPGA …. Same array • Threaded programs exhibit unique feature: Multiple threads often access same or overlapping data • Solution: Fetch data once, broadcast to multiple threads (MAS)

f() f() f() enable Memory Access Synchronization (MAS) • Detect overlapping memory regions – “windows” • Synthesis creates active “smart buffer” [Guo/Najjar FPGA04] • Actively fetches data, stores the reused data, delivers windows to threads • Active rather than passive component; designed for specific threads ……… a[0] a[1] a[2] a[3] a[4] a[5] for (i = 0; i < 100; i++) { thread_create( thread_function, a, i ); } Data streamed to “smart buffer” DMA RAM void f( int a[], int i ) { int result; result += a[i]+a[i+1]+a[i+2]+a[i+3]; . . . . } A[0-103] Smart Buffer A[0-3] A[6-9] A[1-4] ……………… Each thread accesses different addresses – but addresses may overlap Buffer delivers window to each thread W/O smart buffer: 400 memory accesses With smart buffer: 104 memory accesses

Speedups from Thread Warping • Chose benchmarks with extensive parallelism • Four core(ARM11 400 MHz) base system • Virtex IV FPGA at circuit-specific clock frequency (~100-300 MHz) • Average 130x speedup But, FPGA uses additional area. Our FPGA size = ~36 ARM11s • Still 20x faster than 32-core system (and 11x faster than 64-core) • Simulation pessimistic, actual results likely better • FPGA more flexible

FPGA µP FPGA On-chip CAD Single-execution speedup Speedup Warp Scenarios Warping takes time (seconds, minutes, or more) – when useful? • Long-running applications • Scientific computing, etc. • Recurring applications (save and reuse FPGA configurations) • Common in embedded systems • Might view as (long) boot phase • For networked/docked devices, CAD can occur on server (ongoing work) Long Running Applications Recurring Applications µP (1st execution) On-chip CAD µP Time Time

FPGA On-chip CAD µP Applications Tools Architectures “Ecosystem” Why Dynamic? • Static good, but hiding FPGA opens technique to all sw platforms • Standard languages/tools/binaries Dynamic Compiling to FPGAs Static Compiling to FPGAs Specialized Language Any Language Specialized Compiler Any Compiler Binary Netlist Binary FPGA µP

Synthesis-Friendly Applications • Coding style impacts synthesis results

Conversion to Constants (CC) Conversion to Explicit Data Flow (CEDF) Conversion to Fixed Point (CF) Conversion to Explicit Memory Accesses (CEMA) Constant Input Enumeration (CIE) Loop Rerolling (LR) Conversion to Explicit Control Flow (CECF) Function Specialization (FS) Algorithmic Specialization (AS) Pass-By-Value Return (PVR) Synthesis-Friendly Application Coding Guidelines Coding Guidelines

Synthesis unlikely to determine possible targets of function pointer Synthesized Hardware ? Synthesized Circuit a[i] f1(i) f2(i) f3(i) fp 3x1 a[i] Conversion to Explicit Control Flow (CECF) • Problem: Function pointers may prevent static control flow analysis • Guideline: Don’t use function pointers. Replace with if-else, static calls • Makes possible targets explicit void f( int (*fp) (int) ) { . . . . . for (i=0; i < 10; i++) { a[i] = fp(i); } } enum Target { FUNC1, FUNC2, FUNC3 }; void f( enum Target fp ) { . . . . . for (i=0; i < 10; i++) { if (fp == FUNC1) a[i] = f1(i); else if (fp == FUNC2) a[i] = f2(i); else a[i] = f3(i); } }

Simple guidelines increased speedup to 6.5x Speedups from Synthesis-Friendly Coding Guidelines • 10 guidelines • For ~1,000 line benchmark: 5-6 changes typical, tens of minutes each

Speedups from Synthesis-Friendly Coding Guidelines • Original C code (Powerstone, Mediabench) • Original average speedups with FPGA: 2.6x (excludes brev) • Refined C code with guidelines • Average speedup: 8.4x (excludes brev) • Guidelines led to 3.5x improvement of speedup

“Spatial” Algorithms for FPGAs • As FPGAs more common – app writers may expect FPGA presence • Example – Count patterns • Sequential algorithm • Hash table • 10s cycles per pattern • Spatial algorithm (for FPGA) • Pipelined stages Current pattern count pattern logic Level 1 count pattern logic Level 2 count pattern logic Level 3 count pattern logic Level 4 . . . count pattern logic Spatial algorithm: Essence is the connectivity of components, not the sequencing of instructions Level m . . .

Spatial Algorithms for FPGAs • Spatial algorithm 2 • Pipelined binary tree Current pattern 1 Count Memory 1 pattern logic Level 1 2 Count 2 patterns Memory 2 patterns logic Level 2 4 Count 4 patterns Memory 4 patterns logic Level 3 . . . 2n Count 2n patterns Memory 2n patterns logic Level n . . .

Current pattern Memory 1 pattern logic Level 1 Memory 2 patterns logic Level 2 Memory 4 patterns logic Level 3 . . . Memory 2n patterns logic Level n . . . Example 48 73 Possible patterns pre-stored in binary search tree circuit Stage 1 Stage 2 Stage 3 Stage 4

Current pattern Memory 1 pattern logic Level 1 Memory 2 patterns logic Level 2 Memory 4 patterns logic Level 3 . . . Memory 2n patterns logic Level n . . . Example 23 48 Stage 1 73 Stage 2 Stage 3 Stage 4

Current pattern Memory 1 pattern logic Level 1 Memory 2 patterns logic Level 2 Memory 4 patterns logic Level 3 . . . Memory 2n patterns logic Level n . . . Example 75 23 Stage 1 48 Stage 2 73 Stage 3 Stage 4

Current pattern Memory 1 pattern logic Level 1 Memory 2 patterns logic Level 2 Memory 4 patterns logic Level 3 . . . Memory 2n patterns logic Level n . . . Example 11 75 Stage 1 23 Stage 2 48 Stage 3 73 Stage 4 1

Current pattern Memory 1 pattern logic Level 1 Memory 2 patterns logic Level 2 Memory 4 patterns logic Level 3 . . . Memory 2n patterns logic Level n . . . Example 11 Stage 1 75 Stage 2 1 23 Stage 3 48 Stage 4 1 1

Study of Spatial Algorithms in FCCM Year Application Type 2001 3D Vec. Normalization Spatial 2001 Efficient CAM -- 2001 Automated Sensor Temporal 2001 Regular Expression Spatial 2002 Hyperspectral Image Spatial 2002 Machine Vision Spatial 2002 RC4 Temporal 2002 Set Covering Spatial 2002 Template Matching Spatial 2002 Triangle Mesh Spatial 2003 Congruential Sieves Temporal 2003 Content Scanning Temporal 2003 F.P and Square Root Spatial 2003 Gaussian Noise Spatial 2003 TRNG -- 2004 3D FDTD Method Spatial 2004 Deep Packet Filter -- 2004 Online Floating Point -- 2004 Molecular Dynamics Spatial 2004 Pattern Matching Spatial 2004 Seismic Migration Spatial 2004 Software Deceleration -- 2004 V.M Window -- 2005 Data Mining Spatial 2005 Cell Automata Temporal 2005 Particle Graphics Spatial 2005 Radiosity Temporal 2005 Transient Waves Spatial 2005 Road Traffic Temporal 2006 All Pairs Shortest Path Spatial 2006 Apriori Data Mining Spatial 2006 Molecular Dynamics Spatial 2006 Gaussian Elimination Spatial 2006 Radiation Dose Temporal 2006 Random Variates Spatial • FCCM 2001-2006 • 70 papers describing fast application on FPGA • Examined 35 in depth (every other one) • 6 used device-specific features • 9 represented expected synthesized circuit from the obvious sequential algorithm • 20 were spatially-oriented applications • e.g., earlier pipelined binary tree

Portable Spatial Applications? • Current portable microprocessor binaries – sequential • Extensions for threads, processes, ... • How support spatial constructs • Ports, connections, timing model • ..... • Adds libraries and macros, still standard C++ • Sequential and spatial constructs • Compiling links in the simulation kernel • Self-executing simulation • Intended for SoC simulation www.systemc.org

Bytecode • Modern portability approach • Java, C# Compiler Virtual Machine (VM): Program that executes bytecode May JIT compile to native architecture bytecode VM VM VM Pentium Opteron Atom

SystemC Bytecode? SystemC Compiler SystemC bytecode VM VM VM Opteron + FPGA Pentium FPGA

SystemC Bytecode Compiler class EDGE_DETECTOR : public sc_module { //signal declarations … EDGE_DETECTOR() { SC_method(mainComp); sensitive << dataReady; SC_method(getPixel); sensitive << clock.pos(); void getPixel(){ … dataReady.write(1); } void mainComp(){ int i, j; for(i = 0; i < 3; i++){ for(j = 0; j < 3; j++){ sumX = sumX + mem.read()*GX[i][j] } } … edge.write(sumX + sumY) } SystemC SystemC Bytecode Compiler Pinapa Front End AST Link ELAB SystemC bytecode Bytecode Back End Code Generation 1 Register Allocation

Emulator Input Memory Main Processor Profiler Output Memory µP Instruction Memory I$ D$ UART Read Signal Memory USB Interface FPGA On-chip CAD Buttons Write Signal Memory LEDs Accelerator 1 Accelerator 2 Accelerator 3 SystemC Bytecode Emulator SystemC bytecode Bytecode uploadable via USB drive “Warping” also possible – JIT compile bytecode portions to circuits on FPGA FPGA Accelerators speedup emulation

Portability for FPGA Applications—Warp Processing and SystemC Bytecode

Portability for FPGA Applications—Warp Processing and SystemC Bytecode

Presentation Transcript

DSP for FPGA

Analysis, Design and Modeling in SystemC

Modules and Processes in SystemC

WARP

Intermediate Fabrics: Virtual FPGA Architectures for Circuit Portability and Fast Placement and Routing on FPGAs

P rogrammable Logic Device Devices and Applications

portability in cloud and mobile apps

Retroactive API Extensions Through Bytecode Weaving

Performing Multi-Phased Radar Processing with a Very Deep FPGA Pipeline

Development of Virtual FPGA lab and FPGA-based web browser

Run-Time FPGA Partial Reconfiguration for Image Processing Applications

Video on DSP and FPGA

Number Portability and Telecommunications Liberalization

Lesson 6 Controlling the FPGA VI

Learning SystemC

Twitter Frenzy FPGA Data Stream Processing