Thread Warping: A Framework for Dynamic Synthesis of Thread Accelerators

Thread Warping: A Framework for Dynamic Synthesis of Thread Accelerators Frank Vahid Dept. of CS&E University of California, Riverside Also with the Center for Embedded Computer Systems, UC Irvine Greg Stitt Dept. of ECE University of Florida This research was supported in part by the National Science Foundation and the Semiconductor Research Corporation

µP FPGA Binary “Translation” Binary Background • Motivated by commercial dynamic binary translation of early 2000s Performance µP VLIW VLIW Binary e.g., Transmeta Crusoe “code morphing” x86 Binary Binary Translation • Warp processing(Lysecky/Stitt/Vahid 2003-2007): dynamically translate binary to circuits on FPGAs

Software Binary Mov reg3, 0 Mov reg4, 0 loop: Shl reg1, reg3, 1 Add reg5, reg2, reg1 Ld reg6, 0(reg5) Add reg4, reg4, reg6 Add reg3, reg3, 1 Beq reg3, 10, -5 Ret reg4 Warp Processing Background 1 Initially, software binary loaded into instruction memory Profiler I Mem µP D$ FPGA On-chip CAD

Software Binary Mov reg3, 0 Mov reg4, 0 loop: Shl reg1, reg3, 1 Add reg5, reg2, reg1 Ld reg6, 0(reg5) Add reg4, reg4, reg6 Add reg3, reg3, 1 Beq reg3, 10, -5 Ret reg4 µP Warp Processing Background 2 Microprocessor executes instructions in software binary Profiler I Mem µP D$ FPGA On-chip CAD

Software Binary Mov reg3, 0 Mov reg4, 0 loop: Shl reg1, reg3, 1 Add reg5, reg2, reg1 Ld reg6, 0(reg5) Add reg4, reg4, reg6 Add reg3, reg3, 1 Beq reg3, 10, -5 Ret reg4 Critical Loop Detected Warp Processing Background 3 Profiler monitors instructions and detects critical regions in binary Profiler Profiler I Mem µP µP beq beq beq beq beq beq beq beq beq beq add add add add add add add add add add D$ FPGA On-chip CAD

Software Binary Mov reg3, 0 Mov reg4, 0 loop: Shl reg1, reg3, 1 Add reg5, reg2, reg1 Ld reg6, 0(reg5) Add reg4, reg4, reg6 Add reg3, reg3, 1 Beq reg3, 10, -5 Ret reg4 Warp Processing Background 4 On-chip CAD reads in critical region Profiler Profiler I Mem µP µP D$ FPGA On-chip CAD On-chip CAD

Software Binary Mov reg3, 0 Mov reg4, 0 loop: Shl reg1, reg3, 1 Add reg5, reg2, reg1 Ld reg6, 0(reg5) Add reg4, reg4, reg6 Add reg3, reg3, 1 Beq reg3, 10, -5 Ret reg4 reg3 := 0 reg4 := 0 loop: reg4 := reg4 + mem[ reg2 + (reg3 << 1)] reg3 := reg3 + 1 if (reg3 < 10) goto loop ret reg4 Warp Processing Background 5 On-chip CAD decompiles critical region into control data flow graph (CDFG) Profiler Profiler I Mem µP µP D$ Recover loops, arrays, subroutines, etc. – needed to synthesize good circuits FPGA Dynamic Part. Module (DPM) On-chip CAD Decompilation surprisingly effective at recovering high-level program structures Stitt et al ICCAD’02, DAC’03, CODES/ISSS’05, ICCAD’05, FPGA’05, TODAES’06, TODAES’07

Software Binary Mov reg3, 0 Mov reg4, 0 loop: Shl reg1, reg3, 1 Add reg5, reg2, reg1 Ld reg6, 0(reg5) Add reg4, reg4, reg6 Add reg3, reg3, 1 Beq reg3, 10, -5 Ret reg4 + + + + + + . . . + + + . . . + + reg3 := 0 reg4 := 0 . . . loop: reg4 := reg4 + mem[ reg2 + (reg3 << 1)] reg3 := reg3 + 1 if (reg3 < 10) goto loop + ret reg4 Warp Processing Background 6 On-chip CAD synthesizes decompiled CDFG to a custom (parallel) circuit Profiler Profiler I Mem µP µP D$ FPGA Dynamic Part. Module (DPM) On-chip CAD

Software Binary Mov reg3, 0 Mov reg4, 0 loop: Shl reg1, reg3, 1 Add reg5, reg2, reg1 Ld reg6, 0(reg5) Add reg4, reg4, reg6 Add reg3, reg3, 1 Beq reg3, 10, -5 Ret reg4 SM SM SM SM SM SM CLB CLB + + + + + + . . . SM SM SM SM SM SM + + + . . . + + reg3 := 0 reg4 := 0 . . . loop: reg4 := reg4 + mem[ reg2 + (reg3 << 1)] reg3 := reg3 + 1 if (reg3 < 10) goto loop + ret reg4 Warp Processing Background 7 On-chip CAD maps circuit onto FPGA Profiler Profiler I Mem µP µP D$ FPGA FPGA Dynamic Part. Module (DPM) On-chip CAD + + Lean place&route/FPGA 10x faster CAD (Lysecky et al DAC’03, ISSS/CODES’03, DATE’04, DAC’04, DATE’05, FCCM’05, TODAES’06) Multi-core chips – use 1 powerful core for CAD

Software Binary Mov reg3, 0 Mov reg4, 0 loop: Shl reg1, reg3, 1 Add reg5, reg2, reg1 Ld reg6, 0(reg5) Add reg4, reg4, reg6 Add reg3, reg3, 1 Beq reg3, 10, -5 Ret reg4 SM SM SM SM SM SM CLB CLB + + + + + + . . . SM SM SM SM SM SM + + + Software-only “Warped” FPGA . . . + + reg3 := 0 reg4 := 0 . . . loop: reg4 := reg4 + mem[ reg2 + (reg3 << 1)] reg3 := reg3 + 1 if (reg3 < 10) goto loop + ret reg4 Warp Processing Background >10x speedups for some apps On-chip CAD replaces instructions in binary to use hardware, causing performance and energy to “warp” by an order of magnitude or more 8 Mov reg3, 0 Mov reg4, 0 loop: // instructions that interact with FPGA Ret reg4 Profiler Profiler I Mem µP µP D$ FPGA FPGA Dynamic Part. Module (DPM) On-chip CAD + +

FPGA µP FPGA On-chip CAD Single-execution speedup Speedup Warp Scenarios Warping takes time – when useful? • Long-running applications • Scientific computing, etc. • Recurring applications (save FPGA configurations) • Common in embedded systems • Might view as (long) boot phase Long Running Applications Recurring Applications µP (1st execution) On-chip CAD µP Time Time Possible platforms: Xilinx Virtex II Pro, Altera Excalibur, Cray XD1, SGI Altix, Intel QuickAssist, ...

f() f() µP On-chip CAD f() f() Acc. Lib Thread Warping - Overview for (i = 0; i < 10; i++) { thread_create( f, i ); } Multi-core platforms  multi-threaded apps Performance OS schedules threads onto accelerators (possibly dozens), in addition to µPs Compiler Very large speedups possible – parallelism at bit, arithmetic, and now thread level too µP µP FPGA Binary f() OS schedules threads onto available µPs µP µP µP f() OS OS invokes on-chip CAD tools to create accelerators for f() Thread warping: use one core to create accelerator for waiting threads Remaining threads added to queue

FPGA On-chip CAD µP Thread Functions Decompilation Hw/Sw Partitioning Sw Hw Memory Access Synchronization Binary Updater High-level Synthesis Thread Group Table Updated Binary Netlist FPGA Thread Warping Tools • Invoked by OS • Uses pthread library (POSIX) • Mutex/semaphore for synchronization • Defined methods/algorithms of a thread warping framework Thread Queue Thread Functions Thread Counts Queue Analysis Accelerator Library false false Not In Library? Accelerators Synthesized? Done true true Memory Access Synchronization Accelerator Instantiation Accelerator Synthesis Accelerator Synthesis Bitfile Netlist Place&Route Schedulable Resource List Thread Group Table Updated Binary

b() a() Memory Access Synchronization (MAS) • Must deal with widely known memory bottleneck problem • FPGAs great, but often can’t get data to them fast enough for (i = 0; i < 10; i++) { thread_create( thread_function, a, i ); } RAM DMA Data for dozens of threads can create bottleneck void f( int a[], int val ) { int result; for (i = 0; i < 10; i++) { result += a[i] * val; } . . . . } FPGA …. Same array • Threaded programs exhibit unique feature: Multiple threads often access same data • Solution: Fetch data once, broadcast to multiple threads (MAS)

f() f() f() enable (from OS) Memory Access Synchronization (MAS) 1) Identify thread groups – loops that create threads 2) Identify constant memory addresses in thread function • Def-use analysis of parameters to thread function 3) Synthesis creates a “combined” memory access • Execution synchronized by OS Data fetched once, delivered to entire group Thread Group DMA RAM for (i = 0; i < 100; i++) { thread_create( f, a, i ); } A[0-9] A[0-9] A[0-9] A[0-9] ……………… Def-Use: a is constant for all threads void f( int a[], int val ) { int result; for (i = 0; i < 10; i++) { result += a[i] * val; } . . . . } Before MAS: 1000 memory accesses After MAS: 100 memory accesses Addresses of a[0-9] are constant for thread group

f() f() f() enable Memory Access Synchronization (MAS) • Also detects overlapping memory regions – “windows” • Synthesis creates extended “smart buffer” [Guo/Najjar FPGA04] • Caches reused data, delivers windows to threads ……… a[0] a[1] a[2] a[3] a[4] a[5] for (i = 0; i < 100; i++) { thread_create( thread_function, a, i ); } Data streamed to “smart buffer” DMA RAM void f( int a[], int i ) { int result; result += a[i]+a[i+1]+a[i+2]+a[i+3]; . . . . } A[0-103] Smart Buffer A[0-3] A[6-9] A[1-4] ……………… Each thread accesses different addresses – but addresses may overlap Buffer delivers window to each thread W/O smart buffer: 400 memory accesses With smart buffer: 104 memory accesses

Thread Queue Thread Functions Thread Counts Queue Analysis Accelerator Library false false Not In Library? Accelerators Synthesized? Done true true Accelerator Instantiation Accelerator Synthesis Bitfile Netlist Place&Route Schedulable Resource List Thread Group Table Updated Binary FPGA Framework • Also developed initial algorithms for: • Queue analysis • Accelerator instantiation • OS scheduling of threads to accelerators and cores

Thread Warping Example int main( ) { . . . . . . for (i=0; i < 50; i++) { thread_create( filter, a, b, i ); } . . . . . . } void filter( int a[51], int b[50], int i, ) { b[i] =avg( a[i], a[i+1], a[i+2], a[i+3] ); } µP µP FPGA main() filter() filter() threads execute on available cores µP µP µP On-chip CAD filter() OS Thread Queue Remaining threads added to queue Queue Analysis OS invokes CAD (due to queue size or periodically) Thread functions: filter() CAD tools identify filter() for synthesis

Example MAS detects thread group int main( ) { . . . . . . for (i=0; i < 50; i++) { thread_create( filter, a, b, i ); } . . . . . . } void filter( int a[51], int b[50], int i, ) { b[i] =avg( a[i], a[i+1], a[i+2], a[i+3] ); } µP µP FPGA main() filter() µP µP µP On-chip CAD filter() OS filter() binary CAD reads filter() binary Decompilation MAS detects overlapping windows CDFG Memory Access Synchronization

filter filter filter() filter() + + + 2 >> Example Accelerators loaded into FPGA int main( ) { . . . . . . for (i=0; i < 50; i++) { thread_create( filter, a, b, I ); } . . . . . . } void filter( int a[51], int b[50], int i, ) { b[i] =avg( a[i], a[i+1], a[i+2], a[i+3] ); } µP µP FPGA main() filter() µP µP µP On-chip CAD filter() OS filter() binary Synthesis creates pipelined accelerator for filter() group: 8 accelerators Decompilation RAM CDFG Smart Buffer Memory Access Synchronization . . . . . High-level Synthesis Stored for future use Accelerator Library RAM

filter() filter filter filter() Example OS schedules threads to accelerators int main( ) { . . . . . . for (i=0; i < 50; i++) { thread_create( filter, a, b, I ); } . . . . . . } void filter( int a[53], int b[50], int i, ) { b[i][=avg( a[i], a[i+1], a[i+2], a[i+3] ); } µP µP FPGA main() filter() µP µP µP On-chip CAD filter() OS Smart buffer streams a[] data RAM a[0-52] enable (from OS) Smart Buffer After buffer fills, delivers a window to all eight accelerators a[9-12] a[2-5] . . . . . RAM

filter filter filter() filter() Example int main( ) { . . . . . . for (i=0; i < 50; i++) { thread_create( filter, a, b, I ); } . . . . . . } void filter( int a[53], int b[50], int i, ) { b[i] =avg( a[i], a[i+1], a[i+2], a[i+3] ); } µP µP FPGA main() filter() µP µP µP On-chip CAD filter() OS RAM a[0-53] Each cycle, smart buffer delivers eight more windows – pipeline remains full Smart Buffer a[17-20] a[10-13] . . . . . RAM

filter filter filter() filter() Example int main( ) { . . . . . . for (i=0; i < 50; i++) { thread_create( filter, a, b, I ); } . . . . . . } void filter( int a[53], int b[50], int i, ) { b[i] =avg( a[i], a[i+1], a[i+2], a[i+3] ); } µP µP FPGA main() filter() µP µP µP On-chip CAD filter() OS RAM a[0-53] Smart Buffer . . . . . b[2-9] Accelerators create 8 outputs after pipeline latency passes RAM

filter() filter filter filter() Example int main( ) { . . . . . . for (i=0; i < 50; i++) { thread_create( filter, a, b, I ); } . . . . . . } void filter( int a[53], int b[50], int i, ) { b[i] =avg( a[i], a[i+1], a[i+2], a[i+3] ); } µP µP FPGA main() filter() µP µP µP On-chip CAD filter() OS RAM a[0-53] Thread warping: 8 pixel outputs per cycle Smart Buffer Software: 1 pixel output every ~9 cycles . . . . . Additional 8 outputs each cycle 72x cycle count improvement b[10-17] RAM

Experiments to Determine Thread Warping Performance: Simulator Setup Parallel Execution Graph (PEG) – represents thread level parallelism int main( ) { . . . . . . for (i=0; i < 50; i++) { thread_create( filter, a, b, I ); } . . . . . . } main ………… filter filter filter main Simulation Summary …… Generate PEG using pthread wrappers 1) Nodes: Sequential execution blocks (SEBs) Edges: pthread calls Determine SEB performances Sw: SimpleScalar Hw: Synthesis/simulation (Xilinx) 2) Optimistic for Sw execution (no memory contention) Pessimistic for warped execution (accelerators/microprocessors execute exclusively) Event-driven simulation – use defined algoritms to change architecture dynamically 3) 4) Complete when all SEBs simulated Observe total cycles

Experiments • Benchmarks: Image processing, DSP, scientific computing • Highly parallel examples to illustrate thread warping potential • We created multithreaded versions • Base architecture – 4 ARM cores • Focus on recurring applications (embedded) • TW: FPGA running at whatever frequency determined by synthesis Multi-core Thread Warping 4 ARM11 400 MHz 4 ARM11 400 MHz + FPGA (synth freq) µP µP FPGA µP µP Compared to µP On-chip CAD µP µP µP

Speedup from Thread Warping • Average 130x speedup But, FPGA uses additional area So we also compare to systems with 8 to 64 ARM11 uPs – FPGA size = ~36 ARM11s • 11x faster than 64-core system • Simulation pessimistic, actual results likely better

Limitations • Dependent on coding practices • Assumes boss/worker thread model • Not all apps amenable to FPGA speedup • Commercial CAD slow – warping takes time • But in worst case, FPGA just not used by application

FPGA On-chip CAD µP Why not Partition Statically? • Static good, but hiding FPGA opens technique to all sw platforms • Standard languages/tools/binaries Dynamic Thread Warping Static Thread Synthesis Specialized Language Any Language Specialized Compiler Any Compiler Binary Netlist Binary FPGA µP • Can add FPGA without changing binaries – like expanding memory, or adding processors to multiprocessor • Can adapt to changing workloads • Smaller & more accelerators, fewer & large accelerators, ... • Memory-access synchronization applicable to static approach

Conclusions • Thread warping framework dynamically synthesizes accelerators for thread functions • Memory Access Synchronization • Helps reduce memory bottleneck problem • 130x speedups for chosen examples • Future work • Handle wider variety of coding constructs • Improve for different thread models • Numerous open problems • E.g., dynamic reallocation of FPGA resources

Thread Warping: A Framework for Dynamic Synthesis of Thread Accelerators

Thread Warping: A Framework for Dynamic Synthesis of Thread Accelerators

Presentation Transcript

Threads

More on threads

Threads

Threads

Threads

Threads

Threading Part 2

Voice Thread

Thread outline

Dynamic Partial-Order Reduction for Model Checking Software

Work Thread

Other Thread Synchronization Functions

Chapter 8 Java Multi-thread

How to Sew a Button

OOP Spring 2004 lecture 9 Threads

High Level Synchronization and Interprocess Communication

High Level Synchronization and Interprocess Communication

Thread 15213-S04, Recitation, Section A

Chapter 4 – Thread Concepts