1 / 58

Self-Improving Computer Chips – Warp Processing

Self-Improving Computer Chips – Warp Processing. Frank Vahid Dept. of CS&E University of California, Riverside Associate Director, Center for Embedded Computer Systems, UC Irvine. Contributing Ph.D. Students Roman Lysecky (Ph.D. 2005, now Asst. Prof. at Univ. of Arizona

kynton
Download Presentation

Self-Improving Computer Chips – Warp Processing

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Self-Improving Computer Chips – Warp Processing Frank Vahid Dept. of CS&E University of California, Riverside Associate Director, Center for Embedded Computer Systems, UC Irvine Contributing Ph.D. Students Roman Lysecky (Ph.D. 2005, now Asst. Prof. at Univ. of Arizona Greg Stitt (Ph.D. 2007, now Asst. Prof. at Univ. of Florida, Gainesville Scotty Sirowy (current) David Sheldon (current) This research was supported in part by the National Science Foundation, the Semiconductor Research Corporation, Intel, Freescale, IBM, and Xilinx

  2. FPGA Coprocessing Entering Maintstream • Xilinx, Altera, … • Cray, SGI • Mitrionics • AMD Opteron • Intel QuickAssist • IBM Cell (research) Xilinx Virtex II Pro. Source: Xilinx Xilinx Virtex V. Source: Xilinx AMD Opteron socket plug-ins SGI Altix supercomputer (UCR: 64 Itaniums plus 2 FPGA RASCs)

  3. Circuits on FPGAs Can Execute Fast • Large speedups on many important applications • Int. Symp. on FPGAs, FCCM, FPL, CODES/ISSS, ICS, MICRO, CASES, DAC, DATE, ICCAD, …

  4. ~2000: Dynamic Software Optimization/Translation e.g., HP’s Dynamo; Java JIT compilers; Transmeta Crusoe “code morphing” Performance µP VLIW VLIW Binary x86 Binary Binary Translation Background • SpecSyn – 1989-1994 (Gajski et al, UC Irvine) • Synthesize executable specifications like VHDL or SpecCharts (now SpecC) to microprocessors and custom ASIC circuits • FPGAs were just invented and had very little capacity System Synthesis, Hardware/Software Partitioning

  5. 01110100... 001010010 … … 001010010 … … 001010010 … … "Software" "Hardware" Processor Processor Processor Circuits on FPGAs are Software FPGA "Binaries“ (Circuits) Microprocessor Binaries (Instructions) More commonly known as "bitstream" Bits loaded into LUTs and SMs Bits loaded into program memory FPGA 0111 … 0010 …

  6. Circuits on FPGAs are Software • “Circuits” often called “hardware” • Previously same 1958 article – “Today the “software” comprising the carefully planned interpretive routines, compilers, and other aspects of automative programming are at least as important to the modern electronic calculator as its “hardware” of tubes, transistors, wires, tapes, and the like.” • “Software” does not equal “instructions” • Software is simply the “bits” • Bits may represents instructions, circuits, …

  7. Circuits on FPGAs are Software Sep 2007 IEEE Computer

  8. The New Software – Circuits on FPGAs – May Be Worth Paying Attention To History repeats itself? …1876; there was a lot of love in the air, but it was for the telephone, not for Bell or his patent. There were many more applications for telephone-like devices, and most claimed Bell’s original application was for an object that wouldn’t work as described. Bell and his partners weathered these, but at such a great cost that they tried to sell the patent rights to Western Union, the giant telegraph company, in late 1876 for $100,000. But Western Union refused, because at the time they thought the telephone would never amount to anything. After all, why would anyone want a telephone? They could already communicate long-distance through the telegraph, and early phones had poor transmission quality and were limited in range. … • Multi-billion dollar growing industry • Increasingly found in embedded system products – medical devices, base stations, set-top boxes, etc. • Recent announcements (e.g, Intel)  FPGAs about to “take off”?? http://www.telcomhistory.org/

  9. µP FPGA JIT Compiler / Binary “Translation” Binary JIT Compilers / Dynamic Translation • Extensive binary translation in modern microprocessors Performance e.g., Java JIT compilers; Transmeta Crusoe “code morphing” µP VLIW VLIW Binary x86 Binary Binary Translation • Inspired by binary translators of early 2000s, began “Warp processing” project in 2002 – dynamically translate binary to circuits on FPGAs

  10. Software Binary Mov reg3, 0 Mov reg4, 0 loop: Shl reg1, reg3, 1 Add reg5, reg2, reg1 Ld reg6, 0(reg5) Add reg4, reg4, reg6 Add reg3, reg3, 1 Beq reg3, 10, -5 Ret reg4 Warp Processing 1 Initially, software binary loaded into instruction memory Profiler I Mem µP D$ FPGA On-chip CAD

  11. Software Binary Mov reg3, 0 Mov reg4, 0 loop: Shl reg1, reg3, 1 Add reg5, reg2, reg1 Ld reg6, 0(reg5) Add reg4, reg4, reg6 Add reg3, reg3, 1 Beq reg3, 10, -5 Ret reg4 µP Warp Processing 2 Microprocessor executes instructions in software binary Profiler I Mem µP D$ FPGA On-chip CAD

  12. Software Binary Mov reg3, 0 Mov reg4, 0 loop: Shl reg1, reg3, 1 Add reg5, reg2, reg1 Ld reg6, 0(reg5) Add reg4, reg4, reg6 Add reg3, reg3, 1 Beq reg3, 10, -5 Ret reg4 Critical Loop Detected Warp Processing 3 Profiler monitors instructions and detects critical regions in binary Profiler Profiler I Mem µP µP beq beq beq beq beq beq beq beq beq beq add add add add add add add add add add D$ FPGA On-chip CAD

  13. Software Binary Mov reg3, 0 Mov reg4, 0 loop: Shl reg1, reg3, 1 Add reg5, reg2, reg1 Ld reg6, 0(reg5) Add reg4, reg4, reg6 Add reg3, reg3, 1 Beq reg3, 10, -5 Ret reg4 Warp Processing 4 On-chip CAD reads in critical region Profiler Profiler I Mem µP µP D$ FPGA On-chip CAD On-chip CAD

  14. Software Binary Mov reg3, 0 Mov reg4, 0 loop: Shl reg1, reg3, 1 Add reg5, reg2, reg1 Ld reg6, 0(reg5) Add reg4, reg4, reg6 Add reg3, reg3, 1 Beq reg3, 10, -5 Ret reg4 reg3 := 0 reg4 := 0 loop: reg4 := reg4 + mem[ reg2 + (reg3 << 1)] reg3 := reg3 + 1 if (reg3 < 10) goto loop ret reg4 Warp Processing 5 On-chip CAD decompiles critical region into control data flow graph (CDFG) Profiler Profiler I Mem µP µP D$ Recover loops, arrays, subroutines, etc. – needed to synthesize good circuits FPGA Dynamic Part. Module (DPM) On-chip CAD Decompilation surprisingly effective at recovering high-level program structures Stitt et al ICCAD’02, DAC’03, CODES/ISSS’05, ICCAD’05, FPGA’05, TODAES’06, TODAES’07

  15. Software Binary Mov reg3, 0 Mov reg4, 0 loop: Shl reg1, reg3, 1 Add reg5, reg2, reg1 Ld reg6, 0(reg5) Add reg4, reg4, reg6 Add reg3, reg3, 1 Beq reg3, 10, -5 Ret reg4 + + + + + + . . . + + + . . . + + reg3 := 0 reg4 := 0 . . . loop: reg4 := reg4 + mem[ reg2 + (reg3 << 1)] reg3 := reg3 + 1 if (reg3 < 10) goto loop + ret reg4 Warp Processing 6 On-chip CAD synthesizes decompiled CDFG to a custom (parallel) circuit Profiler Profiler I Mem µP µP D$ FPGA Dynamic Part. Module (DPM) On-chip CAD

  16. Software Binary Mov reg3, 0 Mov reg4, 0 loop: Shl reg1, reg3, 1 Add reg5, reg2, reg1 Ld reg6, 0(reg5) Add reg4, reg4, reg6 Add reg3, reg3, 1 Beq reg3, 10, -5 Ret reg4 SM SM SM SM SM SM CLB CLB + + + + + + . . . SM SM SM SM SM SM + + + . . . + + reg3 := 0 reg4 := 0 . . . loop: reg4 := reg4 + mem[ reg2 + (reg3 << 1)] reg3 := reg3 + 1 if (reg3 < 10) goto loop + ret reg4 Warp Processing 7 On-chip CAD maps circuit onto FPGA Profiler Profiler I Mem µP µP D$ FPGA FPGA Dynamic Part. Module (DPM) On-chip CAD + + Lean place&route/FPGA 10x faster CAD (Lysecky et al DAC’03, ISSS/CODES’03, DATE’04, DAC’04, DATE’05, FCCM’05, TODAES’06) Multi-core chips – use 1 powerful core for CAD

  17. Software Binary Mov reg3, 0 Mov reg4, 0 loop: Shl reg1, reg3, 1 Add reg5, reg2, reg1 Ld reg6, 0(reg5) Add reg4, reg4, reg6 Add reg3, reg3, 1 Beq reg3, 10, -5 Ret reg4 SM SM SM SM SM SM CLB CLB + + + + + + . . . SM SM SM SM SM SM + + + Software-only “Warped” FPGA . . . + + reg3 := 0 reg4 := 0 . . . loop: reg4 := reg4 + mem[ reg2 + (reg3 << 1)] reg3 := reg3 + 1 if (reg3 < 10) goto loop + ret reg4 Warp speed, Scotty Warp Processing >10x speedups for some apps On-chip CAD replaces instructions in binary to use hardware, causing performance and energy to “warp” by an order of magnitude or more 8 Mov reg3, 0 Mov reg4, 0 loop: // instructions that interact with FPGA Ret reg4 Profiler Profiler I Mem µP µP D$ FPGA FPGA Dynamic Part. Module (DPM) On-chip CAD + +

  18. Binary Binary Profiling & partitioning Decompilation Synthesis Profiler µP I$ D$ Std. Ckt. Binary Binary Updater FPGA On-chip CAD JIT FPGA compilation FPGA binary Binary Micropr Binary Binary Warp Processing Challenges • Two key challenges • Can we decompile binaries to recover enough high-level constructs to create fast circuits on FPGAs? • Can we just-in-time (JIT) compile to FPGAs using limited on-chip compute resources?

  19. Binary Binary Profiling & partitioning Decompilation Control Structure Recovery Function Recovery Array Recovery Data Flow Analysis Control/Data Flow Graph Creation Synthesis Std. HW Binary Binary Updater long f( long reg2 ) { int reg3 = 0; int reg4 = 0; loop: reg4 = reg4 + mem[reg2 + reg3 << 1)]; reg3 = reg3 + 1; if (reg3 < 10) goto loop; return reg4; } reg3 := 0 reg4 := 0 reg3 := 0 reg4 := 0 long f( long reg2 ) { long reg4 = 0; for (long reg3 = 0; reg3 < 10; reg3++) { reg4 += mem[reg2 + (reg3 << 1)]; } return reg4; } long f( short array[10] ) { long reg4 = 0; for (long reg3 = 0; reg3 < 10; reg3++) { reg4 += array[reg3]; } return reg4; } JIT FPGA compilation FPGA binary Binary Micropr. Binary Binary loop: reg4 := reg4 + mem[ reg2 + (reg3 << 1)] reg3 := reg3 + 1 if (reg3 < 10) goto loop loop: reg1 := reg3 << 1 reg5 := reg2 + reg1 reg6 := mem[reg5 + 0] reg4 := reg4 + reg6 reg3 := reg3 + 1 if (reg3 < 10) goto loop ret reg4 ret reg4 Almost Identical Representations Decompilation • Solution – Recover high-level information from binary (branches, loops, arrays, subroutines, …): Decompilation • Adapted extensive previous work (for different purposes) • Developed new methods (e.g., “reroll” loops) • Ph.D. work of Greg Stitt (Ph.D. UCR 2007, now Asst. Prof. at UF Gainesville) • Numerous publications: http://www.cs.ucr.edu/~vahid/pubs Corresponding Assembly Original C Code Mov reg3, 0 Mov reg4, 0 loop: Shl reg1, reg3, 1 Add reg5, reg2, reg1 Ld reg6, 0(reg5) Add reg4, reg4, reg6 Add reg3, reg3, 1 Beq reg3, 10, -5 Ret reg4 long f( short a[10] ) { long accum; for (int i=0; i < 10; i++) { accum += a[i]; } return accum; }

  20. Decompilation Results vs. C • Competivive with synthesis from C

  21. Decompilation Results on Optimized H.264In-depth Study with Freescale • Again, competitive with synthesis from C

  22. Decompilation is Effective Even with High Compiler-Optimization Levels • Do compiler optimizations generate binaries harder to effectively decompile? • (Surprisingly) found opposite – optimized code even better Average Speedup of 10 Examples

  23. Binary Binary Profiling & partitioning Decompilation Synthesis Profiler µP I$ D$ Std. HW Binary Binary Updater FPGA On-chip CAD JIT FPGA compilation FPGA binary Binary Micropr Binary Binary Warp Processing Challenges • Two key challenges • Can we decompile binaries to recover enough high-level constructs to create fast circuits on FPGAs? • Can we just-in-time(JIT) compile to FPGAs using limited on-chip compute resources?

  24. Binary Binary Profiling & partitioning Decompilation Xilinx ISE Synthesis Std. HW Binary Binary Updater 9.1 s JIT FPGA compilation 60 MB FPGA binary Binary Micropr. Binary Binary Riverside JIT FPGA tools 3.6MB 0.2 s 3.6MB Riverside JIT FPGA tools on a 75MHz ARM7 1.4s Challenge: JIT Compile to FPGA • Developed ultra-lean CAD heuristics for synthesis, placement, routing, and technology mapping; simultaneously developed CAD-oriented FPGA • e.g., Our router (ROCR) 10x faster and 20x less memory, at cost of 30% longer critical path. Similar results for synth & placement • Ph.D. work of Roman Lysecky (Ph.D. UCR 2005, now Asst. Prof. at Univ. of Arizona) • Numerous publications: http://www.cs.ucr.edu/~vahid/pubs -- EDAA Outstanding Dissertation Award DAC’04

  25. Average kernel speedup of 41 Overall application speedup average is 7.4 Warp Processing ResultsPerformance Speedup (Most Frequent Kernel Only) Vs. 200 MHz ARM ARM-Only Execution

  26. f() f() µP On-chip CAD f() f() Acc. Lib Recent Work: Thread Warping (CODES/ISSS Oct 07 Austria, Best Paper Nom.) for (i = 0; i < 10; i++) { thread_create( f, i ); } Multi-core platforms  multi-threaded apps Performance OS schedules threads onto accelerators (possibly dozens), in addition to µPs Compiler Very large speedups possible – parallelism at bit, arithmetic, and now thread level too µP µP FPGA Binary f() OS schedules threads onto available µPs µP µP µP f() OS OS invokes on-chip CAD tools to create accelerators for f() Thread warping: use one core to create accelerator for waiting threads Remaining threads added to queue

  27. FPGA On-chip CAD µP Thread Functions Decompilation Hw/Sw Partitioning Sw Hw Memory Access Synchronization Binary Updater High-level Synthesis Thread Group Table Updated Binary Netlist FPGA Thread Warping Tools • Developed framework • Uses pthread library (POSIX) • Mutex/semaphore for synchronization Thread Queue Thread Functions Thread Counts Queue Analysis Accelerator Library false false Not In Library? Accelerators Synthesized? Done true true Memory Access Synchronization Accelerator Instantiation Accelerator Synthesis Accelerator Synthesis Bitfile Netlist Place&Route Schedulable Resource List Thread Group Table Updated Binary

  28. b() a() Memory Access Synchronization (MAS) • Must deal with widely known memory bottleneck problem • FPGAs great, but often can’t get data to them fast enough for (i = 0; i < 10; i++) { thread_create( thread_function, a, i ); } RAM DMA Data for dozens of threads can create bottleneck void f( int a[], int val ) { int result; for (i = 0; i < 10; i++) { result += a[i] * val; } . . . . } FPGA …. Same array • Threaded programs exhibit unique feature: Multiple threads often access same data • Solution: Fetch data once, broadcast to multiple threads (MAS)

  29. f() f() f() enable (from OS) Memory Access Synchronization (MAS) 1) Identify thread groups – loops that create threads 2) Identify constant memory addresses in thread function • Def-use analysis of parameters to thread function 3) Synthesis creates a “combined” memory access • Execution synchronized by OS Data fetched once, delivered to entire group Thread Group DMA RAM for (i = 0; i < 100; i++) { thread_create( f, a, i ); } A[0-9] A[0-9] A[0-9] A[0-9] ……………… Def-Use: a is constant for all threads void f( int a[], int val ) { int result; for (i = 0; i < 10; i++) { result += a[i] * val; } . . . . } Before MAS: 1000 memory accesses After MAS: 100 memory accesses Addresses of a[0-9] are constant for thread group

  30. f() f() f() enable Memory Access Synchronization (MAS) • Also detects overlapping memory regions – “windows” • Synthesis creates extended “smart buffer” [Guo/Najjar FPGA04] • Caches reused data, delivers windows to threads ……… a[0] a[1] a[2] a[3] a[4] a[5] for (i = 0; i < 100; i++) { thread_create( thread_function, a, i ); } Data streamed to “smart buffer” DMA RAM void f( int a[], int i ) { int result; result += a[i]+a[i+1]+a[i+2]+a[i+3]; . . . . } A[0-103] Smart Buffer A[0-3] A[6-9] A[1-4] ……………… Each thread accesses different addresses – but addresses may overlap Buffer delivers window to each thread W/O smart buffer: 400 memory accesses With smart buffer: 104 memory accesses

  31. Speedups from Thread Warping • Chose benchmarks with extensive parallelism • Compared to 4-ARM device • Average 130x speedup But, FPGA uses additional area So we also compare to systems with 8 to 64 ARM11 uPs – FPGA size = ~36 ARM11s • 11x faster than 64-core system • Simulation pessimistic, actual results likely better

  32. FPGA µP FPGA On-chip CAD Single-execution speedup Speedup Warp Scenarios Warping takes time – when useful? • Long-running applications • Scientific computing, etc. • Recurring applications (save FPGA configurations) • Common in embedded systems • Might view as (long) boot phase Long Running Applications Recurring Applications µP (1st execution) On-chip CAD µP Time Time

  33. FPGA On-chip CAD µP Why Dynamic? • Static good, but hiding FPGA opens technique to all sw platforms • Standard languages/tools/binaries Dynamic Compiling to FPGAs Static Compiling to FPGAs Specialized Language Any Language Specialized Compiler Any Compiler Binary Netlist Binary FPGA µP • Can adapt to changing workloads • Smaller & more accelerators, fewer & large accelerators, … • Can add FPGA without changing binaries – like expanding memory, or adding processors to multiprocessor • Custom interconnections, tuned processors, …

  34. Expandable Logic – Warp tools detect amount of FPGA, invisibly adapt application to use less/more hardware. Expandable RAM – System detects RAM during start, improves performance invisibly RAM DMA FPGA FPGA Cache Cache Profiler FPGA FPGA µP µP Warp Tools Dynamic Enables Expandable Logic Concept RAM Expandable Logic Expandable RAM uP Performance

  35. Dynamic Enables Expandable Logic • Large speedups – 14x to 400x (on scientific apps) • Different apps require different amounts of FPGA • Expandable logic allows customization of single platform • User selects required amount of FPGA • No need to recompile/synthesize • Recent (Spring 2008) results vs. 3.2 GHz Intel Xeon – 2x-8x speedups • Nallatech H101-PCIXM FPGA accelerator board w/ Virtex IV LX100 FPGA. FPGA I/O mems are 8 MB SRAMs. Board connects to host processor over PCI-X bus

  36. Ongoing Work: Dynamic Coprocessor Managements (CASES’08) • Multiple possible applications a1, a2, ... • Each with pre-designed FPGA coprocessor c1, c2, ..., optional use provides speedup • The size of FPGA is limited. How to manage the coprocessors? c1 c2 c3 a1 a2 a3 Memory Runtime with cp c1 c2 CPU a2 a1 c3 Runtime on CPU alone App runtime Reconfig time • Loading c2 would require removing c1 or c3. Is it worthwhile? • Depends on pattern of future instances of a1, a2, a3 • Must make “online” decision App instance FPGA c1 c3 44/19

  37. The Ski-Rental Problem Greedy: Always load Doesn’t consider past apps, which may predict future Solution idea for “ski rental problem” (popular online technique) Ski-Rental Problem You decide to take up skiing Should you rent skis each trip, or buy? Popular online algorithm solution – Rent until cumulative rental cost equals cost of buying, then buy Guarantee never to pay >2x cost of buying

  38. Cumulative Benefit Heuristic Maintain cumulative time benefit for each coprocessor Benefit of coprocessor i: tpi - tci cbenefit(i) = cbenefit(i) + (tpi – tci) Time that coprocessor i would have saved up to this point had it always been used for app i Only consider loading coproc i if cbenefit(i) > loading_time(i) Resists loading coprocs that are infrequent or with little speedup a1 a2 a3 tpi 200 100 50 tci 10 20 25 Benefit: tpi-tci 190 80 25 Q = <a1, a1, a3, a2, a2, a1, a3> c1: 190 380 570 Cumulative benefit table c2: 80 160 c3: 25 50 Loads = < --, c1 --, --, --, --, --> (already loaded) 190!>200 25!>200 380>200 Assume loading time for all coprocessors is 200

  39. Cumulative Benefit Heuristic – Replacing Coprocessors Replacement policy Subset of resident coprocessors such that cbenefit(i) – loading_time(i) > cbenefit(CP) Intuition – Don’t replace higher-benefit coprocessors with lower ones Memory c1 c2 c3 FPGA c1 c2 ? cbenefit > loading_time, but good enough to replace c1 or c2? • Greedy heuristic, maintains sorted cumulative benefit list • Time complexity is O(n) Loading time is 200 Q = <..., a3 c1: 950 Cumulative benefit table 320 c2: 225>200  can consider load c3: 200 225 But 225-200 !> 320  DON’T load

  40. Adjustment for temporal locality Real application sequences exhibit temporal locality Extend heuristic to “fade” cumulative benefit values Multiply by “f” at each step, 0<=f<=1 Define f proportional to reconfiguration time Small reconfig time – reconfig more freely, less attention to past, small f a1 a2 a3 tpi 200 100 50 tci 10 20 25 Benefit: tpi-tci 190 80 25 Q = <a1,a1,...,a1,a1,a2,a3,a3,a2,a2,a3...> 760 ...249 c1: 950 Cumulative benefit table c2: 320 128 +80 ...224 160 ...100 c3: 200 e.g., f = 0.8

  41. Experiments Our online ACBenefit algorithm gets better results than previous online algs RAW Avg FPGA speedup: 10x Avg coprocessor gate count: 48,000 FPGA size set to 60,000 App sequence total runtime Random FPGA reconfig time Biased Periodic

  42. Vdd Bitline Bitline Gated-Vdd Control Gnd More Dynamic Configuration: Configurable Cache Example [Zhang/Vahid/Najjar, ISCA 2003, ISVLSI 2003, TECS 2005] Way Concatenation Line Concatenation Way Shutdown W1 W2 W3 W4 W1 W2 W3 W4 W1 16 bytes 4 physical lines filled when line size is 32 bytes Shut down two ways Four Way Set Associative Base Cache Counter bus W1 W2 W3 W4 Off Chip Memory Two Way Set Associative 40% avg savings W1 W2 W3 W4 One physical cache, can be dynamically reconfigured to 18 different caches Direct mapped cache

  43. Highly-Configurable Platforms • Dynamic tuning of configurable components also Application2 Application1 Memory Encoding schemes Total size Associativity Line size L2 cache Encoding schemes Total size Associativity Line size L1 cache L1 cache Voltage/freq RF size Branch pred. Micro-processor Micro-processor Dynamically tuning the configurable components to match the currently executing application can significantly reduce power (and even improve performance)

  44. Summary Microprocessor instructions • Software is no longer just "instructions" • The sw elephant has a (new) tail – FPGA circuits • Warp processing potentially brings massive FPGA speedups to all of computing (desktop, embedded, scientific, …) • Patent granted Oct 2007, licensed by Intel, IBM, Freescale (via SRC) • Extensive future work • Online CAD algorithms, online architectures and algorithms, ... FPGA circuits

  45. Binary HW Bitstream Updated Binary Std. HW Binary Binary Binary Binary Binary JIT FPGA Compilation RT Synthesis Binary Updater Decompilation Partitioning Profiler µP I$ D$ 1s <1s <1s 10s 1 MB 3.6 MB 1 MB WCLA DPM .5 MB Warp ProcessorsCAD-Oriented FPGA • Solution: Develop a custom CAD-oriented FPGA • Careful simultaneous design of FPGA and CAD • FPGA features evaluated for impact on CAD • Add architecture features for SW kernels • Enables development of fast, lean JIT FPGA compilation tools WCLA

  46. Profiler Profiler ARM ARM I$ I$ DADG & LCH D$ D$ Reg2 Reg1 Reg0 32-bit MAC WCLA WCLA DPM DPM Configurable Logic Fabric Warp ProcessorsWarp Configurable Logic Architecture (WCLA) • Warp Configurable Logic Architecture (WCLA) • Need a fast, efficient coprocessor interface • Analyzed digital signal processors (DSP) and existing coprocessors • Data address generators (DADG) and Loop control hardware (LCH) • Provide fast loop execution • Supports memory accesses with regular access pattern • Integrated 32-bit multiplier-accumulator (MAC) • Frequently found in within critical SW kernels • Fast, single-cycle multipliers are large and require many interconnections WCLA A Configurable Logic Fabric for Dynamic Hardware/Software Partitioning, DATE’04

  47. SM SM SM CLB CLB SM SM SM µP µP I$ I$ WCLA D$ D$ WCLA FPGA DPM DPM SM SM SM DADG LCH 32-bit MAC CLB CLB Configurable Logic Fabric SM SM SM Warp Processors - WCLA Configurable Logic Fabric • Configurable Logic Fabric (CLF) • Hundreds of existing commercial and research FPGA fabrics • Most designed to balance circuit density and speed • Analyzed FPGA’s features to determine their impact of CAD • Designed our CLF in conjunction with JIT FPGA compilation tools • Array of configurable logic blocks (CLBs) surrounded by switch matrices (SMs) • CLB is directly connected to a SM • Along with SM design, allows for design of lean JIT routing A Configurable Logic Fabric for Dynamic Hardware/Software Partitioning, DATE’04

  48. µP µP I$ I$ WCLA D$ D$ FPGA WCLA e a b c d f DPM DPM LUT LUT Adj. CLB Adj. CLB o1 o2 o3 o4 Warp Processors - WCLA Combinational Logic Block • Combinational Logic Block • Incorporate two 3-input 2-output LUTs • Equivalent to four 3-input LUTs with fixed internal routing • Allows for good quality circuit while reducing JIT technology mapping complexity • Provide routing resources between adjacent CLBs to support carry chains • Reduces number of nets we need to route A Configurable Logic Fabric for Dynamic Hardware/Software Partitioning, DATE’04

  49. µP µP I$ I$ WCLA 0 1 2 3 0L 1L 2L 3L D$ D$ WCLA FPGA DPM DPM 3L 3L 2L 2L 1L 1L 0L 0L 3 3 2 2 1 1 0 0 0 1 2 3 3L 0L 1L 2L Warp Processors - WCLASwitch Matrix • Switch Matrix • All nets are routed using only a single pair of channels throughout the configurable logic fabric • Each short channel is associated with single long channel • Designed for fast, lean JIT FPGA routing A Configurable Logic Fabric for Dynamic Hardware/Software Partitioning, DATE’04

  50. Binary Std. HW Binary HW Bitstream Updated Binary Binary Binary Binary Binary Tech. Mapping/Packing Placement RT Synthesis Logic Synthesis Routing JIT FPGA Compilation Binary Updater Decompilation Partitioning Profiler µP I$ D$ Logic Synthesis WCLA DPM (CAD) JIT FPGA Compilation Tech. Mapping/Packing Placement Routing Warp ProcessorsJIT FPGA Compilation

More Related