Warp Processors Towards Separating Function and Architecture

Warp ProcessorsTowards Separating Function and Architecture Frank Vahid Professor Department of Computer Science and Engineering University of California, Riverside Faculty member, Center for Embedded Computer Systems, UC Irvine Warp Processor Ph.D. students: Roman Lysecky (Ph.D. 2004), Greg Stitt (Ph.D. 2005) This research is supported in part by the National Science Foundation, the Semiconductor Research Corporation, and Motorola

2 Profile application to determine critical regions 1 Initially execute application in software only 3 Profiler Partition critical regions to hardware µP I$ 5 D$ Partitioned application executes faster with lower energy consumption (speed has been “warped”) Warp Configurable Logic Architecture Dynamic Part. Module (DPM) 4 Program FPGA & update software binary Main IdeaWarp Processors – Dynamic HW/SW Partitioning Profiler µP I$ D$ Warp Config. Logic Architecture Dynamic Part. Module (DPM)

a b 001010010 … … 001010010 … … 001010010 … … 001010010 … … x c y 11 01 01 0010 … 00 01 FPGA 1 11 11 10 0 Processor Processor FPGAs are Programmable Sw Binaries FPGA Binaries CLB (Configurable Logic Block) 0 0 0 1 addr 0 1 a 1 0 b Bits loaded into LUTs, CLBs, and SMs 0 1 Bits loaded into program memory c 1 0 1 0 1 1 x y SM SM SM SM SM SM 0010 … SM (Switch Matrix) CLB CLB CLB CLB a or b a SM SM SM SM SM SM a or b b

Original X Value Original X Value . . . . . . . . . . . . . . . . . . . . . . FPGA Bit Revered X Value Bit Revered X Value Processor Processor Processor FPGA do Bit Manipulation Fast Hardware for Bit Reversal C Code for Bit Reversal • 64 instructions • 32 to 128 cycles x = (x >>16) | (x <<16); x = ((x >> 8) & 0x00ff00ff) | ((x << 8) & 0xff00ff00); x = ((x >> 4) & 0x0f0f0f0f) | ((x << 4) & 0xf0f0f0f0); x = ((x >> 2) & 0x33333333) | ((x << 2) & 0xcccccccc); x = ((x >> 1) & 0x55555555) | ((x << 1) & 0xaaaaaaaa); x = (x >>16) | (x <<16); x = ((x >> 8) & 0x00ff00ff) | ((x << 8) & 0xff00ff00); x = ((x >> 4) & 0x0f0f0f0f) | ((x << 4) & 0xf0f0f0f0); x = ((x >> 2) & 0x33333333) | ((x << 2) & 0xcccccccc); x = ((x >> 1) & 0x55555555) | ((x << 1) & 0xaaaaaaaa); Original X Value Original X Value . . . . . . . . . . . . . . . . . . . . . . Compilation Bit Reversed X Value Binary Bit Revered X Value sll $v1[3],$v0[2],0x10 srl $v0[2],$v0[2],0x10 or $v0[2],$v1[3],$v0[2] srl $v1[3],$v0[2],0x8 and $v1[3],$v1[3],$t5[13] sll $v0[2],$v0[2],0x8 and $v0[2],$v0[2],$t4[12] or $v0[2],$v1[3],$v0[2] srl $v1[3],$v0[2],0x4 and $v1[3],$v1[3],$t3[11] sll $v0[2],$v0[2],0x4 and $v0[2],$v0[2],$t2[10] ….. sll $v1[3],$v0[2],0x10 srl $v0[2],$v0[2],0x10 or $v0[2],$v1[3],$v0[2] srl $v1[3],$v0[2],0x8 and $v1[3],$v1[3],$t5[13] sll $v0[2],$v0[2],0x8 and $v0[2],$v0[2],$t4[12] or $v0[2],$v1[3],$v0[2] srl $v1[3],$v0[2],0x4 and $v1[3],$v1[3],$t3[11] sll $v0[2],$v0[2],0x4 and $v0[2],$v0[2],$t2[10] ….. • Synthesizes to wires • 1 cycle • 32x-128x speedup sll srl

. . . . . . . . . . . . . . * * * * * * * * * * * * + + + + + + . . . . . . . + + + . . . . . . . . . . . . . . + + + FPGA Processor Processor Processor FPGAs Support Much Parallelism Hardware for FIR Filter • 1000’s of instructions • Several thousand cycles C Code for FIR Filter for (i=0; i < 128; i++) y[i] += c[i] * x[i] .. .. .. for (i=0; i < 128; i++) y[i] += c[i] * x[i] .. .. .. • ~ 7 cycles • Speedup > 100x

Why aren’t FPGAs Part of Mainstream Computing? • Benefits known for over a decade • Hardware/software partitioning research since early 90s • PRISM [Athanas, Silverman]; TOSCA [Balboni, Fornaciari, Sciuto]; COSYMA [Henkel/Ernst]; Vulcan [Gupta/DeMicheli]; SpecSyn [Gajski/Vahid/Narayan/Gong]; etc. • Microprocessor/FPGA architectures research since early 90s • DISC [Wirthlin, Hutchings]; GARP [Hauser, Wawrzynek]; Chimaera [Hauck, Fry, Hosler, Kao]; Morphosys [Lee, Singh, Lu, et al] • Commercial ventures for several years • Chameleon • Morphotec • Stretch [www.stretchinc.com] using Tensilica [www.tensilica.com] (April 2004) • New Atmel, Triscend, Altera, Xilinx devices in past few years…

Single-Chip Microprocessor/FPGA Platforms Appearing Commercially Courtesy of Atmel Courtesy of Altera Courtesy of Triscend PowerPCs Courtesy of Xilinx

SW Binary Binary Binary Special Compiler Profiling Modified Binary FPGA-specific Netlist Netlist Why Aren’t FPGAs Mainstream? • Doesn’t fit well with SW • Well-established languages, tools and flows • Concept of a standard binary is missing in FPGA world • Thus, FPGAs limited to CAD domain • But for every 1 CAD user -- >100 software writers • Only about 15,000 CAD seats worldwide; millions of compiler seats Includes synthesis, tech. map, pace & route Proc. FPGA

Binary Binary Standard Compiler Profiling Processor Processor1 Processor Processor2 Processor Processor3 Standard Binary is Important • Separates function from architecture • Tools and architectures can be developed independently • Can even dynamically translate/optimize • UQBT [Cifuentes]; Dynamo [Bala, Duesterwald, Banerjia] • Transmeta Crusoe, Efficeon [www.transmeta.com] and modern Pentiums • Java bytecode SW ______ ______ ______ SW ______ ______ ______

SW Binary Binary Binary Standard Compiler Profiling Binary Partitioner Modified Binary Netlist Netlist Partial Solution to Bring FPGAs into Mainstream SW: Binary-Level Partitioning • Binary-level partitioning • Stitt/Vahid, ICCAD’02 • Recent commercial product: Critical Blue [www.criticalblue.com] • Partition and synthesize starting from SW binary • Advantages • Any compiler, any language, multiple sources, assembly/object support, legacy code support • Disadvantage • Loses high-level information • Quality loss? Traditional partitioning done here Includes synthesis, tech. map, pace & route Less disruptive, back-end tool Proc. FPGA

Key to Good-Quality Binary-Level Partitioning -- Decompilation Software Binary Software Binary • Goal: recover high-level information lost during compilation • Otherwise, synthesis results poor • Utilized sophisticated decompilation methods • Developed over past decades for purpose of binary translation • We developed additional methods specific to our purpose • Some limits (e.g., indirect jumps) • How does binary-level partitioning with decompilation compare with source-level partitioning? Binary Parsing Binary Parsing CDFG Creation CDFG Creation Control Structure Recovery Control Structure Recovery discover loops, if-else, etc. Removing Instruction-Set Overhead Removing Instruction-Set Overhead reduce operation sizes, etc. Undoing Back-End Compiler Optimizations Undoing Back-End Compiler Optimizations reroll loops, etc. Alias Analysis allows parallel memory access Alias Analysis Annotated CDFG Annotated CDFG

Decompilation Recovery Rate • In most situations, we can recover all high-level information • Recovery success for dozens of benchmarks, using several different compilers and optimization levels:

Binary-Level Partitioning vs. Source-Level Stitt/Vahid’04 (submitted)

SW Binary Binary Binary Standard Compiler Profiling CAD Includes synthesis, tech. map, pace & route Binary Partitioner Modified Binary Netlist Netlist IdeaBinary Partitioning Enables Dynamic Partitioning • Embed CAD on-chip • Feasible in era of billion-transistor chips • Advantages • No special desktop tools • Completely transparent • Avoid complexities of supporting different FPGA types • Complements other approaches • Desktop CAD best from purely technical perspective • Dynamic opens additional market segments (i.e., all software developers) that otherwise might not use desktop CAD • Back to “standard binary” – opens processor architects to world of speedup using FPGAs Standard binary! CAD Proc. FPGA

Binary Updated Binary HW Binary Binary Binary Logic Synthesis Placement & Routing Technology Mapping RT Synthesis Binary Updater Partitioning Decompilation Profiler uP I$ D$ Config. Logic Arch. DPM Warp ProcessorsTools & Requirements • Warp Processor Architecture • On-chip profiling architecture • Configurable logic architecture • Dynamic partitioning module DPM with uP overkill? Consider that FPGA much bigger than uP. Also consider there may be dozens or uP, but all can share one DPM.

Decomp. Partitioning Tech. Map RT Syn. Log. Syn. Route Place 10 MB 10 MB 10 MB 10 MB 20 MB 50 MB 60 MB 1-2 mins 1 min 1-2 mins 30 s 1 min 1 min 2 – 30 mins Warp ProcessorsAll that CAD on-chip? • CAD people may first think dynamic HW/SW partitioning is “absurd” • Those CAD tools are complex • Require long execution times on powerful desktop workstations • Require very large memory resources • Usually require GBytes of hard drive space • Costs of complete CAD tools package can exceed $1 million • All that on-chip?

Updated Binary Binary HW Binary Binary Binary Logic Synthesis Placement & Routing Technology Mapping RT Synthesis Binary Updater Partitioning Decompilation Warp ProcessorsTools & Requirements • But, in fact, on-chip CAD may be practical since specialized • CAD • Traditional CAD -- Huge, arbitrary input • Warp Processor CAD -- Critical sw kernels • FPGA • Traditional FPGA – huge, arbitrary netlists, ASIC prototyping, varied I/O • Warp Processor FPGA – kernel speedup • Careful simultaneous design of FPGA and CAD • FPGA features evaluated for impact on CAD • CAD influences FPGA features • Add architecture features for kernels Profiler uP I$ D$ Config. Logic Arch. Config. Logic Arch. DPM

Profiler uP I$ D$ Reg0 Reg1 Reg2 Config. Logic Arch. DPM Warp ProcessorsConfigurable Logic Architecture • Loop support hardware • Data address generators (DADG) and loop control hardware (LCH), found in digital signal processors – fast loop execution • Supports memory accesses with regular access pattern • Synthesis of FSM not required for many critical loops • 32-bit fast Multiply-Accumulate (MAC) unit Lysecky/Vahid, DATE’04 DADG & LCH 32-bit MAC Configurable Logic Fabric

SM SM SM CLB CLB SM SM SM 0 1 2 3 0L 1L 2L 3L Profiler DADG LCH e a b c d f ARM I$ 3L 3L 32-bit MAC SM SM SM 2L 2L D$ Configurable Logic Fabric 1L 1L LUT LUT 0L 0L CLB CLB Adj. CLB Adj. CLB Config. Logic Arch. 3 3 DPM 2 2 1 1 SM SM SM 0 0 o1 o2 o3 o4 0 1 2 3 3L 0L 1L 2L Warp ProcessorsConfigurable Logic Fabric • Simple fabric: array of configurable logic blocks (CLBs) surrounded by switch matrices (SMs) • Simple CLB: Two 3-input 2-output LUTs • carry-chain support • Simple switch matrices: 4-short, 4-long channels • Designed for simple fast CAD Lysecky/Vahid, DATE’04

Profiler uP HW Binary Updated Binary Binary Binary Binary I$ D$ RT Synthesis Placement & Routing Logic Synthesis Technology Mapping Binary Updater WCLA Partitioning Decompilation Decompilation DPM ARM ARM I$ Memory Memory D$ Warp ProcessorsDynamic Partitioning Module (DPM) • Dynamic Partitioning Module • Executes on-chip partitioning tools • Consists of small low-power processor (ARM7) • Current SoCs can have dozens • On-chip instruction & data caches • Memory: a few megabytes

<1s Decomp. Partitioning Tech. Map RT Syn. Log. Syn. 1 MB Route Place 10 MB 10 MB 10 MB 10 MB 50 MB 20 MB 60 MB 1-2 mins 1 min 1-2 mins 30 s 1 min 1 min 2 – 30 mins Warp ProcessorsExecution Time and Memory Requirements

Profiler uP Binary Updated Binary HW Binary Binary Binary I$ D$ Technology Mapping RT Synthesis Logic Synthesis Placement & Routing Binary Updater WCLA Decompilation RT Synthesis Partitioning DPM ARM ARM I$ Memory Memory D$ Warp ProcessorsDynamic Partitioning Module (DPM)

1 r1 DADG Read r2 + Read r2 + + r1 r3 Warp ProcessorsRT Synthesis • Converts decompiled CDFG to Boolean expressions • Maps memory accesses to our data address generator architecture • Detects read/write, memory access pattern, memory read/write ordering • Optimizes dataflow graph • Removes address calculations and loop counter/exit conditions • Loop control handled by Loop Control Hardware • Memory Read • Increment Address r3 Stitt/Lysecky/Vahid, DAC’03

r1 r3 8 r2 + < r4 r5 Warp ProcessorsRT Synthesis • Maps dataflow operations to hardware components • We currently support adders, comparators, shifters, Boolean logic, and multipliers • Creates Boolean expression for each output bit of dataflow graph 32-bit adder 32-bit comparator r4[0]=r1[0] xor r2[0], carry[0]=r1[0] and r2[0] r4[1]=(r1[1] xor r2[1]) xor carry[0], carry[1]= ……. ……. Stitt/Lysecky/Vahid, DAC’03

<1s <1s .5 MB Decomp. Partitioning Tech. Map RT Syn. Log. Syn. 1 MB .5 MB Route Place 10 MB 10 MB 10 MB 10 MB 50 MB 20 MB 60 MB 1-2 mins 1 min 1-2 mins 30 s 1 min 1 min 2 – 30 mins Warp ProcessorsExecution Time and Memory Requirements <1s

Profiler uP Binary Updated Binary HW Binary Binary Binary I$ D$ Technology Mapping RT Synthesis Logic Synthesis Placement & Routing Binary Updater WCLA Decompilation Logic Synthesis Partitioning DPM ARM ARM I$ Memory Memory D$ Warp ProcessorsDynamic Partitioning Module (DPM)

Logic Synthesis r1 4 r2[0] = r1[0] xor 0 xor 0 r2[1] = r1[1] xor 0 xor carry[0] r2[2] = r1[2] xor 1 xor carry[1] r2[3] = r1[3] xor 0 xor carry[2] … + r2 r2[0] = r1[0] r2[1] = r1[1] xor carry[0] r2[2] = r1[2] xor carry[1] r2[3] = r1[3] xor carry[2] … Warp ProcessorsLogic Synthesis • Optimize hardware circuit created during RT synthesis • Large opportunity for logic minimization due to use of immediate values in the binary code • Utilize simple two-level logic minimization approach Stitt/Lysecky/Vahid, DAC’03

Expand Reduce Irredundant on-set dc-set off-set Warp Processors - ROCM • ROCM – Riverside On-Chip Minimizer • Two-level minimization tool • Utilized a combination of approaches from Espresso-II [Brayton, et al. 1984] and Presto [Svoboda & White, 1979] • Eliminate the need to compute the off-set to reduce memory usage • Utilizes a single expand phase instead of multiple iterations • On average only 2% larger than optimal solution for benchmarks Lysecky/Vahid, DAC’03 Lysecky/Vahid, CODES+ISSS’03

ROCM executing on 40MHz ARM7 requires less than 1 second Small code size of only 22 kilobytes Average data memory usage of only 1 megabyte Warp Processors - ROCMResults 40 MHz ARM 7 (Triscend A7) 500 MHz Sun Ultra60 Lysecky/Vahid, DAC’03 Lysecky/Vahid, CODES+ISSS’03

<1s <1s 1s 1 MB .5 MB Decomp. Partitioning Tech. Map RT Syn. Log. Syn. 1 MB .5 MB Route Place 10 MB 10 MB 10 MB 10 MB 50 MB 20 MB 60 MB 1-2 mins 1 min 1-2 mins 30 s 1 min 1 min 2 – 30 mins Warp ProcessorsExecution Time and Memory Requirements <1s

Profiler uP Binary HW Updated Binary Binary Binary Binary I$ D$ Technology Mapping RT Synthesis Logic Synthesis Placement & Routing Binary Updater WCLA Decompilation Placement and Routing Technology Mapping Partitioning DPM ARM ARM I$ Memory Memory D$ Warp ProcessorsDynamic Partitioning Module (DPM)

Warp ProcessorsTechnology Mapping/Packing • ROCPAR – Technology Mapping/Packing • Decompose hardware circuit into basic logic gates (AND, OR, XOR, etc.) • Traverse logic network combining nodes to form single-output LUTs • Combine LUTs with common inputs to form final 2-output LUTs • Pack LUTs in which output from one LUT is input to second LUT • Pack remaining LUTs into CLBs Lysecky/Vahid, DATE’04 Stitt/Lysecky/Vahid, DAC’03

CLB CLB CLB CLB CLB CLB CLB CLB CLB CLB CLB CLB CLB CLB CLB CLB CLB CLB CLB Warp ProcessorsPlacement • ROCPAR – Placement • Identify critical path, placing critical nodes in center of configurable logic fabric • Use dependencies between remaining CLBs to determine placement • Attempt to use adjacent cell routing whenever possible Lysecky/Vahid, DATE’04 Stitt/Lysecky/Vahid, DAC’03

<1s <1s 1s <1s <1s .5 MB 1 MB 1 MB Decomp. Partitioning Tech. Map RT Syn. Log. Syn. 1 MB .5 MB .5 MB Route Place 10 MB 10 MB 10 MB 10 MB 50 MB 20 MB 60 MB 1-2 mins 1 min 1-2 mins 30 s 1 min 1 min 2 – 30 mins Warp ProcessorsExecution Time and Memory Requirements <1s

Route Done! illegal? congestion? Routing Resource Graph Routing Resource Graph Rip-up yes no Warp ProcessorsRouting • Find a path within FPGA to connect source and sinks of each net • VPR – Versatile Place and Route [Betz, et al., 1997] • Modified Pathfinder algorithm; allows overuse each iteration, updates costs, rips-up and reroutes – may require big routing resource graph • Riverside On-Chip Router (ROCR) • Represent routing nets between CLBs as routing between switch matrices • Resource Graph: Nodes are SMs, edges short/long channels between SMs • Much smaller, fixed size (based on number of SMs) • Greedy, depth-first algorithm routes nets between SMs

Warp Processors Routing • Average 10X faster than VPR (TD) • Up to 21X faster for ex5p • Memory usage of only 3.6 MB • 13X less than VPR Lysecky/Vahid/Tan, DAC’04

Warp ProcessorsRouting: Critical Path Results 32% longer critical path than VPR (Timing Driven) 10% shorter critical path than VPR (Routability Driven) Lysecky/Vahid/Tan, DAC’04

<1s <1s 1s <1s <1s 10s 1 MB 3.6 MB .5 MB 1 MB Decomp. Partitioning Tech. Map RT Syn. Log. Syn. 1 MB .5 MB .5 MB Route Place 10 MB 10 MB 10 MB 10 MB 50 MB 20 MB 60 MB 1-2 mins 1 min 1-2 mins 30 s 1 min 1 min 2 – 30 mins Warp ProcessorsExecution Time and Memory Requirements <1s

Profiler ARM I$ D$ Config. Logic Arch. DPM ARM I$ D$ Xilinx Virtex-E FPGA Warp ProcessorsExperimental Setup • Warp Processor • Embedded microprocessor • Configurable logic fabric with frequency 80% that of the microprocessor • Based on commercial platform (Triscend A7) • Dynamic partitioning module maps critical region to hardware • Executed on a 75 MHz ARM7 processor • DPM active for ~10 seconds • Key tools automated; some tasks assisted by hand • Versus traditional HW/SW Partitioning • Embedded microprocessor • Xilinx Virtex-E FPGA (maximum possible speed) • Manually partitioned software using VHDL • VHDL synthesized using Xilinx ISE 4.1 on desktop

Average loop speedup of 29x Warp Processors: Initial ResultsSpeedup (Critical Region/Loop)

Average speedup of 2.1 vs. 2.2 for Virtex-E 4.1 Warp Processors: Initial ResultsSpeedup (overall application with ONLY 1 loop sped up)

Average energy reduction of 33% v.s 36% for Xilinx Virtex-E 74% Warp Processors: Initial ResultsEnergy Reduction (overall application, 1 loop ONLY)

Xilinx ISE 9.1 s Manually performed 60 MB ROCPAR 3.6MB 0.2 s Warp Processors Execution Time and Memory Requirements (on PC) 46x improvement On a 75Mhz ARM7: only 1.4 s

Multi-processor platforms • Multiple processors can share a single DPM • Time-multiplex • Just another processor whose task is to help the other processors • Processors can even be soft cores in FPGA • DPM can even re-visit same application in case use or data has changed uP uP uP uP uP uP uP uP DPM Shared by all uP Config. Logic Arch. uP uP

2 Profile application to determine critical regions 1 Profiler µP Initially execute application in software only I$ 5 D$ 3 Profiler Partitioned application executes faster with lower energy consumption (speed has been “warped”) Partition critical regions to hardware µP I$ D$ Warp Config. Logic Architecture Dynamic Part. Module (DPM) 4 Program configurable logic & update software binary WarpProcessing Summary • Kernels sped up 29x on average • Over 100x in some • Corresponding energy savings • Standard binary • No tool impact • Makes FPGA usable with any existing software environment • Currently investigating applications: • Embedded (w/ Motorola and Xilinx) • Desktop and mainframe (with Philips and IBM)

Publications & Acknowledgements All these publications are available at http://www.cs.ucr.edu/~vahid/pubs • Dynamic FPGA Routing for Just-in-Time FPGA Compilation, R. Lysecky, F. Vahid, S. Tan, Design Automation Conference, 2004. • A Configurable Logic Architecture for Dynamic Hardware/Software Partitioning, R. Lysecky and F. Vahid, Design Automation and Test in Europe Conference (DATE), February 2004. • Frequent Loop Detection Using Efficient Non-Intrusive On-Chip Hardware, A. Gordon-Ross and F. Vahid, ACM/IEEE Conf. on Compilers, Architecture and Synthesis for Embedded Systems (CASES), 2003; to appear in special issue “Best of CASES/MICRO” of IEEE Trans. on Comp. • A Codesigned On-Chip Logic Minimizer, R. Lysecky and F. Vahid, ACM/IEEE ISSS/CODES conference, 2003. • Dynamic Hardware/Software Partitioning: A First Approach. G. Stitt, R. Lysecky and F. Vahid, Design Automation Conference, 2003. • On-Chip Logic Minimization, R. Lysecky and F. Vahid, Design Automation Conference, 2003. • The Energy Advantages of Microprocessor Platforms with On-Chip Configurable Logic, G. Stitt and F. Vahid, IEEE Design and Test of Computers, November/December 2002. • Hardware/Software Partitioning of Software Binaries, G. Stitt and F. Vahid, IEEE/ACM International Conference on Computer Aided Design, November 2002. We gratefully acknowledge financial support from the National Science Foundation and the Semiconductor Research Corporation for this work. We also appreciate the collaborations and support from Motorola, Triscend, Philips and Xilinx.

Warp Processors Towards Separating Function and Architecture

Warp Processors Towards Separating Function and Architecture

Presentation Transcript

Separating

SEPARATING

WARP

Computer Architecture Parallel Processors

Function-Architecture Co-design

Computer Architecture Instruction-Level Parallel Processors

Computer Architecture Superscalar Processors

Vulnerabilities in Embedded Harvard Architecture Processors

Architecture and Design Automation for Application-Specific Processors

Towards Green GPUs: Warp Size Impact Analysis

Warp Processors

Application Processors Limit Function Expansion

Warp Processors (a.k.a. Self-Improving Configurable IC Platforms)

Warp Processors (a.k.a. Self-Improving Configurable IC Platforms)

MISD Architecture of Specialized Processors

Towards Optimal Custom Instruction Processors

Warp Processors

WARP Managed Service Platform (WARP-MSP)

Warp Processors

System Architecture of Sensor Network Processors

MPI/WARP

Warp Processors Towards Separating Function and Architecture