A Configurable Logic Architecture for Dynamic Hardware/Software Partitioning

A Configurable Logic Architecture for Dynamic Hardware/Software Partitioning Roman Lysecky, Frank Vahid* Department of Computer Science and Engineering University of California, Riverside *Also with the Center for Embedded Computer Systems at UC Irvine This work was supported in part by the National Science Foundation, the Semiconductor Research Corporation, and a Department of Education GAANN fellowship

IntroductionDynamic Software Optimization • Dynamic optimizations are increasingly common • Dynamo - Dynamic software optimizations • Transmeta Crusoe, Efficeon - Dynamic code morphing • Just In Time (JIT) Compilation - Interpreted languages • Advantages • Transparent optimizations • No designer effort • No tool restrictions • Adapts to actual usage • Drawbacks • Currently limited to software optimizations • Limited speedup (1.1x to 1.3x common) Lysecky, R., Vahid, F.

Profiler Profiler Critical Regions HW ______ ______ ______ SW ______ ______ ______ SW ______ ______ ______ SW ______ ______ ______ ASIC/FPGA Processor Processor Processor Processor Processor IntroductionHardware/Software Partitioning • Benefits • Speedups of 2X to 10X typical • Speedups of 800X possible • Far more potential than dynamic SW optimizations (1.2x) • Energy reductions of 25% to 95% typical SW ______ ______ ______ SW ______ ______ ______ Lysecky, R., Vahid, F.

SW Binary CAD Tools Profiling SW Binary Netlist Netlist ASIC/FPGA Processor Processor Processor IntroductionTraditional Hardware/Software Partitioning • Requires specialized CAD tools • Non-standard partitioning compilers Lysecky, R., Vahid, F.

Binary SW Binary Binary Standard Compiler Profiling CAD Tools Profiling Modified Binary Netlist Netlist ASIC/FPGA Processor Processor Processor IntroductionBinary Hardware/Software Partitioning • Binary Partitioning [Stitt/Vahid ICCAD’02] [Banerjee DATE’03] • Partition application starting from SW binary • Can be desktop based • Advantages • Use any standard compiler • Supports any language • Supports multiple sources from multiple languages • Supports assembly/object code • Supports legacy code • Disadvantage • Loses some high-level information, so may be some loss of quality Lysecky, R., Vahid, F.

SW Binary Binary Binary Standard Compiler Profiling CAD Proc. FPGA IntroductionDynamic Hardware/Software Partitioning • Dynamic HW/SW Partitioning • Embed HW/SW partitioning CAD tools on-chip • Feasible in era of billion-transistor chips • Advantages • Does not require any special compilers • Completely transparent • Bring benefits of HW/SW partitioning to all SW designers • Complements other approaches • Desktop CAD best from purely technical perspective • Dynamic opens additional market segments (i.e., all software developers) that otherwise might not use desktop CAD Lysecky, R., Vahid, F.

IntroductionWarp Processors 2 Profile application to determine critical regions 1 Initially execute application in software only 3 Profiler Partition critical regions to hardware MIPS/ARM I$ 5 D$ Partitioned application executes faster with lower energy consumption Configurable Logic Dynamic Part. Module (DPM) 4 Program configurable logic & update software binary Lysecky, R., Vahid, F.

HW Binary Binary Binary Placement & Routing Technology Mapping RT & Logic Synthesis Decompilation • Develop new Warp Configurable Logic Architecture (WCLA) WCLA Warp ProcessorsRequirements & Tools • Warp Processor Architecture and Tools • Basic configurable logic architecture • Efficient profiling architecture • On-chip CAD tools for HW/SW partitioning • Decompilation • Synthesis • Technology Mapping • Placement and Routing Profiler ARM I$ D$ Config. Logic DPM Lysecky, R., Vahid, F.

Warp Configurable Logic ArchitectureRequirements • Robustness • Capable of supporting large set of applications • Simplicity • Existing FPGAs are too complex for warp processors • Design goals of FPGAs much different • Design configurable fabric by analyzing architectural features as to their impacts on on-chip CAD tools • Fast execution • Very low data memory • Produce reasonable hardware circuits • Efficient interface to memory Lysecky, R., Vahid, F.

Profiler ARM I$ D$ Reg0 Reg1 Reg2 WCLA DPM Warp Configurable Logic Architecture • Data address generators (DADG) and Loop control hardware (LCH) • Found in most digital signal processors • Provide fast loop execution • Supports memory accesses with regular access pattern • Synthesis of FSM not required for many critical loops • Configurable logic fabric input provide alternative control of loop execution DADG & LCH 32-bit MAC Configurable Logic Fabric Lysecky, R., Vahid, F.

Profiler ARM I$ D$ Reg0 Reg1 Reg2 WCLA DPM Warp Configurable Logic Architecture • Integrated 32-bit multiplier-accumulator (MAC) • Multiplications are frequently found within critical loops • Frequently in the form of a multiply-accumulate operation • Fast, single-cycle multipliers are large and require many interconnections DADG & LCH 32-bit MAC Configurable Logic Fabric Lysecky, R., Vahid, F.

SM SM SM CLB CLB SM SM SM DADG LCH 32-bit MAC Configurable Logic Fabric Warp Configurable Logic ArchitectureConfigurable Logic Fabric • Array of configurable logic blocks (CLBs) surrounded by switch matrices (SMs) • Each CLB is directly connected to a SM • Switch matrix connections • Four short wires connect adjacent SMs • Four long wires connect every other SM together SM SM SM CLB CLB SM SM SM Lysecky, R., Vahid, F.

e a b c d f LUT LUT Adj. CLB Adj. CLB o1 o2 o3 o4 Warp Configurable Logic ArchitectureCombinational Logic Block Design • Several studies have analyzed the impact of LUT and CLB size of overall design area and delay • LUTs with 5 to 6 inputs result in best performance • LUTs with less than 3 inputs have much worse performance [Chow, et al. 1999, Singh, et al. 1992] • CLB cluster size of 3 to 20 LUTs are feasible [Marquardt, Betz, Rose 2000] Lysecky, R., Vahid, F.

e a b c d f LUT LUT Adj. CLB Adj. CLB o1 o2 o3 o4 Warp Configurable Logic ArchitectureCombinational Logic Block Design • Incorporate two 3-input 2-output LUTs • Corresponds to four 3-input LUTs • Allows for good quality circuit while reducing on-chip CAD tools complexity • Provide routing resources between adjacent CLBs to support carry chains Lysecky, R., Vahid, F.

0 1 2 3 0L 1L 2L 3L 3L 3L 2L 2L 1L 1L 0L 0L 3 3 2 2 1 1 0 0 0 1 2 3 3L 0L 1L 2L Warp Configurable Logic ArchitectureSwitch Matrix • Switch Matrix • SM connected using eight channels per side • Four short channels • Four long channels • Routes connect wires from different side using the same channel • Each short channel is associated with single long channel • Wires are routed using a single pair of channels through configurable logic fabric Lysecky, R., Vahid, F.

ResultsBenchmarks • Considered 12 embedded benchmarks from NetBench, MediaBench, EEMBC, and Powerstone • Average of 53% of total software execution time was spent executing single critical loop (more speedup possible if more loops considered) • On average, critical loops comprised only 1% of total program size Lysecky, R., Vahid, F.

Profiler ARM7 I$ D$ WCLA DPM ARM7 I$ D$ Xilinx Virtex-E FPGA ResultsExperimental Setup • Warp Processor • 75 MHz ARM7 processor • Configurable logic fabric with fixed frequency of 60 MHz • Used dynamic partitioning CAD tools to map critical region to hardware • Executed on an ARM7 processor • Active for roughly 10 seconds to perform partitioning • Traditional HW/SW Partitioning • 75 MHz ARM7 processor • Xilinx Virtex-E FPGA (executing at maximum possible speed) • Manually partitioned software using VHDL • VHDL synthesized using Xilinx ISE 4.1 on desktop Lysecky, R., Vahid, F.

Average speedup of 2.1 vs. 2.2 for Virtex-E 4.1 ResultsPerformance Speedup Lysecky, R., Vahid, F.

Average energy reduction of 33% v.s 36% for Xilinx Virtex-E 74% ResultsEnergy Reduction Lysecky, R., Vahid, F.

Efficient on-chip profiling [Gordon-Ross, Vahid] Configurable cache [Zhang, Vahid, Najjar] Profiler I$ MIPS/ ARM Cache Tuner D$ Self-tuning cache [Zhang, Vahid, Lysecky] WCLA Dynamic Part. Module (DPM) Binary decompilation, loop unrolling, alias analysis [Stitt, Vahid] Lean on-chip CAD tools [Lysecky, Vahid, Tan] Context: UCR’s Research on Configurable SoCs Self Tuning, Self Configuring Mass Produced ICs Lysecky, R., Vahid, F.

Conclusions & Future Work • Warp Configurable Logic Fabric • Supports wide range of embedded systems applications • Design specifically to allow development of lean on-chip CAD tools • Provide excellent results • Average speedups of 2.1 • Average energy reduction of 33% • Much better than dynamic software optimizations • One loop only – more speedup possible • More recent examples since DATE publication – 10x speedups • Working towards examples with 100x speedups • Future Work • Partitioning multiple software loops to hardware • Synthesizing Finite State Machines (FSMs) • Improved synthesis, technology mapping, and place and route Lysecky, R., Vahid, F.

A Configurable Logic Architecture for Dynamic Hardware/Software Partitioning