html5-img
1 / 21

A Configurable Logic Architecture for Dynamic Hardware/Software Partitioning

A Configurable Logic Architecture for Dynamic Hardware/Software Partitioning. Roman Lysecky, Frank Vahid* Department of Computer Science and Engineering University of California, Riverside *Also with the Center for Embedded Computer Systems at UC Irvine

snowy
Download Presentation

A Configurable Logic Architecture for Dynamic Hardware/Software Partitioning

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. A Configurable Logic Architecture for Dynamic Hardware/Software Partitioning Roman Lysecky, Frank Vahid* Department of Computer Science and Engineering University of California, Riverside *Also with the Center for Embedded Computer Systems at UC Irvine This work was supported in part by the National Science Foundation, the Semiconductor Research Corporation, and a Department of Education GAANN fellowship

  2. IntroductionDynamic Software Optimization • Dynamic optimizations are increasingly common • Dynamo - Dynamic software optimizations • Transmeta Crusoe, Efficeon - Dynamic code morphing • Just In Time (JIT) Compilation - Interpreted languages • Advantages • Transparent optimizations • No designer effort • No tool restrictions • Adapts to actual usage • Drawbacks • Currently limited to software optimizations • Limited speedup (1.1x to 1.3x common) Lysecky, R., Vahid, F.

  3. Profiler Profiler Critical Regions HW ______ ______ ______ SW ______ ______ ______ SW ______ ______ ______ SW ______ ______ ______ ASIC/FPGA Processor Processor Processor Processor Processor IntroductionHardware/Software Partitioning • Benefits • Speedups of 2X to 10X typical • Speedups of 800X possible • Far more potential than dynamic SW optimizations (1.2x) • Energy reductions of 25% to 95% typical SW ______ ______ ______ SW ______ ______ ______ Lysecky, R., Vahid, F.

  4. SW Binary CAD Tools Profiling SW Binary Netlist Netlist ASIC/FPGA Processor Processor Processor IntroductionTraditional Hardware/Software Partitioning • Requires specialized CAD tools • Non-standard partitioning compilers Lysecky, R., Vahid, F.

  5. Binary SW Binary Binary Standard Compiler Profiling CAD Tools Profiling Modified Binary Netlist Netlist ASIC/FPGA Processor Processor Processor IntroductionBinary Hardware/Software Partitioning • Binary Partitioning [Stitt/Vahid ICCAD’02] [Banerjee DATE’03] • Partition application starting from SW binary • Can be desktop based • Advantages • Use any standard compiler • Supports any language • Supports multiple sources from multiple languages • Supports assembly/object code • Supports legacy code • Disadvantage • Loses some high-level information, so may be some loss of quality Lysecky, R., Vahid, F.

  6. SW Binary Binary Binary Standard Compiler Profiling CAD Proc. FPGA IntroductionDynamic Hardware/Software Partitioning • Dynamic HW/SW Partitioning • Embed HW/SW partitioning CAD tools on-chip • Feasible in era of billion-transistor chips • Advantages • Does not require any special compilers • Completely transparent • Bring benefits of HW/SW partitioning to all SW designers • Complements other approaches • Desktop CAD best from purely technical perspective • Dynamic opens additional market segments (i.e., all software developers) that otherwise might not use desktop CAD Lysecky, R., Vahid, F.

  7. IntroductionWarp Processors 2 Profile application to determine critical regions 1 Initially execute application in software only 3 Profiler Partition critical regions to hardware MIPS/ARM I$ 5 D$ Partitioned application executes faster with lower energy consumption Configurable Logic Dynamic Part. Module (DPM) 4 Program configurable logic & update software binary Lysecky, R., Vahid, F.

  8. HW Binary Binary Binary Placement & Routing Technology Mapping RT & Logic Synthesis Decompilation • Develop new Warp Configurable Logic Architecture (WCLA) WCLA Warp ProcessorsRequirements & Tools • Warp Processor Architecture and Tools • Basic configurable logic architecture • Efficient profiling architecture • On-chip CAD tools for HW/SW partitioning • Decompilation • Synthesis • Technology Mapping • Placement and Routing Profiler ARM I$ D$ Config. Logic DPM Lysecky, R., Vahid, F.

  9. Warp Configurable Logic ArchitectureRequirements • Robustness • Capable of supporting large set of applications • Simplicity • Existing FPGAs are too complex for warp processors • Design goals of FPGAs much different • Design configurable fabric by analyzing architectural features as to their impacts on on-chip CAD tools • Fast execution • Very low data memory • Produce reasonable hardware circuits • Efficient interface to memory Lysecky, R., Vahid, F.

  10. Profiler ARM I$ D$ Reg0 Reg1 Reg2 WCLA DPM Warp Configurable Logic Architecture • Data address generators (DADG) and Loop control hardware (LCH) • Found in most digital signal processors • Provide fast loop execution • Supports memory accesses with regular access pattern • Synthesis of FSM not required for many critical loops • Configurable logic fabric input provide alternative control of loop execution DADG & LCH 32-bit MAC Configurable Logic Fabric Lysecky, R., Vahid, F.

  11. Profiler ARM I$ D$ Reg0 Reg1 Reg2 WCLA DPM Warp Configurable Logic Architecture • Integrated 32-bit multiplier-accumulator (MAC) • Multiplications are frequently found within critical loops • Frequently in the form of a multiply-accumulate operation • Fast, single-cycle multipliers are large and require many interconnections DADG & LCH 32-bit MAC Configurable Logic Fabric Lysecky, R., Vahid, F.

  12. SM SM SM CLB CLB SM SM SM DADG LCH 32-bit MAC Configurable Logic Fabric Warp Configurable Logic ArchitectureConfigurable Logic Fabric • Array of configurable logic blocks (CLBs) surrounded by switch matrices (SMs) • Each CLB is directly connected to a SM • Switch matrix connections • Four short wires connect adjacent SMs • Four long wires connect every other SM together SM SM SM CLB CLB SM SM SM Lysecky, R., Vahid, F.

  13. e a b c d f LUT LUT Adj. CLB Adj. CLB o1 o2 o3 o4 Warp Configurable Logic ArchitectureCombinational Logic Block Design • Several studies have analyzed the impact of LUT and CLB size of overall design area and delay • LUTs with 5 to 6 inputs result in best performance • LUTs with less than 3 inputs have much worse performance [Chow, et al. 1999, Singh, et al. 1992] • CLB cluster size of 3 to 20 LUTs are feasible [Marquardt, Betz, Rose 2000] Lysecky, R., Vahid, F.

  14. e a b c d f LUT LUT Adj. CLB Adj. CLB o1 o2 o3 o4 Warp Configurable Logic ArchitectureCombinational Logic Block Design • Incorporate two 3-input 2-output LUTs • Corresponds to four 3-input LUTs • Allows for good quality circuit while reducing on-chip CAD tools complexity • Provide routing resources between adjacent CLBs to support carry chains Lysecky, R., Vahid, F.

  15. 0 1 2 3 0L 1L 2L 3L 3L 3L 2L 2L 1L 1L 0L 0L 3 3 2 2 1 1 0 0 0 1 2 3 3L 0L 1L 2L Warp Configurable Logic ArchitectureSwitch Matrix • Switch Matrix • SM connected using eight channels per side • Four short channels • Four long channels • Routes connect wires from different side using the same channel • Each short channel is associated with single long channel • Wires are routed using a single pair of channels through configurable logic fabric Lysecky, R., Vahid, F.

  16. ResultsBenchmarks • Considered 12 embedded benchmarks from NetBench, MediaBench, EEMBC, and Powerstone • Average of 53% of total software execution time was spent executing single critical loop (more speedup possible if more loops considered) • On average, critical loops comprised only 1% of total program size Lysecky, R., Vahid, F.

  17. Profiler ARM7 I$ D$ WCLA DPM ARM7 I$ D$ Xilinx Virtex-E FPGA ResultsExperimental Setup • Warp Processor • 75 MHz ARM7 processor • Configurable logic fabric with fixed frequency of 60 MHz • Used dynamic partitioning CAD tools to map critical region to hardware • Executed on an ARM7 processor • Active for roughly 10 seconds to perform partitioning • Traditional HW/SW Partitioning • 75 MHz ARM7 processor • Xilinx Virtex-E FPGA (executing at maximum possible speed) • Manually partitioned software using VHDL • VHDL synthesized using Xilinx ISE 4.1 on desktop Lysecky, R., Vahid, F.

  18. Average speedup of 2.1 vs. 2.2 for Virtex-E 4.1 ResultsPerformance Speedup Lysecky, R., Vahid, F.

  19. Average energy reduction of 33% v.s 36% for Xilinx Virtex-E 74% ResultsEnergy Reduction Lysecky, R., Vahid, F.

  20. Efficient on-chip profiling [Gordon-Ross, Vahid] Configurable cache [Zhang, Vahid, Najjar] Profiler I$ MIPS/ ARM Cache Tuner D$ Self-tuning cache [Zhang, Vahid, Lysecky] WCLA Dynamic Part. Module (DPM) Binary decompilation, loop unrolling, alias analysis [Stitt, Vahid] Lean on-chip CAD tools [Lysecky, Vahid, Tan] Context: UCR’s Research on Configurable SoCs Self Tuning, Self Configuring Mass Produced ICs Lysecky, R., Vahid, F.

  21. Conclusions & Future Work • Warp Configurable Logic Fabric • Supports wide range of embedded systems applications • Design specifically to allow development of lean on-chip CAD tools • Provide excellent results • Average speedups of 2.1 • Average energy reduction of 33% • Much better than dynamic software optimizations • One loop only – more speedup possible • More recent examples since DATE publication – 10x speedups • Working towards examples with 100x speedups • Future Work • Partitioning multiple software loops to hardware • Synthesizing Finite State Machines (FSMs) • Improved synthesis, technology mapping, and place and route Lysecky, R., Vahid, F.

More Related