1 / 37

Synthesis of Custom Processors based on Extensible Platforms

Synthesis of Custom Processors based on Extensible Platforms. Fei Sun + , Srivaths Ravi ++ , Anand Raghunathan ++ and Niraj K. Jha + + : Dept. of Electrical Engineering Princeton University ++ : NEC Laboratories America, Inc. Outline. SoC design constraints Background

velma
Download Presentation

Synthesis of Custom Processors based on Extensible Platforms

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Synthesis of Custom Processors based on Extensible Platforms Fei Sun+, Srivaths Ravi++, Anand Raghunathan++ and Niraj K. Jha+ +: Dept. of Electrical Engineering Princeton University ++: NEC Laboratories America, Inc.

  2. Outline • SoC design constraints • Background • Previous work in ASIP design • Xtensa platform • Manual custom instruction generation procedure • Automatic custom instruction generation flow • Experimental results • Conclusions

  3. SoC Design Constraints • Time to market • Cost • Performance • Power • Cost-performance trade-off • Flexibility • ……

  4. Comparison of Different Approaches ASIC ASIP GPP Time to market -- + ++Cost ++ + --Performance ++ + --Power ++ + --Cost-performance ++ + --Flexibility -- + ++ ++ Very good + Good -- Very bad

  5. 500 500 - - 1000 MOPS/mW 1000 MOPS/mw ASIC ASIP (Xtensa) 50 50 - - 100 MIPS/mW 100 MIPS/mw Domain Specific Domain Specific Flexibility Flexibility 1 1 - - 10 MIPS/mW 10 MIPS/mw Processor (DSP) Energy Efficiency Energy Efficiency General Embedded 0.1 0.1 - - 1 MIPS/mw 1 MIPS/mW Processor Processor (AMD-K6E) Flexibility vs. Energy Efficiency

  6. Previous Work in ASIP Design • ASIP architectures and overall design methodologies • [Huang, 1994], [Adams, 1996], [Fisher, 1999], [Kucukcakar, 1999] • Application-specific instruction set selection • [Choi, 1999], [Gschwind, 1999], [Arnold, 1999] • Low power ASIP design • [Kalambur, 1997], [Dougherty, 1999], [Ishihara, 2000], [Sami, 2001] • Commercial offerings • Xtensa, ARCtangent, Jazz, SP-5flex, Carmel

  7. Xtensa Architecture TRACE Port Instruction JTAG Tap Control Instruction Memory or Cache & Tags Instruction Address On Chip Debug Align and Decode Interrupt Control Branch Logic & Instruction Fetch Memory Protection Unit Processor Interface Window Register File Date Memory or Cache &Tags Exception Support Coprocessor Register File ALU & Address Generation Processor Controls Write Buffer MAC 16 Base ISA Feature Data Address Coprocessor Execution Units Designer Defined Instruction Execution Unit Configurable Function Timers 1 to n Optional Function Data Special Function Register Access Configurable & Optional Function Data Address Watch 0 to n Extensible Source:www.tensilica.com Instruction Address Watch 0 to n

  8. Logic Synthesis (Synopsys or Ambit) Application Specific Compile, Assemble, Link Block Place/Route (Avant! Or Cadence) Application Simulation with ISS and/or Emulator Timing Verification Software Debugging/Profiling Hardware Profile Xtensa Processor Design Flow Processor Configuration Inputs Designer-DefinedInstruction Descriptions Configuration File Configured GNUC/C++ Compiler Configured Processor HDL Configured GNUAssembler/Disassembler Configured Instruction SetSimulator/Emulator Area, Power and Timing Estimation Application Source Code Generator Output Sample Application Data Internal Database Design data Use of Generated Data Source:www.tensilica.com Optimized Hardware Optimized Software

  9. Manual Custom Instruction Generation Procedure Identify potential new instructions Profile, read source code Slow and error-prone Describe custom instructions Understand source code Insert custom instructions Rewrite source code Verify functional correctness

  10. Contributions of Our Work • Automatic custom instruction selection • Application program to extensible processors with custom instructions • Features • Efficient design space search • Use accurate information from instruction set simulator and synthesis • Bridge the gap between automatic synthesized and manually designed architectures

  11. Automatic Custom Instruction Generation Flow

  12. Automatic Custom Instruction Generation Flow

  13. Example Illustration of Template Generation

  14. Example Illustration of Template Generation

  15. Example Illustration of Template Generation

  16. Example Illustration of Template Generation

  17. Example Illustration of Template Generation

  18. Key Observations for Pruning • Higher the weight of the template, higher the potential for improvement --- Amdahl’s law • Scope for optimization determined by computation --- No. of cycles needed for executing the template • Scope for optimization determined by read/write ports limitation --- Additional cycles needed for extra reading/writing of input/output variables

  19. Pruning Algorithm • Ranking criterion: • OriginalTime: Fraction of the total execution time of the original program spent in the template (weight) • In, Out: Number of inputs and outputs of the template, respectively • α, β: Number of inputs/outputs encoded in the instruction • γ: No. of cycles needed for executing the template • Higher priority means greater potential for speed up

  20. Highest priority 12.73 12.73 12.73 12.73 5.36 1.18 16.35 Template Generation with Pruning Ranked pool of seed templates Threshold: 0.1 Template set 10.51 7.92 4.05 2.13

  21. 12.73 5.36 5.36 4.05 2.13 10.51 7.92 10.51 7.92 4.05 2.13 Template Generation with Pruning Highest priority Ranked pool of seed templates Threshold: 0.1 12.73 Template set 1.18 16.35

  22. 12.73 10.51 7.92 5.36 1.18 1.18 4.05 2.13 Template Generation with Pruning Highest priority Ranked pool of seed templates Threshold: 0.1 12.73 Template set 16.35

  23. 12.73 16.35 10.51 5.36 10.51 16.35 16.35 7.92 7.92 5.36 4.05 4.05 2.13 2.13 Template Generation with Pruning Highest priority Ranked pool of seed templates Threshold: 0.1 12.73 16.35 Template set

  24. No. of Templates vs. Threshold Ratio

  25. Automatic Custom Instruction Generation Flow

  26. Automatic Custom Instruction Generation Flow (Contd.)

  27. Automatic Custom Instruction Generation Flow (Contd.)

  28. Custom Instruction Insertion • Care must be taken to insert custom instructions into appropriate places without affecting program’s functional correctness • If custom instructions need extra inputs (outputs), care must be taken to select appropriate variables to write to (read from) user-defined registers

  29. Example Illustration of Custom Instruction Insertion

  30. Example Illustration of Custom Instruction Insertion (Contd.) ....offset = t + 1;for (i=0; i<100; i++){ j = .... result = offset + i * j;}.... ....offset = t + 1;for (i=0; i<100; i++){ j = .... result = CustomInstr(i,j); }.... WUR(offset,0); (a) (b)

  31. Automatic Custom Instruction Generation Flow

  32. Custom Instruction Combination Selection --- Problem Statement • Given a set of non-overlapping custom instructions, with each instruction having several versions, find a version for each instruction such that performance is maximized while area is under a certain threshold

  33. Custom Instruction Combination Selection --- Flow Chart

  34. Automatic Custom Instruction Generation Flow

  35. TIE NECCB11 Custom Processor(HDL Description) Experimental Methodology C Program Aristotle Xtensa GNU Profiler Automatic Custom Instruction Generation Xtensa TIE Compiler Modified C program Synopsys Design Compiler Cross Compiler Tensilica Processor Generator Sente Wattwatcher ISS Synopsys Design Compiler Execution Cycles Power Area Clock Period

  36. Experimental Results (Contd.) Average Performance improvement: 3.4X Energy reduction: 3.2X Energy*delay reduction: 12.6X Area increase: 1.8%

  37. Conclusions • Automatic custom instruction synthesis for ASIPs • Template generation/selection • Custom instruction insertion • Custom instruction combination selection • Experimental results • 3.4X average performance improvement • 12.6X average energy*delay reduction

More Related