1 / 22

Towards Optimal Custom Instruction Processors

Wayne Luk Kubilay Atasu, Rob Dimond and Oskar Mencer Department of Computing Imperial College London HOT CHIPS 18. Towards Optimal Custom Instruction Processors. Overview. 1. background: extensible processors 2. design flow: C to custom processor silicon

ada
Download Presentation

Towards Optimal Custom Instruction Processors

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Wayne Luk Kubilay Atasu, Rob Dimond and Oskar Mencer Department of Computing Imperial College London HOT CHIPS 18 Towards OptimalCustom Instruction Processors

  2. Overview 1. background: extensible processors 2. design flow: C to custom processor silicon 3. instruction selection: bandwidth/area constraints 4. application-specific processor synthesis 5. results: 3x area delay product reduction 6. current and future work + summary

  3. ALU Register File data in data out 1. Instruction-set extensible processors • base processor + custom logic • partition data-flow graphs into custom instructions

  4. Previous work • many techniques, e.g. • Atasu et al. (DAC 03) • Goodwin and Petkov (CASES 03) • Clark et al. (MICRO 03, HOT CHIPS 04) • current challenges • optimality and robustness of heuristics • complete tool chain: application to silicon • research infrastructure for custom processor design

  5. 2. Custom processor research at Imperial • focus on effective optimization techniques • e.g. Integer Linear Programming (ILP) • complete tool-chain • high-level descriptions to custom processor silicon • open infrastructure for research in • custom processor synthesis • automatic customization techniques • current tools • optimizing compiler (Trimaran) for custom CPUs • custom processor synthesis tool

  6. Application to custom processor flow Application Source (C) AreaConstraint Processor Description Generate Custom Unit ASIC Tools Template Generation Generate Base CPU Template Selection Area, Timing

  7. Custom instruction model Input Register input ports Register File Pipeline Register output ports Output Register

  8. 3. Optimal instruction identification • minimize schedule length of program data flow graphs (DFGs) • subject to constraints • convexity: ensure feasible schedules • fixed processor critical path: pipeline for multi-cycle instructions • fixed data bandwidth: limited by register file ports • steps: based on Integer Linear Pogramming (ILP) a. template generation b. template selection

  9. a. Template generation 2. Collapse template to a single DFG node 1. Solve ILP for DFG to generate a template X 3. Repeat while (objective > 0)

  10. b. Template selection • determine isomorphism classes • find templates that can be implemented using the same instruction • calculate speed-up potential of each class • solve Knapsack problem using ILP • maximize speedup within area constraint

  11. Data Bandwidth Constraints Area Constraints Optimizing compilation flow Application in C/C++ VHDL Area Synopsys Synthesis Impact Front-end Data Bandwidth Constraints Gain CDFG Formation a) Template Generation b) Template Selection MDES Generation Instruction Replacement Elcor Backend Scheduling, Reg. Allocation Assembly Code and Statistics

  12. 4. Application-specific processor synthesis • design space exploration framework • Processor Component Library • specialized structural description • prototype: MIPS integer instruction set • custom instructions • flexible micro-architecture • evaluate using actual implementation • timing and area

  13. Custom Data paths Processor synthesis flow Custom Processor from compiler • merging • add state registers • processor interface • interface • data in/out • stall control Processor Component Library • pipeline description • parameters FE FE M W EX

  14. Implementation • based on Python scripts • structural meta-language for processors • combine RTL (Verilog/VHDL) IP blocks • module generators for custom units • generate 100s of designs automatically • ASIC processor cores • complete system on FPGA: CPU + memory + I/O

  15. 5. Results • cryptography benchmarks: C source • AES decrypt, AES encrypt, DES, MD5, SHA • 4/5 stage pipelined MIPS base processor • 0.225mm2 area, 200 MHz clock speed • single issue processor • register file with 2 input ports, 1 output port • processors synthesized to 130nm library • Synopsys DC and Cadence SoC Encounter • also synthesize to Xilinx FPGA for testing

  16. AES Decryption Processor 130nm CMOS 200MHz 0.307mm2 35% area cost (mostly one instruction) 76% cycle reduction

  17. AES Decryption Processor 130nm CMOS 200MHz 0.307mm2 35% area cost (mostly one instruction) 76% cycle reduction

  18. Execution time Register file in all cases: 2 input ports, 1 output port 4 inputs, 1 output 4 inputs, 2 outputs 43% reduction 4 inputs, 4 outputs 63% reduction 4 inputs, 1 output 4 inputs, 1 output 76% reduction

  19. Timing • 48% of designs meet timing at 200MHz without manual optimization

  20. Area (for maximum speedup) 93% 42% 35% 28% 23%

  21. 6. Current and future work • support memory access in custom instructions • automate data partitioning for memory access • automate SIMD load/store instructions for state registers • use architectural techniques e.g. shadow registers • improve bandwidth without additional register file ports • study trade-offs for VLIW style • multiple register file ports • multiple issue and custom instructions • extend compiler: e.g. ILP model for cyclic graphs • adapt software pipelining for hardware

  22. Summary • complete flow from C to custom processor • automatic instruction set extension • based on integer linear programming • optimize schedule length under constraints • application-specific processor synthesis • complete flow: permits real hardware evaluation • up to 76% reduction in execution cycles • 3x area delay product reduction • max speedup: 23% to 93% area overhead

More Related