slide1
Download
Skip this Video
Download Presentation
Towards Optimal Custom Instruction Processors

Loading in 2 Seconds...

play fullscreen
1 / 22

Towards Optimal Custom Instruction Processors - PowerPoint PPT Presentation


  • 103 Views
  • Uploaded on

Wayne Luk Kubilay Atasu, Rob Dimond and Oskar Mencer Department of Computing Imperial College London HOT CHIPS 18. Towards Optimal Custom Instruction Processors. Overview. 1. background: extensible processors 2. design flow: C to custom processor silicon

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about ' Towards Optimal Custom Instruction Processors' - ada


An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
slide1
Wayne Luk

Kubilay Atasu, Rob Dimond and Oskar Mencer

Department of Computing

Imperial College London

HOT CHIPS 18

Towards OptimalCustom Instruction Processors
overview
Overview

1. background: extensible processors

2. design flow: C to custom processor silicon

3. instruction selection: bandwidth/area constraints

4. application-specific processor synthesis

5. results: 3x area delay product reduction

6. current and future work + summary

1 instruction set extensible processors

ALU

Register

File

data in

data out

1. Instruction-set extensible processors
  • base processor + custom logic
    • partition data-flow graphs into custom instructions
previous work
Previous work
  • many techniques, e.g.
    • Atasu et al. (DAC 03)
    • Goodwin and Petkov (CASES 03)
    • Clark et al. (MICRO 03, HOT CHIPS 04)
  • current challenges
    • optimality and robustness of heuristics
    • complete tool chain: application to silicon
    • research infrastructure for custom processor design
2 custom processor research at imperial
2. Custom processor research at Imperial
  • focus on effective optimization techniques
    • e.g. Integer Linear Programming (ILP)
  • complete tool-chain
    • high-level descriptions to custom processor silicon
  • open infrastructure for research in
    • custom processor synthesis
    • automatic customization techniques
  • current tools
    • optimizing compiler (Trimaran) for custom CPUs
    • custom processor synthesis tool
application to custom processor flow
Application to custom processor flow

Application

Source (C)

AreaConstraint

Processor

Description

Generate

Custom

Unit

ASIC

Tools

Template

Generation

Generate

Base

CPU

Template

Selection

Area,

Timing

custom instruction model
Custom instruction model

Input Register

input ports

Register

File

Pipeline Register

output ports

Output Register

3 optimal instruction identification
3. Optimal instruction identification
  • minimize schedule length of program data flow graphs (DFGs)
  • subject to constraints
    • convexity: ensure feasible schedules
    • fixed processor critical path: pipeline for multi-cycle instructions
    • fixed data bandwidth: limited by register file ports
  • steps: based on Integer Linear Pogramming (ILP)

a. template generation

b. template selection

a template generation
a. Template generation

2. Collapse template to a single DFG node

1. Solve ILP for DFG to generate a template

X

3. Repeat while (objective > 0)

b template selection
b. Template selection
  • determine isomorphism classes
    • find templates that can be implemented using the same instruction
    • calculate speed-up potential of each class
  • solve Knapsack problem using ILP
    • maximize speedup within area constraint
optimizing compilation flow

Data Bandwidth

Constraints

Area

Constraints

Optimizing compilation flow

Application in C/C++

VHDL

Area

Synopsys

Synthesis

Impact Front-end

Data Bandwidth

Constraints

Gain

CDFG

Formation

a) Template

Generation

b) Template

Selection

MDES

Generation

Instruction

Replacement

Elcor Backend

Scheduling,

Reg. Allocation

Assembly Code

and Statistics

4 application specific processor synthesis
4. Application-specific processor synthesis
  • design space exploration framework
    • Processor Component Library
    • specialized structural description
  • prototype: MIPS integer instruction set
    • custom instructions
    • flexible micro-architecture
  • evaluate using actual implementation
    • timing and area
processor synthesis flow

Custom

Data paths

Processor synthesis flow

Custom Processor

from compiler

  • merging
  • add state registers
  • processor interface
  • interface
  • data in/out
  • stall control

Processor Component Library

  • pipeline description
  • parameters

FE

FE

M

W

EX

implementation
Implementation
  • based on Python scripts
    • structural meta-language for processors
    • combine RTL (Verilog/VHDL) IP blocks
    • module generators for custom units
  • generate 100s of designs automatically
    • ASIC processor cores
    • complete system on FPGA: CPU + memory + I/O
5 results
5. Results
  • cryptography benchmarks: C source
    • AES decrypt, AES encrypt, DES, MD5, SHA
  • 4/5 stage pipelined MIPS base processor
    • 0.225mm2 area, 200 MHz clock speed
    • single issue processor
    • register file with 2 input ports, 1 output port
  • processors synthesized to 130nm library
    • Synopsys DC and Cadence SoC Encounter
    • also synthesize to Xilinx FPGA for testing
aes decryption processor
AES Decryption Processor

130nm CMOS

200MHz

0.307mm2

35% area cost

(mostly one instruction)

76% cycle reduction

aes decryption processor1
AES Decryption Processor

130nm CMOS

200MHz

0.307mm2

35% area cost

(mostly one instruction)

76% cycle reduction

execution time
Execution time

Register file in all cases: 2 input ports, 1 output port

4 inputs, 1 output

4 inputs, 2 outputs

43% reduction

4 inputs, 4 outputs

63% reduction

4 inputs, 1 output

4 inputs, 1 output

76% reduction

timing
Timing
  • 48% of designs meet timing at 200MHz without manual optimization
6 current and future work
6. Current and future work
  • support memory access in custom instructions
    • automate data partitioning for memory access
    • automate SIMD load/store instructions for state registers
  • use architectural techniques e.g. shadow registers
    • improve bandwidth without additional register file ports
  • study trade-offs for VLIW style
    • multiple register file ports
    • multiple issue and custom instructions
  • extend compiler: e.g. ILP model for cyclic graphs
    • adapt software pipelining for hardware
summary
Summary
  • complete flow from C to custom processor
  • automatic instruction set extension
    • based on integer linear programming
    • optimize schedule length under constraints
  • application-specific processor synthesis
    • complete flow: permits real hardware evaluation
  • up to 76% reduction in execution cycles
    • 3x area delay product reduction
  • max speedup: 23% to 93% area overhead
ad