A High-Level Synthesis Flow for
Download
1 / 29

Asia and South Pacific Design Automation Conference Taipei, Taiwan R.O.C. January 21, 2010 - PowerPoint PPT Presentation


  • 75 Views
  • Uploaded on

A High-Level Synthesis Flow for Custom Instruction Set Extensions for Application-Specific Processors. Nagaraju Pothineni Google, India. Philip Brisk UC Riverside. Paolo Ienne EPFL. Anshul Kumar IIT Delhi. Kolin Paul IIT Delhi. Asia and South Pacific Design Automation Conference

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about 'Asia and South Pacific Design Automation Conference Taipei, Taiwan R.O.C. January 21, 2010' - toya


An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
Asia and south pacific design automation conference taipei taiwan r o c january 21 2010

A High-Level Synthesis Flow for Custom Instruction Set Extensions for Application-Specific Processors

Nagaraju Pothineni

Google, India

Philip Brisk

UC Riverside

Paolo Ienne

EPFL

Anshul Kumar

IIT Delhi

Kolin Paul

IIT Delhi

Asia and South Pacific Design Automation Conference

Taipei, Taiwan R.O.C. January 21, 2010


Extensible processors

Assembly code with ISEs

Applications

Compiler

Extensible Processors

ISE

Instruction Set Extensions

I$

RF

D$

RF

Fetch

Decode

Execute

Memory

Write-back

1


Compilation flow

G1

Gn

Optimized source code rewritten with ISE calls

Behavioral HDL description of n ISEs

Architecture description

Application source code

Target-specific compiler

Source-to-source Compiler

ISE Synthesizer

Architecture description

High-level program optimizations

ISE identification

Rewrite source code with ISEs

Linker

Processor Generator

Assembler

G1

Gn

Optimized source code rewritten with ISE calls

Behavioral description of n ISEs

Machine code with ISE calls

Structural HDL description of the processor and n ISEs

Compilation Flow

2


Ise synthesis flow

I/O-constrained Scheduling

ISE Synthesis Flow

Gn

RF I/O Ports

Clock Period Constraint:

G1

R

W

* = 

I/O-constrained Scheduling

Reschedule to Reduce Area

Decompose each ISE into 1-cycle Ops

Resource Allocation and Binding

Yes

No

Done

Clk period <

* = * - 

3


I o constrained scheduling

A

B

C

D

E

F

A

B

C

E

F

D

I/O-constrained Scheduling

  • I/O ports are a resource constraint

    • Resource-constrained scheduling is NP-complete

    • Optimal algorithm [Pozzi and Ienne, CASES ’05]

A

B

C

RD1

Wr1

RF

RF

RD2

E

F

D

4


Ise synthesis flow1
ISE Synthesis Flow

Gn

RF I/O Ports

Clock Period Constraint:

G1

R

W

* = 

I/O-constrained Scheduling

Reschedule to Reduce Area

Decompose each ISE into 1-cycle Ops

Resource Allocation and Binding

Yes

No

Done

Clk period <

* = * - 

5


Reschedule to reduce area

Goal:

Minimize area

Constraints:

Do not increase latency or clock period

I/O constraints

Implementation:

Simulated annealing (details in the paper)

Reschedule to Reduce Area

2 Adders, 1 Multiplier

1 Adder, 1 Multiplier

6


Ise synthesis flow2
ISE Synthesis Flow

Gn

RF I/O Ports

Clock Period Constraint:

G1

R

W

* = 

I/O-constrained Scheduling

Reschedule to Reduce Area

Decompose each ISE into 1-cycle Ops

Resource Allocation and Binding

Yes

No

Done

Clk period <

* = * - 

7


Decompose each ise into single cycle operations

B

C

D

A

B

C

B

C

A

D

A

D

E

E

E

Decompose each ISE into Single-Cycle Operations

ISE

After Scheduling

1-cycle Ops

8


Decomposition facilitates resource sharing within an ise

B

B

C

C

B

B

A

A

D

D

A

A

C

C

D

D

E

E

E

E

Decomposition Facilitates Resource Sharing within an ISE

9


Ise synthesis flow3
ISE Synthesis Flow

Gn

RF I/O Ports

Clock Period Constraint:

G1

R

W

* = 

I/O-constrained Scheduling

Reschedule to Reduce Area

Decompose each ISE into 1-cycle Ops

Resource Allocation and Binding

Yes

No

Done

Clk period <

* = * - 

10


Resource allocation and binding

?

Minimum cost common supergraph

Higher cost common supergraph

Maximum-cost weighted common isomorphic subgraph problem

Weighted minimum-cost common supergraph (WMCS) problem

2-input operation (Multiplexers required)

Requires a multiplexer

No multiplexers needed

Which solution is better?

NP-complete

NP-complete

Depends on the cost of the multiplexers compared to the merged operations!

(Graph theory)

(Graph theory)

(VLSI)

Resource Allocation and Binding

Two 1-cycle ISEs

11


Datapath merging dpm

Old problem formulation

Based on WMCS problem (graph theory)

NP-complete [Bunke et al., ICIP ’02]

Share as many operations/interconnects as possible

[Moreano et al., TCAD ’05; de Souza e al., JEA ’05]

Optimize port assignment as a post-processing step

NP-complete [Pangrle, TCAD ’91]

Heuristic [Chen and Cong, ASPDAC ’04]

Contribution: New problem formulation

Accounts for multiplexer cost and port assignment

Datapath Merging (DPM)

12


New dpm algorithms

ILP Formulation

See the paper for details

Reduction to Max-Clique Problem

Extends [Moreano et al., TCAD ’05]

Solve Max-Clique problem optimally using “Cliquer”

Identify isomorphic subgraphs up-front

Merge isomorphic subgraphs rather than vertices/edges

(Details in the paper)

New DPM Algorithms

13


Example

v2

v5

v1

v4

r1

r2

e1

e2

e3

e4

r3

v3

v6

Edge mappings:

Vertex mappings:

e1 could map onto: e3, (v4, r3), (r1, v5), (r1, r3)

1. Map v1 onto v4

2. Allocate a new resource r1; map v1 onto r1

Must be compatible with vertex mappings!

Example

New ISE fragment to merge

Partial merged datapath

14


Compatibility

Vertex/vertex compatibility

No

Yes

Compatibility

  • Vertex/edge compatibility

Yes

No

15


Why edge mappings

Allocate an edge in the merged datapath

May require a multiplexer

Why Edge Mappings?

16


Port assignment

Deterministic for non-commutative operators

NP-complete for every commutative operator

[Pangrle, TCAD ’91]

L

R

L

R

Commutative Operator

No!

Port Assignment

We want this!

17


Edge mappings port assignment

v5

v2

v1

v4

e1

e3

e2

e4

v6

v3

Mapping:

Mapping:

e1: (v4, v6, L)

e1: (v4, v6, R)

Edge Mappings = Port Assignment

Commutative Operator

18


Compatibility graph

Vertices correspond to mappings

Vertex mappings

Weight is 0 for vertex  vertex

Weight is resource cost for vertex  new resource

Edge mappings, including port assignment

Weight is 0 if edge exists in merged datapath

Weight is estimated cost of increasing mux size by +1 otherwise

Place edges between compatible mappings

Each max-clique corresponds to a complete binding solution

Goal is to find max-clique of minimum weight

Compatibility Graph

19


Compatibility graph1

v2, r2

v2, r2

v1, v4

v5

v5

v1, v4

v2

v1

e2

e2

e3

e1

e4

e4

e1, e3

e1

e2

v6

v3, v6

v3, r3

v5

v4

v3

fE(e2) = (r2, v6, L)

fE(e2) = (r2, v6, L)

fE(e2) = (r2, v6, L)

fV(v3) = v6

fV(v3) = v6

fV(v3) = v6

fE(e2) = (r2, v6, R)

fE(e2) = (r2, v6, R)

fE(e2) = (r2, v6, R)

w = Amux(2)

w = Amux(2)

w = Amux(2)

w = Amux(2)

w = Amux(2)

w = Amux(2)

w = 0

w = 0

w = 0

e3

e4

fE(e1) = (v4, v6, L)

fE(e1) = (v4, v6, L)

fE(e1) = (v4, v6, L)

fE(e1) = (v4, v6, R)

fE(e1) = (v4, v6, R)

fE(e1) = (v4, v6, R)

fE(e1) = (r1, v6, L)

fE(e1) = (r1, v6, L)

fE(e1) = (r1, v6, L)

fE(e1) = (r1, v6, R)

fE(e1) = (r1, v6, R)

fE(e1) = (r1, v6, R)

w = Amux(2)

w = Amux(2)

w = Amux(2)

w = 0

w = 0

w = 0

w = Amux(2)

w = Amux(2)

w = Amux(2)

w = Amux(2)

w = Amux(2)

w = Amux(2)

v6

fV(v1) = v4

fV(v1) = v4

fV(v1) = v4

fV(v2) = r2

fV(v2) = r2

fV(v2) = r2

fV(v1) = r1

fV(v1) = r1

fV(v1) = r1

w = A()

w = A()

w = A()

w = A( )

w = A( )

w = A( )

w = 0

w = 0

w = 0

fE(e1) = (r1, r3, R)

fE(e1) = (r1, r3, R)

fE(e1) = (r1, r3, R)

fE(e1) = (v4, r3, L)

fE(e1) = (v4, r3, L)

fE(e1) = (v4, r3, L)

fE(e1) = (v4, r3, R)

fE(e1) = (v4, r3, R)

fE(e1) = (v4, r3, R)

fE(e1) = (r1, r3, L)

fE(e1) = (r1, r3, L)

fE(e1) = (r1, r3, L)

w = 0

w = 0

w = 0

w = 0

w = 0

w = 0

w = 0

w = 0

w = 0

w = 0

w = 0

w = 0

fE(e2) = (r2, r3, L)

fE(e2) = (r2, r3, L)

fE(e2) = (r2, r3, L)

fV(v3) = r3

fV(v3) = r3

fV(v3) = r3

fE(e2) = (r2, r3, R)

fE(e2) = (r2, r3, R)

fE(e2) = (r2, r3, R)

r2

r3

r1

w = 0

w = 0

w = 0

w = A()

w = A()

w = A()

w = 0

w = 0

w = 0

Compatibility Graph

20


Experimental setup

Internally developed research compiler

1-cycle ISEs [Atasu et al., DAC ’03]

RF has 2 read ports, 1 write port

Standard cell design flow, 0.18m technology node

Five DPM algorithms

Baseline No resource sharing

ILP (Optimal) [This paper]

Our heuristic* [This paper]

Moreano’s heuristic* [Moreano et al., TCAD ’05]

Brisk’s heuristic [Brisk et al., DAC ’04]

Experimental Setup

* Max-cliques found by “Cliquer”

21


1 cycle ise area savings

Moreano’s heuristic is sometimes competitive!

Brisk’s heuristic performed as well as ours for one benchmark!

Moreano’s heuristic is not always competitive!

Brisk’s heuristic was NOT competitive for three benchmarks!

Brisk’s heuristic outperformed Moreano’s for three benchmarks!

1-cycle ISE Area Savings

22


Critical path delay increase
Critical Path Delay Increase

Critical Path Delay Increase (%) – Our heuristic

23


Runtimes
Runtimes

  • Baseline 0

  • Optimal (ILP) 3-8 hours

  • Our heuristic 2-10 minutes

  • Moreano’s heuristic ~1 minute

  • Brisk’s heuristic < 5-10 seconds

24


Experimental setup1

Internally developed research compiler

Multi-cycle ISEs [Pozzi and Ienne, CASES ’05]

RF has 5 read ports, 2 write ports

Standard cell design flow, 0.18m technology node

Four versions of our flow

Single-cycle ISE (None) (1-cycle)

Full flow (200 MHz) (Multi-cycle)

No rescheduling (200 MHz) (Multi-cycle)

Baseline flow (200 MHz) (Multi-cycle)

(Resource sharing and binding step disabled)

Experimental Setup

25


Single vs multi cycle ises

Resource sharing across multiple cycles of the same ISE

Cost of extra registers

Single vs. Multi-cycle ISEs

Area savings (%)

Single-cycle ISE (No clock period constraint)

Full flow (200 MHz)

0%

Baseline flow (200 MHz)

26


Impact of rescheduling
Impact of Rescheduling

Single-cycle ISE (no clock period constraint)

Normalized Area

Full Flow (200 MHz)

No rescheduling (200 MHz)

27


Conclusion

HLS Flow for ISEs

RF I/O constraints

Min-latency scheduling is NP-complete

Requires two scheduling steps

Rescheduling is important for area reduction

Resource allocation and binding

Modeled as a datapath merging problem

New problem formulation

Multiplexer cost

Port assignment

Conclusion

28