data centric subgraph mapping for narrow computation accelerators n.
Download
Skip this Video
Loading SlideShow in 5 Seconds..
Data-centric Subgraph Mapping for Narrow Computation Accelerators PowerPoint Presentation
Download Presentation
Data-centric Subgraph Mapping for Narrow Computation Accelerators

Loading in 2 Seconds...

play fullscreen
1 / 29

Data-centric Subgraph Mapping for Narrow Computation Accelerators - PowerPoint PPT Presentation


  • 341 Views
  • Uploaded on

Data-centric Subgraph Mapping for Narrow Computation Accelerators. Amir Hormati, Nathan Clark, and Scott Mahlke Advanced Computer Architecture Lab. University of Michigan. Introduction. Migration of applications Programmability and cost issues in ASIC

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about 'Data-centric Subgraph Mapping for Narrow Computation Accelerators' - paul


An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
data centric subgraph mapping for narrow computation accelerators

Data-centric Subgraph Mapping for Narrow Computation Accelerators

Amir Hormati, Nathan Clark,

and Scott Mahlke

Advanced Computer Architecture Lab.

University of Michigan

introduction
Introduction
  • Migration of applications
  • Programmability and cost issues in ASIC
  • More functionality in the embedded processor

2

what are the challenges
What Are the Challenges

Accelerator Hardware: Compiler Algorithm:

3

configurable compute array cca

Input2

Input3

Input4

Input1

Output1

Output2

Configurable Compute Array (CCA)
  • Array of FUs
  • Arithmetic/logic
  • 32-bit functional units
  • Full interconnect between

rows

  • Supports 95 percent of all

computation patterns

(Nathan Clark, ISCA 2005)

4

report card on the original cca
Report Card on the Original CCA
  • Easy to integrate to current embedded systems
  • High performance gain

however...

  • 32-bit general purpose CCA:
    • 130nm standard cell library
    • Area requirement: 0.3mm2
    • Latency: 3.3ns

die photo of a processor with CCA

5

objectives of this work
Objectives of this Work
  • Redesign of the CCA hardware
    • Area
    • Latency
  • Compilation strategy
    • Code quality
    • Runtime

6

width utilization
Width Utilization
  • Full width of the FUs is not always needed.
  • Narrower FUs is not the solution.

7

width aware narrow cca

[8-31]

[8-31]

[8-31]

[8-31]

Width Checker

Carry bits

Iterate

Width-Aware Narrow CCA

Input Registers

[

0

-

7

]

[

0

-

7

]

[

0

-

7

]

[

0

-

7

]

-

[8-31]

[8-31]

[8-31]

[8-31]

Iteration

Controller

Iterate

CCA

Output Registers

Carry Bits

Output 2

Output1

8

sparse interconnect

Input2

Input3

Input4

Input1

Input2

Input3

Input4

Input1

Output1

Output2

Output1

Output2

Sparse Interconnect
  • Rank wires based on utilization.
  • >50% wires removed.
  • 91% of all patterns are supported.

9

synthesis results
Synthesis Results
  • Synthesized using Synopsys and Encounter in 130nm library.

10

compilation challenges
Compilation Challenges
  • Best portions of the code
  • Non-uniform latency
  • What are the current solutions:
    • Hand coding
    • Function intrinsics
    • Greedy solution

11

step 1 enumeration

ADD

3

6

3

3

ADD

ADD

ADD

OR

AND

AND

7

5

4

XOR

ADD

8

6

OR

AND

XOR

ADD

AND

ADD

ADD

CMP

Step 1: Enumeration

Live In

Live In

3

5

4

Live In

6

1

Live Out

7

2

8

Live Out

Live Out

12

step 2 subgraph isomorphism pruning

SHL

8

AND

3

<<

<<

8

<<

<<

<<

<<

*

*

*

*

*

*

*

Logic

3

3

3

3

3

3

SUB

A

A

A

A

A

A

A

B

B

B

B

B

B

B

C

C

C

C

C

C

C

6

ADD

11

>>

>>

10

10

>>

>>

>>

>>

>>

>>

>>

>>

10

>>

6

6

6

+/-

6

+/-

6

D

D

D

D

D

D

D

E

E

E

E

E

E

E

F

F

F

F

F

F

F

+/-

+/-

+/-

+/-

+/-

+/-

+/-

11

+/-

+/-

11

11

11

+/-

G

G

G

G

G

G

G

H

H

H

H

H

H

H

Step 2: Subgraph Isomorphism Pruning
  • Ensure subgraphs can run on accelerator

SHRA

10

13

step 3 grouping

CMP

ADD

ADD

AND

ADD

OR

AND

OR

XOR

ADD

CMP

ADD

AND

AND

ADD

XOR

Step 3: Grouping

Live In

Live In

Live In

Live In

3

3

E

E

5

5

C

B

C

B

4

Live In

4

Live In

6

1

6

1

A

A

7

7

Live Out

Live Out

2

2

F

F

AC

D

D

8

8

Live Out

Live Out

Live Out

Live Out

  • Assuming A and C are the only possibilities for grouping.

14

dealing with non uniform latency
Dealing with Non-uniform Latency

ADD

OR

AND

24 bit

8 bit

Average Latency =2

Average Latency =2

Average Latency =2

A

B

C

8 bit

24 bit

8 bit

24 bit

Time

  • >94% do not change width

15

step 4 unate covering

Width

Op ID

AC

D

G

H

N

24

1

1

1

8

2

1

1

24

3

1

8

4

1

32

5

32

6

8

7

1

8

8

1

1

Cost

3

1

1

1

1

Benefit

1

1

0

0

0

Step 4: Unate Covering

16

experimental evaluation
Experimental Evaluation
  • ARM port of Trimaran compiler system
  • Processor model
    • ARM-926EJS
    • Single issue, in-order execution, 5 stage pipeline
    • I/D caches : 16k, 64-way
  • Hardware simulation: SimpleScalar 4.0

17

comparison of different ccas
Comparison of Different CCAs

16-bit and 8-bit CCAs are 7% and 9% better than 32-bit CCA.

  • Assuming clock speed(1/(3.3ns) = 300 MHZ)

18

comparison of different algorithms
Comparison of Different Algorithms
  • Previous work: Greedy 10% worse than data-unaware

19

conclusion
Conclusion
  • Programmable hardware accelerator
  • Width-aware CCA: Optimizes for common case.
    • 64% faster clock
    • 4.2x smaller
  • Data-centric compilation: Deals with non-uniform latency of CCA.
    • Average 6.5%,
    • Max 12% better than data-unaware algorithm.

20

slide21
?

For more information: http://cccp.eecs.umich.edu/

21

operation of narrow cca

2

0

0

8

C

1

D

0

1

0

ADD

OR

0

0

A

B

C

D

2

2

FU

FU

ADD

OR

1

ADD

0

0

0

2

0

0

8

C

1

D

0

5

1

FU

ADD

0

0

ADD

OR

0

B

C

D

A

1

0

9

8

0

ADD

1

1

Operation of Narrow CCA

[(0x1D + 0x0C) + (0x20 OR 0x08)]

23

data centric subgraph mapping

Enumeration

Pruning

Grouping

Selection

Data-Centric Subgraph Mapping
  • Enumerate
    • All subgraphs
  • Pruning
    • Subgraph isomorphism
  • Grouping
    • Iteratively group

disconnected subgraphs

  • Selection
    • Unate covering
  • Shrink search space to control runtime

24

how good is the cost function
How Good is the Cost Function

Almost all of the operands have the same width range through out the execution.

25

width utilization1
Width Utilization
  • Full width of the FUs is not always needed.
  • Replacing FUs with narrower FUs is not a good idea by itself.

27

introduction1
Introduction
  • Migration of applications
  • Programmability and cost issues in ASIC
  • More functionality in the embedded processor

28

what are the challenges1
What Are the Challenges

Accelerator Hardware: Compiler Algorithm:

29