cgra express accelerating execution using dynamic operation fusion
Download
Skip this Video
Download Presentation
CGRA Express: Accelerating Execution using Dynamic Operation Fusion

Loading in 2 Seconds...

play fullscreen
1 / 15

CGRA Express: Accelerating Execution using Dynamic Operation Fusion - PowerPoint PPT Presentation


  • 120 Views
  • Uploaded on

CGRA Express: Accelerating Execution using Dynamic Operation Fusion. CCCP Research Group, University of Michigan. Yongjun Park, Hyunchul Park, Scott Mahlke. Coarse-Grained Reconfigurable Architecture (CGRA). Array of PEs connected in a mesh-like interconnect

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about ' CGRA Express: Accelerating Execution using Dynamic Operation Fusion' - jonah


An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
cgra express accelerating execution using dynamic operation fusion

CGRA Express: Accelerating Execution using Dynamic Operation Fusion

CCCP Research Group, University of Michigan

Yongjun Park, Hyunchul Park, Scott Mahlke

1

coarse grained reconfigurable architecture cgra
Coarse-Grained Reconfigurable Architecture (CGRA)
  • Array of PEs connected in a mesh-like interconnect
  • High throughput with a large number of resources
  • Distributed hardware offers low cost/power consumption
  • High flexibility with dynamic reconfiguration

2

cgra attractive alternative to asics
CGRA : Attractive Alternative to ASICs
  • Suitable for running multimedia applications for future embedded systems
    • High throughput, low power consumption, high flexibility

Morphosys SiliconHive ADRES

viterbi at 80Mbps

h.264 at 30fps

50-60 MOps /mW

  • Morphosys : 8x8 array with RISC processor
  • SiliconHive : hierarchical systolic array
  • ADRES : 4x4 array with tightly coupled VLIW

3

performance bottleneck acyclic code
Performance Bottleneck: Acyclic Code

Software Pipeline

Normal schedule

Original

Loop region dominant

Software Pipeline

Acyclic region dominant

Block 0

Block 0

Block 0

Block 1

Block 1

Block 1

Acyclic region is substantial!

It’s time to optimize acyclic code.

Block 2

Block 3

Block 2

Block 5

Block 2

Block 3

Block 3

Block 4

Block 5

Block 5

Application

Execution Time

4

key idea chaining instructions
Key Idea: Chaining Instructions

1. Clock period

Longest operation with register file access.

2. CGRA is not VLIW.

Register file access is not frequent!

3. Opportunity of instruction chaining.

4. Considerable register access time

≈ Arithmetic operation delay

(3.5ns clock period @ IBM 90nm)

Critical Path: Slow!

Non-critical path : Fast!

5

dynamic operation fusion

ADD

ADD

LSR

Dynamic Operation Fusion
  • Execute multiple dependent operations in one cycle
  • Key benefits

1. Minimal hardware overhead

2. Multiple subgraphs can be executed simultaneously.

3. Dynamic merging of FUs

4x4 CGRA

Add512r10

LD

MUL

4x4 CGRA

A

B

Assumption

Instruction time

= RF read time

= RF write time

ADD

512

ADD

10

LSR

Current :

3 Cycle

Operation fusion :

1 Cycle

Out

6

hardware support
Hardware Support
  • Simple bypass network
  • Small overhead: 3.8%(SRAM), 2.3%(MUX)

7

compiler support
Compiler Support
  • Tick-based scheduling
    • Tick: small time unit based on hardware delay information
    • Clock cycle = # of ticks
    • Clock boundary constraint checking
      • Resource conflict
      • Time conflict
  • Tick-based scheduling
    • Tick: small time unit based on hardware delay information
    • Clock cycle = # of ticks
    • Clock boundary constraint checking
      • Resource conflict
      • Time conflict
  • Tick-based scheduling
    • Tick: small time unit based on hardware delay information
    • Clock cycle = # of ticks
    • Clock boundary constraint checking
      • Resource conflict
      • Time conflict

8

dynamic operation fusion example 1
Dynamic Operation Fusion Example(1)

1. Conventional Scheduling – 5 cycle

1. Conventional Scheduling

DataFlow Graph

DataFlow Graph

DataFlow Graph

DataFlow Graph

DataFlow Graph

DataFlow Graph

Schedule Table

Schedule Table

Schedule Table

Schedule Table

Schedule Table

Schedule Table

const

const

const

const

const

const

RF[0]

RF[0]

RF[0]

RF[0]

RF[0]

RF[0]

const

const

const

const

const

const

RF[1]

RF[1]

RF[1]

RF[1]

RF[1]

RF[1]

const

const

const

const

const

const

SUB(0)

SUB(0)

SUB(0)

SUB(0)

SUB(0)

SUB(0)

ADD(1)

ADD(1)

ADD(1)

ADD(1)

ADD(1)

ADD(1)

ADD(2)

ADD(2)

ADD(2)

ADD(2)

ADD(2)

ADD(2)

const

const

const

const

const

const

LSR(3)

LSR(3)

LSR(3)

LSR(3)

LSR(3)

LSR(3)

CGRA Mapping

CGRA Mapping

CGRA Mapping

CGRA Mapping

CGRA Mapping

CGRA Mapping

Register file

Register file

Register file

Register file

Register file

Register file

LSL(4)

LSL(4)

LSL(4)

LSL(4)

LSL(4)

LSL(4)

OP 0

OP 0

OP 0

OP 0

OP 0

OP 0

OP 1

OP 1

OP 1

OP 1

OP 1

OP 1

OP 5

OP 5

OP 5

OP 5

OP 5

OP 5

ADD(5)

ADD(5)

ADD(5)

ADD(5)

ADD(5)

ADD(5)

OP 2

OP 2

OP 2

OP 2

OP 2

OP 2

OP 3

OP 3

OP 3

OP 3

OP 3

OP 3

OP 4

OP 4

OP 4

OP 4

OP 4

OP 4

RF[2]

RF[2]

RF[2]

RF[2]

RF[2]

RF[2]

9

dynamic operation fusion example 2
Dynamic Operation Fusion Example(2)

2. Dynamic Operation Fusion – 3 Cycle.

2. Dynamic Operation Fusion

Schedule Table

Schedule Table

Schedule Table

Schedule Table

DataFlow Graph

DataFlow Graph

DataFlow Graph

DataFlow Graph

RF[0]

RF[0]

RF[0]

RF[0]

const

const

const

const

RF[1]

RF[1]

RF[1]

RF[1]

const

const

const

const

const

const

const

const

SUB(0)

SUB(0)

SUB(0)

SUB(0)

ADD(1)

ADD(1)

ADD(1)

ADD(1)

ADD(2)

ADD(2)

ADD(2)

ADD(2)

const

const

const

const

LSR(3)

LSR(3)

LSR(3)

LSR(3)

CGRA Mapping

CGRA Mapping

CGRA Mapping

CGRA Mapping

LSL(4)

LSL(4)

LSL(4)

LSL(4)

Register file

Register file

Register file

Register file

ADD(5)

ADD(5)

ADD(5)

ADD(5)

OP 0

OP 0

OP 0

OP 0

OP 1

OP 1

OP 1

OP 1

OP 5

OP 5

OP 5

OP 5

RF[2]

RF[2]

RF[2]

RF[2]

OP 2

OP 2

OP 2

OP 2

OP 3

OP 3

OP 3

OP 3

OP 4

OP 4

OP 4

OP 4

10

experimental setup
Experimental Setup
  • Benchmarks
    • multimedia applications for embedded systems
    • Audio decoding (AAC)
    • Video decoding (H.264)
    • 3D graphics (3D)
  • Two designs
    • baseline : 4x4 heterogeneous CGRA
    • express : 4x4 heterogeneous CGRA with bypass network

11

performance enhancement
Performance Enhancement
  • Express achieves 7-17% reduction in execution time
    • Most of reduction comes from acyclic code region.
  • Express also improves the performance of resource-constrained loop.
    • Bypass network gives more freedom to compiler.

12

detailed result for 3d graphics
Detailed Result for 3D Graphics
  • Target application
    • 3D graphics
  • Power consumption
    • 3% higher than the baseline
  • Performance enhancement
    • 17% faster than the baseline
  • Energy consumption
    • 15% more efficient

13

conclusion
Conclusion
  • Acyclic region becomes the performance bottleneck.
    • The run-time for loops decreases by large factors.
  • Dynamic operation fusion enables to execute back-to-back operations in a cycle
    • Bypass network
    • Tick-based scheduler
  • Up to17% faster and 15% more energy efficient

with 3% hardware overhead

14

ad