Loading in 2 Seconds...

Reducing NoC Energy Consumption Through Compiler-Directed Channel Voltage Scaling

Loading in 2 Seconds...

- By
**chaim** - Follow User

- 168 Views
- Uploaded on

Download Presentation
## PowerPoint Slideshow about 'Reducing NoC Energy Consumption Through Compiler-Directed Channel Voltage Scaling' - chaim

**An Image/Link below is provided (as is) to download presentation**

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.

- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -

Presentation Transcript

### Reducing NoC Energy Consumption Through Compiler-Directed Channel Voltage Scaling

IPCG Example (2)IPCG Example (2)

Guangyu Chen, Feihui Li, Mahmut Kandemir, Mary Jane Irwin

Microsystems Design Lab, Department of CSE

The Pennsylvania State University

mdl@cse.psu.edu

Why NoCs?

- Scalability
- Support for large number of processing units
- Flexibility
- Topology and routing policy can be configured according to the needs of a particular application
- Point-to-point, broadcasting (one-to-multiple), gathering (multiple-to-one)
- Performance
- Low latency, high bandwidth
- Reliability
- Multiple routes between a source/target pair
- Signal strengthening in routers

PLDI’06

Mesh-Based NoC Abstraction

Communication Channel

Router

CPU

CPU

CPU

Memory

Memory

Memory

CPU

CPU

CPU

Memory

Memory

Memory

CPU

CPU

CPU

Memory

Memory

Memory

PLDI’06

Related Work

- Communication channels can account for a significant portion to the chip energy consumption (between 20% and 45%)
- Prior efforts
- Simunic and Boyd: NoC power modeling (DATE’02)
- Benini and De Micheli: Design methodology for energy-efficient reliable SoC networks (ISSS’01)
- Shang et al: Hardware-directed DVS for communication links (HPCA’03)
- Kim et al: Communication link shutdown (ISLPED’03)
- Soteriou and Peh: Design space exploration for link turn on/off (ICCD’04)
- Soteriou et al: Software-directed power-aware interconnection networks (CASES’05)
- Li et al: Software-directed DVS for communication links (CASES’05)
- Li et al: Compiler-directed link turnoff and routing (ICCAD’05, EMSOFT’05, POPL’06)
- Our goal is to save network energy through voltage/frequency scaling

PLDI’06

Motivational Example (1)

Node 2

Node 1

for i = 0 to N {

send(2, A[i][0..1023]

receive(2, buffer)

}

for i = 0 to N{

send(1, A[i][0..255]

receive(1, buffer)

}

i=0

i=1

i=2

i=3

i=4

PLDI’06

Node 2

Motivational Example (2)Node 2

Node 1

for i = 0 to N {

send(2, A[i][0..255]

short computation

receive(2, buffer)

}

for i = 0 to N{

send(1, A[i][0..255]

long computation

receive(1, buffer)

}

Node 1

Node 2

i=4

i=0

i=1

i=2

i=3

PLDI’06

Process and Connection Mapping

- NoC Parameters

CriticalPathAnalysis

BuildingIPCG

InputParallel

Code

IPCG

CodeModification

Scaling Factorfor EachConnection

OutputParallelCode

PLDI’06

Assumptions

- Array-based embedded applications
- Message-passing based parallel program
- For each send(p, m) instruction, the destination node p, and the size of message m can be statically determined at compilation time
- For each receive(p, m) instruction, the source node p can be determined at compilation time
- A send instruction is blocked if the previous message send by the same node has not been delivered to the destination node
- A receive instruction is blocked if the message is not ready in the buffer of the receiver node
- Code is parallelized and process-to-node mapping is performed
- Network is exposed to the compiler

PLDI’06

Inter-Process Communication Graph (IPCG)

- IPCG G(P) captures the communication behavior of application P
- G(P) = (V(P), E(P), , )
- V(P): the set of vertices
- E(P): the set of edges
- , : the weights for edges, capturing minimum/maximum execution latencies

PLDI’06

Vertices of IPCG

- V(P) = X(P) B(P) S(P) D(P) R(P)
- x X(P): the entry point of a loop in program P
- b B(P): the back jump of a loop in program P
- s S(P): the point in P at which a message is sent
- d D(P): the point in P at which a message is delivered
- r R(P): the point in P at which a message is used

send(2,..)

Node 1

s

Node 2

d

r

messagedelivered

receive(1,..)

PLDI’06

Edges of IPCG

- Task edges
- Communication edge (s, d): a message is sent at point s S(P) and delivered at point d D(P)
- Computation edge (u, v): a computation task starts at point u and ends at point v
- u, v X(P) S(P) R(P)
- Control edges
- Enforce the order at which the points of the given program can be reached
- Back-jump edge
- Other control edges

PLDI’06

and Functions

- (u,v) and (u,v): the minimum and maximum times required to execute task (u,v)
- For communication edge (s,d)
- (s,d) = (min. message size) / (max. data rate)
- (u,v) = (max. message size) / (max. data rate)
- For computation edge (u, v)
- (s,d) = the minimum time for executing the instructions between u and v
- (u,v) = the maximum time for executing the instructions between u and v
- For control edge(u,v)
- (s,d) = (u,v) = 0

PLDI’06

IPCG Example (1)

// Process 1

x3:for(...) {

r1:receive(2,..)

20–25 cycles

s2:send(2,..)

}

// Process 2

x1:for(...) {

s1:send(1,..);

x2:for(...) {

10 cycles

s3:send(3,..);

10–15 cycles

s4:send(3,..);

80-90 cycles

r5:receive(3,..)

20 cycles

}

r2:receive(1,..);

}

// Process 3

x4:for(...) {

10 cycles

r3:receive(2,..)

15 cycles

r4:receive(2,..)

40-50 cycles

s5:send(2,..)

}

PLDI’06

IPCG Example (2)

x4

10/10

10/10

10/10

0/0

x3

x1

s3

d3

r3

0/0

15/15

10/15

s1

d1

r1

s4

r4

d4

0/0

10/15

20/25

10/15

x2

40/50

80/90

s2

120/

s5

d5

r5

d2

r2

0/0

10/10

0/0

10/10

20/20

b3

0/0

b4

b2

b1

p2

p3

p1

PLDI’06

IPCG Example (2)

x4

10/10

x3

x1

s3

d3

r3

s1

d1

r1

s4

r4

d4

10/15

10/15

x2

s2

s5

d5

r5

d2

r2

10/10

10/10

b3

b4

b2

b1

p2

p3

p1

PLDI’06

IPCG Example (2)

x4

10/10

10/10

10/10

0/0

x3

x1

s3

d3

r3

0/0

15/15

10/15

s1

d1

r1

s4

r4

d4

0/0

10/15

20/25

10/15

x2

40/50

80/90

s2

120/

s5

d5

r5

d2

r2

0/0

10/10

0/0

10/10

20/20

b3

0/0

b4

b2

b1

p2

p3

p1

PLDI’06

IPCG Example (2)

x4

10/10

10/10

10/10

0/0

x3

x1

s3

d3

r3

0/0

15/15

10/15

s1

d1

r1

s4

r4

d4

0/0

10/15

20/25

10/15

x2

40/50

80/90

s2

120/

s5

d5

r5

d2

r2

0/0

10/10

0/0

10/10

20/20

b3

0/0

b4

b2

b1

p2

p3

p1

PLDI’06

x4

10/10

10/10

10/10

0/0

x3

x1

s3

d3

r3

0/0

15/15

10/15

s1

d1

r1

s4

r4

d4

0/0

10/15

20/25

10/15

x2

40/50

80/90

s2

120/

s5

d5

r5

d2

r2

0/0

10/10

0/0

10/10

20/20

b3

0/0

b4

b2

b1

p2

p3

p1

PLDI’06

x4

10/10

10/10

10/10

0/0

x3

x1

s3

d3

r3

0/0

15/15

10/15

s1

d1

r1

s4

r4

d4

0/0

10/15

20/25

10/15

x2

40/50

80/90

s2

120/

s5

d5

r5

d2

r2

0/0

10/10

0/0

10/10

20/20

b3

0/0

b4

b2

b1

p2

p3

p1

PLDI’06

10/10

10/10

10/10

0/0

x3

x1

s3

d3

r3

0/0

15/15

10/15

s1

d1

r1

s4

r4

d4

0/0

10/15

20/25

10/15

x2

40/50

80/90

s2

120/

s5

d5

r5

d2

r2

0/0

10/10

0/0

10/10

20/20

b3

0/0

b4

b2

b1

p2

p3

p1

IPCG Example (2)PLDI’06

A set of loops that communicate with each other

Unit of granularity for optimization

Parallel Loop Groupx4

10/10

10/10

10/10

0/0

x3

x1

s3

d3

r3

0/0

15/15

10/15

s1

d1

r1

s4

r4

d4

0/0

10/15

20/25

10/15

x2

40/50

80/90

s2

120/

s5

d5

r5

d2

r2

0/0

10/10

0/0

10/10

20/20

b3

0/0

b4

b2

b1

PLDI’06

q = 1

Q = 4

j = 0

j = 1

j = 2

j = 3

j = 4

j = 5

j = 6

j = 7

j = 8

t1,0

t1,1

t1,2

t1,3

t1,4

t1,5

t1,6

t1,7

t1,8

Loop x1

t2,0

t2,1

t2,2

t2,3

t2,4

t2,5

t2,6

t2,7

t2,8

Loop x2

t3,0

t3,1

t3,2

t3,3

t3,4

t3,5

t3,6

t3,7

t3,8

Loop x3

t4,0

t4,1

t4,2

t4,3

t4,4

t4,5

t4,6

t4,7

t4,8

Loop x4

T

T

Representative Iterations- A set of loop iterations that represent the timing behavior of the entire parallel loop group

Time

PLDI’06

Critical Path Analysis

- Determine q and Q such that [q, Q– 1] are the set of representative loop iterations
- Determine t[i,j]: the earliest time that node vi at the jth iteration (j [q, Q-1]) can be reached, assuming each task is completed in the shortest time
- Determine t[i,j]: the earliest time that node vi at the jth iteration (j [q, Q-1]) can be reached, assuming each task takes the longest time
- Determine the scaling factor for each communication channel such that the overall performance degradation due to voltage scaling is within (a preset bound)

PLDI’06

Determining t[i,j] - Constraints

where

: the set of intra-iteration edges

: at each iteration j, u must be reached before v

: the set of inter-iteration edges

: u at the (j – 1)th iteration must be reached before v at the jth iteration

PLDI’06

Examples of Intra- and Inter-Iteration Edges

x4

x3

x1

s3

d3

r3

s1

d1

r1

s4

r4

d4

x2

s2

s5

d5

r5

d2

r2

b3

b4

b2

b1

p2

p3

p1

Intra-Iteration edge

Inter-Iteration edge

PLDI’06

Determining t[i,j] - Example

x2

x3

x1

d3

s2

s3

s1

d1

d1

20/25

20/25

20/25

25/30

20/20

20/25

r2

r3

r1

25/30

15/15

10/10

b2

b3

b1

p2

p3

p1

PLDI’06

Determining t[i,j] - Constraints

where

: the set of intra-iteration edges

: the set of inter-iteration edges

PLDI’06

Determining Scaling Factor -Constraints

where

: the set of intra-iteration and inter-iteration edges

: the node that executes operation v

: the maximum performance degradation allowed

: the scaling factor for the network connection from node n1 to n2

We try to maximizek(n1, n2) for each connection

PLDI’06

Determining Scaling Factor - Algorithm

repeat

select a connection C

scale down the data rate of C by one grade

determine t[i, j] using

if

make the data rate of C permanent

else

restore the data rate of C

until no more connection can be scale down

PLDI’06

Determining Scaling Factor - Example

q = 2, Q = 4, T = 100, = 10%, k = 1, 0.8, 0.6, 0.4, 0.2

PLDI’06

Determining Scaling Factor - Example

q = 2, Q = 4, T = 100, = 10%, k = 1, 0.8, 0.6, 0.4, 0.2

k[1, 2] = 0.8, k[2, 3] = 1, k[3, 1] = 1

PLDI’06

Determining Scaling Factor - Example

q = 2, Q = 4, T = 100, = 10%, k = 1, 0.8, 0.6, 0.4, 0.2

k[1, 2] = 0.8, k[2, 3] = 0.8, k[3, 1] = 1

PLDI’06

Determining Scaling Factor - Example

q = 2, Q = 4, T = 100, = 10%, k = 1, 0.8, 0.6, 0.4, 0.2

k[1, 2] = 0.8, k[2, 3] = 1, k[3, 1] = 0.8

PLDI’06

Determining Scaling Factor - Example

q = 2, Q = 4, T = 100, = 10%, k = 1, 0.8, 0.6, 0.4, 0.2

k[1, 2] = 0.6, k[2, 3] = 1, k[3, 1] = 1

PLDI’06

Determining Scaling Factor - Example

q = 2, Q = 4, T = 100, = 10%, k = 1, 0.8, 0.6, 0.4, 0.2

k[1, 2] = 0.4, k[2, 3] = 1, k[3, 1] = 1

PLDI’06

Determining Scaling Factor - Example

q = 2, Q = 4, T = 100, = 10%, k = 1, 0.8, 0.6, 0.4, 0.2

k[1, 2] = 0.2, k[2, 3] = 1, k[3, 1] = 1

PLDI’06

Determining Scaling Factor - Example

q = 2, Q = 4, T = 100, = 10%, k = 1, 0.8, 0.6, 0.4, 0.2

k[1, 2] = 0.2, k[2, 3] = 1, k[3, 1] = 1

RESULT:k[1, 2] = 0.4, k[2, 3] = 1, k[3, 1] = 1

PLDI’06

Shared Communication Channels

The voltage level of the channel shared by multiple connections is determined by the connection that requires the highestvoltage level

v1

a

c

v1

v3

v2

v2

v2

b

b

v3

v1

v3

v1

c

a

PLDI’06

Experimental Setup

PLDI’06

Impact on Energy Consumption

PLDI’06

Energy Consumption Breakdown

PLDI’06

Conclusions and Research Directions

- NoC presents unique opportunities for compilers
- Expose network layout to compiler for energy reduction through voltage scaling and channel shutdown
- We implemented a compiler directed voltage scaling algorithm and compared its performance to a hardware scheme
- Promising results
- Research Directions
- Evaluating impact of process-to-node mapping
- Combined voltage/frequency scaling for NoC and CPUs
- Metrics other than energy (e.g., temperature, reliability,…)

PLDI’06

Download Presentation

Connecting to Server..