Prediction router
Sponsored Links
This presentation is the property of its rightful owner.
1 / 41

Prediction Router: PowerPoint PPT Presentation


  • 143 Views
  • Uploaded on
  • Presentation posted in: General

Prediction Router:. Yet another low-latency on-chip router architecture. Hiroki Matsutani (Keio Univ., Japan) Michihiro Koibuchi (NII, Japan) Hideharu Amano (Keio Univ., Japan) Tsutomu Yoshinaga (UEC, Japan). Tile architecture

Download Presentation

Prediction Router:

An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -

Presentation Transcript


Prediction Router:

Yet another low-latency on-chip router architecture

Hiroki Matsutani (Keio Univ., Japan)

Michihiro Koibuchi (NII, Japan)

Hideharu Amano (Keio Univ., Japan)

Tsutomu Yoshinaga (UEC, Japan)


Tile architecture

Many cores (e.g., processors & caches)

On-chip interconnection network

Why low-latency router is needed?

[Dally, DAC’01]

Core

Router

router

router

router

router

router

router

router

router

router

Packet switched network

16-core tile architecture

On-chip router affects the performance and cost of the chip


Number of cores increases (e.g., 64-core or more?)

Their communication latency is a crucial problem

Number of hops increases

Why low-latency router is needed?

Low-latency router architecture has been extensively studied


Outline:Prediction router for low-latency NoC

  • Existing low-latency routers

    • Speculative router

    • Look-ahead router

    • Bypassing router

  • Prediction router

    • Architecture and the prediction algorithms

  • Hit rate analysis

  • Evaluations

    • Hit rate, gate count, and energy consumption

    • Case study 1: 2-D mesh (small core size)

    • Case study 2: 2-D mesh (large core size)

    • Case study 3: Fat tree network


2) arbitration for the selected output channel

1) selecting an output channel

GRANT

3) sending the packet to the output channel

Wormhole router: Hardware structure

Output ports

Input ports

ARBITER

X+

X+

FIFO

X-

X-

FIFO

Y+

Y+

FIFO

Y-

Y-

FIFO

5x5 CROSSBAR

CORE

CORE

FIFO

Routing, arbitration, & switch traversal are performed in a pipeline manner


Speculative router:VA/SA in parallel

[Peh,HPCA’01]

Pipeline structure: 3-cycle router

  • At least 3-cycle for traversing a router

    • RC (Routing computation)

    • VSA (Virtual channel & switch allocations)

    • ST (Switch traversal)

  • A packet transfer from router (a) to router (c)

VA & SA are speculatively performed in parallel

@Router B

@Router C

@Router A

RC

VSA

ST

RC

VSA

ST

RC

VSA

ST

HEAD

DATA 1

ST

ST

ST

SA

SA

SA

DATA 2

ST

ST

ST

SA

SA

SA

ST

ST

ST

DATA 3

SA

SA

SA

1

2

3

4

5

6

7

8

9

10

11

12

To perform RC and VSA in parallel, look-ahead routing is used

At least 12-cycle for transferring a packet from router (a) to router (c)

ELAPSED TIME [CYCLE]


Look-ahead router:RC/VA in parallel

  • At least 3-cycle for traversing a router

    • NRC (Next routing computation)

    • VSA (Virtual channel & switch allocations)

    • ST (Switch traversal)

VSA can be performed w/o waiting for NRC

Routing computation for the next hop

 Output port of router (i+1) is selected by router i

@Router B

@Router C

@Router A

NRC

NRC

NRC

VSA

ST

VSA

ST

VSA

ST

HEAD

DATA 1

ST

ST

ST

SA

SA

SA

DATA 2

ST

ST

ST

SA

SA

SA

ST

ST

ST

DATA 3

SA

SA

SA

1

2

3

4

5

6

7

8

9

10

11

12

ELAPSED TIME [CYCLE]


Look-ahead router:RC/VA in parallel

  • At least 2-cycle for traversing a router

    • NRC + VSA (Next routing computation / arbitrations)

    • ST (Switch traversal)

No dependency between NRC & VSA  NRC & VSA in parallel

[Dally’s book, 2004]

@Router A

@Router B

@Router C

NRC

NRC

NRC

Typical example of 2-cycle router

HEAD

ST

ST

ST

VSA

VSA

VSA

DATA 1

DATA 2

DATA 3

1

2

3

4

5

6

7

8

9

Packing NRC,VSA,ST into a single stage  frequency harmed

At least 9-cycle for transferring a packet from router (a) to router (c)

ELAPSED TIME [CYCLE]


Virtual bypassing paths

Bypassed

Bypassed

1-cycle

1-cycle

Bypassing router: skip some stages

  • Bypassing between intermediate nodes

    • E.g., Express VCs

[Kumar, ISCA’07]

SRC

DST

3-cycle

3-cycle

3-cycle

3-cycle

3-cycle


Virtual bypassing paths

Bypassed

Bypassed

1-cycle

1-cycle

Bypassing router: skip some stages

  • Bypassing between intermediate nodes

    • E.g., Express VCs

  • Pipeline bypassing utilizing the regularity of DOR

    • E.g., Mad postman

  • Pipeline stages on frequently used are skipped

    • E.g., Dynamic fast path

  • Pipeline stages on user-specific paths are skipped

    • E.g., Preferred path

    • E.g., DBP

[Kumar, ISCA’07]

SRC

DST

3-cycle

3-cycle

3-cycle

3-cycle

3-cycle

[Izu, PDP’94]

[Park, HOTI’07]

[Michelogiannakis, NOCS’07]

[Koibuchi, NOCS’08]

We propose a low-latency router based on multiple predictors


Outline:Prediction router for low-latency NoC

  • Existing low-latency routers

    • Speculative router

    • Look-ahead router

    • Bypassing router

  • Prediction router

    • Architecture and the prediction algorithms

  • Hit rate analysis

  • Evaluations

    • Hit rate, gate count, and energy consumption

    • Case study 1: 2-D mesh (small core size)

    • Case study 2: 2-D mesh (large core size)

    • Case study 3: Fat tree network


Prediction router for 1-cycle transfer

[Yoshinaga,IWIA’06]

  • Each input channel has predictors

  • When an input channel is idle,

    • Predict an output port to be used (RC pre-execution)

    • Arbitration to use the predicted port(SA pre-execution)

[Yoshinaga,IWIA’07]

RC & VSA are skipped if prediction hits 1-cycle transfer

@Router B

@Router C

@Router A

RC

VSA

ST

RC

VSA

ST

RC

VSA

ST

HEAD

DATA 1

ST

ST

ST

DATA 2

ST

ST

ST

ST

ST

ST

DATA 3

1

2

3

4

5

6

7

8

9

10

11

12

ELAPSED TIME [CYCLE]

E.g, we can expect 1.6 cycle transfer if 70% of predictions hit


Prediction router for 1-cycle transfer

[Yoshinaga,IWIA’06]

  • Each input channel has predictors

  • When an input channel is idle,

    • Predict an output port to be used (RC pre-execution)

    • Arbitration to use the predicted port(SA pre-execution)

[Yoshinaga,IWIA’07]

RC & VSA are skipped if prediction hits 1-cycle transfer

MISS

@Router B

@Router C

RC

VSA

ST

RC

VSA

ST

RC

VSA

ST

HEAD

DATA 1

ST

ST

ST

DATA 2

ST

ST

ST

ST

ST

ST

DATA 3

1

2

3

4

5

6

7

8

9

10

11

12

ELAPSED TIME [CYCLE]

E.g, we can expect 1.6 cycle transfer if 70% of predictions hit


Prediction router for 1-cycle transfer

[Yoshinaga,IWIA’06]

  • Each input channel has predictors

  • When an input channel is idle,

    • Predict an output port to be used (RC pre-execution)

    • Arbitration to use the predicted port(SA pre-execution)

[Yoshinaga,IWIA’07]

RC & VSA are skipped if prediction hits 1-cycle transfer

HIT

MISS

@Router C

RC

VSA

ST

ST

RC

VSA

ST

HEAD

ST

DATA 1

ST

ST

DATA 2

ST

ST

ST

ST

ST

ST

DATA 3

1

2

3

4

5

6

7

8

9

10

11

12

ELAPSED TIME [CYCLE]

E.g, we can expect 1.6 cycle transfer if 70% of predictions hit


Prediction router for 1-cycle transfer

[Yoshinaga,IWIA’06]

  • Each input channel has predictors

  • When an input channel is idle,

    • Predict an output port to be used (RC pre-execution)

    • Arbitration to use the predicted port(SA pre-execution)

[Yoshinaga,IWIA’07]

RC & VSA are skipped if prediction hits 1-cycle transfer

HIT

HIT

MISS

RC

VSA

ST

ST

ST

HEAD

ST

DATA 1

ST

ST

DATA 2

ST

ST

ST

ST

ST

ST

DATA 3

1

2

3

4

5

6

7

8

9

10

11

12

ELAPSED TIME [CYCLE]

E.g, we can expect 1.6 cycle transfer if 70% of predictions hit


Efficient predictor is key

Prediction router

Multiple predictors for each input channel

Select one of them in response to a given network environment

  • Random

  • Static Straight (SS)

    • An output channel on the same dimension

    • is selected (exploiting the regularity of DOR)

  • Custom

    • User can specify which output channel is

    • accelerated

  • Latest Port (LP)

    • Previously used output channel is selected

  • Finite Context Method (FCM)

    • The most frequently appeared pattern of

    • n -context sequence (n = 0,1,2,…)

  • Sampled Pattern Match (SPM)

    • Pattern matching using a record table

Predictors

Predictors

A

B

C

A

B

C

[Burtscher, TC’02]

[Jacquet, TIT’02]

Prediction router: Prediction algorithms

[Yoshinaga,IWIA’06]

[Yoshinaga,IWIA’07]

Single predictor isn’t enough

for applications with different traffic patterns


Idle state: Output port X+ is selected and reserved

1st cycle: Incoming flit is transferred to X+ without RC and VSA

1st cycle: RC is performed  The prediction is correct!

Predictors

A

B

C

ARBITER

X+

X+

FIFO

Correct

X-

X-

Y+

Y+

Crossbar is reserved

Y-

Y-

CORE

5x5 XBAR

CORE

Basic operation @ Correct prediction

2nd cycle: Next flit is transferred to X+ without RC and VSA

1-cycle transfer using the reserved crossbar-port when prediction hits


Idle state: Output port X+ is selected and reserved

1st cycle: Incoming flit is transferred to X+ without RC and VSA

1st cycle: RC is performed  The prediction is wrong! (X- is correct)

Predictors

Kill signal to X+ is asserted

KILL

A

B

C

ARBITER

X+

X+

FIFO

X-

Dead flit

X-

Correct

Y+

Y+

Y-

Y-

CORE

5x5 XBAR

CORE

Basic operation @ Miss prediction

2nd/3rd cycle: Dead flit is removed; retransmission to the correct port

More energy for retransmission

Even with miss prediction, a flit is transferred in 3-cycle as original router


Outline:Prediction router for low-latency NoC

  • Existing low-latency routers

    • Speculative router

    • Look-ahead router

    • Bypassing router

  • Prediction router

    • Architecture and the prediction algorithms

  • Hit rate analysis

  • Evaluations

    • Hit rate, gate count, and energy consumption

    • Case study 1: 2-D mesh (small core size)

    • Case study 2: 2-D mesh (large core size)

    • Case study 3: Fat tree network


Prediction hit rate analysis

  • Formulas to calculate the prediction hit rates on

    • 2-D torus (Random, LP, SS, FCM, and SPM)

    • 2-D mesh (Random, LP, SS, FCM, and SPM)

    • Fat tree (Random and LRU)

    • To forecast which prediction algorithm is suited for a given network environment w/o simulations

  • Accuracy of the analytical model is confirmed through simulations

Derivation of the formulas is omitted in this talk

(See “Section 4” of our paper for more detail)


Outline:Prediction router for low-latency NoC

  • Existing low-latency routers

    • Speculative router

    • Look-ahead router

    • Bypassing router

  • Prediction router

    • Architecture and the prediction algorithms

  • Hit rate analysis

  • Evaluations

    • Hit rate, gate count, and energy consumption

    • Case study 1: 2-D mesh (small core size)

    • Case study 2: 2-D mesh (large core size)

    • Case study 3: Fat tree network


Evaluation items

How many cycles ?

Astro (place & route)

FIFO

hit

NC-Verilog (simulation)

FIFO

XBAR

SDF

SAIF

miss

hit

hit

Design compiler(synthesis)

Power compiler

Fujitsu 65nm library

Flit-level net simulation

Hit rate / Comm. latency

Area (gate count)

Energy cons. [pJ / bit]

Table 2: Process library

Table 1: Router & network parameters

Table 3: CAD tools used

*Topology and traffic are mentioned later


3 case studies of prediction router

How many cycles ?

Astro (place & route)

FIFO

hit

NC-Verilog (simulation)

FIFO

XBAR

SDF

SAIF

miss

hit

hit

Design compiler(synthesis)

Power compiler

Fujitsu 65nm library

Flit-level net simulation

Hit rate / Comm. latency

Area (gate count)

Energy cons. [pJ / bit]

2-D mesh network

Fat tree network

  • The most popular network topology

  • MIT’s RAW [Taylor,ISCA’04]

  • Intel’s 80-core [Vangal,ISSCC’07]

  • Dimension-order routing (XY routing)

  •  Here, we show the results of case studies 1 and 2 together

Case study 1 & 2

Case study 3


48.2% reduced for 16x16 cores

35.8% reduced for 8x8 cores

Case study 1: Zero-load comm.latency

  • Original router

  • Pred router (SS)

  • Pred router (100% hit)

Uniform random traffic on

4x4 to 16x16 meshes

(*) 1-cycle transfer for correct prediction, 3-cycle for wrong prediction

 Simulation results

(analytical model also shows the same result)

Comm. latency [cycles]

More latency reduced (48% for k=16) as network size increases

Network size (k-ary 2-mesh)


SS: go straight

LP: the last one

FCM: frequently used pattern

Efficient for long straight comm.

Case study 2: Hit rate @ 8x8 mesh

Prediction hit rate [%]

7 NAS parallel benchmark programs

4 synthesized traffics


SS: go straight

LP: the last one

FCM: frequently used pattern

Case study 2: Hit rate @ 8x8 mesh

Efficient for long straight comm.

Efficient for short repeated comm.

Prediction hit rate [%]

7 NAS parallel benchmark programs

4 synthesized traffics


SS: go straight

LP: the last one

FCM: frequently used pattern

  • Existing bypassing routers use

    • Only a static or a single bypassing policy

  • Prediction router supports

    • Multiple predictors which can be switched in a cycle

    • To accelerate a wider range of applications

However, effective bypassing policy depends on traffic patterns…

Case study 2: Hit rate @ 8x8 mesh

Efficient for long straight comm.

Efficient for short repeated comm.

All arounder !

Prediction hit rate [%]

7 NAS parallel benchmark programs

4 synthesized traffics


Area (gate count)

Original router

Pred router (SS + LP)

Pred router (SS+LP+FCM)

Energy consumption

FCM is all-arounder, but requires counters

Case study 2: Area & Energy

Light-weight (small overhead)

Verilog-HDL designs

Router area [kilo gates]

Synthesized with 65nm library

6.4 - 15.9% increased, depending on type and number of predictors


Area (gate count)

Original router

Pred router (SS + LP)

Pred router (SS+LP+FCM)

Energy consumption

Original router

Pred router (70% hit)

Pred router (100% hit)

Case study 2: Area & Energy

  • This estimation is pessimistic.

  • More energy consumed in links  Effect of router energy overhead is reduced

  • Application will be finished early  More energy saved

Router area [kilo gates]

Flit switching energy [pJ / bit]

6.4 - 15.9% increased, depending on type and number of predictors

Miss prediction consumes power; 9.5% increased if hit rate is 70%

Latency 35.8%-48.2% saved w/ reasonable area/energy overheads


3 case studies of prediction router

How many cycles ?

Astro (place & route)

FIFO

hit

NC-Verilog (simulation)

FIFO

XBAR

SDF

SAIF

miss

hit

hit

Design compiler(synthesis)

Power compiler

Fujitsu 65nm library

Flit-level net simulation

Hit rate / Comm. latency

Area (gate count)

Energy cons. [pJ / bit]

2-D mesh network

Fat tree network

Case study 1 & 2

Case study 3


Case study 3: Fat tree network

Down

Up

1. LRU algorithm

LRU output port is selected for upward transfer

2. LRU + LP algorithm

Plus, LP for downward transfer


Comm. latency @uniform

Original router

Pred router (LRU)

Pred router (LRU + LP)

Case study 3: Fat tree network

Down

Up

Comm. latency [cycles]

1. LRU algorithm

LRU output port is selected for upward transfer

2. LRU + LP algorithm

Plus, LP for downward transfer

Network size (# of cores)

Latency 30.7% reduced @ 256-core; Small area overhead (7.8%)


Area overhead: 6.4% (SS+LP)

Energy overhead: 9.5% (worst)

Latency reduction: up to 48%

(from Case studies 1 & 2)

Summary of the prediction router

  • Prediction router for low-latency NoCs

    • Multiple predictors, which can be switched in a cycle

    • Architecture and six prediction algorithms

    • Analytical model of prediction hit rates

  • Evaluations of prediction router

    • Case study 1 : 2-D mesh (small core size)

    • Case study 2 : 2-D mesh (large core size)

    • Case study 3 : Fat tree network

  • Results

    • Prediction router can be applied to various NoCs

    • Communication latency reduced with small overheads

      3. Prediction router with multiple predictors can accelerate a wider range of applications

From three case studies


Thank you

for your attention

It would be very helpful if you would speak slowly. Thank you in advance.


Predictors

A

B

C

Prediction router: New modifications

  • Predictors for each input channel

  • Kill mechanism to remove dead flits

  • Two-level arbiter

    • “Reservation”  higher priority

    • “Tentative reservation” by the pre-execution of VSA

KILL signals

ARBITER

X+

X+

FIFO

Currently, the critical path is related tothe arbiter

X-

X-

Y+

Y+

Y-

Y-

5x5 XBAR

CORE

CORE


Static scheme

A predictor is selected by user per application

Dynamic scheme

A predictor is adaptively selected

Prediction router: Predictor selection

Predictors

Predictors

A

B

C

A

B

C

Count up if each predictor hits

Configuration table

A predictor is selected every n cycles (e.g., n =10,000)

Flexible More energy

Simple Pre-analysis is needed


Case study 1: Router critical path

  • RC: Routing comp.

  • VSA: Arbitration

  • ST: Switch traversal

ST can be occurred in these stages of prediction router

6.2% critical path delay increased compared with original router

Stage delay [FO4s]

Pred router (SS)

Original router


SS: go straight

LP: the last one

FCM: frequently used pattern

Custom: user-specific path

Case study 2: Hit rate @ 8x8 mesh

Efficient for long straight comm.

Efficient for short repeated comm.

All arounder !

Efficient for simple comm.

Prediction hit rate [%]

7 NAS parallel benchmark programs

4 synthesized traffics


Spidergon topology

Ring + across links

Each router has 3-port

Mesh-like 2-D layout

Across first routing

Hit rate @ Uniform

Case study 4: Spidergon network

[Coppola,ISSOC’04]


Spidergon topology

Ring + across links

Each router has 3-port

Mesh-like 2-D layout

Across first routing

Hit rate @ Uniform

SS: Go straight

LP: Last used one

FCM: Frequently used one

Case study 4: Spidergon network

[Coppola,ISSOC’04]

Prediction hit rate [%]

Hit rates of SS &FCM are almost the same

Network size (# of cores)

High hit rate is achieved (80% for 64core; 94% for 256core)


4 case studies of prediction router

How many cycles ?

Astro (place & route)

FIFO

hit

NC-Verilog (simulation)

FIFO

XBAR

SDF

SAIF

miss

hit

hit

Design compiler(synthesis)

Power compiler

Fujitsu 65nm library

Flit-level net simulation

Hit rate / Comm. latency

Area (gate count)

Energy cons. [pJ / bit]

2-D mesh network

Fat tree network

Spidergon network

Case study 1 & 2

Case study 3

Case study 4


  • Login