Prediction router
This presentation is the property of its rightful owner.
Sponsored Links
1 / 41

Prediction Router: PowerPoint PPT Presentation


  • 132 Views
  • Uploaded on
  • Presentation posted in: General

Prediction Router:. Yet another low-latency on-chip router architecture. Hiroki Matsutani (Keio Univ., Japan) Michihiro Koibuchi (NII, Japan) Hideharu Amano (Keio Univ., Japan) Tsutomu Yoshinaga (UEC, Japan). Tile architecture

Download Presentation

Prediction Router:

An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -

Presentation Transcript


Prediction router

Prediction Router:

Yet another low-latency on-chip router architecture

Hiroki Matsutani (Keio Univ., Japan)

Michihiro Koibuchi (NII, Japan)

Hideharu Amano (Keio Univ., Japan)

Tsutomu Yoshinaga (UEC, Japan)


Why low latency router is needed

Tile architecture

Many cores (e.g., processors & caches)

On-chip interconnection network

Why low-latency router is needed?

[Dally, DAC’01]

Core

Router

router

router

router

router

router

router

router

router

router

Packet switched network

16-core tile architecture

On-chip router affects the performance and cost of the chip


Why low latency router is needed1

Number of cores increases (e.g., 64-core or more?)

Their communication latency is a crucial problem

Number of hops increases

Why low-latency router is needed?

Low-latency router architecture has been extensively studied


Outline prediction router for low latency noc

Outline:Prediction router for low-latency NoC

  • Existing low-latency routers

    • Speculative router

    • Look-ahead router

    • Bypassing router

  • Prediction router

    • Architecture and the prediction algorithms

  • Hit rate analysis

  • Evaluations

    • Hit rate, gate count, and energy consumption

    • Case study 1: 2-D mesh (small core size)

    • Case study 2: 2-D mesh (large core size)

    • Case study 3: Fat tree network


Wormhole router hardware structure

2) arbitration for the selected output channel

1) selecting an output channel

GRANT

3) sending the packet to the output channel

Wormhole router: Hardware structure

Output ports

Input ports

ARBITER

X+

X+

FIFO

X-

X-

FIFO

Y+

Y+

FIFO

Y-

Y-

FIFO

5x5 CROSSBAR

CORE

CORE

FIFO

Routing, arbitration, & switch traversal are performed in a pipeline manner


Pipeline structure 3 cycle router

Speculative router:VA/SA in parallel

[Peh,HPCA’01]

Pipeline structure: 3-cycle router

  • At least 3-cycle for traversing a router

    • RC (Routing computation)

    • VSA (Virtual channel & switch allocations)

    • ST (Switch traversal)

  • A packet transfer from router (a) to router (c)

VA & SA are speculatively performed in parallel

@Router B

@Router C

@Router A

RC

VSA

ST

RC

VSA

ST

RC

VSA

ST

HEAD

DATA 1

ST

ST

ST

SA

SA

SA

DATA 2

ST

ST

ST

SA

SA

SA

ST

ST

ST

DATA 3

SA

SA

SA

1

2

3

4

5

6

7

8

9

10

11

12

To perform RC and VSA in parallel, look-ahead routing is used

At least 12-cycle for transferring a packet from router (a) to router (c)

ELAPSED TIME [CYCLE]


Look ahead router rc va in parallel

Look-ahead router:RC/VA in parallel

  • At least 3-cycle for traversing a router

    • NRC (Next routing computation)

    • VSA (Virtual channel & switch allocations)

    • ST (Switch traversal)

VSA can be performed w/o waiting for NRC

Routing computation for the next hop

 Output port of router (i+1) is selected by router i

@Router B

@Router C

@Router A

NRC

NRC

NRC

VSA

ST

VSA

ST

VSA

ST

HEAD

DATA 1

ST

ST

ST

SA

SA

SA

DATA 2

ST

ST

ST

SA

SA

SA

ST

ST

ST

DATA 3

SA

SA

SA

1

2

3

4

5

6

7

8

9

10

11

12

ELAPSED TIME [CYCLE]


Look ahead router rc va in parallel1

Look-ahead router:RC/VA in parallel

  • At least 2-cycle for traversing a router

    • NRC + VSA (Next routing computation / arbitrations)

    • ST (Switch traversal)

No dependency between NRC & VSA  NRC & VSA in parallel

[Dally’s book, 2004]

@Router A

@Router B

@Router C

NRC

NRC

NRC

Typical example of 2-cycle router

HEAD

ST

ST

ST

VSA

VSA

VSA

DATA 1

DATA 2

DATA 3

1

2

3

4

5

6

7

8

9

Packing NRC,VSA,ST into a single stage  frequency harmed

At least 9-cycle for transferring a packet from router (a) to router (c)

ELAPSED TIME [CYCLE]


Bypassing router skip some stages

Virtual bypassing paths

Bypassed

Bypassed

1-cycle

1-cycle

Bypassing router: skip some stages

  • Bypassing between intermediate nodes

    • E.g., Express VCs

[Kumar, ISCA’07]

SRC

DST

3-cycle

3-cycle

3-cycle

3-cycle

3-cycle


Bypassing router skip some stages1

Virtual bypassing paths

Bypassed

Bypassed

1-cycle

1-cycle

Bypassing router: skip some stages

  • Bypassing between intermediate nodes

    • E.g., Express VCs

  • Pipeline bypassing utilizing the regularity of DOR

    • E.g., Mad postman

  • Pipeline stages on frequently used are skipped

    • E.g., Dynamic fast path

  • Pipeline stages on user-specific paths are skipped

    • E.g., Preferred path

    • E.g., DBP

[Kumar, ISCA’07]

SRC

DST

3-cycle

3-cycle

3-cycle

3-cycle

3-cycle

[Izu, PDP’94]

[Park, HOTI’07]

[Michelogiannakis, NOCS’07]

[Koibuchi, NOCS’08]

We propose a low-latency router based on multiple predictors


Outline prediction router for low latency noc1

Outline:Prediction router for low-latency NoC

  • Existing low-latency routers

    • Speculative router

    • Look-ahead router

    • Bypassing router

  • Prediction router

    • Architecture and the prediction algorithms

  • Hit rate analysis

  • Evaluations

    • Hit rate, gate count, and energy consumption

    • Case study 1: 2-D mesh (small core size)

    • Case study 2: 2-D mesh (large core size)

    • Case study 3: Fat tree network


Prediction router for 1 cycle transfer

Prediction router for 1-cycle transfer

[Yoshinaga,IWIA’06]

  • Each input channel has predictors

  • When an input channel is idle,

    • Predict an output port to be used (RC pre-execution)

    • Arbitration to use the predicted port(SA pre-execution)

[Yoshinaga,IWIA’07]

RC & VSA are skipped if prediction hits 1-cycle transfer

@Router B

@Router C

@Router A

RC

VSA

ST

RC

VSA

ST

RC

VSA

ST

HEAD

DATA 1

ST

ST

ST

DATA 2

ST

ST

ST

ST

ST

ST

DATA 3

1

2

3

4

5

6

7

8

9

10

11

12

ELAPSED TIME [CYCLE]

E.g, we can expect 1.6 cycle transfer if 70% of predictions hit


Prediction router for 1 cycle transfer1

Prediction router for 1-cycle transfer

[Yoshinaga,IWIA’06]

  • Each input channel has predictors

  • When an input channel is idle,

    • Predict an output port to be used (RC pre-execution)

    • Arbitration to use the predicted port(SA pre-execution)

[Yoshinaga,IWIA’07]

RC & VSA are skipped if prediction hits 1-cycle transfer

MISS

@Router B

@Router C

RC

VSA

ST

RC

VSA

ST

RC

VSA

ST

HEAD

DATA 1

ST

ST

ST

DATA 2

ST

ST

ST

ST

ST

ST

DATA 3

1

2

3

4

5

6

7

8

9

10

11

12

ELAPSED TIME [CYCLE]

E.g, we can expect 1.6 cycle transfer if 70% of predictions hit


Prediction router for 1 cycle transfer2

Prediction router for 1-cycle transfer

[Yoshinaga,IWIA’06]

  • Each input channel has predictors

  • When an input channel is idle,

    • Predict an output port to be used (RC pre-execution)

    • Arbitration to use the predicted port(SA pre-execution)

[Yoshinaga,IWIA’07]

RC & VSA are skipped if prediction hits 1-cycle transfer

HIT

MISS

@Router C

RC

VSA

ST

ST

RC

VSA

ST

HEAD

ST

DATA 1

ST

ST

DATA 2

ST

ST

ST

ST

ST

ST

DATA 3

1

2

3

4

5

6

7

8

9

10

11

12

ELAPSED TIME [CYCLE]

E.g, we can expect 1.6 cycle transfer if 70% of predictions hit


Prediction router for 1 cycle transfer3

Prediction router for 1-cycle transfer

[Yoshinaga,IWIA’06]

  • Each input channel has predictors

  • When an input channel is idle,

    • Predict an output port to be used (RC pre-execution)

    • Arbitration to use the predicted port(SA pre-execution)

[Yoshinaga,IWIA’07]

RC & VSA are skipped if prediction hits 1-cycle transfer

HIT

HIT

MISS

RC

VSA

ST

ST

ST

HEAD

ST

DATA 1

ST

ST

DATA 2

ST

ST

ST

ST

ST

ST

DATA 3

1

2

3

4

5

6

7

8

9

10

11

12

ELAPSED TIME [CYCLE]

E.g, we can expect 1.6 cycle transfer if 70% of predictions hit


Prediction router prediction algorithms

Efficient predictor is key

Prediction router

Multiple predictors for each input channel

Select one of them in response to a given network environment

  • Random

  • Static Straight (SS)

    • An output channel on the same dimension

    • is selected (exploiting the regularity of DOR)

  • Custom

    • User can specify which output channel is

    • accelerated

  • Latest Port (LP)

    • Previously used output channel is selected

  • Finite Context Method (FCM)

    • The most frequently appeared pattern of

    • n -context sequence (n = 0,1,2,…)

  • Sampled Pattern Match (SPM)

    • Pattern matching using a record table

Predictors

Predictors

A

B

C

A

B

C

[Burtscher, TC’02]

[Jacquet, TIT’02]

Prediction router: Prediction algorithms

[Yoshinaga,IWIA’06]

[Yoshinaga,IWIA’07]

Single predictor isn’t enough

for applications with different traffic patterns


Basic operation @ correct prediction

Idle state: Output port X+ is selected and reserved

1st cycle: Incoming flit is transferred to X+ without RC and VSA

1st cycle: RC is performed  The prediction is correct!

Predictors

A

B

C

ARBITER

X+

X+

FIFO

Correct

X-

X-

Y+

Y+

Crossbar is reserved

Y-

Y-

CORE

5x5 XBAR

CORE

Basic operation @ Correct prediction

2nd cycle: Next flit is transferred to X+ without RC and VSA

1-cycle transfer using the reserved crossbar-port when prediction hits


Basic operation @ miss prediction

Idle state: Output port X+ is selected and reserved

1st cycle: Incoming flit is transferred to X+ without RC and VSA

1st cycle: RC is performed  The prediction is wrong! (X- is correct)

Predictors

Kill signal to X+ is asserted

KILL

A

B

C

ARBITER

X+

X+

FIFO

X-

Dead flit

X-

Correct

Y+

Y+

Y-

Y-

CORE

5x5 XBAR

CORE

Basic operation @ Miss prediction

2nd/3rd cycle: Dead flit is removed; retransmission to the correct port

More energy for retransmission

Even with miss prediction, a flit is transferred in 3-cycle as original router


Outline prediction router for low latency noc2

Outline:Prediction router for low-latency NoC

  • Existing low-latency routers

    • Speculative router

    • Look-ahead router

    • Bypassing router

  • Prediction router

    • Architecture and the prediction algorithms

  • Hit rate analysis

  • Evaluations

    • Hit rate, gate count, and energy consumption

    • Case study 1: 2-D mesh (small core size)

    • Case study 2: 2-D mesh (large core size)

    • Case study 3: Fat tree network


Prediction hit rate analysis

Prediction hit rate analysis

  • Formulas to calculate the prediction hit rates on

    • 2-D torus (Random, LP, SS, FCM, and SPM)

    • 2-D mesh (Random, LP, SS, FCM, and SPM)

    • Fat tree (Random and LRU)

    • To forecast which prediction algorithm is suited for a given network environment w/o simulations

  • Accuracy of the analytical model is confirmed through simulations

Derivation of the formulas is omitted in this talk

(See “Section 4” of our paper for more detail)


Outline prediction router for low latency noc3

Outline:Prediction router for low-latency NoC

  • Existing low-latency routers

    • Speculative router

    • Look-ahead router

    • Bypassing router

  • Prediction router

    • Architecture and the prediction algorithms

  • Hit rate analysis

  • Evaluations

    • Hit rate, gate count, and energy consumption

    • Case study 1: 2-D mesh (small core size)

    • Case study 2: 2-D mesh (large core size)

    • Case study 3: Fat tree network


Evaluation items

Evaluation items

How many cycles ?

Astro (place & route)

FIFO

hit

NC-Verilog (simulation)

FIFO

XBAR

SDF

SAIF

miss

hit

hit

Design compiler(synthesis)

Power compiler

Fujitsu 65nm library

Flit-level net simulation

Hit rate / Comm. latency

Area (gate count)

Energy cons. [pJ / bit]

Table 2: Process library

Table 1: Router & network parameters

Table 3: CAD tools used

*Topology and traffic are mentioned later


3 case studies of prediction router

3 case studies of prediction router

How many cycles ?

Astro (place & route)

FIFO

hit

NC-Verilog (simulation)

FIFO

XBAR

SDF

SAIF

miss

hit

hit

Design compiler(synthesis)

Power compiler

Fujitsu 65nm library

Flit-level net simulation

Hit rate / Comm. latency

Area (gate count)

Energy cons. [pJ / bit]

2-D mesh network

Fat tree network

  • The most popular network topology

  • MIT’s RAW [Taylor,ISCA’04]

  • Intel’s 80-core [Vangal,ISSCC’07]

  • Dimension-order routing (XY routing)

  •  Here, we show the results of case studies 1 and 2 together

Case study 1 & 2

Case study 3


Case study 1 zero load comm latency

48.2% reduced for 16x16 cores

35.8% reduced for 8x8 cores

Case study 1: Zero-load comm.latency

  • Original router

  • Pred router (SS)

  • Pred router (100% hit)

Uniform random traffic on

4x4 to 16x16 meshes

(*) 1-cycle transfer for correct prediction, 3-cycle for wrong prediction

 Simulation results

(analytical model also shows the same result)

Comm. latency [cycles]

More latency reduced (48% for k=16) as network size increases

Network size (k-ary 2-mesh)


Case study 2 hit rate @ 8 x8 mesh

SS: go straight

LP: the last one

FCM: frequently used pattern

Efficient for long straight comm.

Case study 2: Hit rate @ 8x8 mesh

Prediction hit rate [%]

7 NAS parallel benchmark programs

4 synthesized traffics


Case study 2 hit rate @ 8 x8 mesh1

SS: go straight

LP: the last one

FCM: frequently used pattern

Case study 2: Hit rate @ 8x8 mesh

Efficient for long straight comm.

Efficient for short repeated comm.

Prediction hit rate [%]

7 NAS parallel benchmark programs

4 synthesized traffics


Case study 2 hit rate @ 8 x8 mesh2

SS: go straight

LP: the last one

FCM: frequently used pattern

  • Existing bypassing routers use

    • Only a static or a single bypassing policy

  • Prediction router supports

    • Multiple predictors which can be switched in a cycle

    • To accelerate a wider range of applications

However, effective bypassing policy depends on traffic patterns…

Case study 2: Hit rate @ 8x8 mesh

Efficient for long straight comm.

Efficient for short repeated comm.

All arounder !

Prediction hit rate [%]

7 NAS parallel benchmark programs

4 synthesized traffics


Case study 2 area energy

Area (gate count)

Original router

Pred router (SS + LP)

Pred router (SS+LP+FCM)

Energy consumption

FCM is all-arounder, but requires counters

Case study 2: Area & Energy

Light-weight (small overhead)

Verilog-HDL designs

Router area [kilo gates]

Synthesized with 65nm library

6.4 - 15.9% increased, depending on type and number of predictors


Case study 2 area energy1

Area (gate count)

Original router

Pred router (SS + LP)

Pred router (SS+LP+FCM)

Energy consumption

Original router

Pred router (70% hit)

Pred router (100% hit)

Case study 2: Area & Energy

  • This estimation is pessimistic.

  • More energy consumed in links  Effect of router energy overhead is reduced

  • Application will be finished early  More energy saved

Router area [kilo gates]

Flit switching energy [pJ / bit]

6.4 - 15.9% increased, depending on type and number of predictors

Miss prediction consumes power; 9.5% increased if hit rate is 70%

Latency 35.8%-48.2% saved w/ reasonable area/energy overheads


3 case studies of prediction router1

3 case studies of prediction router

How many cycles ?

Astro (place & route)

FIFO

hit

NC-Verilog (simulation)

FIFO

XBAR

SDF

SAIF

miss

hit

hit

Design compiler(synthesis)

Power compiler

Fujitsu 65nm library

Flit-level net simulation

Hit rate / Comm. latency

Area (gate count)

Energy cons. [pJ / bit]

2-D mesh network

Fat tree network

Case study 1 & 2

Case study 3


Case study 3 fat tree network

Case study 3: Fat tree network

Down

Up

1. LRU algorithm

LRU output port is selected for upward transfer

2. LRU + LP algorithm

Plus, LP for downward transfer


Case study 3 fat tree network1

Comm. latency @uniform

Original router

Pred router (LRU)

Pred router (LRU + LP)

Case study 3: Fat tree network

Down

Up

Comm. latency [cycles]

1. LRU algorithm

LRU output port is selected for upward transfer

2. LRU + LP algorithm

Plus, LP for downward transfer

Network size (# of cores)

Latency 30.7% reduced @ 256-core; Small area overhead (7.8%)


Summary of the prediction router

Area overhead: 6.4% (SS+LP)

Energy overhead: 9.5% (worst)

Latency reduction: up to 48%

(from Case studies 1 & 2)

Summary of the prediction router

  • Prediction router for low-latency NoCs

    • Multiple predictors, which can be switched in a cycle

    • Architecture and six prediction algorithms

    • Analytical model of prediction hit rates

  • Evaluations of prediction router

    • Case study 1 : 2-D mesh (small core size)

    • Case study 2 : 2-D mesh (large core size)

    • Case study 3 : Fat tree network

  • Results

    • Prediction router can be applied to various NoCs

    • Communication latency reduced with small overheads

      3. Prediction router with multiple predictors can accelerate a wider range of applications

From three case studies


Prediction router

Thank you

for your attention

It would be very helpful if you would speak slowly. Thank you in advance.


Prediction router new modifications

Predictors

A

B

C

Prediction router: New modifications

  • Predictors for each input channel

  • Kill mechanism to remove dead flits

  • Two-level arbiter

    • “Reservation”  higher priority

    • “Tentative reservation” by the pre-execution of VSA

KILL signals

ARBITER

X+

X+

FIFO

Currently, the critical path is related tothe arbiter

X-

X-

Y+

Y+

Y-

Y-

5x5 XBAR

CORE

CORE


Prediction router predictor selection

Static scheme

A predictor is selected by user per application

Dynamic scheme

A predictor is adaptively selected

Prediction router: Predictor selection

Predictors

Predictors

A

B

C

A

B

C

Count up if each predictor hits

Configuration table

A predictor is selected every n cycles (e.g., n =10,000)

Flexible More energy

Simple Pre-analysis is needed


Case study 1 router critical path

Case study 1: Router critical path

  • RC: Routing comp.

  • VSA: Arbitration

  • ST: Switch traversal

ST can be occurred in these stages of prediction router

6.2% critical path delay increased compared with original router

Stage delay [FO4s]

Pred router (SS)

Original router


Case study 2 hit rate @ 8 x8 mesh3

SS: go straight

LP: the last one

FCM: frequently used pattern

Custom: user-specific path

Case study 2: Hit rate @ 8x8 mesh

Efficient for long straight comm.

Efficient for short repeated comm.

All arounder !

Efficient for simple comm.

Prediction hit rate [%]

7 NAS parallel benchmark programs

4 synthesized traffics


Case study 4 spidergon network

Spidergon topology

Ring + across links

Each router has 3-port

Mesh-like 2-D layout

Across first routing

Hit rate @ Uniform

Case study 4: Spidergon network

[Coppola,ISSOC’04]


Case study 4 spidergon network1

Spidergon topology

Ring + across links

Each router has 3-port

Mesh-like 2-D layout

Across first routing

Hit rate @ Uniform

SS: Go straight

LP: Last used one

FCM: Frequently used one

Case study 4: Spidergon network

[Coppola,ISSOC’04]

Prediction hit rate [%]

Hit rates of SS &FCM are almost the same

Network size (# of cores)

High hit rate is achieved (80% for 64core; 94% for 256core)


4 case studies of prediction router

4 case studies of prediction router

How many cycles ?

Astro (place & route)

FIFO

hit

NC-Verilog (simulation)

FIFO

XBAR

SDF

SAIF

miss

hit

hit

Design compiler(synthesis)

Power compiler

Fujitsu 65nm library

Flit-level net simulation

Hit rate / Comm. latency

Area (gate count)

Energy cons. [pJ / bit]

2-D mesh network

Fat tree network

Spidergon network

Case study 1 & 2

Case study 3

Case study 4


  • Login