Low cost adaptive data prefetching l.jpg
Advertisement
This presentation is the property of its rightful owner.
1 / 25

Low-Cost Adaptive Data Prefetching PowerPoint PPT Presentation

Download Presentation

Low-Cost Adaptive Data Prefetching

An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -

Presentation Transcript


Low cost adaptive data prefetching l.jpg

Low-Cost Adaptive Data Prefetching

Luis M. Ramos, José Luis Briz,

Pablo E. Ibáñez and Víctor Viñals.

University of Zaragoza (Spain)

Euro-Par 2008 - Las Palmas de Gran Canaria - August 26-29th, 2008


Introduction l.jpg

Introduction

Hardware Data Prefetching

Effective to hide memory latency

Recent successful proposals: GHB, SMS

Simple mechanisms in commercial processors:

UltraSPARC-IIIcu & SPARC64 VI (sequential tagged)

Power4 & Power5 (sequential stream buffers)

Intel Core (sequential & stride)

Sequential Tagged prefetching (SEQT)

Prefetches on a cache miss or on a 1st. use

Highest speed-ups

High pressure on mem. & perf. losses in hostile app.

Euro-Par 2008 - Las Palmas de Gran Canaria - August 26-29th, 2008


Introduction3 l.jpg

Introduction

Our aim:

Use the simplest prefetcher (SEQT)

Evaluate degree-distance policies and adaptive mechanisms

Compare them with:

Stride

GHB

P-DFCM

SMS

Euro-Par 2008 - Las Palmas de Gran Canaria - August 26-29th, 2008


Outline l.jpg

Outline

Prefetching mechanisms

Experimental framework and benchmarks

Preliminary results

Performance

Pressure to memory

Degree-distance policies

Results

Conclusions and future work

Euro-Par 2008 - Las Palmas de Gran Canaria - August 26-29th, 2008


Prefetching mechanisms l.jpg

Prefetching mechanisms

Stride prefetching

@’s separated by a constant distance

Table indexed by PC

on-miss insertion [Ibáñez et al. 98]

SMS (Spatial Memory Streaming)

Spatial access patterns

Prefetches blocks inside a memory region

Avoids useless blocks

Euro-Par 2008 - Las Palmas de Gran Canaria - August 26-29th, 2008


Prefetching mechanisms6 l.jpg

Prefetching mechanisms

Correlating prefetchers

Tables store memory program behaviour (addresses or deltas)

Indexed by address or PC

GHB (Global History Buffer)  PC/DC

Focused on reducing table sizes

2 tables, several accesses to calculate deltas

P-DFCM

Based on DFCM value predictor

2 tables, delta stream used to predict next delta

Euro-Par 2008 - Las Palmas de Gran Canaria - August 26-29th, 2008


Experimental framework and benchmarks l.jpg

Experimental framework and benchmarks

SimpleScalar 3.0

Alpha binaries

Aggressive superscalar processor

3-level memory hierarchy (Itanium2)

Spec2k

Simple Simpoints

200 M instruction warming

Selection rule: ideal L2 speed-up > 2%

4 MB

256 KB

16 KB

Euro-Par 2008 - Las Palmas de Gran Canaria - August 26-29th, 2008


Preliminary results performance l.jpg

Preliminary results: performance

a) CINT

b) CFP

Euro-Par 2008 - Las Palmas de Gran Canaria - August 26-29th, 2008


Preliminary results pressure l.jpg

Preliminary results: pressure

Euro-Par 2008 - Las Palmas de Gran Canaria - August 26-29th, 2008


Preliminary results breakdown per application l.jpg

Preliminary results: breakdown per application

Euro-Par 2008 - Las Palmas de Gran Canaria - August 26-29th, 2008


Degree distance policies l.jpg

Degree-distance policies

Deg(4)

prefetch

demand hit

i

i+1

i+2

i+3

i+4

i+5

i+6

i+7

i+8

demand miss

1st. use of a prefetch

time

Deg(x)

on miss & on 1st. use prefetches x blocks

Euro-Par 2008 - Las Palmas de Gran Canaria - August 26-29th, 2008


Degree distance policies12 l.jpg

Degree-distance policies

Dist(4)

prefetch

demand hit

i

i+1

i+2

i+3

i+4

i+5

i+6

i+7

i+8

demand miss

1st. use of a prefetch

time

Dist(x)

on miss & on 1st. use prefetches the x-th block

Euro-Par 2008 - Las Palmas de Gran Canaria - August 26-29th, 2008


Degree distance policies13 l.jpg

Degree-distance policies

Deg-dist(4)

prefetch

demand hit

i

i+1

i+2

i+3

i+4

i+5

i+6

i+7

i+8

demand miss

1st. use of a prefetch

time

Deg-dist(x)

on miss  x blocks

on 1st. use  the x-th block

Euro-Par 2008 - Las Palmas de Gran Canaria - August 26-29th, 2008


Degree distance policies14 l.jpg

Degree-distance policies

Deg(1-4)

prefetch

demand hit

i

i+1

i+2

i+3

i+4

i+5

i+6

i+7

i+8

demand miss

1st. use of a prefetch

time

Deg(1-x)

degmiss = 1

deg1st use = x

Euro-Par 2008 - Las Palmas de Gran Canaria - August 26-29th, 2008


Degree distance policies15 l.jpg

Degree-distance policies

Ad1(4)

prefetch

demand hit

i

i+1

i+2

i+3

i+4

i+5

i+6

i+7

deg

demand miss

0

1st. use of a prefetch

0

time

0

Ad1(x)

01

degmiss = 1

deg1st use = f(usefulness) [0..x]

1

100x  deg--

1

50x  deg++

Euro-Par 2008 - Las Palmas de Gran Canaria - August 26-29th, 2008


Degree distance policies16 l.jpg

Degree-distance policies

Ad2(4)

prefetch

demand hit

i

i+1

i+2

i+3

i+4

deg

i-1

demand miss

time

2

1st. use of a prefetch

2

2

Ad2(x)

2

degmiss = 1 (both dir.)

deg1st use = f(usefulness) [0..x]

2

100x  deg--

2

50x  deg++

k-4

k-3

k-2

k-1

k

k+1

Euro-Par 2008 - Las Palmas de Gran Canaria - August 26-29th, 2008


Degree distance policies17 l.jpg

Degree-distance policies

Ad3(4)

prefetch

demand hit

i

i+1

i+2

i+3

i+4

i+5

i+6

i+7

deg

demand miss

1

1st. use of a prefetch

1

time

1

Ad3(x)

degmiss = 1

12

deg1st use = f(usefulness, timeliness, pollution) [0..x]

2

100x  deg--

100x pollution  deg--

2

50x  deg++

50x late deg++

Euro-Par 2008 - Las Palmas de Gran Canaria - August 26-29th, 2008


Degree distance policies18 l.jpg

Degree-distance policies

Ad4(4,32)

prefetch

demand hit

i

i+1

i+2

i+3

i+4

deg

i-1

demand miss

time

2

1st. use of a prefetch

2

2

Ad4(x,y)

1

region [0..y-1]

deg1st use = f(usefulness, region) [0..x]

1

100x  deg--

1

50x  deg++

k-4

k-3

k-2

k-1

k

k+1

Euro-Par 2008 - Las Palmas de Gran Canaria - August 26-29th, 2008


Degree distance policies19 l.jpg

Degree-distance policies

Ad5(4)

prefetch

demand hit

i

i+1

i+2

i+3

i+4

i+5

i+6

i+7

deg

demand miss

0

1st. use of a prefetch

0

time

1

Ad5(x) [Dahlgren-93]

1

deg = f(usefulness) [0..x]

1

  • same deg. on miss & on 1st. use

  • mechanism needed when deg==0

1

Euro-Par 2008 - Las Palmas de Gran Canaria - August 26-29th, 2008


Results performance l.jpg

Results: performance

  • SMS as reference

  • Ad have no losses

  • INT  deg 4 or 8

  • FP  deg 8 or 16

  • Dist & Ad5 the worse

  • The rest similar to Deg

  • Among Ad: INTAd4(8,32) (diff 1%)

    • FP Ad3(8) (diff 1% - 5%)

  • Ad4(8,32) & Ad2(8) best on average

a) CINT

b) CFP

Euro-Par 2008 - Las Palmas de Gran Canaria - August 26-29th, 2008


Results pressure l.jpg

Results: pressure

Euro-Par 2008 - Las Palmas de Gran Canaria - August 26-29th, 2008


Slide22 l.jpg

PAB

i+1

i+2

i+3

i+4

i+2

i+3

i+4

i+5

i+1

Deg(4)

Prefetch Engine

PAB (4 entries)

L2

i

Euro-Par 2008 - Las Palmas de Gran Canaria - August 26-29th, 2008


Pab as a filter l.jpg

PAB as a filter

i+4

i+2

i+1

i+3

i+5

i+4

i+2

i+3

i+1

i+2

i+1

Deg(4)

Prefetch Engine

PAB (4 entries)

L2

i

  • L2 lookups reduction:

    • 2% for Deg-dist

    • SMS 49% (but continues being the most demanding)

    • 25%-40% for the rest

  • Performance unaffected

Euro-Par 2008 - Las Palmas de Gran Canaria - August 26-29th, 2008


Conclusions and future work l.jpg

Conclusions and future work

Ways of tuning the aggressiveness of SEQT prefetchers

Ad2(8) and Ad4(8,32) perform the best

Adaptive: vary the degree according to prefetch usefulness

Ad2 prefetches forward and backward

Ad4 adjusts the degree for every of the 32 memory regions

Both equal SMS in CINT and outperform it in CFP (60% less lookups in L2)

Ad2: 2 bits/line; Ad4: 2b + 64B table; SMS 33KB

PAB used to reduce the pressure on L2 (25%-40%)

No losses & really low hardware cost

Future work: use a realistic on-chip memory controller

Euro-Par 2008 - Las Palmas de Gran Canaria - August 26-29th, 2008


Slide25 l.jpg

Thank

you

Euro-Par 2008 - Las Palmas de Gran Canaria - August 26-29th, 2008


  • Login