A 256 kbits l tage branch predictor l.jpg
This presentation is the property of its rightful owner.
Sponsored Links
1 / 29

A 256 Kbits L-TAGE branch predictor PowerPoint PPT Presentation


  • 180 Views
  • Uploaded on
  • Presentation posted in: General

A 256 Kbits L-TAGE branch predictor . André Seznec IRISA/INRIA/HIPEAC. Directly derived from : A case for (partially) tagged branch predictors , A. Seznec and P. Michaud JILP Feb. 2006 + Tricks: Loop predictor Kernel/user histories. TAGE: TAgged GEometric history length predictors.

Download Presentation

A 256 Kbits L-TAGE branch predictor

An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -

Presentation Transcript


A 256 kbits l tage branch predictor l.jpg

A 256 Kbits L-TAGE branch predictor

André Seznec

IRISA/INRIA/HIPEAC


Slide2 l.jpg

Directly derived from:

A case for (partially) tagged branch predictors,

A. Seznec and P. Michaud JILP Feb. 2006

+

Tricks:

Loop predictor

Kernel/user histories


Slide3 l.jpg

TAGE:

TAgged GEometric history length predictors

The genesis


Back around 2003 l.jpg

Back around 2003

  • 2bcgskew was state-of-the-art, but:

    • but was lagging behind neural inspired predictors on a few benchmarks

  • Just wanted to get best of both behaviors and maintain:

    • Reasonable implementation cost:

      • Use only global history

      • Medium number of tables

    • In-time response


Slide5 l.jpg

The basis : A Multiple length global history predictor

TO

T1

T2

?

L(0)

T3

L(1)

L(2)

T4

L(3)

L(4)


Geometric history length predictor l.jpg

GEometric History Length predictor

The set of history lengths forms a geometric series

Capture correlation

on very long histories

{0, 2, 4, 8, 16, 32, 64, 128}

most of the storage

for short history !!

What is important:L(i)-L(i-1) is drastically increasing


Combining multiple predictions l.jpg

Combining multiple predictions ?

  • Classical solution:

    • Use of a meta predictor

      “wasting” storage !?!

      chosing among 5 or 10 predictions ??

  • Neural inspired predictors, Jimenez and Lin 2001

    • Use an adder tree instead of a meta-predictor

  • Partial matching

    • Use tagged tables and the longest matching history

      Chen et al 96, Michaud 2005


Slide8 l.jpg

TO

T1

T2

T3

L(1)

L(2)

T4

L(3)

L(4)

CBP-1 (2004): OGEHL

Final computation through a sum

L(0)

Prediction=Sign

12 components 3.670 misp/KI


Tage geometric history length ppm like optimized update policy l.jpg

h[0:L1]

pc

pc

pc

h[0:L2]

pc

h[0:L3]

tag

tag

tag

ctr

ctr

ctr

u

u

u

1

1

1

1

1

1

1

=?

=?

=?

1

hash

hash

hash

hash

hash

hash

1

prediction

TAGEGeometric history length + PPM-like + optimized update policy

Tagless base

predictor


Slide10 l.jpg

Miss

Hit

Pred

=?

=?

1

1

1

1

1

1

1

=?

1

Hit

1

Altpred


Prediction computation l.jpg

Prediction computation

  • General case:

    • Longest matching component provides the prediction

  • Special case:

    • Many mispredictions on newly allocated entries: weak Ctr

      On many applications, Altpred more accuratethan Pred

    • Property dynamically monitored through a single 4-bit counter


Tage update policy l.jpg

TAGE update policy

  • General principle:

    Minimize the footprint of the prediction.

    • Just update the longest history matching component and allocate at most one entry on mispredictions


A tagged table entry l.jpg

U

Tag

Ctr

A tagged table entry

  • Ctr: 3-bit prediction counter

  • U: 2-bit useful counter

    • Was the entry recently useful ?

  • Tag: partial tag


Updating the u counter l.jpg

Updating the U counter

  • If (Altpred ≠ Pred) then

    • Pred = taken : U= U + 1

    • Pred ≠ taken : U = U - 1

  • Graceful aging:

    • Periodic shift of all U counters

    • implemented through the reset of a single bit


Allocating a new entry on a misprediction l.jpg

Allocating a new entry on a misprediction

  • Find a single “useless” entry with a longer history:

    • Priviledge the smallest possible history

      • To minimize footprint

    • But not too much

      • To avoid ping-pong phenomena

  • Initialize Ctr as weak and U as zero


Improve the global history l.jpg

Improve the global history

  • Address + conditional branch history:

    • path confusion on short histories 

  • Address + path:

    • Direct hashing leads to path confusion 

  • Represent all branches in branch history

  • Use also path history ( 1 bit per branch, limited to 16 bits)


Design tradeoff for cbp2 1 l.jpg

Design tradeoff for CBP2 (1)

  • 13 components:

    • Bring the best accuracy on distributed traces

      • 8 components not very far !

  • History length:

    • Min=4 , Max = 640

      Could use any Min in [2,6] and any Max in [300, 2000]


Design tradeoff for cbp2 2 l.jpg

Design tradeoff for CBP2 (2)

  • Tag width tradeoff:

    • (destructive) false match is better tolerated on shorter history

    • 7 bits on T1 to 15 bits on T12

  • Tuning the number of table entries:

    • Smaller number for very long histories

    • Smaller number for very short histories


Adding a loop predictor l.jpg

Adding a loop predictor

  • The loop predictor captures the number of iterations of a loop

    • When successively encounters 4 times the same number of iterations, the loop predictor provides the prediction.

  • Advantages:

    • Very reliable

    • Small storage budget: 256 52-bit entries

  • Complexity ?

    • Might be difficult to manage speculative iteration numbers on deep pipelines


Using a kernel history and a user history l.jpg

Using a kernel history and a user history

  • Traces mix user and kernel activities:

    • Kernel activity after exception

      • Global history pollution

  • Solution: use two separate global histories

    • User history is updated only in user mode

    • Kernel history is updated in both modes


L tage submission accuracy distributed traces l.jpg

L-TAGE submission accuracy (distributed traces)

3.314 misp/KI


Reducing l tage complexity l.jpg

Reducing L-TAGE complexity

  • Included 241,5 Kbits TAGE predictor:

    • 3.368 misp/KI

    • Loop predictor beneficial only on gzip:

      Might not be worth the extra complexity


Using less tables l.jpg

Using less tables

  • 8 components 256 Kbits TAGE predictor:

    • 3.446 misp/KI


Tage prediction computation time l.jpg

TAGE prediction computation time ?

  • 3 successive steps:

    • Index computation

    • Table read

    • Partial match + multiplexor

  • Does not fit on a single cycle:

    • But can be ahead pipelined !


Ahead pipelining a global history branch predictor principle l.jpg

Ahead pipelining a global history branch predictor (principle)

  • Initiate branch prediction X+1 cycles in advance to provide the prediction in time

    • Use information available:

      • X-block ahead instruction address

      • X-block ahead history

  • To ensure accuracy:

    • Use intermediate path information


Practice l.jpg

Practice

C

A

B

bc

Ahead pipelined TAGE:

4// prediction computations

Ha

A


3 branch ahead pipelined 8 component 256 kbits tage l.jpg

3-branch ahead pipelined 8 component 256 Kbits TAGE

3.552 misp/KI


A final case for the geometric history length predictors l.jpg

A final case for the Geometric History Length predictors

  • delivers state-of-the-art accuracy

  • uses only global information:

    • Very long history: 300+ bits !!

  • can be ahead pipelined

  • many effective design points

    • OGEHL or TAGE 

    • Nb of tables, history lengths


Slide29 l.jpg

The End 


  • Login