A 256 kbits l tage branch predictor l.jpg
Sponsored Links
This presentation is the property of its rightful owner.
1 / 29

A 256 Kbits L-TAGE branch predictor PowerPoint PPT Presentation


  • 187 Views
  • Uploaded on
  • Presentation posted in: General

A 256 Kbits L-TAGE branch predictor . André Seznec IRISA/INRIA/HIPEAC. Directly derived from : A case for (partially) tagged branch predictors , A. Seznec and P. Michaud JILP Feb. 2006 + Tricks: Loop predictor Kernel/user histories. TAGE: TAgged GEometric history length predictors.

Download Presentation

A 256 Kbits L-TAGE branch predictor

An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -

Presentation Transcript


A 256 Kbits L-TAGE branch predictor

André Seznec

IRISA/INRIA/HIPEAC


Directly derived from:

A case for (partially) tagged branch predictors,

A. Seznec and P. Michaud JILP Feb. 2006

+

Tricks:

Loop predictor

Kernel/user histories


TAGE:

TAgged GEometric history length predictors

The genesis


Back around 2003

  • 2bcgskew was state-of-the-art, but:

    • but was lagging behind neural inspired predictors on a few benchmarks

  • Just wanted to get best of both behaviors and maintain:

    • Reasonable implementation cost:

      • Use only global history

      • Medium number of tables

    • In-time response


The basis : A Multiple length global history predictor

TO

T1

T2

?

L(0)

T3

L(1)

L(2)

T4

L(3)

L(4)


GEometric History Length predictor

The set of history lengths forms a geometric series

Capture correlation

on very long histories

{0, 2, 4, 8, 16, 32, 64, 128}

most of the storage

for short history !!

What is important:L(i)-L(i-1) is drastically increasing


Combining multiple predictions ?

  • Classical solution:

    • Use of a meta predictor

      “wasting” storage !?!

      chosing among 5 or 10 predictions ??

  • Neural inspired predictors, Jimenez and Lin 2001

    • Use an adder tree instead of a meta-predictor

  • Partial matching

    • Use tagged tables and the longest matching history

      Chen et al 96, Michaud 2005


TO

T1

T2

T3

L(1)

L(2)

T4

L(3)

L(4)

CBP-1 (2004): OGEHL

Final computation through a sum

L(0)

Prediction=Sign

12 components 3.670 misp/KI


h[0:L1]

pc

pc

pc

h[0:L2]

pc

h[0:L3]

tag

tag

tag

ctr

ctr

ctr

u

u

u

1

1

1

1

1

1

1

=?

=?

=?

1

hash

hash

hash

hash

hash

hash

1

prediction

TAGEGeometric history length + PPM-like + optimized update policy

Tagless base

predictor


Miss

Hit

Pred

=?

=?

1

1

1

1

1

1

1

=?

1

Hit

1

Altpred


Prediction computation

  • General case:

    • Longest matching component provides the prediction

  • Special case:

    • Many mispredictions on newly allocated entries: weak Ctr

      On many applications, Altpred more accuratethan Pred

    • Property dynamically monitored through a single 4-bit counter


TAGE update policy

  • General principle:

    Minimize the footprint of the prediction.

    • Just update the longest history matching component and allocate at most one entry on mispredictions


U

Tag

Ctr

A tagged table entry

  • Ctr: 3-bit prediction counter

  • U: 2-bit useful counter

    • Was the entry recently useful ?

  • Tag: partial tag


Updating the U counter

  • If (Altpred ≠ Pred) then

    • Pred = taken : U= U + 1

    • Pred ≠ taken : U = U - 1

  • Graceful aging:

    • Periodic shift of all U counters

    • implemented through the reset of a single bit


Allocating a new entry on a misprediction

  • Find a single “useless” entry with a longer history:

    • Priviledge the smallest possible history

      • To minimize footprint

    • But not too much

      • To avoid ping-pong phenomena

  • Initialize Ctr as weak and U as zero


Improve the global history

  • Address + conditional branch history:

    • path confusion on short histories 

  • Address + path:

    • Direct hashing leads to path confusion 

  • Represent all branches in branch history

  • Use also path history ( 1 bit per branch, limited to 16 bits)


Design tradeoff for CBP2 (1)

  • 13 components:

    • Bring the best accuracy on distributed traces

      • 8 components not very far !

  • History length:

    • Min=4 , Max = 640

      Could use any Min in [2,6] and any Max in [300, 2000]


Design tradeoff for CBP2 (2)

  • Tag width tradeoff:

    • (destructive) false match is better tolerated on shorter history

    • 7 bits on T1 to 15 bits on T12

  • Tuning the number of table entries:

    • Smaller number for very long histories

    • Smaller number for very short histories


Adding a loop predictor

  • The loop predictor captures the number of iterations of a loop

    • When successively encounters 4 times the same number of iterations, the loop predictor provides the prediction.

  • Advantages:

    • Very reliable

    • Small storage budget: 256 52-bit entries

  • Complexity ?

    • Might be difficult to manage speculative iteration numbers on deep pipelines


Using a kernel history and a user history

  • Traces mix user and kernel activities:

    • Kernel activity after exception

      • Global history pollution

  • Solution: use two separate global histories

    • User history is updated only in user mode

    • Kernel history is updated in both modes


L-TAGE submission accuracy (distributed traces)

3.314 misp/KI


Reducing L-TAGE complexity

  • Included 241,5 Kbits TAGE predictor:

    • 3.368 misp/KI

    • Loop predictor beneficial only on gzip:

      Might not be worth the extra complexity


Using less tables

  • 8 components 256 Kbits TAGE predictor:

    • 3.446 misp/KI


TAGE prediction computation time ?

  • 3 successive steps:

    • Index computation

    • Table read

    • Partial match + multiplexor

  • Does not fit on a single cycle:

    • But can be ahead pipelined !


Ahead pipelining a global history branch predictor (principle)

  • Initiate branch prediction X+1 cycles in advance to provide the prediction in time

    • Use information available:

      • X-block ahead instruction address

      • X-block ahead history

  • To ensure accuracy:

    • Use intermediate path information


Practice

C

A

B

bc

Ahead pipelined TAGE:

4// prediction computations

Ha

A


3-branch ahead pipelined 8 component 256 Kbits TAGE

3.552 misp/KI


A final case for the Geometric History Length predictors

  • delivers state-of-the-art accuracy

  • uses only global information:

    • Very long history: 300+ bits !!

  • can be ahead pipelined

  • many effective design points

    • OGEHL or TAGE 

    • Nb of tables, history lengths


The End 


  • Login