a 256 kbits l tage branch predictor
Download
Skip this Video
Download Presentation
A 256 Kbits L-TAGE branch predictor

Loading in 2 Seconds...

play fullscreen
1 / 29

A 256 Kbits L-TAGE branch predictor - PowerPoint PPT Presentation


  • 226 Views
  • Uploaded on

A 256 Kbits L-TAGE branch predictor . André Seznec IRISA/INRIA/HIPEAC. Directly derived from : A case for (partially) tagged branch predictors , A. Seznec and P. Michaud JILP Feb. 2006 + Tricks: Loop predictor Kernel/user histories. TAGE: TAgged GEometric history length predictors.

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about 'A 256 Kbits L-TAGE branch predictor' - loyal


An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
a 256 kbits l tage branch predictor

A 256 Kbits L-TAGE branch predictor

André Seznec

IRISA/INRIA/HIPEAC

slide2
Directly derived from:

A case for (partially) tagged branch predictors,

A. Seznec and P. Michaud JILP Feb. 2006

+

Tricks:

Loop predictor

Kernel/user histories

slide3
TAGE:

TAgged GEometric history length predictors

The genesis

back around 2003
Back around 2003
  • 2bcgskew was state-of-the-art, but:
    • but was lagging behind neural inspired predictors on a few benchmarks
  • Just wanted to get best of both behaviors and maintain:
    • Reasonable implementation cost:
      • Use only global history
      • Medium number of tables
    • In-time response
geometric history length predictor
GEometric History Length predictor

The set of history lengths forms a geometric series

Capture correlation

on very long histories

{0, 2, 4, 8, 16, 32, 64, 128}

most of the storage

for short history !!

What is important:L(i)-L(i-1) is drastically increasing

combining multiple predictions
Combining multiple predictions ?
  • Classical solution:
    • Use of a meta predictor

“wasting” storage !?!

chosing among 5 or 10 predictions ??

  • Neural inspired predictors, Jimenez and Lin 2001
    • Use an adder tree instead of a meta-predictor
  • Partial matching
    • Use tagged tables and the longest matching history

Chen et al 96, Michaud 2005

slide8
TO

T1

T2

T3

L(1)

L(2)

T4

L(3)

L(4)

CBP-1 (2004): OGEHL

Final computation through a sum

L(0)

Prediction=Sign

12 components 3.670 misp/KI

tage geometric history length ppm like optimized update policy
h[0:L1]

pc

pc

pc

h[0:L2]

pc

h[0:L3]

tag

tag

tag

ctr

ctr

ctr

u

u

u

1

1

1

1

1

1

1

=?

=?

=?

1

hash

hash

hash

hash

hash

hash

1

prediction

TAGEGeometric history length + PPM-like + optimized update policy

Tagless base

predictor

slide10
Miss

Hit

Pred

=?

=?

1

1

1

1

1

1

1

=?

1

Hit

1

Altpred

prediction computation
Prediction computation
  • General case:
    • Longest matching component provides the prediction
  • Special case:
    • Many mispredictions on newly allocated entries: weak Ctr

On many applications, Altpred more accuratethan Pred

    • Property dynamically monitored through a single 4-bit counter
tage update policy
TAGE update policy
  • General principle:

Minimize the footprint of the prediction.

    • Just update the longest history matching component and allocate at most one entry on mispredictions
a tagged table entry
U

Tag

Ctr

A tagged table entry
  • Ctr: 3-bit prediction counter
  • U: 2-bit useful counter
    • Was the entry recently useful ?
  • Tag: partial tag
updating the u counter
Updating the U counter
  • If (Altpred ≠ Pred) then
    • Pred = taken : U= U + 1
    • Pred ≠ taken : U = U - 1
  • Graceful aging:
    • Periodic shift of all U counters
    • implemented through the reset of a single bit
allocating a new entry on a misprediction
Allocating a new entry on a misprediction
  • Find a single “useless” entry with a longer history:
    • Priviledge the smallest possible history
      • To minimize footprint
    • But not too much
      • To avoid ping-pong phenomena
  • Initialize Ctr as weak and U as zero
improve the global history
Improve the global history
  • Address + conditional branch history:
    • path confusion on short histories 
  • Address + path:
    • Direct hashing leads to path confusion 
  • Represent all branches in branch history
  • Use also path history ( 1 bit per branch, limited to 16 bits)
design tradeoff for cbp2 1
Design tradeoff for CBP2 (1)
  • 13 components:
    • Bring the best accuracy on distributed traces
      • 8 components not very far !
  • History length:
    • Min=4 , Max = 640

Could use any Min in [2,6] and any Max in [300, 2000]

design tradeoff for cbp2 2
Design tradeoff for CBP2 (2)
  • Tag width tradeoff:
    • (destructive) false match is better tolerated on shorter history
    • 7 bits on T1 to 15 bits on T12
  • Tuning the number of table entries:
    • Smaller number for very long histories
    • Smaller number for very short histories
adding a loop predictor
Adding a loop predictor
  • The loop predictor captures the number of iterations of a loop
    • When successively encounters 4 times the same number of iterations, the loop predictor provides the prediction.
  • Advantages:
    • Very reliable
    • Small storage budget: 256 52-bit entries
  • Complexity ?
    • Might be difficult to manage speculative iteration numbers on deep pipelines
using a kernel history and a user history
Using a kernel history and a user history
  • Traces mix user and kernel activities:
    • Kernel activity after exception
      • Global history pollution
  • Solution: use two separate global histories
    • User history is updated only in user mode
    • Kernel history is updated in both modes
reducing l tage complexity
Reducing L-TAGE complexity
  • Included 241,5 Kbits TAGE predictor:
    • 3.368 misp/KI
    • Loop predictor beneficial only on gzip:

Might not be worth the extra complexity

using less tables
Using less tables
  • 8 components 256 Kbits TAGE predictor:
    • 3.446 misp/KI
tage prediction computation time
TAGE prediction computation time ?
  • 3 successive steps:
    • Index computation
    • Table read
    • Partial match + multiplexor
  • Does not fit on a single cycle:
    • But can be ahead pipelined !
ahead pipelining a global history branch predictor principle
Ahead pipelining a global history branch predictor (principle)
  • Initiate branch prediction X+1 cycles in advance to provide the prediction in time
    • Use information available:
      • X-block ahead instruction address
      • X-block ahead history
  • To ensure accuracy:
    • Use intermediate path information
practice
Practice

C

A

B

bc

Ahead pipelined TAGE:

4// prediction computations

Ha

A

a final case for the geometric history length predictors
A final case for the Geometric History Length predictors
  • delivers state-of-the-art accuracy
  • uses only global information:
    • Very long history: 300+ bits !!
  • can be ahead pipelined
  • many effective design points
    • OGEHL or TAGE 
    • Nb of tables, history lengths
ad