Just a new case for geometric history length predictors

Just a new case for geometric history length predictors André Seznec and Pierre Michaud IRISA/INRIA/HIPEAC

Why branch prediction ? • Processors are deeply pipelined: • More than 30 cycles on a Pentium 4 • Processors execute a few instructions per cycle: • 3-4 on current superscalar processors • More than 100 instructions are inflight in a processor • ≈ 10 % of instructions are branches (mostly conditional) • Effective direction/target known very late: • To avoid delays, predict direction/target • Backtrack as soon as a misprediction is detected

Why (essentially) predicting the direction of conditional branches • Most branchs are conditional: • PC relative address are known/computed at decode time: • Correction early in the pipeline • Effective direction is computed at execution time • Correction very late in the pipeline • Other branchs: • Returns: easy with a return stack • Indirect jumps: same penalty as conditional branches on mispredictions 3 cycles lost 30 cycles lost

1 1 1 1 3 2 1 0 0 0 0 0 Dynamic branch prediction J. Smith 1981 • Predict the same direction as the last time: • Use of a table of 1-bit counter indexed by the instruction address • Table of 2bit counters: hysteresis to enhance predictor behaviors

Exploiting branch historyYeh and Patt 91, Pan et al 91 • Autocorrelation: • Local history • the past behavior of this particular branch • Intercorrelation: • Global history • the (recent) past behavior of all branches

Global history predictor gshare, McFarling 93 n bits PC A representation of the recent path xor histo L bits Prediction is inserted in the history to predict the next branch PHT 2n 2 bits counters prediction

Local history predictor m bits PC 2m L bits local history histo L bits PHT 2n 2 bits counters Prediction

And of course: hybrid predictors !! • Take a few distinct branch predictors • Select or compute the final prediction

EV8 predictor: (derived from) (2Bc-gskew)Seznec et al, ISCA 2002 e-gskew Michaud et al 97

∑ Perceptron predictorJimenez and Lin 2001 X Sign=prediction

Qualities required on a branch predictor • State-of-the-art accuracy: • 2bcgskew was state-of-art on most applications, but was lagging behind neural inspired predictors on a few. • Reasonable implementation cost: • Neural inspired are not really realistic, (huge number of adders), many access to the tables • In-time response

Speculative history must be managed !? • Global history: • Append a bit on asinglehistory register • Use of a circular buffer and just a pointer to speculatively manage the history • Local history: • table of histories (unspeculatively updated) • must maintain a speculative history per inflight branch: • Associative search, etc ?!?

From my experience with Alpha EV8 team Designers hate maintaining speculative local history

The basis : A Multiple length global history predictor TO T1 T2 ? L(0) T3 L(1) L(2) T4 L(3) L(4)

Underlying idea • H and H’ two history vectors equal on L bits • e.g. L <L(2) • Branches (A,H) and (A,H’) biased in opposite directions Table T2 allows to discriminate between (A,H) and (A,H’)

Selecting between multiple predictions • Classical solution: • Use of a meta predictor “wasting” storage !?! chosing among 5 or 10 predictions ?? • Neural inspired predictors: • Use an adder tree instead of a meta-predictor Jimenez and Lin 2001 • Partial matching: • Use tagged tables and the longest matching history Chen et al 96, Eden and Mudge 1998, Michaud 2005

TO T1 T2 ∑ T3 L(1) L(2) T4 L(3) L(4) Final computation through a sum L(0) Prediction=Sign

h[0:L1] pc pc pc h[0:L2] pc h[0:L3] tag tag tag ctr ctr ctr u u u 1 1 1 1 1 1 1 =? =? =? 1 hash hash hash hash hash hash 1 prediction Final computation through partial match Tagless base predictor

From “old experience” on 2bcgskew and EV8 Some applications benefit from 100+ bits histories Some don’t !!

GEometric History Length predictor The set of history lengths forms a geometric series {0, 2, 4, 8, 16, 32, 64, 128} What is important:L(i)-L(i-1) is drastically increasing Spends most of the storage for short history !!

Tree adder: • O-GEHL: Optimized Geometric History Length predictor • Partial match: • TAGE: Tagged Geometric History Length predictor • Inspired from PPM-like, Michaud 2004

O-GEHL update policy • Perceptron inspired threshold based • Perceptron-like threshold did not work • Reasonable fixed threshold= Number of tables

Dynamic update threshold fitting On an O-GEHL predictor, best threshold depends on • the application  • the predictor size  • the counter width  By chance, on most applications, for the best fixed threshold, updates on mispredictions ≈ updates on correct predictions Monitor the difference and adapt the update threshold

(½ applications: L(7) < 50) ≠ (½ applications: L(7) > 150) Let us adapt some history lengths to the behavior of each application 8 tables: T2: L(2) and L(8) T4: L(4) and L(9) T6: L(6) and L(10) Adaptative history length fitting (inspired by Juan et al 98)

Adaptative history length fitting (2) Intuition: • if high degree of conflicts on T7, stick with short history Implementation: • monitoring of aliasing on updates on T7 through a tag bit and a counter Simple is sufficient: • Flipping from short to long histories and vice-versa

Evaluation framework 1st Championship Branch Prediction traces: 20 traces including system activity Floating point apps : loop dominated Integer apps: usual SPECINT Multimedia apps Server workload apps: very large footprint

Reference configurationpresented for 2004 Championship Branch Prediction • 8 tables: • 2 Kentries except T1, 1Kentries • 5 bit counters for T0 and T1, 4 bit counters otherwise • 1 Kbits of one bit tags associated with T7 10K + 5K + 6x8K + 1K = 64K • L(1) =3 and L(10)= 200 • {0,3,5,8,12,19,31,49,75,125,200}

The OGEHL predictor • 2nd at CBP: 2.82 misp/KI • Best practice award: • The predictor the closest to a possible hardware implementation • Does not use exotic features: • Various prime numbers, etc • Strange initial state • Very short warming intervals: • Chaining all simulations: 2.84 misp/KI

A case for the OGEHL predictor (2) • High accuracy • 32Kbits (3,150): 3.41 misp/KI better than any implementable 128 Kbits predictor before CBP • 128 Kbits 2bcgkew (6,6,24,48): 3.55 misp/KI • 176 Kbits PBNP (43) : 3.67 misp/KI • 1Mbits (5,300): 2.27 misp/KI • 1Mbit 2bcgskew (9,9,36,72): 3.19 misp/KI • 1888 Kbits PBNP (58): 3.23 misp/KI

Robustness of the OGEHL predictor (3) • Robustness to variations of history lengths choices: • L(1) in [2,6], L(10) in [125,300] • misp. rate < 2.96 misp/KI • Geometric series: not a bad formula !! • best geometric L(1)=3, L(10)=223, 2.80 misp/KI • best overall {0, 2, 4, 9, 12, 18, 31, 54, 114, 145, 266} 2.78 misp/KI

OGEHL scalability • 4 components — 8 components • 64 Kbits: 3.02 -- 2.84 misp/KI • 256Kbits: 2.59 -- 2.44 misp/KI • 1Mbit: 2.40 -- 2.27 misp/KI • 6 components — 12 components • 48 Kbits: 3.02 – 3.03 misp/KI • 768Kbits: 2.35 – 2.25 misp/KI 4 to 12 components bring high accuracy 

Impact of counter width • Robustness to counter width variations: • 3-bit counter, 49 Kbits: 3.09 misp/KI Dynamic update threshold fitting helps a lot • 5-bit counter 79 Kbits: 2.79 misp/KI 4-bit is the best tradeoff

PPM-like predictor: 3.24 misp/KI but .. The update policy was poor

TAGE update policy • General principle: Minimize the footprint of the prediction. • Just update the longest history matching component and allocate at most one entry on mispredictions

U Tag Ctr A tagged table entry • Ctr: 3-bit prediction counter • U: 2-bit useful counter • Was the entry recently useful ? • Tag: partial tag

Miss Hit Pred =? =? 1 1 1 1 1 1 1 =? 1 Hit 1 Altpred

Updating the U counter • If (Altpred ≠ Pred) then • Pred = taken : U= U + 1 • Pred ≠ taken : U = U - 1 • Graceful aging: • Periodic reset of 1 bit in all counters

Allocating a new entry on a misprediction • Find a single Unuseful entry with a longer history: • Priviledge the smallest possible history • To minimize footprint • But not too much • To avoid ping-pong phenomena • Initialize Ctr as weak and U as zero

TAGE update policy • + a few other tricks The update is not on the critical path

Tradeoffs on tag width • Too narrow: • Lots of ( destructive) false match • Too wide: • Losing storage space • Tradeoff: • 11 bits for 8 tables • 9 bits for 5 tables • can be tuned on a per table basis: • (destructive) false match is better tolerated on shorter history

TAGE vs OGEHL 8 comp. TAGE 64K: 2.58 misp/KI 128K: 2.38 misp/KI 256K: 2.23 misp/KI 512K: 2.12 misp/KI 1M: 2.05 misp/KI 5 comp. TAGE 64K: 2.70 misp/KI 128K: 2.45 misp/KI 256K: 2.28 misp/KI 512K: 2.19 misp/KI 1M: 2.12 misp/KI 8 comp. OGEHL 64K: 2.84 misp/KI 128K: 2.54 misp/KI 256K: 2.43 misp/KI 512K: 2.33 misp/KI 1M : 2.27 misp/KI

Partial tag matching is more effective than adder tree • At equal table numbers and equal history lengths: • TAGE better than OGEHL on each of the 20 benchmarks • Accuracy limit ( with very high storage budget) is higher on TAGE • Additions leads to averaging and information loss • With tags, no information is lost

Prediction computation time ? • 3 successive steps: • Index computation: a few entry XOR gate • Table read • Adder tree • May not fit on a single cycle: • But can be ahead pipelined !

Ahead pipelining a global history branch predictor (principle) • Initiate branch prediction X+1 cycles in advance to provide the prediction in time • Use information available: • X-block ahead instruction address • X-block ahead history • To ensure accuracy: • Use intermediate path information

A B C D bcd Practice Ahead OGEHL or TAGE: 8 // prediction computations Ha A

Ahead Pipelined 64 Kbits OGEHL or TAGE • 3-block 64Kbits TAGE: • 2.70 misp/KI vs 2.58 misp/KI • 3-block 64Kbits OGEHL: • 2.94 misp/KI vs 2.84 misp/KI Not such a huge accuracy loss 

And indirect jumps ? • TAGE principles can be adapted to predict indirect jumps: • “A case for (partially) tagged branch predictors”, JILP Feb. 2006

A final case for the Geometric History Length predictors • delivers state-of-the-art accuracy • uses only global information: • Very long history: 200+ bits !! • can be ahead pipelined • many effective design points • OGEHL or TAGE  • Nb of tables, history lengths • prediction computation logic complexity is low (compared with concurrent predictors)

The End 

BACK UP

Just a new case for geometric history length predictors

Just a new case for geometric history length predictors

Presentation Transcript

a case just like a cheshire cat

New Geometric Datum: plans

Case length

Case for Just In Case Inventory

Case History

The History of Geometric Construction

Case History

A Case for Purchase of New Cabinets for the Natural History Collections

Case History:

Case History

Case History

The Case for a New CONCAM3

Case History

Case History

Case History

Case for Just In Case Inventory

Case History

Case length

Usefulness of new predictors

Case history

Basic Idea History-based predictors use a global history to predict a branch.