1 / 54

Just a new case for geometric history length predictors

Just a new case for geometric history length predictors. André Seznec and Pierre Michaud IRISA/INRIA/HIPEAC. Why branch prediction ?. Processors are deeply pipelined: More than 30 cycles on a Pentium 4 Processors execute a few instructions per cycle: 3-4 on current superscalar processors

miette
Download Presentation

Just a new case for geometric history length predictors

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Just a new case for geometric history length predictors André Seznec and Pierre Michaud IRISA/INRIA/HIPEAC

  2. Why branch prediction ? • Processors are deeply pipelined: • More than 30 cycles on a Pentium 4 • Processors execute a few instructions per cycle: • 3-4 on current superscalar processors • More than 100 instructions are inflight in a processor • ≈ 10 % of instructions are branches (mostly conditional) • Effective direction/target known very late: • To avoid delays, predict direction/target • Backtrack as soon as a misprediction is detected

  3. Why (essentially) predicting the direction of conditional branches • Most branchs are conditional: • PC relative address are known/computed at decode time: • Correction early in the pipeline • Effective direction is computed at execution time • Correction very late in the pipeline • Other branchs: • Returns: easy with a return stack • Indirect jumps: same penalty as conditional branches on mispredictions 3 cycles lost 30 cycles lost

  4. 1 1 1 1 3 2 1 0 0 0 0 0 Dynamic branch prediction J. Smith 1981 • Predict the same direction as the last time: • Use of a table of 1-bit counter indexed by the instruction address • Table of 2bit counters: hysteresis to enhance predictor behaviors

  5. Exploiting branch historyYeh and Patt 91, Pan et al 91 • Autocorrelation: • Local history • the past behavior of this particular branch • Intercorrelation: • Global history • the (recent) past behavior of all branches

  6. Global history predictor gshare, McFarling 93 n bits PC A representation of the recent path xor histo L bits Prediction is inserted in the history to predict the next branch PHT 2n 2 bits counters prediction

  7. Local history predictor m bits PC 2m L bits local history histo L bits PHT 2n 2 bits counters Prediction

  8. And of course: hybrid predictors !! • Take a few distinct branch predictors • Select or compute the final prediction

  9. EV8 predictor: (derived from) (2Bc-gskew)Seznec et al, ISCA 2002 e-gskew Michaud et al 97

  10. Perceptron predictorJimenez and Lin 2001 X Sign=prediction

  11. Qualities required on a branch predictor • State-of-the-art accuracy: • 2bcgskew was state-of-art on most applications, but was lagging behind neural inspired predictors on a few. • Reasonable implementation cost: • Neural inspired are not really realistic, (huge number of adders), many access to the tables • In-time response

  12. Speculative history must be managed !? • Global history: • Append a bit on asinglehistory register • Use of a circular buffer and just a pointer to speculatively manage the history • Local history: • table of histories (unspeculatively updated) • must maintain a speculative history per inflight branch: • Associative search, etc ?!?

  13. From my experience with Alpha EV8 team Designers hate maintaining speculative local history

  14. The basis : A Multiple length global history predictor TO T1 T2 ? L(0) T3 L(1) L(2) T4 L(3) L(4)

  15. Underlying idea • H and H’ two history vectors equal on L bits • e.g. L <L(2) • Branches (A,H) and (A,H’) biased in opposite directions Table T2 allows to discriminate between (A,H) and (A,H’)

  16. Selecting between multiple predictions • Classical solution: • Use of a meta predictor “wasting” storage !?! chosing among 5 or 10 predictions ?? • Neural inspired predictors: • Use an adder tree instead of a meta-predictor Jimenez and Lin 2001 • Partial matching: • Use tagged tables and the longest matching history Chen et al 96, Eden and Mudge 1998, Michaud 2005

  17. TO T1 T2 ∑ T3 L(1) L(2) T4 L(3) L(4) Final computation through a sum L(0) Prediction=Sign

  18. h[0:L1] pc pc pc h[0:L2] pc h[0:L3] tag tag tag ctr ctr ctr u u u 1 1 1 1 1 1 1 =? =? =? 1 hash hash hash hash hash hash 1 prediction Final computation through partial match Tagless base predictor

  19. From “old experience” on 2bcgskew and EV8 Some applications benefit from 100+ bits histories Some don’t !!

  20. GEometric History Length predictor The set of history lengths forms a geometric series {0, 2, 4, 8, 16, 32, 64, 128} What is important:L(i)-L(i-1) is drastically increasing Spends most of the storage for short history !!

  21. Tree adder: • O-GEHL: Optimized Geometric History Length predictor • Partial match: • TAGE: Tagged Geometric History Length predictor • Inspired from PPM-like, Michaud 2004

  22. O-GEHL update policy • Perceptron inspired threshold based • Perceptron-like threshold did not work • Reasonable fixed threshold= Number of tables

  23. Dynamic update threshold fitting On an O-GEHL predictor, best threshold depends on • the application  • the predictor size  • the counter width  By chance, on most applications, for the best fixed threshold, updates on mispredictions ≈ updates on correct predictions Monitor the difference and adapt the update threshold

  24. (½ applications: L(7) < 50) ≠ (½ applications: L(7) > 150) Let us adapt some history lengths to the behavior of each application 8 tables: T2: L(2) and L(8) T4: L(4) and L(9) T6: L(6) and L(10) Adaptative history length fitting (inspired by Juan et al 98)

  25. Adaptative history length fitting (2) Intuition: • if high degree of conflicts on T7, stick with short history Implementation: • monitoring of aliasing on updates on T7 through a tag bit and a counter Simple is sufficient: • Flipping from short to long histories and vice-versa

  26. Evaluation framework 1st Championship Branch Prediction traces: 20 traces including system activity Floating point apps : loop dominated Integer apps: usual SPECINT Multimedia apps Server workload apps: very large footprint

  27. Reference configurationpresented for 2004 Championship Branch Prediction • 8 tables: • 2 Kentries except T1, 1Kentries • 5 bit counters for T0 and T1, 4 bit counters otherwise • 1 Kbits of one bit tags associated with T7 10K + 5K + 6x8K + 1K = 64K • L(1) =3 and L(10)= 200 • {0,3,5,8,12,19,31,49,75,125,200}

  28. The OGEHL predictor • 2nd at CBP: 2.82 misp/KI • Best practice award: • The predictor the closest to a possible hardware implementation • Does not use exotic features: • Various prime numbers, etc • Strange initial state • Very short warming intervals: • Chaining all simulations: 2.84 misp/KI

  29. A case for the OGEHL predictor (2) • High accuracy • 32Kbits (3,150): 3.41 misp/KI better than any implementable 128 Kbits predictor before CBP • 128 Kbits 2bcgkew (6,6,24,48): 3.55 misp/KI • 176 Kbits PBNP (43) : 3.67 misp/KI • 1Mbits (5,300): 2.27 misp/KI • 1Mbit 2bcgskew (9,9,36,72): 3.19 misp/KI • 1888 Kbits PBNP (58): 3.23 misp/KI

  30. Robustness of the OGEHL predictor (3) • Robustness to variations of history lengths choices: • L(1) in [2,6], L(10) in [125,300] • misp. rate < 2.96 misp/KI • Geometric series: not a bad formula !! • best geometric L(1)=3, L(10)=223, 2.80 misp/KI • best overall {0, 2, 4, 9, 12, 18, 31, 54, 114, 145, 266} 2.78 misp/KI

  31. OGEHL scalability • 4 components — 8 components • 64 Kbits: 3.02 -- 2.84 misp/KI • 256Kbits: 2.59 -- 2.44 misp/KI • 1Mbit: 2.40 -- 2.27 misp/KI • 6 components — 12 components • 48 Kbits: 3.02 – 3.03 misp/KI • 768Kbits: 2.35 – 2.25 misp/KI 4 to 12 components bring high accuracy 

  32. Impact of counter width • Robustness to counter width variations: • 3-bit counter, 49 Kbits: 3.09 misp/KI Dynamic update threshold fitting helps a lot • 5-bit counter 79 Kbits: 2.79 misp/KI 4-bit is the best tradeoff

  33. PPM-like predictor: 3.24 misp/KI but .. The update policy was poor

  34. TAGE update policy • General principle: Minimize the footprint of the prediction. • Just update the longest history matching component and allocate at most one entry on mispredictions

  35. U Tag Ctr A tagged table entry • Ctr: 3-bit prediction counter • U: 2-bit useful counter • Was the entry recently useful ? • Tag: partial tag

  36. Miss Hit Pred =? =? 1 1 1 1 1 1 1 =? 1 Hit 1 Altpred

  37. Updating the U counter • If (Altpred ≠ Pred) then • Pred = taken : U= U + 1 • Pred ≠ taken : U = U - 1 • Graceful aging: • Periodic reset of 1 bit in all counters

  38. Allocating a new entry on a misprediction • Find a single Unuseful entry with a longer history: • Priviledge the smallest possible history • To minimize footprint • But not too much • To avoid ping-pong phenomena • Initialize Ctr as weak and U as zero

  39. TAGE update policy • + a few other tricks The update is not on the critical path

  40. Tradeoffs on tag width • Too narrow: • Lots of ( destructive) false match • Too wide: • Losing storage space • Tradeoff: • 11 bits for 8 tables • 9 bits for 5 tables • can be tuned on a per table basis: • (destructive) false match is better tolerated on shorter history

  41. TAGE vs OGEHL 8 comp. TAGE 64K: 2.58 misp/KI 128K: 2.38 misp/KI 256K: 2.23 misp/KI 512K: 2.12 misp/KI 1M: 2.05 misp/KI 5 comp. TAGE 64K: 2.70 misp/KI 128K: 2.45 misp/KI 256K: 2.28 misp/KI 512K: 2.19 misp/KI 1M: 2.12 misp/KI 8 comp. OGEHL 64K: 2.84 misp/KI 128K: 2.54 misp/KI 256K: 2.43 misp/KI 512K: 2.33 misp/KI 1M : 2.27 misp/KI

  42. Partial tag matching is more effective than adder tree • At equal table numbers and equal history lengths: • TAGE better than OGEHL on each of the 20 benchmarks • Accuracy limit ( with very high storage budget) is higher on TAGE • Additions leads to averaging and information loss • With tags, no information is lost

  43. Prediction computation time ? • 3 successive steps: • Index computation: a few entry XOR gate • Table read • Adder tree • May not fit on a single cycle: • But can be ahead pipelined !

  44. Ahead pipelining a global history branch predictor (principle) • Initiate branch prediction X+1 cycles in advance to provide the prediction in time • Use information available: • X-block ahead instruction address • X-block ahead history • To ensure accuracy: • Use intermediate path information

  45. A B C D bcd Practice Ahead OGEHL or TAGE: 8 // prediction computations Ha A

  46. Ahead Pipelined 64 Kbits OGEHL or TAGE • 3-block 64Kbits TAGE: • 2.70 misp/KI vs 2.58 misp/KI • 3-block 64Kbits OGEHL: • 2.94 misp/KI vs 2.84 misp/KI Not such a huge accuracy loss 

  47. And indirect jumps ? • TAGE principles can be adapted to predict indirect jumps: • “A case for (partially) tagged branch predictors”, JILP Feb. 2006

  48. A final case for the Geometric History Length predictors • delivers state-of-the-art accuracy • uses only global information: • Very long history: 200+ bits !! • can be ahead pipelined • many effective design points • OGEHL or TAGE  • Nb of tables, history lengths • prediction computation logic complexity is low (compared with concurrent predictors)

  49. The End 

  50. BACK UP

More Related