1 / 43

Revisiting the perceptron predictor

Revisiting the perceptron predictor. André Seznec IRISA/ INRIA. Perceptron-based branch prediction Jimenez and Lin HPCA 2001. Radically new approach to branch prediction Associate a set of 8-bit counters or weights with a branch address

trinh
Download Presentation

Revisiting the perceptron predictor

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Revisiting the perceptron predictor André Seznec IRISA/ INRIA

  2. Perceptron-based branch prediction Jimenez and Lin HPCA 2001 • Radically new approach to branch prediction • Associate a set of 8-bit counters or weights with a branch address • Use the global history vector as an input vector (+1, -1) • Multiply/accumulate weights by inputs and use the sign as a prediction • Selective update: • Increment/decrement if misprediction • Or if Sum is lower than a threshold

  3. Perceptron predictor X Sign=prediction

  4. Perceptron prediction works • + Complexity linear with the history length: • Can capture correlation on a very long history length • - But: • long latency: the multiply-accumulate tree ! • Inherently unabled to discriminate between two histories if they are not « linearly separable » • (2 weights, 2 history bits): h0  h1 is not recognized ! Can we do better ?

  5. Use a redundant history • Insert several bits per branch in history to enhance linear separability: h0, h0h1, h0h2, h0add

  6. Redundant history perceptron • + significant misprediction reduction: • > 30 % for 12 out of 20 benchmarks • - 256 weights: • A 256 multiply-add tree: 2048 bits wide !! • 256 counter updates !! • Latency ? • Power consumption ? • Logic complexity ?

  7. 4 weights for 2 history bits = a single counter read  • Inputs (0, h0, h1, h0  h1), weights W0, W1, W2, W3 • Possible contributions to the branch prediction: • h=0  (0,0,0,0) C0= -W0 –W1-W2-W3 • h=1  (0,1,0,1) C1= -W0 +W1-W2+W3 • h=2  (0,0,1,1) C2= -W0 –W1+W2+W3 • h=3  (0,1,1,0) C3= -W0 +W1+W2-W3 • Update for h =2 and Out =1: • C2 +=4 • C0, C1 and C3 unchanged Let us store the Multiply-Accumulate contributions instead of the weights !!

  8. MAC contribution: 4-way redundant history • Let us really represent blocks of 4 history bits per 16 weights • there are only 16 possible multiply-accumulate contributions associated with these 16 weights Storing the Multiply-Accumulate contributions instead of the weights !!

  9. Redundant History Perceptron Predictor with MAC contribution 4N history bits Sign=prediction N 16x1 MUX

  10. Redundant history and MAC representation • Replace a 16 multiply-add tree by a 16-to1 MUX • Use of saturated arithmetic: • one can reduce the width of counters to 6-bit  A 256 8-bit multiply-accumulate tree replaced by a 16 6-bit adder tree

  11. Redundant history and MAC representation

  12. Back to finite storage predictors

  13. Redundant History Perceptron vs optimized 2bcgskew • Optimized 2bcgskew: 1Mbit 72-36-9-9 history + lots of tricks  • 768 Kbits redundant history perceptron • 20 benchmarks: SPEC 2000 + SPEC 95 fifty / fifty!! Perceptron and 2bcgskew do not capture exactly the same kind of correlation !!

  14. Towards the best of both worlds ! Redundant history skewed perceptron predictor

  15. Self-aliasing on a perceptron predictor 1. Consider H and H’ for a branch B differing on recent bits, If both behaviors are dictated by the same coincidating « old » history segment (e.g. bits 20-23), then there is an aliasing effect on a counter!! 2.Most of the correlation is captured by recent history: Most counters associated with « old » history are « wasted » 3. Let us enable the use of whole spectrum of counters through the use of multiple tables with different indices : « SKEWING »

  16. Redundant History Skewed Perceptron Predictor 4 tables accessed with different indices

  17. Redundant History Skewed Perceptron Predictor

  18. Further leveraging long history • Some applications may benefit from history length up to 128 bits, many do not !! • Don’t want to use a wider adder tree • For a fixed history length, the number of pathes that lead to a single branch varies in a considerable way • less information in some history sections than in others: • Repeating patterns « waste » space in history Use of a compressed form of history !

  19. Further leveraging long history (2) • Replace repeating patterns (up to 5 bits) by narrower chains • 1.5-3 compression ratio on our benchmark set • Use half uncompressed history and half compressed history • Significant benefit ( > 25 %) on several benchmarks; harmless for the others • Essentially captures all correlation associated with local history

  20. RHSP and compressed history

  21. Addressing the predictor latency Ahead pipelined redundant history perceptron predictor

  22. The latency issue ! • Single cycle prediction would be needed but: • 2-4 cycles for table read • 2-4 cycles for adder tree • Ahead pipelined 2bcgskew, Seznec and Fraboulet, ISCA 2003 • on the fly information insertion in table indices • resolve misprediction at execution time • Path-based perceptron, Jiménez MICRO2003 • « systolic-like » ahead pipelined perceptron prediction • does not address table read delay • resolve misprediction at commit time, not at execution time

  23. Ahead pipelining the RHSP: the challenges • Use of X-block ahead information to initiate branch prediction: • X-block ahead address and global history • Use intermediate path information to ensure prediction accuracy • But, inflight insertion of table indices is not sufficient !?! • Need to checkpoint every information for recomputing on the fly any possible prediction for the X-1 intermediate blocks • But avoid checkpoint volume explosion

  24. Ahead pipelined Redundant History Skewed Perceptron Predictor 5 1-block ahead history X block ahead Sum on 14 counters + RHSP tables read 32 counters for intermediate pathes

  25. Ahead pipelined Redundant History Skewed Perceptron Predictor • Partial sum using only X-block ahead information • Discriminate only 32 possible paths: • 32 associated counters are read • Compute 32 possible sums • Select the prediction on last cycle • Checkpoint the 32 possible predictions

  26. Ahead pipelined RHSP (768 Kbits)

  27. Ahead pipelined RHSP • Very limited loss of accuracy for 6-block ahead: • 5 1-bit ahead history are sufficient to discriminate among all the intermediate pathes • Loss of accuracy increases with the length of prediction: • Do not discriminate between all the pathes • explosion of the number of pathes originated from the same X-block ahead block: • Less and less predictions performed by low order counters

  28. Summary • Perceptron based prediction improved: • Prediction accuracy • Use of redundant history • Introduction of skewing • Introduction of history compression • MAC representation: • 16 6-bit adder tree against 256 8-bit mult/acc. tree • X-block ahead RHSP: • on-time prediction without sacrificing accuracy or penalty • misprediction resolution at execution stage

  29. Wide possible design space • For dealing with his/her implementation constraint, the designer can play with: • Number of tables • Width of histories • Compressed/uncompressed ratio • Threshold/width of counters: • Half threshold/ 5 bits counters is not so bad • Use of other MAC representation • 8 counters for 3 bits, 16 counters for 5 bits • ..

  30. Bonus An « objective » comparison of RHSP and 2bcgskew by their (common) inventor

  31. e-gskew 2bc-gskew : logical view

  32. Optimized 2bcgskew • All optimizations in EV8 predictor: • Different history lengthes for all tables • Different hysteresis and prediction table sizes • + a few other tricks: • Sharing predictors and hysteresis tables through banking • Randomly enforcing the flipping of counters on mispredictions to avoid ping-pong phenomena • No « guru » design hash functions: just good functions • 2**(N+11) bits predictor; (N,N,4N,8N) history • (4,4,16,32) for 32Kbit • (9,9,36,72) for 1Mbit

  33. 2bcgskew vs RHSP (1) Efficiency of the prediction scheme: • Both can use very long history: • Extra local history prediction brings very poor benefit • Not aware of any other predictor handling such long history • RHSP better tolerates/accomodates compressed history • RHSP captures some extra correlation Efficiency of the storage usage (small size predictors, e.g. 32Kbits): • 2bcgskew more efficient on a few demanding benchmarks: go, gcc95 • RHSP surprisingly efficient on most benchmarks

  34. 2bcgskew vs RHSP (2) Accesses to the predictor: • Up to three accesses on RHSP on correct predictions • But not so many, accesses on correct predictions • Single access to prediction, single access to hysteresis on correct predictions on 2bcgskew

  35. 2bcgskew vs RHSP (3) • Hardware logic cost: • Adder tree + counter update for RHSP • Hashing functions + small logic for 2bcgskew • Latency: • Table read + adder tree for RHSP • Table read + a few gates for 2bcgskew

  36. That’s the end folks !

  37. RHSP and compressed history

  38. RHSP and compressed history (2)

  39. RHSP and compressed history (3)

  40. RHSP vs 2bcgskewstorage effectiveness (1)

  41. RHSP vs 2bcgskewstorage effectiveness (2)

  42. RHSP vs 2bcgskewstorage effectiveness (3)

More Related