1 / 37

Neural Methods for Dynamic Branch Prediction

Neural Methods for Dynamic Branch Prediction. Daniel A. Jiménez Department of Computer Science Rutgers University. The Context. I'll be discussing the implementation of microprocessors Microarchitecture I study deeply pipelined, high clock frequency CPUs The goal is to improve performance

Download Presentation

Neural Methods for Dynamic Branch Prediction

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Neural Methods for Dynamic Branch Prediction Daniel A. Jiménez Department of Computer Science Rutgers University

  2. The Context • I'll be discussing the implementation of microprocessors • Microarchitecture • I study deeply pipelined, high clock frequency CPUs • The goal is to improve performance • Make the program go faster • How can we exploit program behavior to make it go faster? • Remove control dependences • Increase instruction-level parallelism

  3. An Example • This C++ code computes something useful. The inner loop executes two statements each time through the loop. int foo (int w[], bool v[], int n) { int sum = 0; for (int i=0; i<n; i++) { if (v[i]) sum += w[i]; else sum += ~w[i]; } return sum; }

  4. An Example continued • This C++ code computes the same thing with three statements in the loop. • This version is 55% faster on a Pentium 4. • Previous version had many mispredicted branch instructions. int foo2 (int w[], bool v[], int n) { int sum = 0; for (int i=0; i<n; i++) { int a = w[i]; int b = - (int) v[i]; sum += ~(a ^ b); } return sum; }

  5. How an Instruction is Processed Processing can be divided into five stages: Instruction fetch Instruction decode Execute Memory access Write back

  6. Instruction-Level Parallelism To speed up the process, pipelining overlaps execution of multiple instructions, exploiting parallelism between instructions Instruction fetch Instruction decode Execute Memory access Write back

  7. Control Hazards: Branches Conditional branches create a problem for pipelining: the next instruction can't be fetched until the branch has executed, several stages later. Branch instruction

  8. Pipelining and Branches Pipelining overlaps instructions to exploit parallelism, allowing the clock rate to be increased. Branches cause bubbles in the pipeline, where some stages are left idle. Instruction fetch Instruction decode Execute Memory access Write back Unresolved branch instruction

  9. Branch Prediction A branch predictor allows the processor to speculatively fetch and execute instructions down the predicted path. Instruction fetch Instruction decode Execute Memory access Write back Speculative execution Branch predictors must be highly accurate to avoid mispredictions!

  10. Branch Predictors Must Improve • The cost of a misprediction is proportional to pipeline depth • As pipelines deepen, we need more accurate branch predictors • Pentium 4 pipeline has 20 stages • Future pipelines will have > 32 stages • Deeper pipelines allow higher clock rates by decreasing the delay of each pipeline stage • Decreasing misprediction rate from 9% to 4% results in 31% speedup for 32 stage pipeline Simulations with SimpleScalar/Alpha

  11. Overview • Branch prediction background • Applying machine learning to branch prediction • Results and analysis • Circuit-level implementation • Future work and conclusions

  12. Branch Prediction Background

  13. Branch Prediction Background • The basic mechanism: 2-level adaptive prediction [Yeh & Patt `91] • Uses correlations between branch history and outcome • Examples: • gshare[McFarling `93] • agree[Sprangle et al. `97] • hybrid predictors [Evers et al. `96] This scheme is highly accurate in practice

  14. Branch Predictor Accuracy • Larger tables and smarter organizations yield better accuracy • Longer histories provide more context for finding correlations • Table size is exponential in history length • The cost is increased access delay and chip area

  15. Applying Machine Learning to Branch Prediction

  16. Branch Prediction is a Machine Learning Problem • So why not apply a machine learning algorithm? • Replace 2-bit counters with a more accurate predictor • Tight constraints on prediction mechanism • Must be fast and small enough to work as a component of a microprocessor • Artificial neural networks • Simple model of neural networks in brain cells • Learn to recognize and classify patterns • Most neural nets are slow and complex relative to tables • For branch prediction, we need a small and fast neural method

  17. A Neural Method for Branch Prediction • We investigated several neural methods • Most were too slow, too big, or not accurate enough • Our choice: The perceptron[Rosenblatt `62, Block `62] • Very high accuracy for branch prediction • Prediction and update are quick, relative to other neural methods • Sound theoretical foundation; perceptron convergence theorem • Proven to work well for many classification problems

  18. Branch-Predicting Perceptron • Inputs (x’s) are from branch history register • Weights (w’s) are small integers learned by on-line training • Output (y)gives prediction; dot product of x’s and w’s • Training finds correlations between history and outcome

  19. Training Algorithm

  20. Organization of the Perceptron Predictor • Keeps a table of perceptrons, indexed by branch address • Inputs are from branch history register • Predict taken if output 0, otherwise predict not taken • Key intuition: table size isn't exponential in history length, so we can consider much longer histories

  21. Results and Analysis for the Perceptron Predictor

  22. Experimental Evaluation • Execution and trace driven simulations: • Measure instruction throughput (IPC) and misprediction rates • SimpleScalar/Alpha [Burger & Austin `97] • Alpha 21264-like configuration: • 4-wide issue, 64KB I-cache, 64KB D-cache, 512 entry BTB • SPECint 2000 benchmarks • Technological estimates: • HSPICE for circuit delay estimates • Modified CACTI 2.0[Agarwal 2000]for PHT delay estimates

  23. Results: Predictor Accuracy • Perceptron outperforms competitive hybrid predictor by 36% at ~4KB; 1.71% vs. 2.66%

  24. Results: Large Hardware Budgets • Multi-component hybrid was the most accurate fully dynamic predictor known in the literature [Evers 2000] • Perceptron predictor is even more accurate

  25. Delay Sensitive Implementation • Even the relatively simple perceptron has high access delay • Our solution: An overriding perceptron predictor • First level is a single-cycle gshare • Second level is a 4KB, 23-bit history perceptron predictor • HSPICE total prediction delay estimates: • 2 cycles at 833 MHz (like Alpha 21264) • 4 cycles at 1.76 GHz (like Pentium 4) • Compare with 4KB hybrid predictor

  26. Results: IPC with high clock rate • Pentium 4-like: 20 cycle misprediction penalty, 1.76 GHz • 15.8% higher IPC than gshare, 5.7% higher than hybrid

  27. Analysis: History Length • The fixed-length path branch predictor can also use long histories [Stark, Evers & Patt `98]

  28. Analysis: Training Times • Perceptron “warms up’’ faster

  29. Circuit-Level Implementation of a Neural Branch Predictor

  30. Circuit-Level Implementation Example output computation: 12 weights, Wallace tree of depth 6 followed by 14-bit carry-lookahead adder Carry-save adders have O(1) depth, carry-lookahead adder has O(log n) depth Delay is 2-4 cycles for longer histories

  31. HSPICE Perceptron Simulations • 2 cycles at 833 MHz, 4 cycles at 1.76 GHz, 180 nm technology

  32. Future Work and Conclusions

  33. Future Work with Perceptron Predictor • Let's make the best predictor even better • Better representation • Better training algorithm • Latency is a problem • Crazy people are saying that overriding organizations don't work as well as simple but large predictors [ Me, HPCA 2003 ] • How can we eliminate the latency of the perceptron predictor?

  34. Future Work with Perceptron Predictor • Value prediction • Predict value of a load to mitigate memory latency • Indirect branch prediction • Virtual dispatch • Switch statements in C • Exit prediction • Predict the taken exit from predicated hyperblocks

  35. Future Work Characterizing Predictability • Branch predictability, value predictability • How can we characterize algorithms in terms of their predictability? • Given an algorithm, how can we transform it so that its branches and values are easier to predict? • How much predictability is inherent in the algorithm, and how much is an artifact of the program structure? • How can we compare different algorithms' predictability?

  36. Conclusions • Neural predictors can improve performance for deeply pipelined microprocessors • Perceptron learning is well-suited for microarchitectural implementation • There is still a lot of work left to be done on the perceptron predictor in particular and microarchitectural prediction in general

  37. The End

More Related