1 / 33

CENG 450 Computer Systems and Architecture Lecture 11

CENG 450 Computer Systems and Architecture Lecture 11. Amirali Baniasadi amirali@ece.uvic.ca. This Lecture. Branch Prediction Multiple Issue. Branch Prediction. Predicting the outcome of a branch Direction: Taken / Not Taken Direction predictors Target Address

cyma
Download Presentation

CENG 450 Computer Systems and Architecture Lecture 11

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. CENG 450Computer Systems and ArchitectureLecture 11 Amirali Baniasadi amirali@ece.uvic.ca

  2. This Lecture • Branch Prediction • Multiple Issue

  3. Branch Prediction • Predicting the outcome of a branch • Direction: • Taken / Not Taken • Direction predictors • Target Address • PC+offset (Taken)/ PC+4 (Not Taken) • Target address predictors • Branch Target Buffer (BTB)

  4. Why do we need branch prediction? • Branch prediction • Increases the number of instructions available for the scheduler to issue. Increases instruction level parallelism (ILP) • Allows useful work to be completed while waiting for the branch to resolve

  5. Branch Prediction Strategies • Static • Decided before runtime • Examples: • Always-Not Taken • Always-Taken • Backwards Taken, Forward Not Taken (BTFNT) • Profile-driven prediction • Dynamic • Prediction decisions may change during the execution of the program

  6. What happens when a branch is predicted? • On misprediction: • No speculative state may commit • Squash instructions in the pipeline • Must not allow stores in the pipeline to occur • Cannot allow stores which would not have happened to commit • Even for good branch predictors more than half of the fetched instructions are squashed

  7. A Generic Branch Predictor Predicted Stream PC, T or NT Fetch f(PC, x) Resolve Actual Stream f(PC, x) = T or NT Actual Stream Execution Order Predicted Stream - What’s f (PC, x)? - x can be any relevant info thus far x was empty

  8. Bimodal Branch Predictors • Dynamically store information about the branch behaviour • Branches tend to behave in a fixed way • Branches tend to behave in the same way across program execution • Index a Pattern History Table using the branch address • 1 bit: branch behaves as it did last time • Saturating 2 bit counter: branch behaves as it usually does

  9. Saturating-Counter Predictors • Consider strongly biased branch with infrequent outcome • TTTTTTTTNTTTTTTTTNTTTT • Last-outcome will misspredict twice per infrequent outcome encounter: • TTTTTTTTNTTTTTTTTNTTTT • Idea: Remember most frequent case • Saturating-Counter: Hysteresis • often called bi-modal predictor • Captures Temporal Bias

  10. Bimodal Prediction • Table of 2-bit saturating counters • Predict the most common direction • Advantages: simple, cheap, “good” accuracy • Bimodal will misspredict once per infrequent outcome encounter: TTTTTTTTNTTTTTTTTNTTTT

  11. Correlating Predictors • From program perspective: • Different Branches may be correlated • if (aa == 2) aa = 0; • if (bb == 2) bb = 0; • if (aa != bb) then … • Can be viewed as a pattern detector • Instead of keeping aggregate history information • I.e., most frequent outcome • Keep exact history information • Pattern of n most recent outcomes • Example: • BHR: n most recent branch outcomes • Use PC and BHR (xor?) to access prediction table

  12. Pattern-based Prediction • Nested loops: for i = 0 to N for j = 0 to 3 … • Branch Outcome Stream for j-for branch • 11101110111011101110 • Patterns: • 111 -> 0 • 110 -> 1 • 101 -> 1 • 011 -> 1 • 100% accuracy • Learning time 4 instances • Table Index (PC, 3-bit history)

  13. Two-level Branch Predictors • A branch outcome depends on the outcomes of previous branches • First level: Branch History Registers (BHR) • Global history / Branch correlation: past executions of all branches • Self history / Private history: past executions of the same branch • Second level: Pattern History Table (PHT) • Use first level information to index a table • Possibly XOR with the branch address • PHT: Usually saturating 2 bit counters • Also private, shared or global

  14. Gshare Predictor (McFarling) Branch History Table Global BHR • PC and BHR can be • concatenated • completely overlapped • partially overlapped • xored, etc. • How deep BHR should be? • Really depends on program • But, deeper increases learning time • May increase quality of information Prediction f PC

  15. Hybrid Prediction PC GSHARE Bimodal ... T/NT T/NT Selector T/NT • Combining branch predictors • Use two different branch predictors • Access both in parallel • A third table determines which prediction to use Two or more predictor components combined • Different branches benefit from different types of history

  16. Issues Affecting Accurate Branch Prediction • Aliasing • More than one branch may use the same BHT/PHT entry • Constructive • Prediction that would have been incorrect, predicted correctly • Destructive • Prediction that would have been correct, predicted incorrectly • Neutral • No change in the accuracy

  17. More Issues • Training time • Need to see enough branches to uncover pattern • Need enough time to reach steady state • “Wrong” history • Incorrect type of history for the branch • Stale state • Predictor is updated after information is needed • Operating system context switches • More aliasing caused by branches in different programs

  18. Performance Metrics • Misprediction rate • Mispredicted branches per executed branch • Unfortunately the most usually found • Instructions per mispredicted branch • Gives a better idea of the program behaviour • Branches are not evenly spaced

  19. Upper Limit to ILP: Ideal Machine Amount of parallelism when there are no branch mis-predictions and we’re limited only by data dependencies. FP: 75 - 150 Integer: 18 - 60 IPC Instructions that could theoretically be issued per cycle.

  20. Impact of Realistic Branch Prediction FP: 15 - 45 • Limiting the type of branch prediction. Integer: 6 - 12 IPC

  21. Pentium III • Dynamic branch prediction • 512-entry BTB predicts direction and target, 4-bit history used with PC to derive direction • Mispredicted: at least 9 cycles, as many as 26, average 10-15 cycles

  22. AMD Athlon K7 • 10-stage integer, 15-stage fp pipeline, predictor accessed in fetch • 2K-entry bimodal, 2K-entry BTB • Branch Penalties: • Mispredict penalty: at least 10 cycles

  23. Multiple Issue • Multiple Issue is the ability of the processor to start more than one instruction in a given cycle. • Superscalar processors • Very Long Instruction Word (VLIW) processors

  24. 1990’s: Superscalar Processors • Bottleneck: CPI >= 1 • Limit on scalar performance (single instruction issue) • Hazards • Superpipelining? Diminishing returns (hazards + overhead) • How can we make the CPI = 0.5? • Multiple instructions in every pipeline stage (super-scalar) • 1 2 3 4 5 6 7 • Inst0 IF ID EX MEM WB • Inst1 IF ID EX MEM WB • Inst2 IF ID EX MEM WB • Inst3 IF ID EX MEM WB • Inst4 IF ID EX MEM WB • Inst5 IF ID EX MEM WB

  25. Superscalar Vs. VLIW • Religious debate, similar to RISC vs. CISC • Wisconsin + Michigan (Super scalar) Vs. Illinois (VLIW) • Q. Who can schedule code better, hardware or software?

  26. Hardware Scheduling • High branch prediction accuracy • Dynamic information on latencies (cache misses) • Dynamic information on memory dependences • Easy to speculate (& recover from mis-speculation) • Works for generic, non-loop, irregular code • Ex: databases, desktop applications, compilers • Limited reorder buffer size limits “lookahead” • High cost/complexity • Slow clock

  27. Software Scheduling • Large scheduling scope (full program), large “lookahead” • Can handle very long latencies • Simple hardware with fast clock • Only works well for “regular” codes (scientific, FORTRAN) • Low branch prediction accuracy • Can improve by profiling • No information on latencies like cache misses • Can improve by profiling • Pain to speculate and recover from mis-speculation • Can improve with hardware support

  28. Superscalar Processors • Pioneer: IBM (America => RIOS, RS/6000, Power-1) • Superscalar instruction combinations • 1 ALU or memory or branch + 1 FP (RS/6000) • Any 1 + 1 ALU (Pentium) • Any 1 ALU or FP+ 1 ALU + 1 load + 1 store + 1 branch (Pentium II) • Impact of superscalar • More opportunity for hazards (why?) • More performance loss due to hazards (why?)

  29. Superscalar Processors • Issues varying number of instructions per clock • Scheduling: Static (by the compiler) or dynamic(by the hardware) • Superscalar has a varying number of instructions/cycle (1 to 8), scheduled by compiler or by HW (Tomasulo). • IBM PowerPC, Sun UltraSparc, DEC Alpha, HP 8000

  30. Elements of Advanced Superscalars • High performance instruction fetching • Good dynamic branch and jump prediction • Multiple instructions per cycle, multiple branches per cycle? • Scheduling and hazard elimination • Dynamic scheduling • Not necessarily: Alpha 21064 & Pentium were statically scheduled • Register renaming to eliminate WAR and WAW • Parallel functional units, paths/buses/multiple register ports • High performance memory systems • Speculative execution

  31. SS + DS + Speculation • Superscalar + Dynamic scheduling + Speculation Three great tastes that taste great together • CPI >= 1? • Overcome with superscalar • Superscalar increases hazards • Overcome with dynamic scheduling • RAW dependences still a problem? • Overcome with a large window • Branches a problem for filling large window? • Overcome with speculation

  32. The Big Picture issue Static program Fetch & branch predict execution & Reorder & commit

  33. Readings • New paper on branch prediction online. READ. • Material would be used in the THIRD quiz

More Related