1 / 35

Ginger: Control Independence Using Tag Rewriting

Ginger: Control Independence Using Tag Rewriting. Andrew Hilton, Amir Roth University of Pennsylvania {adhilton, amir}@cis.upenn.edu. ISCA-34 :: June, 2007. A: bez r1, D. D: r2=2. D: r2=2. B: r2=1. B: r2=1. Control dependent (CD) insns. C: jmp E. C: jmp E. }. E: r3=r1+1.

Download Presentation

Ginger: Control Independence Using Tag Rewriting

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Ginger:Control Independence Using Tag Rewriting Andrew Hilton, Amir Roth University of Pennsylvania {adhilton, amir}@cis.upenn.edu ISCA-34 :: June, 2007

  2. A: bez r1, D D: r2=2 D: r2=2 B: r2=1 B: r2=1 Control dependent (CD) insns C: jmp E C: jmp E } E: r3=r1+1 E: r3=r1+1 F: r4=r2+1 F: r4=r2+1 Control independent (CI) insns G: r5=ld(r4) G: r5=ld(r4) Control Independence (CI) Branch mispredictions limit single-thread performance • Improve prediction accuracy? Hard • Predicate? Cost on correct predictions • Exploit control independence (CI) to reduce squash penalty This paper: Ginger, a new (better) CI microarchitecture remember acronyms CI, CD

  3. D: r2=2 B: r2=1 C: jmp E E: r3=r1+1 F: r4=r2+1 E: r3=r1+1 G: r5=ld(r4) F: r4=r2+1 G: r5=ld(r4) D: r2=2 D: r2=2 B: r2=1 B: r2=1 C: jmp E E: r3=r1+1 F: r4=r2+1 F: r4=r2+1 G: r5=ld(r4) Exploiting Control Independence A: bez r1, D Conventional recovery • Squash all post mis-prediction insns • Fetch/execute all correct-path insns • Re-fetch/re-execute CI insns (waste) A: bez r1, D CI recovery • Squash only wrong-path CD insns • Fetch/execute only correct-path CD insns • Preserve CI insns: E, F,G • Preserve un-dispatched CI insns: H, I… How to “Insert” CD insns? What to do about CI insns that depend on CD insns?

  4. Start: wrong path Goal: correct path CI halfway A: bez p1, D A: bez p1, D A: bez p1, D 1 D: p2=2 D: p2=2 B: p6=1 B: p6=1 2 B: p6=1 C: jmp E C: jmp E E: p3=p1+1 E: p3=p1+1 E: p3=p1+1 F: p4=p2+1 F: p4=p2+1 F: p4=p2+1 F: p4=p6+1 F: p4=p6+1 G: p5=ld(p4) G: p5=ld(p4) G: p5=ld(p4) Out-of-Order Renaming CI step 1: replace CD insns CI Step 2: out-of-order renaming • Step 1 changes inputs for some CI insns • CI data dependent (CIDD) insns: F and G (transitively, via F) • Must identify CIDD insns and repair their inputs • Must re-issue CIDD insns that have already issued • Key feature of CI, implementation distinguishes CI schemes ?? remember CIDD acronym too

  5. Outline Control Independence (CI) and out-of-order renaming Prior CI microarchitectures (ooo renaming schemes) • “Walker” • Skipper Ginger Comparative performance evaluation Conclusion

  6. A: bez p1, D B: p6=1 C: jmp E F: p4=p6+1 input changed  re-dispatch E: p3=p1+1 input transitively changed  re-dispatch F: p4=p2+1 G: p5=ld(p4) “Walker” [Rotenberg+, HPCA’99] Ooo renaming: walk all CI insns • Re-rename, re-dispatch if inputs (transitively) changed • Reactive: no penalty on correct prediction (no worse than base) • High overhead on mis-prediction • Walk and re-renames CI data independent insns (CIDI): E • Typically many more of those than CIDD • Still better than baseline

  7. B: p6=1 C: jmp E P: p9=?? P: p9=p6 pre-synchronize “pmove” E: p3=p1+1 F: p4=p9+1 G: p5=ld(p4) Skipper [Cher+, MICRO’01] Ooo renaming: proactive CI + pre-synchronization • Defer CD fetch until branch resolves (reserve space) • Pre-synchronize: predict CD output registers (r2) and pre-allocate • After correct-path CD, dispatch/execute “pmoves” • Low ooo renaming overhead on mis-prediction • Proportional to CD region register output set • Same overhead even on correct prediction A: bez p1, D

  8. OOO Renaming: “Walker”+SkipperGinger “Walker”: walk CI insns • Reactive: no overhead on correct predictions • High overhead on mis-predictions: proportional to CI insns Skipper: pre-synchronize • Low overhead on mis-predictions: proportional to CD registers • Proactive: same overhead on correct predictions Ginger: tag rewriting • Low overhead on mis-predictions: proportional to CD registers • Reactive: no overhead on correct predictions • Proactive also possible, but not really worth it • Uses (mostly) existing hardware • Supports ooo renaming of loads

  9. Outline Control Independence (CI) and out-of-order renaming Prior CI microarchitectures (ooo renaming schemes) Ginger • Tag rewriting • Selective re-dispatch • Out-of-order renaming for loads • Inserting CD insns Comparative performance evaluation Conclusion

  10. Goal: correct path CI halfway A: bez p1, D A: bez p1, D B: p6=1 B: p6=1 C: jmp E C: jmp E E: p3=p1+1 E: p3=p1+1 F: p4=p2+1 F: p4=p2+1 F: p4=p6+1 F: p4=p6+1 G: p5=ld(p4) G: p5=ld(p4) Tag Rewriting at 32K Feet Recall: ooo renaming • Correctness: repair F’s r2 input p2p6 • Performance: without walking E and G also Tag rewriting: ooo renaming by register, not by insn • Identify which registers have changed (r2: p2p6) • Do a fast “search-replace” on CI insns • 1 step (“search-replace” p2p6), not 3 (re-rename E, F, G) • How to actually do both of these things you are “here”

  11. Start: wrong path CI halfway A: bez p1, D A: bez p1, D D: p2=2 B: p6=1 C: jmp E E: p3=p1+1 E: p3=p1+1 F: p4=p2+1 F: p4=p2+1 G: p5=ld(p4) G: p5=ld(p4) r1 r2 r3 r1 r2 r3 p1 p6 p3 p1 p2 1 p2 p3 p6 1 or 0 1 0 0 1 0 Tag Rewriting 1: Tracking Register Changes Active map table: correct-path mappings at E (CI start) Need: checkpoint for wrong-path mappings at E • Bitvectors identify which registers must be rewritten • Fromto = wrong-pathcorrect-path • How to get wrong-path checkpoint (“CI checkpoint”) you are “here”

  12. D: p2=2 E: p3=p1+1 F: p4=p2+1 G: p5=ld(p4) r1 r2 r3 r1 r2 r3 p1 p2 p3 0 0 1 0 Tag Rewriting 0: Setup Start: wrong path How do we know to create the CI checkpoint? • Predict that branch A is low-confidence [Jacobson+ MICRO’06] • Start tracking written registers How do we know where to create it? • Predict A’s convergence PC: E [Cher+ MICRO’01, Collins+ MICRO’04] • Take CI checkpoint before convergence PC is renamed A: bez p1, D

  13. CI halfway A: bez p1, D B: p6=1 C: jmp E E: p3=p1+1 F: p4=p2+1 r1 G: p5=ld(p4) r2 r3 r1 r2 r3 r1 r2 r3 p1 p6 p3 p1 p2 p3 p1 p2 p2 p3 Tag Rewriting 2: Actual Tag Rewriting Tags must be re-written in two places • In younger issue queue entries • In younger map table checkpoints: to rename future insns correctly you are “here” F: p4=p2+1 

  14. Basic Tag Rewriting Approach Observe: tag rewriting hardware (mostly) exists • But used for different purposes: rename, dispatch, wakeup Exploit: borrow existing hardware • Stop the pipeline for a few cycles • Walk changed registers & tag rewrite • Restart the pipeline with correct dependences linked

  15. dispatch tags/ready bits = = > r ptag ptag r age = = > r ptag ptag r age wakeup tags Tag Rewriting Hardware Issue queue • Existing: wakeup match = “search”, dispatch write = “replace” • Some additional logic may be necessary (age tags) Map table checkpoints • Some additional hardware here (but not associative search) • See paper

  16. ROB map table issue queue regfile exec ready bits ? issue queue? CIDD Re-Dispatch So far: tag rewriting for insns in issue queue • ROB-size issue queue? Segmented/pipelined? [Hrishikesh+, ISCA’02] • No, slows down common-case wakeup/select Now: conventional issue queue, issued insns leave as usual • CIDD insns re-dispatch from someplace • That place itself must supports tag rewriting

  17. ROB map table issue queue regfile exec ready bits re-dispatch queue CIDD Re-Dispatch Ginger: a ROB-sized re-dispatch queue • Internal wakeup/select re-dispatch loop • Separate from issue wakeup/select • Supports tag rewriting to identify initial re-dispatch wave • Transitively identifies minimal dependent slice for re-dispatch • Segmented/pipelined and “half-bandwidth”  slow • Only 2% of insns re-dispatch  slow is fine

  18. CIDD Loads CIDD loads: depend (via memory) on CD stores • How are these identified when CD stores inserted/removed? SQIP (store queue index prediction)[Sha+ MICRO’05] • Solution for large LSQ • Makes store-load forwarding act like register communication • Supports “store tag rewriting” A: bez r1, D D: st(r1)=2 B: r2=1 C: jmp E E: r3=r1+1 F: r4=r2+1 G: r5=ld(r1)

  19. A: bez p1, D D: st(p1)=1, @6 E: p3=p1+1 F: p4=p2+1 G: p5=ld(p1) C D E F G – – – – D 6 SQIP and Store Tag Rewriting 15 second introduction to SQIP • Store map table: store-PC  SQ index • Forwarding predictor: load-PC  store-PC • Load G  store D  SQ index 6 • Load G’s second register tag is 6 • Load G indexes SQ at position 6 G: p5=ld(p1), 6 Store tag rewriting • Checkpoint & walk store map table • Search-and-replace old-SQ-index  new-SQ-index • Re-dispatch load if SQ-index tag has changed

  20. } D: p2=2 Convergence distance: here 2 insns E: r3=r1+1 E: p3=p1+1 F: r4=r2+1 F: p4=p2+1 G: r5=ld(r4) G: p5=ld(p4) Inserting CD Instructions A: bez p1, D Ginger uses proactive resource management (a la Skipper) • Not the same as proactive ooo renaming • Predict convergence distance • Reserve ROB, LSQ, and physical registers for them • Simplifies CD insn insertion • Simplifies commit and recovery, avoids resource deadlocks • Keeps CI stores in SQ positions: minimizes store tag rewriting • Reduces window utilization, but still better than non-CI

  21. Outline Control Independence (CI) and out-of-order renaming Prior CI microarchitectures (ooo renaming schemes) Ginger Acronym pop quiz Comparative performance evaluation Conclusion

  22. Experimental Methodology Goal: compare ooo renaming schemes • Re-implemented “Walker”, Skipper • All things equal other than ooo renaming • Paper also has selective branch recovery (SBR) [Gandhi+ HPCA’04] Simulated configuration • 4-way fetch/issue/commit, 21-stage pipe, 512 ROB, 64 issue queue • 32KB hybrid gShare, 8KB confidence predictor • 2-way, 8-stage re-dispatch, 16 checkpoints • Statically computed convergence PCs & distances • CI for branches confidence <95%, convergence distance <256 Benchmarks: SPECint2000, MediaBench, CommBench • Gmeans over entire suite

  23. Before We Start: Ideal CI Ideal CI: instantaneous, zero bandwidth ooo renaming • Not a CI limit study in any other sense • 95% confidence, 256 convergence distance limits apply Mis-predictions CI’ed: 55% Speedups: 8% SPECint, 14% Comm, 16% Media • Perfect branch prediction provides higher speedups

  24. Comparative Performance: Ginger Mis-predictions CI’ed: 53% Speedups: 5% SPECint, 11% Comm, 12% Media • Ooo renaming overhead of tag rewriting is low: ~3%

  25. Comparative Performance: Walker Mis-predictions CI’ed: 56% • Exploits more CI opportunities: 1 checkpoint per CI, not 2 Speedups: 1% SPECint, 7% Comm, 5% Media • High rename/dispatch bandwidth overhead

  26. Comparative Performance: Skipper Mis-predictions CI’ed: 29% • Penalty on correct prediction  possible slowdowns • Limits benefit to very low confidence branches (<80%) • In turn, limits CI opportunities Speedups: -1% SPECint, 8% Comm, 9% Media

  27. More Insight: Dispatch Bandwidth Dispatch bandwidth: limits commit bandwidth • Overhead: slot spent on anything other than committing insn Non-CI processor overheads • Squashed insns/fetch refill stalls: big components • Full window stalls: smaller, partially due to mis-predictions vpr (SPECint)

  28. More Insight: Dispatch Bandwidth Effect of ideal CI • Reduces squashed insns: CI insns • Reduces fetch refill stalls: don’t squash front-end insns, dispatch • Increases full window stalls: space reservation, higher utilization • Some low overhead for CIDD re-dispatch: ~2% vpr (SPECint)

  29. More Insight: Dispatch Bandwidth Effect of realistic CI • Some additional ooo renaming overhead: tag rewrites, pmoves • Additional inefficiencies and limitations vpr (SPECint)

  30. tag rewriting More Insight: Dispatch Bandwidth Ginger • Low ooo renaming overhead: few other inefficiencies vpr (SPECint)

  31. More Insight: Dispatch Bandwidth • Walker: high ooo renaming bandwidth overhead • Skipper: very high ooo renaming bandwidth overhead • Restricted to very low confidence branches vpr (SPECint)

  32. Conclusions Control independence (CI) • Complements improvements in predictor accuracy • Ooo renaming: most important feature, should be: • Low-overhead on mis-prediction • No overhead on correct prediction (“reactive”) Ginger: new reactive CI microarchitecture • Out-performs previous schemes: “Walker”, Skipper • Tag rewriting: new ooo renaming scheme • Uses (largely) existing hardware • Supports ooo memory renaming too • New re-dispatch mechanism: potentially useful beyond CI

  33. A: beqz p1, D D: p2 = 2 D: p2 = p9 transform to “pmove”, re-dispatch E: p3 = p1+1 F: p4 = p2+1 re-dispatch G: p5 = ld(p4) re-dispatch Selective Branch Recovery [Gandhi+, HPCA’04] Ooo renaming: annul wrong-path CD instructions • Transform wrong-path CD insns to pmoves (in place) • Re-dispatch them and CIDD insns (from recovery buffer) • Limited applicability: can remove CD instructions, but not insert • Exact convergence : works for “if-then”, not “if-then-else”

  34. Comparative Performance: SBR Mis-predictions CI’ed: 26% • Inability to insert CD insns limits CI opportunities Speedups: 0% SPECint, 5% Comm, 3% Media • CD to pmove transform adds latency  possible slowdowns

More Related