ginger control independence using tag rewriting
Download
Skip this Video
Download Presentation
Ginger: Control Independence Using Tag Rewriting

Loading in 2 Seconds...

play fullscreen
1 / 35

Ginger: Control Independence Using Tag Rewriting - PowerPoint PPT Presentation


  • 113 Views
  • Uploaded on

Ginger: Control Independence Using Tag Rewriting. Andrew Hilton, Amir Roth University of Pennsylvania {adhilton, amir}@cis.upenn.edu. ISCA-34 :: June, 2007. A: bez r1, D. D: r2=2. D: r2=2. B: r2=1. B: r2=1. Control dependent (CD) insns. C: jmp E. C: jmp E. }. E: r3=r1+1.

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about ' Ginger: Control Independence Using Tag Rewriting' - remedios-knox


An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
ginger control independence using tag rewriting

Ginger:Control Independence Using Tag Rewriting

Andrew Hilton, Amir Roth

University of Pennsylvania

{adhilton, amir}@cis.upenn.edu

ISCA-34 :: June, 2007

control independence ci

A: bez r1, D

D: r2=2

D: r2=2

B: r2=1

B: r2=1

Control dependent (CD) insns

C: jmp E

C: jmp E

}

E: r3=r1+1

E: r3=r1+1

F: r4=r2+1

F: r4=r2+1

Control independent (CI) insns

G: r5=ld(r4)

G: r5=ld(r4)

Control Independence (CI)

Branch mispredictions limit single-thread performance

  • Improve prediction accuracy? Hard
  • Predicate? Cost on correct predictions
  • Exploit control independence (CI) to reduce squash penalty

This paper: Ginger, a new (better) CI microarchitecture

remember acronyms CI, CD

exploiting control independence

D: r2=2

B: r2=1

C: jmp E

E: r3=r1+1

F: r4=r2+1

E: r3=r1+1

G: r5=ld(r4)

F: r4=r2+1

G: r5=ld(r4)

D: r2=2

D: r2=2

B: r2=1

B: r2=1

C: jmp E

E: r3=r1+1

F: r4=r2+1

F: r4=r2+1

G: r5=ld(r4)

Exploiting Control Independence

A: bez r1, D

Conventional recovery

  • Squash all post mis-prediction insns
  • Fetch/execute all correct-path insns
  • Re-fetch/re-execute CI insns (waste)

A: bez r1, D

CI recovery

  • Squash only wrong-path CD insns
  • Fetch/execute only correct-path CD insns
  • Preserve CI insns: E, F,G
  • Preserve un-dispatched CI insns: H, I…

How to “Insert” CD insns?

What to do about CI insns that depend on CD insns?

out of order renaming

Start: wrong path

Goal: correct path

CI halfway

A: bez p1, D

A: bez p1, D

A: bez p1, D

1

D: p2=2

D: p2=2

B: p6=1

B: p6=1

2

B: p6=1

C: jmp E

C: jmp E

E: p3=p1+1

E: p3=p1+1

E: p3=p1+1

F: p4=p2+1

F: p4=p2+1

F: p4=p2+1

F: p4=p6+1

F: p4=p6+1

G: p5=ld(p4)

G: p5=ld(p4)

G: p5=ld(p4)

Out-of-Order Renaming

CI step 1: replace CD insns

CI Step 2: out-of-order renaming

  • Step 1 changes inputs for some CI insns
  • CI data dependent (CIDD) insns: F and G (transitively, via F)
  • Must identify CIDD insns and repair their inputs
  • Must re-issue CIDD insns that have already issued
  • Key feature of CI, implementation distinguishes CI schemes

??

remember CIDD acronym too

outline
Outline

Control Independence (CI) and out-of-order renaming

Prior CI microarchitectures (ooo renaming schemes)

  • “Walker”
  • Skipper

Ginger

Comparative performance evaluation

Conclusion

walker rotenberg hpca 99

A: bez p1, D

B: p6=1

C: jmp E

F: p4=p6+1

input changed  re-dispatch

E: p3=p1+1

input transitively changed  re-dispatch

F: p4=p2+1

G: p5=ld(p4)

“Walker” [Rotenberg+, HPCA’99]

Ooo renaming: walk all CI insns

  • Re-rename, re-dispatch if inputs (transitively) changed
  • Reactive: no penalty on correct prediction (no worse than base)
  • High overhead on mis-prediction
    • Walk and re-renames CI data independent insns (CIDI): E
    • Typically many more of those than CIDD
    • Still better than baseline
skipper cher micro 01

B: p6=1

C: jmp E

P: p9=??

P: p9=p6

pre-synchronize

“pmove”

E: p3=p1+1

F: p4=p9+1

G: p5=ld(p4)

Skipper [Cher+, MICRO’01]

Ooo renaming: proactive CI + pre-synchronization

  • Defer CD fetch until branch resolves (reserve space)
  • Pre-synchronize: predict CD output registers (r2) and pre-allocate
  • After correct-path CD, dispatch/execute “pmoves”
  • Low ooo renaming overhead on mis-prediction
    • Proportional to CD region register output set
  • Same overhead even on correct prediction

A: bez p1, D

ooo renaming walker skipper ginger
OOO Renaming: “Walker”+SkipperGinger

“Walker”: walk CI insns

  • Reactive: no overhead on correct predictions
  • High overhead on mis-predictions: proportional to CI insns

Skipper: pre-synchronize

  • Low overhead on mis-predictions: proportional to CD registers
  • Proactive: same overhead on correct predictions

Ginger: tag rewriting

  • Low overhead on mis-predictions: proportional to CD registers
  • Reactive: no overhead on correct predictions
    • Proactive also possible, but not really worth it
  • Uses (mostly) existing hardware
  • Supports ooo renaming of loads
outline1
Outline

Control Independence (CI) and out-of-order renaming

Prior CI microarchitectures (ooo renaming schemes)

Ginger

  • Tag rewriting
  • Selective re-dispatch
  • Out-of-order renaming for loads
  • Inserting CD insns

Comparative performance evaluation

Conclusion

tag rewriting at 32k feet

Goal: correct path

CI halfway

A: bez p1, D

A: bez p1, D

B: p6=1

B: p6=1

C: jmp E

C: jmp E

E: p3=p1+1

E: p3=p1+1

F: p4=p2+1

F: p4=p2+1

F: p4=p6+1

F: p4=p6+1

G: p5=ld(p4)

G: p5=ld(p4)

Tag Rewriting at 32K Feet

Recall: ooo renaming

  • Correctness: repair F’s r2 input p2p6
  • Performance: without walking E and G also

Tag rewriting: ooo renaming by register, not by insn

  • Identify which registers have changed (r2: p2p6)
  • Do a fast “search-replace” on CI insns
  • 1 step (“search-replace” p2p6), not 3 (re-rename E, F, G)
  • How to actually do both of these things

you are “here”

tag rewriting 1 tracking register changes

Start: wrong path

CI halfway

A: bez p1, D

A: bez p1, D

D: p2=2

B: p6=1

C: jmp E

E: p3=p1+1

E: p3=p1+1

F: p4=p2+1

F: p4=p2+1

G: p5=ld(p4)

G: p5=ld(p4)

r1

r2

r3

r1

r2

r3

p1

p6

p3

p1

p2

1

p2

p3

p6

1

or

0

1

0

0

1

0

Tag Rewriting 1: Tracking Register Changes

Active map table: correct-path mappings at E (CI start)

Need: checkpoint for wrong-path mappings at E

  • Bitvectors identify which registers must be rewritten
  • Fromto = wrong-pathcorrect-path
  • How to get wrong-path checkpoint (“CI checkpoint”)

you are “here”

tag rewriting 0 setup

D: p2=2

E: p3=p1+1

F: p4=p2+1

G: p5=ld(p4)

r1

r2

r3

r1

r2

r3

p1

p2

p3

0

0

1

0

Tag Rewriting 0: Setup

Start: wrong path

How do we know to create the CI checkpoint?

  • Predict that branch A is low-confidence [Jacobson+ MICRO’06]
  • Start tracking written registers

How do we know where to create it?

  • Predict A’s convergence PC: E [Cher+ MICRO’01, Collins+ MICRO’04]
  • Take CI checkpoint before convergence PC is renamed

A: bez p1, D

tag rewriting 2 actual tag rewriting

CI halfway

A: bez p1, D

B: p6=1

C: jmp E

E: p3=p1+1

F: p4=p2+1

r1

G: p5=ld(p4)

r2

r3

r1

r2

r3

r1

r2

r3

p1

p6

p3

p1

p2

p3

p1

p2

p2

p3

Tag Rewriting 2: Actual Tag Rewriting

Tags must be re-written in two places

  • In younger issue queue entries
  • In younger map table checkpoints: to rename future insns correctly

you are “here”

F: p4=p2+1

basic tag rewriting approach
Basic Tag Rewriting Approach

Observe: tag rewriting hardware (mostly) exists

  • But used for different purposes: rename, dispatch, wakeup

Exploit: borrow existing hardware

  • Stop the pipeline for a few cycles
  • Walk changed registers & tag rewrite
  • Restart the pipeline with correct dependences linked
tag rewriting hardware

dispatch tags/ready bits

=

=

>

r

ptag

ptag

r

age

=

=

>

r

ptag

ptag

r

age

wakeup tags

Tag Rewriting Hardware

Issue queue

  • Existing: wakeup match = “search”, dispatch write = “replace”
  • Some additional logic may be necessary (age tags)

Map table checkpoints

  • Some additional hardware here (but not associative search)
  • See paper
cidd re dispatch

ROB

map

table

issue

queue

regfile

exec

ready

bits

?

issue

queue?

CIDD Re-Dispatch

So far: tag rewriting for insns in issue queue

  • ROB-size issue queue? Segmented/pipelined? [Hrishikesh+, ISCA’02]
  • No, slows down common-case wakeup/select

Now: conventional issue queue, issued insns leave as usual

  • CIDD insns re-dispatch from someplace
  • That place itself must supports tag rewriting
cidd re dispatch1

ROB

map

table

issue

queue

regfile

exec

ready

bits

re-dispatch queue

CIDD Re-Dispatch

Ginger: a ROB-sized re-dispatch queue

  • Internal wakeup/select re-dispatch loop
    • Separate from issue wakeup/select
    • Supports tag rewriting to identify initial re-dispatch wave
    • Transitively identifies minimal dependent slice for re-dispatch
  • Segmented/pipelined and “half-bandwidth”  slow
  • Only 2% of insns re-dispatch  slow is fine
cidd loads
CIDD Loads

CIDD loads: depend (via memory) on CD stores

  • How are these identified when CD stores inserted/removed?

SQIP (store queue index prediction)[Sha+ MICRO’05]

  • Solution for large LSQ
  • Makes store-load forwarding act like register communication
    • Supports “store tag rewriting”

A: bez r1, D

D: st(r1)=2

B: r2=1

C: jmp E

E: r3=r1+1

F: r4=r2+1

G: r5=ld(r1)

sqip and store tag rewriting

A: bez p1, D

D: st(p1)=1, @6

E: p3=p1+1

F: p4=p2+1

G: p5=ld(p1)

C

D

E

F

G

D

6

SQIP and Store Tag Rewriting

15 second introduction to SQIP

  • Store map table: store-PC  SQ index
  • Forwarding predictor: load-PC  store-PC
    • Load G  store D  SQ index 6
    • Load G’s second register tag is 6
    • Load G indexes SQ at position 6

G: p5=ld(p1), 6

Store tag rewriting

  • Checkpoint & walk store map table
  • Search-and-replace old-SQ-index  new-SQ-index
  • Re-dispatch load if SQ-index tag has changed
inserting cd instructions

}

D: p2=2

Convergence distance: here 2 insns

E: r3=r1+1

E: p3=p1+1

F: r4=r2+1

F: p4=p2+1

G: r5=ld(r4)

G: p5=ld(p4)

Inserting CD Instructions

A: bez p1, D

Ginger uses proactive resource management (a la Skipper)

  • Not the same as proactive ooo renaming
  • Predict convergence distance
  • Reserve ROB, LSQ, and physical registers for them
  • Simplifies CD insn insertion
  • Simplifies commit and recovery, avoids resource deadlocks
  • Keeps CI stores in SQ positions: minimizes store tag rewriting
  • Reduces window utilization, but still better than non-CI
outline2
Outline

Control Independence (CI) and out-of-order renaming

Prior CI microarchitectures (ooo renaming schemes)

Ginger

Acronym pop quiz

Comparative performance evaluation

Conclusion

experimental methodology
Experimental Methodology

Goal: compare ooo renaming schemes

  • Re-implemented “Walker”, Skipper
  • All things equal other than ooo renaming
  • Paper also has selective branch recovery (SBR) [Gandhi+ HPCA’04]

Simulated configuration

  • 4-way fetch/issue/commit, 21-stage pipe, 512 ROB, 64 issue queue
  • 32KB hybrid gShare, 8KB confidence predictor
  • 2-way, 8-stage re-dispatch, 16 checkpoints
  • Statically computed convergence PCs & distances
  • CI for branches confidence <95%, convergence distance <256

Benchmarks: SPECint2000, MediaBench, CommBench

  • Gmeans over entire suite
before we start ideal ci
Before We Start: Ideal CI

Ideal CI: instantaneous, zero bandwidth ooo renaming

  • Not a CI limit study in any other sense
  • 95% confidence, 256 convergence distance limits apply

Mis-predictions CI’ed: 55%

Speedups: 8% SPECint, 14% Comm, 16% Media

  • Perfect branch prediction provides higher speedups
comparative performance ginger
Comparative Performance: Ginger

Mis-predictions CI’ed: 53%

Speedups: 5% SPECint, 11% Comm, 12% Media

  • Ooo renaming overhead of tag rewriting is low: ~3%
comparative performance walker
Comparative Performance: Walker

Mis-predictions CI’ed: 56%

  • Exploits more CI opportunities: 1 checkpoint per CI, not 2

Speedups: 1% SPECint, 7% Comm, 5% Media

  • High rename/dispatch bandwidth overhead
comparative performance skipper
Comparative Performance: Skipper

Mis-predictions CI’ed: 29%

  • Penalty on correct prediction  possible slowdowns
  • Limits benefit to very low confidence branches (<80%)
  • In turn, limits CI opportunities

Speedups: -1% SPECint, 8% Comm, 9% Media

more insight dispatch bandwidth
More Insight: Dispatch Bandwidth

Dispatch bandwidth: limits commit bandwidth

  • Overhead: slot spent on anything other than committing insn

Non-CI processor overheads

  • Squashed insns/fetch refill stalls: big components
  • Full window stalls: smaller, partially due to mis-predictions

vpr (SPECint)

more insight dispatch bandwidth1
More Insight: Dispatch Bandwidth

Effect of ideal CI

  • Reduces squashed insns: CI insns
  • Reduces fetch refill stalls: don’t squash front-end insns, dispatch
  • Increases full window stalls: space reservation, higher utilization
  • Some low overhead for CIDD re-dispatch: ~2%

vpr (SPECint)

more insight dispatch bandwidth2
More Insight: Dispatch Bandwidth

Effect of realistic CI

  • Some additional ooo renaming overhead: tag rewrites, pmoves
  • Additional inefficiencies and limitations

vpr (SPECint)

more insight dispatch bandwidth3

tag rewriting

More Insight: Dispatch Bandwidth

Ginger

  • Low ooo renaming overhead: few other inefficiencies

vpr (SPECint)

more insight dispatch bandwidth4
More Insight: Dispatch Bandwidth
  • Walker: high ooo renaming bandwidth overhead
  • Skipper: very high ooo renaming bandwidth overhead
    • Restricted to very low confidence branches

vpr (SPECint)

conclusions
Conclusions

Control independence (CI)

  • Complements improvements in predictor accuracy
  • Ooo renaming: most important feature, should be:
    • Low-overhead on mis-prediction
    • No overhead on correct prediction (“reactive”)

Ginger: new reactive CI microarchitecture

  • Out-performs previous schemes: “Walker”, Skipper
  • Tag rewriting: new ooo renaming scheme
    • Uses (largely) existing hardware
    • Supports ooo memory renaming too
  • New re-dispatch mechanism: potentially useful beyond CI
selective branch recovery gandhi hpca 04

A: beqz p1, D

D: p2 = 2

D: p2 = p9

transform to “pmove”, re-dispatch

E: p3 = p1+1

F: p4 = p2+1

re-dispatch

G: p5 = ld(p4)

re-dispatch

Selective Branch Recovery [Gandhi+, HPCA’04]

Ooo renaming: annul wrong-path CD instructions

  • Transform wrong-path CD insns to pmoves (in place)
  • Re-dispatch them and CIDD insns (from recovery buffer)
  • Limited applicability: can remove CD instructions, but not insert
    • Exact convergence : works for “if-then”, not “if-then-else”
comparative performance sbr
Comparative Performance: SBR

Mis-predictions CI’ed: 26%

  • Inability to insert CD insns limits CI opportunities

Speedups: 0% SPECint, 5% Comm, 3% Media

  • CD to pmove transform adds latency  possible slowdowns
ad