Ginger control independence using tag rewriting
This presentation is the property of its rightful owner.
Sponsored Links
1 / 35

Ginger: Control Independence Using Tag Rewriting PowerPoint PPT Presentation


  • 74 Views
  • Uploaded on
  • Presentation posted in: General

Ginger: Control Independence Using Tag Rewriting. Andrew Hilton, Amir Roth University of Pennsylvania {adhilton, [email protected] ISCA-34 :: June, 2007. A: bez r1, D. D: r2=2. D: r2=2. B: r2=1. B: r2=1. Control dependent (CD) insns. C: jmp E. C: jmp E. }. E: r3=r1+1.

Download Presentation

Ginger: Control Independence Using Tag Rewriting

An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -

Presentation Transcript


Ginger control independence using tag rewriting

Ginger:Control Independence Using Tag Rewriting

Andrew Hilton, Amir Roth

University of Pennsylvania

{adhilton, [email protected]

ISCA-34 :: June, 2007


Control independence ci

A: bez r1, D

D: r2=2

D: r2=2

B: r2=1

B: r2=1

Control dependent (CD) insns

C: jmp E

C: jmp E

}

E: r3=r1+1

E: r3=r1+1

F: r4=r2+1

F: r4=r2+1

Control independent (CI) insns

G: r5=ld(r4)

G: r5=ld(r4)

Control Independence (CI)

Branch mispredictions limit single-thread performance

  • Improve prediction accuracy? Hard

  • Predicate? Cost on correct predictions

  • Exploit control independence (CI) to reduce squash penalty

    This paper: Ginger, a new (better) CI microarchitecture

remember acronyms CI, CD


Exploiting control independence

D: r2=2

B: r2=1

C: jmp E

E: r3=r1+1

F: r4=r2+1

E: r3=r1+1

G: r5=ld(r4)

F: r4=r2+1

G: r5=ld(r4)

D: r2=2

D: r2=2

B: r2=1

B: r2=1

C: jmp E

E: r3=r1+1

F: r4=r2+1

F: r4=r2+1

G: r5=ld(r4)

Exploiting Control Independence

A: bez r1, D

Conventional recovery

  • Squash all post mis-prediction insns

  • Fetch/execute all correct-path insns

  • Re-fetch/re-execute CI insns (waste)

A: bez r1, D

CI recovery

  • Squash only wrong-path CD insns

  • Fetch/execute only correct-path CD insns

  • Preserve CI insns: E, F,G

  • Preserve un-dispatched CI insns: H, I…

How to “Insert” CD insns?

What to do about CI insns that depend on CD insns?


Out of order renaming

Start: wrong path

Goal: correct path

CI halfway

A: bez p1, D

A: bez p1, D

A: bez p1, D

1

D: p2=2

D: p2=2

B: p6=1

B: p6=1

2

B: p6=1

C: jmp E

C: jmp E

E: p3=p1+1

E: p3=p1+1

E: p3=p1+1

F: p4=p2+1

F: p4=p2+1

F: p4=p2+1

F: p4=p6+1

F: p4=p6+1

G: p5=ld(p4)

G: p5=ld(p4)

G: p5=ld(p4)

Out-of-Order Renaming

CI step 1: replace CD insns

CI Step 2: out-of-order renaming

  • Step 1 changes inputs for some CI insns

  • CI data dependent (CIDD) insns: F and G (transitively, via F)

  • Must identify CIDD insns and repair their inputs

  • Must re-issue CIDD insns that have already issued

  • Key feature of CI, implementation distinguishes CI schemes

??

remember CIDD acronym too


Outline

Outline

Control Independence (CI) and out-of-order renaming

Prior CI microarchitectures (ooo renaming schemes)

  • “Walker”

  • Skipper

    Ginger

    Comparative performance evaluation

    Conclusion


Walker rotenberg hpca 99

A: bez p1, D

B: p6=1

C: jmp E

F: p4=p6+1

input changed  re-dispatch

E: p3=p1+1

input transitively changed  re-dispatch

F: p4=p2+1

G: p5=ld(p4)

“Walker” [Rotenberg+, HPCA’99]

Ooo renaming: walk all CI insns

  • Re-rename, re-dispatch if inputs (transitively) changed

  • Reactive: no penalty on correct prediction (no worse than base)

  • High overhead on mis-prediction

    • Walk and re-renames CI data independent insns (CIDI): E

    • Typically many more of those than CIDD

    • Still better than baseline


Skipper cher micro 01

B: p6=1

C: jmp E

P: p9=??

P: p9=p6

pre-synchronize

“pmove”

E: p3=p1+1

F: p4=p9+1

G: p5=ld(p4)

Skipper [Cher+, MICRO’01]

Ooo renaming: proactive CI + pre-synchronization

  • Defer CD fetch until branch resolves (reserve space)

  • Pre-synchronize: predict CD output registers (r2) and pre-allocate

  • After correct-path CD, dispatch/execute “pmoves”

  • Low ooo renaming overhead on mis-prediction

    • Proportional to CD region register output set

  • Same overhead even on correct prediction

A: bez p1, D


Ooo renaming walker skipper ginger

OOO Renaming: “Walker”+SkipperGinger

“Walker”: walk CI insns

  • Reactive: no overhead on correct predictions

  • High overhead on mis-predictions: proportional to CI insns

    Skipper: pre-synchronize

  • Low overhead on mis-predictions: proportional to CD registers

  • Proactive: same overhead on correct predictions

    Ginger: tag rewriting

  • Low overhead on mis-predictions: proportional to CD registers

  • Reactive: no overhead on correct predictions

    • Proactive also possible, but not really worth it

  • Uses (mostly) existing hardware

  • Supports ooo renaming of loads


Outline1

Outline

Control Independence (CI) and out-of-order renaming

Prior CI microarchitectures (ooo renaming schemes)

Ginger

  • Tag rewriting

  • Selective re-dispatch

  • Out-of-order renaming for loads

  • Inserting CD insns

    Comparative performance evaluation

    Conclusion


Tag rewriting at 32k feet

Goal: correct path

CI halfway

A: bez p1, D

A: bez p1, D

B: p6=1

B: p6=1

C: jmp E

C: jmp E

E: p3=p1+1

E: p3=p1+1

F: p4=p2+1

F: p4=p2+1

F: p4=p6+1

F: p4=p6+1

G: p5=ld(p4)

G: p5=ld(p4)

Tag Rewriting at 32K Feet

Recall: ooo renaming

  • Correctness: repair F’s r2 input p2p6

  • Performance: without walking E and G also

    Tag rewriting: ooo renaming by register, not by insn

  • Identify which registers have changed (r2: p2p6)

  • Do a fast “search-replace” on CI insns

  • 1 step (“search-replace” p2p6), not 3 (re-rename E, F, G)

  • How to actually do both of these things

you are “here”


Tag rewriting 1 tracking register changes

Start: wrong path

CI halfway

A: bez p1, D

A: bez p1, D

D: p2=2

B: p6=1

C: jmp E

E: p3=p1+1

E: p3=p1+1

F: p4=p2+1

F: p4=p2+1

G: p5=ld(p4)

G: p5=ld(p4)

r1

r2

r3

r1

r2

r3

p1

p6

p3

p1

p2

1

p2

p3

p6

1

or

0

1

0

0

1

0

Tag Rewriting 1: Tracking Register Changes

Active map table: correct-path mappings at E (CI start)

Need: checkpoint for wrong-path mappings at E

  • Bitvectors identify which registers must be rewritten

  • Fromto = wrong-pathcorrect-path

  • How to get wrong-path checkpoint (“CI checkpoint”)

you are “here”


Tag rewriting 0 setup

D: p2=2

E: p3=p1+1

F: p4=p2+1

G: p5=ld(p4)

r1

r2

r3

r1

r2

r3

p1

p2

p3

0

0

1

0

Tag Rewriting 0: Setup

Start: wrong path

How do we know to create the CI checkpoint?

  • Predict that branch A is low-confidence [Jacobson+ MICRO’06]

  • Start tracking written registers

    How do we know where to create it?

  • Predict A’s convergence PC: E [Cher+ MICRO’01, Collins+ MICRO’04]

  • Take CI checkpoint before convergence PC is renamed

A: bez p1, D


Tag rewriting 2 actual tag rewriting

CI halfway

A: bez p1, D

B: p6=1

C: jmp E

E: p3=p1+1

F: p4=p2+1

r1

G: p5=ld(p4)

r2

r3

r1

r2

r3

r1

r2

r3

p1

p6

p3

p1

p2

p3

p1

p2

p2

p3

Tag Rewriting 2: Actual Tag Rewriting

Tags must be re-written in two places

  • In younger issue queue entries

  • In younger map table checkpoints: to rename future insns correctly

you are “here”

F: p4=p2+1


Basic tag rewriting approach

Basic Tag Rewriting Approach

Observe: tag rewriting hardware (mostly) exists

  • But used for different purposes: rename, dispatch, wakeup

    Exploit: borrow existing hardware

  • Stop the pipeline for a few cycles

  • Walk changed registers & tag rewrite

  • Restart the pipeline with correct dependences linked


Tag rewriting hardware

dispatch tags/ready bits

=

=

>

r

ptag

ptag

r

age

=

=

>

r

ptag

ptag

r

age

wakeup tags

Tag Rewriting Hardware

Issue queue

  • Existing: wakeup match = “search”, dispatch write = “replace”

  • Some additional logic may be necessary (age tags)

    Map table checkpoints

  • Some additional hardware here (but not associative search)

  • See paper


Cidd re dispatch

ROB

map

table

issue

queue

regfile

exec

ready

bits

?

issue

queue?

CIDD Re-Dispatch

So far: tag rewriting for insns in issue queue

  • ROB-size issue queue? Segmented/pipelined? [Hrishikesh+, ISCA’02]

  • No, slows down common-case wakeup/select

    Now: conventional issue queue, issued insns leave as usual

  • CIDD insns re-dispatch from someplace

  • That place itself must supports tag rewriting


Cidd re dispatch1

ROB

map

table

issue

queue

regfile

exec

ready

bits

re-dispatch queue

CIDD Re-Dispatch

Ginger: a ROB-sized re-dispatch queue

  • Internal wakeup/select re-dispatch loop

    • Separate from issue wakeup/select

    • Supports tag rewriting to identify initial re-dispatch wave

    • Transitively identifies minimal dependent slice for re-dispatch

  • Segmented/pipelined and “half-bandwidth”  slow

  • Only 2% of insns re-dispatch  slow is fine


Cidd loads

CIDD Loads

CIDD loads: depend (via memory) on CD stores

  • How are these identified when CD stores inserted/removed?

    SQIP (store queue index prediction)[Sha+ MICRO’05]

  • Solution for large LSQ

  • Makes store-load forwarding act like register communication

    • Supports “store tag rewriting”

A: bez r1, D

D: st(r1)=2

B: r2=1

C: jmp E

E: r3=r1+1

F: r4=r2+1

G: r5=ld(r1)


Sqip and store tag rewriting

A: bez p1, D

D: st(p1)=1, @6

E: p3=p1+1

F: p4=p2+1

G: p5=ld(p1)

C

D

E

F

G

D

6

SQIP and Store Tag Rewriting

15 second introduction to SQIP

  • Store map table: store-PC  SQ index

  • Forwarding predictor: load-PC  store-PC

    • Load G  store D  SQ index 6

    • Load G’s second register tag is 6

    • Load G indexes SQ at position 6

G: p5=ld(p1), 6

Store tag rewriting

  • Checkpoint & walk store map table

  • Search-and-replace old-SQ-index  new-SQ-index

  • Re-dispatch load if SQ-index tag has changed


Inserting cd instructions

}

D: p2=2

Convergence distance: here 2 insns

E: r3=r1+1

E: p3=p1+1

F: r4=r2+1

F: p4=p2+1

G: r5=ld(r4)

G: p5=ld(p4)

Inserting CD Instructions

A: bez p1, D

Ginger uses proactive resource management (a la Skipper)

  • Not the same as proactive ooo renaming

  • Predict convergence distance

  • Reserve ROB, LSQ, and physical registers for them

  • Simplifies CD insn insertion

  • Simplifies commit and recovery, avoids resource deadlocks

  • Keeps CI stores in SQ positions: minimizes store tag rewriting

  • Reduces window utilization, but still better than non-CI


Outline2

Outline

Control Independence (CI) and out-of-order renaming

Prior CI microarchitectures (ooo renaming schemes)

Ginger

Acronym pop quiz

Comparative performance evaluation

Conclusion


Experimental methodology

Experimental Methodology

Goal: compare ooo renaming schemes

  • Re-implemented “Walker”, Skipper

  • All things equal other than ooo renaming

  • Paper also has selective branch recovery (SBR) [Gandhi+ HPCA’04]

    Simulated configuration

  • 4-way fetch/issue/commit, 21-stage pipe, 512 ROB, 64 issue queue

  • 32KB hybrid gShare, 8KB confidence predictor

  • 2-way, 8-stage re-dispatch, 16 checkpoints

  • Statically computed convergence PCs & distances

  • CI for branches confidence <95%, convergence distance <256

    Benchmarks: SPECint2000, MediaBench, CommBench

  • Gmeans over entire suite


Before we start ideal ci

Before We Start: Ideal CI

Ideal CI: instantaneous, zero bandwidth ooo renaming

  • Not a CI limit study in any other sense

  • 95% confidence, 256 convergence distance limits apply

    Mis-predictions CI’ed: 55%

    Speedups: 8% SPECint, 14% Comm, 16% Media

  • Perfect branch prediction provides higher speedups


Comparative performance ginger

Comparative Performance: Ginger

Mis-predictions CI’ed: 53%

Speedups: 5% SPECint, 11% Comm, 12% Media

  • Ooo renaming overhead of tag rewriting is low: ~3%


Comparative performance walker

Comparative Performance: Walker

Mis-predictions CI’ed: 56%

  • Exploits more CI opportunities: 1 checkpoint per CI, not 2

    Speedups: 1% SPECint, 7% Comm, 5% Media

  • High rename/dispatch bandwidth overhead


Comparative performance skipper

Comparative Performance: Skipper

Mis-predictions CI’ed: 29%

  • Penalty on correct prediction  possible slowdowns

  • Limits benefit to very low confidence branches (<80%)

  • In turn, limits CI opportunities

    Speedups: -1% SPECint, 8% Comm, 9% Media


More insight dispatch bandwidth

More Insight: Dispatch Bandwidth

Dispatch bandwidth: limits commit bandwidth

  • Overhead: slot spent on anything other than committing insn

    Non-CI processor overheads

  • Squashed insns/fetch refill stalls: big components

  • Full window stalls: smaller, partially due to mis-predictions

vpr (SPECint)


More insight dispatch bandwidth1

More Insight: Dispatch Bandwidth

Effect of ideal CI

  • Reduces squashed insns: CI insns

  • Reduces fetch refill stalls: don’t squash front-end insns, dispatch

  • Increases full window stalls: space reservation, higher utilization

  • Some low overhead for CIDD re-dispatch: ~2%

vpr (SPECint)


More insight dispatch bandwidth2

More Insight: Dispatch Bandwidth

Effect of realistic CI

  • Some additional ooo renaming overhead: tag rewrites, pmoves

  • Additional inefficiencies and limitations

vpr (SPECint)


More insight dispatch bandwidth3

tag rewriting

More Insight: Dispatch Bandwidth

Ginger

  • Low ooo renaming overhead: few other inefficiencies

vpr (SPECint)


More insight dispatch bandwidth4

More Insight: Dispatch Bandwidth

  • Walker: high ooo renaming bandwidth overhead

  • Skipper: very high ooo renaming bandwidth overhead

    • Restricted to very low confidence branches

vpr (SPECint)


Conclusions

Conclusions

Control independence (CI)

  • Complements improvements in predictor accuracy

  • Ooo renaming: most important feature, should be:

    • Low-overhead on mis-prediction

    • No overhead on correct prediction (“reactive”)

      Ginger: new reactive CI microarchitecture

  • Out-performs previous schemes: “Walker”, Skipper

  • Tag rewriting: new ooo renaming scheme

    • Uses (largely) existing hardware

    • Supports ooo memory renaming too

  • New re-dispatch mechanism: potentially useful beyond CI


Selective branch recovery gandhi hpca 04

A: beqz p1, D

D: p2 = 2

D: p2 = p9

transform to “pmove”, re-dispatch

E: p3 = p1+1

F: p4 = p2+1

re-dispatch

G: p5 = ld(p4)

re-dispatch

Selective Branch Recovery [Gandhi+, HPCA’04]

Ooo renaming: annul wrong-path CD instructions

  • Transform wrong-path CD insns to pmoves (in place)

  • Re-dispatch them and CIDD insns (from recovery buffer)

  • Limited applicability: can remove CD instructions, but not insert

    • Exact convergence : works for “if-then”, not “if-then-else”


Comparative performance sbr

Comparative Performance: SBR

Mis-predictions CI’ed: 26%

  • Inability to insert CD insns limits CI opportunities

    Speedups: 0% SPECint, 5% Comm, 3% Media

  • CD to pmove transform adds latency  possible slowdowns


  • Login