Decoupled Store Completion
Download
1 / 27

Andrew Hilton, Amir Roth University of Pennsylvania {adhilton, amir}@cis.upenn.edu - PowerPoint PPT Presentation


  • 112 Views
  • Uploaded on

Decoupled Store Completion Silent Deterministic Replay Enabling Scalable Data Memory for CPR/CFP Processors. Andrew Hilton, Amir Roth University of Pennsylvania {adhilton, amir}@cis.upenn.edu. ISCA-36 :: June 23, 2009. Brief Overview. Latency-tolerant processors.

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about 'Andrew Hilton, Amir Roth University of Pennsylvania {adhilton, amir}@cis.upenn.edu' - telma


An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
Andrew hilton amir roth university of pennsylvania adhilton amir cis upenn edu

Decoupled Store CompletionSilent Deterministic ReplayEnabling Scalable Data Memory for CPR/CFP Processors

Andrew Hilton, Amir Roth

University of Pennsylvania

{adhilton, amir}@cis.upenn.edu

ISCA-36:: June 23, 2009


Brief overview
Brief Overview

Latency-tolerant processors

Scalable load & store queues

CPR/CFP [Akkary03, Srinivasan04]

SVW/SQIP [Roth05, Sha05]

Scalable load & store queues for

latency-tolerant processors

SA-LQ/HSQ [Akkary03]

SRL [Gandhi05]

ELSQ [Pericas08]

Dynamically scheduled superscalar processors

DKIP, FMC [Pericas06, Pericas07]

Granularity mismatch: checkpoint (CPR) vs. instruction (SVW/SQIP)

Decoupled Store Completion & Silent Deterministic Replay


Outline
Outline

Background

CPR/CFP

SVW/SQIP

The granularity mismatch problem

DSC/SDR

Evaluation


Cpr cfp
CPR/CFP

Latency-tolerant: scale key window structures under LL$ miss

Issue queue, regfile, load & store queues

CFP (Continual Flow Pipeline)[Srinivasan04]

Scale issue queue & regfile by “slicing out” miss-dependent insns

CPR (Checkpoint Processing & Recovery)[Akkary03]

Scale regfile by limiting recovery to pre-created checkpoints

Aggressive reclamation of non-checkpoint registers

Unintended consequence? checkpoint-granularity “bulk commit”

SA-LQ (Set-Associative Load Queue) [Akkary03]

HSQ (Hierarchical Store Queue) [Akkary03]


Baseline performance area
Baseline Performance (& Area)

  • ASSOC (baseline): 64/48 entry fully-associative load/store queues

  • 8SA-LQ/HSQ: 512-entry load queue, 256-entry store queue

    • Load queue: area is fine, poor performance (set conflicts)

    • Store queue: performance is fine, area inefficient (large CAM)


Andrew hilton amir roth university of pennsylvania adhilton amir cis upenn edu
SQIP

SQIP (Store Queue Index Prediction)[Sha05]

Scales store queue/buffer by eliminating associative search

@dispatch: load predicts store queue position of forwarding store

@execute: load indexes store queue at this position

younger

instruction stream

older

[?]

[?]

[?]

[x20]

[x10]

T:Br

S:+

R:Ld

Q:St

P:St

B:Ld

A:St

<9>

<8>

<4>

commit <ssn=4>

dispatch <ssn=9>

addresses

P:St

SSNs

<8>

Preliminaries: SSNs (Store Sequence Numbers) [Roth05]

  • Stores named by monotonically increasing sequence numbers

  • Low-order bits are store queue/buffer positions

  • Global SSNs track dispatch, commit, (store) completion


Andrew hilton amir roth university of pennsylvania adhilton amir cis upenn edu
SVW

Store Vulnerability Window (SVW)[Roth05]

Scales load queue by eliminating associative search

Load verification by in-order re-execution prior to commit

Highly filtered: <1% of loads actually re-execute

[x18]

[x18]

[x20]

[x18]

[x18]

[x20]

[x10]

T:Br

S:+

R:Ld

Q:St

P:St

B:Ld

A:St

<8>

<8>

<9>

<8>

<8>

<4>

commit <9>

complete <3>

x?0

x20

<9>

SSBF (SSN Bloom Filter)

x?8

x18

<8>

x?8

x18

<8>

verify/

  • Address-indexed SSBF tracks [addr, SSN] of commited stores

  • @commit: loads check SSBF, re-execute if possibly incorrect


Svw naive
SVW–NAIVE

  • SVW: 512-entry indexed load queue, 256-entry store queue

  • Slowdowns over 8SA-LQ (mesa, wupwise)

  • Some slowdowns even over ASSOC too (bzip2, vortex)

  • Why? Not forwarding mis-predictions … store-load serialization

    • Load Y can’t verify until older store X completes to D$


Store load serialization rob
Store-Load Serialization: ROB

SVW/SQIP example: SSBF verification “hole”

Load R forwards from store <4>  vulnerable to stores <5>–<9>

No SSBF entry for address [x10]  must replay

Can’t search store buffer  wait until stores <5>–<8> in D$

In a ROB processor … <8> (P) will complete (and usually quickly)

In a CPR processor …

[x10]

[x20]

[x18]

[x20]

[x10]

T:Br

S:+

R:Ld

Q:St

P:St

B:Ld

A:St

<4>

<9>

<8>

<4>

complete <3>

complete <8>

verify/commit <9>

verify/commit <9>

x?0

x?0

x20

x20

<9>

<9>

x?8

x?8

x18

x18

<8>

<8>

x?0

x20

<9>


Store load serialization cpr
Store-Load Serialization: CPR

P will complete … unless it’s in same checkpoint as R

Deadlock: load R can’t verify  store P can’t complete

Resolve: squash (ouch), on re-execute, create checkpoint before R

P and R will be in separate checkpoints

Better: learn and create checkpoints before future instances of R

This is SVW–TRAIN

commit

[x10]

[x20]

[x18]

[x20]

[x10]

T:Br

S:+

R:Ld

Q:St

P:St

B:Ld

A:St

<4>

<9>

<8>

<4>

complete <3>

verify <9>

x?0

x20

<9>

x?8

x18

<8>


Svw train
SVW–TRAIN

  • Better than SVW–NAÏVE

  • But worse in some cases (art, mcf, vpr)

    • Over-checkpointing holds too many registers

    • Checkpoint may not be available for branches


What about set associative ssbfs
What About Set-Associative SSBFs?

  • Higher associativity helps (reduces hole frequency) but …

  • We’re replacing store queue associativity with SSBF associativity

    • Trying to avoid things like this

  • Want a better solution…


Dsc decoupled store completion
DSC (Decoupled Store Completion)

No fundamental reason we cannot complete stores <4> – <9>

All older instructions have completed

What’s stopping us? definition of commit & architected state

CPR: commit = oldest register checkpoint (checkpoint granularity)

ROB: commit = SVW-verify (instruction granularity)

Restore ROB definition

Allow stores to complete past oldest checkpoint

This is DSC (Decoupled Store Completion)

commit

commit

[x10]

[x20]

[x18]

[x20]

[x10]

T:Br

S:+

R:Ld

Q:St

P:St

B:Ld

A:St

<4>

<9>

<8>

<4>

complete <3>

complete <8>

verify/commit <9>

verify <6>


Dsc what about mis speculations
DSC: What About Mis-Speculations?

DSC: Architected state younger than oldest checkpoint

What about mis-speculation (e.g., branch T mis-predicted)?

Can only recover to checkpoint

Squash committed instructions?

Squash stores visible to other processors? etc.

How do we recover architected state?

[x10]

[x20]

[x18]

[x20]

[x10]

T:Br

S:+

R:Ld

Q:St

P:St

B:Ld

A:St

<4>

<9>

<8>

<4>

complete <8>

?

T:Br

verify/commit <9>


Silent deterministic recovery sdr
Silent Deterministic Recovery (SDR)

Reconstruct architected state on demand

Squash to oldest checkpoint and replay …

Deterministically: re-produce committed values

Silently: without generating coherence events

How? discard committed stores at rename (already in SB or D$)

How? read load values from load queue

Avoid WAR hazards with younger stores

Same thread (e.g., BQ) or different thread (coherence)

complete <8>

[x10]

[x20]

[x18]

[x20]

[x10]

T:Br

S:+

R:Ld

Q:St

P:St

B:Ld

A:St

<4>

<9>

<8>

<4>

verify/commit <9>


Outline1
Outline

Background

DSC/SDR (yes, that was it)

Evaluation

Performance

Performance-area trade-offs


Performance methodology
Performance Methodology

Workloads

SPEC2000, Alpha AXP ISA, -O4, train inputs, 2% periodic sampling

Cycle-level simulator configuration

4-way superscalar out-of-order CPR/CFP processor

8 checkpoints, 32/32 INT/FP issue queue entries

32KByte D$, 15-cycle 2MByte L2, 8 8-entry stream prefetchers

400 cycle memory, 4Byte/cycle memory bus


Svw dsc sdr
SVW+DSC/SDR

  • Outperforms SVW–Naïve and SVW–Train

  • Outperforms 8SA-LQ on average (by a lot)

  • Occasional slight slowdowns (eon, vortex) relative to 8SA-LQ

    • These are due to forwarding mis-speculation


Smaller less associative ssbfs
Smaller, Less-Associative SSBFs

Does DSC/SDR make set-associative SSBFs unnecessary?

You can bet your associativity on it


Fewer checkpoints
Fewer Checkpoints

DSC/SDR reduce need for large numbers of checkpoints

  • Don’t need checkpoints to serialize store/load pairs

  • Efficient use of D$ bandwidth even with widely spaced checkpoints

  • Good: checkpoints are expensive


And less area
… And Less Area

Area methodology

CACTI-4[Tarjan04], 45nm

Sum areas for load/store queues (SSBF & predictor too if needed)

E.g., 512-entry 8SA-LQ / 256-entry HSQ

High-performance/low-area

6.6% speedup, 0.91mm2


How performance area was won
How Performance/Area Was Won

SVW load queue: big performance gain (no conflicts) & small area loss

SQIP store queue: small performance loss & big area gain (no CAM)

Big SVW performance gain offsets small SQIP performance loss

Big SQIP area gain offsets small SVW area loss

DSC/SDR: big performance gain & small area gain


Dsc sdr performance area
DSC/SDR Performance/Area

DSC/SDR improve SVW/SQIP IPC and reduce its area

No new structures, just new ways of using existing structures

No SSBF checkpoints

No checkpoint-creation predictor

More tolerant to reduction in checkpoints, SSBF size


Pareto analysis
Pareto Analysis

SVW/SQIP+DSC/SDR dominates all other designs

SVW/SQIP are low area (no CAMs)

DSC/SDR needed to match IPC of fully-associative load queue (FA-LQ)


Related work
Related Work

SRL (Store Redo Log)[Gandhi05]

Large associative store queue  FIFO buffer + forwarding cache

Expands store queue only under LL$ misses  under-performs HSQ

Unordered late-binding load/store queues [Sethumadhavan08]

Entries only for executed loads and stores

Poor match for centralized latency tolerant processors

Cherry [Martinez02]

“Post retirement” checkpoints

No large load/store queues, but may benefit from DSC/SDR

Deterministic replay (e.g., race debugging) [Xu04, Narayanasamy06]


Conclusions
Conclusions

Checkpoint granularity …

… register management: good

… store commit: somewhat painful

DSC/SDR: the good parts of the checkpoint world

Checkpoint granularity registers + instruction granularity stores

Key 1: disassociate commit from oldest register checkpoint

Key 2: reconstruct architected state silently on demand

Committed load values available in load queue

Allow checkpoint processor to use SVW/SQIP load/store queues

Performance and area advantages

Simplify multi-processor operation for checkpoint processors


Andrew hilton amir roth university of pennsylvania adhilton amir cis upenn edu

[ 27 ]