Decoupled Store Completion
This presentation is the property of its rightful owner.
Sponsored Links
1 / 27

Andrew Hilton, Amir Roth University of Pennsylvania {adhilton, [email protected] PowerPoint PPT Presentation


  • 70 Views
  • Uploaded on
  • Presentation posted in: General

Decoupled Store Completion Silent Deterministic Replay Enabling Scalable Data Memory for CPR/CFP Processors. Andrew Hilton, Amir Roth University of Pennsylvania {adhilton, [email protected] ISCA-36 :: June 23, 2009. Brief Overview. Latency-tolerant processors.

Download Presentation

Andrew Hilton, Amir Roth University of Pennsylvania {adhilton, [email protected]

An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -

Presentation Transcript


Andrew hilton amir roth university of pennsylvania adhilton amir cis upenn edu

Decoupled Store CompletionSilent Deterministic ReplayEnabling Scalable Data Memory for CPR/CFP Processors

Andrew Hilton, Amir Roth

University of Pennsylvania

{adhilton, [email protected]

ISCA-36:: June 23, 2009


Brief overview

Brief Overview

Latency-tolerant processors

Scalable load & store queues

CPR/CFP [Akkary03, Srinivasan04]

SVW/SQIP [Roth05, Sha05]

Scalable load & store queues for

latency-tolerant processors

SA-LQ/HSQ [Akkary03]

SRL [Gandhi05]

ELSQ [Pericas08]

Dynamically scheduled superscalar processors

DKIP, FMC [Pericas06, Pericas07]

Granularity mismatch: checkpoint (CPR) vs. instruction (SVW/SQIP)

Decoupled Store Completion & Silent Deterministic Replay


Outline

Outline

Background

CPR/CFP

SVW/SQIP

The granularity mismatch problem

DSC/SDR

Evaluation


Cpr cfp

CPR/CFP

Latency-tolerant: scale key window structures under LL$ miss

Issue queue, regfile, load & store queues

CFP (Continual Flow Pipeline)[Srinivasan04]

Scale issue queue & regfile by “slicing out” miss-dependent insns

CPR (Checkpoint Processing & Recovery)[Akkary03]

Scale regfile by limiting recovery to pre-created checkpoints

Aggressive reclamation of non-checkpoint registers

Unintended consequence? checkpoint-granularity “bulk commit”

SA-LQ (Set-Associative Load Queue) [Akkary03]

HSQ (Hierarchical Store Queue) [Akkary03]


Baseline performance area

Baseline Performance (& Area)

  • ASSOC (baseline): 64/48 entry fully-associative load/store queues

  • 8SA-LQ/HSQ: 512-entry load queue, 256-entry store queue

    • Load queue: area is fine, poor performance (set conflicts)

    • Store queue: performance is fine, area inefficient (large CAM)


Andrew hilton amir roth university of pennsylvania adhilton amir cis upenn edu

SQIP

SQIP (Store Queue Index Prediction)[Sha05]

Scales store queue/buffer by eliminating associative search

@dispatch: load predicts store queue position of forwarding store

@execute: load indexes store queue at this position

younger

instruction stream

older

[?]

[?]

[?]

[x20]

[x10]

T:Br

S:+

R:Ld

Q:St

P:St

B:Ld

A:St

<9>

<8>

<4>

commit <ssn=4>

dispatch <ssn=9>

addresses

P:St

SSNs

<8>

Preliminaries: SSNs (Store Sequence Numbers) [Roth05]

  • Stores named by monotonically increasing sequence numbers

  • Low-order bits are store queue/buffer positions

  • Global SSNs track dispatch, commit, (store) completion


Andrew hilton amir roth university of pennsylvania adhilton amir cis upenn edu

SVW

Store Vulnerability Window (SVW)[Roth05]

Scales load queue by eliminating associative search

Load verification by in-order re-execution prior to commit

Highly filtered: <1% of loads actually re-execute

[x18]

[x18]

[x20]

[x18]

[x18]

[x20]

[x10]

T:Br

S:+

R:Ld

Q:St

P:St

B:Ld

A:St

<8>

<8>

<9>

<8>

<8>

<4>

commit <9>

complete <3>

x?0

x20

<9>

SSBF (SSN Bloom Filter)

x?8

x18

<8>

x?8

x18

<8>

verify/

  • Address-indexed SSBF tracks [addr, SSN] of commited stores

  • @commit: loads check SSBF, re-execute if possibly incorrect


Svw naive

SVW–NAIVE

  • SVW: 512-entry indexed load queue, 256-entry store queue

  • Slowdowns over 8SA-LQ (mesa, wupwise)

  • Some slowdowns even over ASSOC too (bzip2, vortex)

  • Why? Not forwarding mis-predictions … store-load serialization

    • Load Y can’t verify until older store X completes to D$


Store load serialization rob

Store-Load Serialization: ROB

SVW/SQIP example: SSBF verification “hole”

Load R forwards from store <4>  vulnerable to stores <5>–<9>

No SSBF entry for address [x10]  must replay

Can’t search store buffer  wait until stores <5>–<8> in D$

In a ROB processor … <8> (P) will complete (and usually quickly)

In a CPR processor …

[x10]

[x20]

[x18]

[x20]

[x10]

T:Br

S:+

R:Ld

Q:St

P:St

B:Ld

A:St

<4>

<9>

<8>

<4>

complete <3>

complete <8>

verify/commit <9>

verify/commit <9>

x?0

x?0

x20

x20

<9>

<9>

x?8

x?8

x18

x18

<8>

<8>

x?0

x20

<9>


Store load serialization cpr

Store-Load Serialization: CPR

P will complete … unless it’s in same checkpoint as R

Deadlock: load R can’t verify  store P can’t complete

Resolve: squash (ouch), on re-execute, create checkpoint before R

P and R will be in separate checkpoints

Better: learn and create checkpoints before future instances of R

This is SVW–TRAIN

commit

[x10]

[x20]

[x18]

[x20]

[x10]

T:Br

S:+

R:Ld

Q:St

P:St

B:Ld

A:St

<4>

<9>

<8>

<4>

complete <3>

verify <9>

x?0

x20

<9>

x?8

x18

<8>


Svw train

SVW–TRAIN

  • Better than SVW–NAÏVE

  • But worse in some cases (art, mcf, vpr)

    • Over-checkpointing holds too many registers

    • Checkpoint may not be available for branches


What about set associative ssbfs

What About Set-Associative SSBFs?

  • Higher associativity helps (reduces hole frequency) but …

  • We’re replacing store queue associativity with SSBF associativity

    • Trying to avoid things like this

  • Want a better solution…


Dsc decoupled store completion

DSC (Decoupled Store Completion)

No fundamental reason we cannot complete stores <4> – <9>

All older instructions have completed

What’s stopping us? definition of commit & architected state

CPR: commit = oldest register checkpoint (checkpoint granularity)

ROB: commit = SVW-verify (instruction granularity)

Restore ROB definition

Allow stores to complete past oldest checkpoint

This is DSC (Decoupled Store Completion)

commit

commit

[x10]

[x20]

[x18]

[x20]

[x10]

T:Br

S:+

R:Ld

Q:St

P:St

B:Ld

A:St

<4>

<9>

<8>

<4>

complete <3>

complete <8>

verify/commit <9>

verify <6>


Dsc what about mis speculations

DSC: What About Mis-Speculations?

DSC: Architected state younger than oldest checkpoint

What about mis-speculation (e.g., branch T mis-predicted)?

Can only recover to checkpoint

Squash committed instructions?

Squash stores visible to other processors? etc.

How do we recover architected state?

[x10]

[x20]

[x18]

[x20]

[x10]

T:Br

S:+

R:Ld

Q:St

P:St

B:Ld

A:St

<4>

<9>

<8>

<4>

complete <8>

?

T:Br

verify/commit <9>


Silent deterministic recovery sdr

Silent Deterministic Recovery (SDR)

Reconstruct architected state on demand

Squash to oldest checkpoint and replay …

Deterministically: re-produce committed values

Silently: without generating coherence events

How? discard committed stores at rename (already in SB or D$)

How? read load values from load queue

Avoid WAR hazards with younger stores

Same thread (e.g., BQ) or different thread (coherence)

complete <8>

[x10]

[x20]

[x18]

[x20]

[x10]

T:Br

S:+

R:Ld

Q:St

P:St

B:Ld

A:St

<4>

<9>

<8>

<4>

verify/commit <9>


Outline1

Outline

Background

DSC/SDR (yes, that was it)

Evaluation

Performance

Performance-area trade-offs


Performance methodology

Performance Methodology

Workloads

SPEC2000, Alpha AXP ISA, -O4, train inputs, 2% periodic sampling

Cycle-level simulator configuration

4-way superscalar out-of-order CPR/CFP processor

8 checkpoints, 32/32 INT/FP issue queue entries

32KByte D$, 15-cycle 2MByte L2, 8 8-entry stream prefetchers

400 cycle memory, 4Byte/cycle memory bus


Svw dsc sdr

SVW+DSC/SDR

  • Outperforms SVW–Naïve and SVW–Train

  • Outperforms 8SA-LQ on average (by a lot)

  • Occasional slight slowdowns (eon, vortex) relative to 8SA-LQ

    • These are due to forwarding mis-speculation


Smaller less associative ssbfs

Smaller, Less-Associative SSBFs

Does DSC/SDR make set-associative SSBFs unnecessary?

You can bet your associativity on it


Fewer checkpoints

Fewer Checkpoints

DSC/SDR reduce need for large numbers of checkpoints

  • Don’t need checkpoints to serialize store/load pairs

  • Efficient use of D$ bandwidth even with widely spaced checkpoints

  • Good: checkpoints are expensive


And less area

… And Less Area

Area methodology

CACTI-4[Tarjan04], 45nm

Sum areas for load/store queues (SSBF & predictor too if needed)

E.g., 512-entry 8SA-LQ / 256-entry HSQ

High-performance/low-area

6.6% speedup, 0.91mm2


How performance area was won

How Performance/Area Was Won

SVW load queue: big performance gain (no conflicts) & small area loss

SQIP store queue: small performance loss & big area gain (no CAM)

Big SVW performance gain offsets small SQIP performance loss

Big SQIP area gain offsets small SVW area loss

DSC/SDR: big performance gain & small area gain


Dsc sdr performance area

DSC/SDR Performance/Area

DSC/SDR improve SVW/SQIP IPC and reduce its area

No new structures, just new ways of using existing structures

No SSBF checkpoints

No checkpoint-creation predictor

More tolerant to reduction in checkpoints, SSBF size


Pareto analysis

Pareto Analysis

SVW/SQIP+DSC/SDR dominates all other designs

SVW/SQIP are low area (no CAMs)

DSC/SDR needed to match IPC of fully-associative load queue (FA-LQ)


Related work

Related Work

SRL (Store Redo Log)[Gandhi05]

Large associative store queue  FIFO buffer + forwarding cache

Expands store queue only under LL$ misses  under-performs HSQ

Unordered late-binding load/store queues [Sethumadhavan08]

Entries only for executed loads and stores

Poor match for centralized latency tolerant processors

Cherry [Martinez02]

“Post retirement” checkpoints

No large load/store queues, but may benefit from DSC/SDR

Deterministic replay (e.g., race debugging) [Xu04, Narayanasamy06]


Conclusions

Conclusions

Checkpoint granularity …

… register management: good

… store commit: somewhat painful

DSC/SDR: the good parts of the checkpoint world

Checkpoint granularity registers + instruction granularity stores

Key 1: disassociate commit from oldest register checkpoint

Key 2: reconstruct architected state silently on demand

Committed load values available in load queue

Allow checkpoint processor to use SVW/SQIP load/store queues

Performance and area advantages

Simplify multi-processor operation for checkpoint processors


Andrew hilton amir roth university of pennsylvania adhilton amir cis upenn edu

[ 27 ]


  • Login