Andrew Hilton, Amir Roth University of Pennsylvania {adhilton, amir}@cis.upenn.edu

Decoupled Store CompletionSilent Deterministic ReplayEnabling Scalable Data Memory for CPR/CFP Processors Andrew Hilton, Amir Roth University of Pennsylvania {adhilton, amir}@cis.upenn.edu ISCA-36:: June 23, 2009

Brief Overview Latency-tolerant processors Scalable load & store queues CPR/CFP [Akkary03, Srinivasan04] SVW/SQIP [Roth05, Sha05] Scalable load & store queues for latency-tolerant processors SA-LQ/HSQ [Akkary03] SRL [Gandhi05] ELSQ [Pericas08] Dynamically scheduled superscalar processors DKIP, FMC [Pericas06, Pericas07] Granularity mismatch: checkpoint (CPR) vs. instruction (SVW/SQIP) Decoupled Store Completion & Silent Deterministic Replay

Outline Background CPR/CFP SVW/SQIP The granularity mismatch problem DSC/SDR Evaluation

CPR/CFP Latency-tolerant: scale key window structures under LL$ miss Issue queue, regfile, load & store queues CFP (Continual Flow Pipeline)[Srinivasan04] Scale issue queue & regfile by “slicing out” miss-dependent insns CPR (Checkpoint Processing & Recovery)[Akkary03] Scale regfile by limiting recovery to pre-created checkpoints Aggressive reclamation of non-checkpoint registers Unintended consequence? checkpoint-granularity “bulk commit” SA-LQ (Set-Associative Load Queue) [Akkary03] HSQ (Hierarchical Store Queue) [Akkary03]

Baseline Performance (& Area) • ASSOC (baseline): 64/48 entry fully-associative load/store queues • 8SA-LQ/HSQ: 512-entry load queue, 256-entry store queue • Load queue: area is fine, poor performance (set conflicts) • Store queue: performance is fine, area inefficient (large CAM)

SQIP SQIP (Store Queue Index Prediction)[Sha05] Scales store queue/buffer by eliminating associative search @dispatch: load predicts store queue position of forwarding store @execute: load indexes store queue at this position younger instruction stream older [?] [?] [?] [x20] [x10] T:Br S:+ R:Ld Q:St P:St … B:Ld A:St <9> <8> <4> commit <ssn=4> dispatch <ssn=9> addresses P:St SSNs <8> Preliminaries: SSNs (Store Sequence Numbers) [Roth05] • Stores named by monotonically increasing sequence numbers • Low-order bits are store queue/buffer positions • Global SSNs track dispatch, commit, (store) completion

SVW Store Vulnerability Window (SVW)[Roth05] Scales load queue by eliminating associative search Load verification by in-order re-execution prior to commit Highly filtered: <1% of loads actually re-execute [x18] [x18] [x20] [x18] [x18] [x20] [x10] T:Br S:+ R:Ld Q:St P:St … B:Ld A:St <8> <8> <9> <8> <8> <4> commit <9> complete <3> x?0 x20 <9> SSBF (SSN Bloom Filter) x?8 x18 <8> x?8 x18 <8> verify/ • Address-indexed SSBF tracks [addr, SSN] of commited stores • @commit: loads check SSBF, re-execute if possibly incorrect

SVW–NAIVE • SVW: 512-entry indexed load queue, 256-entry store queue • Slowdowns over 8SA-LQ (mesa, wupwise) • Some slowdowns even over ASSOC too (bzip2, vortex) • Why? Not forwarding mis-predictions … store-load serialization • Load Y can’t verify until older store X completes to D$

Store-Load Serialization: ROB SVW/SQIP example: SSBF verification “hole” Load R forwards from store <4>  vulnerable to stores <5>–<9> No SSBF entry for address [x10]  must replay Can’t search store buffer  wait until stores <5>–<8> in D$ In a ROB processor … <8> (P) will complete (and usually quickly) In a CPR processor … [x10] [x20] [x18] [x20] [x10] T:Br S:+ R:Ld Q:St P:St … B:Ld A:St <4> <9> <8> <4> complete <3> complete <8> verify/commit <9> verify/commit <9> x?0 x?0 x20 x20 <9> <9> x?8 x?8 x18 x18 <8> <8> x?0 x20 <9>

Store-Load Serialization: CPR P will complete … unless it’s in same checkpoint as R Deadlock: load R can’t verify  store P can’t complete Resolve: squash (ouch), on re-execute, create checkpoint before R P and R will be in separate checkpoints Better: learn and create checkpoints before future instances of R This is SVW–TRAIN commit [x10] [x20] [x18] [x20] [x10] T:Br S:+ R:Ld Q:St P:St … B:Ld A:St <4> <9> <8> <4> complete <3> verify <9> x?0 x20 <9> x?8 x18 <8>

SVW–TRAIN • Better than SVW–NAÏVE • But worse in some cases (art, mcf, vpr) • Over-checkpointing holds too many registers • Checkpoint may not be available for branches

What About Set-Associative SSBFs? • Higher associativity helps (reduces hole frequency) but … • We’re replacing store queue associativity with SSBF associativity • Trying to avoid things like this • Want a better solution…

DSC (Decoupled Store Completion) No fundamental reason we cannot complete stores <4> – <9> All older instructions have completed What’s stopping us? definition of commit & architected state CPR: commit = oldest register checkpoint (checkpoint granularity) ROB: commit = SVW-verify (instruction granularity) Restore ROB definition Allow stores to complete past oldest checkpoint This is DSC (Decoupled Store Completion) commit commit [x10] [x20] [x18] [x20] [x10] T:Br S:+ R:Ld Q:St P:St … B:Ld A:St <4> <9> <8> <4> complete <3> complete <8> verify/commit <9> verify <6>

DSC: What About Mis-Speculations? DSC: Architected state younger than oldest checkpoint What about mis-speculation (e.g., branch T mis-predicted)? Can only recover to checkpoint Squash committed instructions? Squash stores visible to other processors? etc. How do we recover architected state? [x10] [x20] [x18] [x20] [x10] T:Br S:+ R:Ld Q:St P:St … B:Ld A:St <4> <9> <8> <4> complete <8> ? T:Br verify/commit <9>

Silent Deterministic Recovery (SDR) Reconstruct architected state on demand Squash to oldest checkpoint and replay … Deterministically: re-produce committed values Silently: without generating coherence events How? discard committed stores at rename (already in SB or D$) How? read load values from load queue Avoid WAR hazards with younger stores Same thread (e.g., BQ) or different thread (coherence) complete <8> [x10] [x20] [x18] [x20] [x10] T:Br S:+ R:Ld Q:St P:St … B:Ld A:St <4> <9> <8> <4> verify/commit <9>

Outline Background DSC/SDR (yes, that was it) Evaluation Performance Performance-area trade-offs

Performance Methodology Workloads SPEC2000, Alpha AXP ISA, -O4, train inputs, 2% periodic sampling Cycle-level simulator configuration 4-way superscalar out-of-order CPR/CFP processor 8 checkpoints, 32/32 INT/FP issue queue entries 32KByte D$, 15-cycle 2MByte L2, 8 8-entry stream prefetchers 400 cycle memory, 4Byte/cycle memory bus

SVW+DSC/SDR • Outperforms SVW–Naïve and SVW–Train • Outperforms 8SA-LQ on average (by a lot) • Occasional slight slowdowns (eon, vortex) relative to 8SA-LQ • These are due to forwarding mis-speculation

Smaller, Less-Associative SSBFs Does DSC/SDR make set-associative SSBFs unnecessary? You can bet your associativity on it

Fewer Checkpoints DSC/SDR reduce need for large numbers of checkpoints • Don’t need checkpoints to serialize store/load pairs • Efficient use of D$ bandwidth even with widely spaced checkpoints • Good: checkpoints are expensive

… And Less Area Area methodology CACTI-4[Tarjan04], 45nm Sum areas for load/store queues (SSBF & predictor too if needed) E.g., 512-entry 8SA-LQ / 256-entry HSQ High-performance/low-area 6.6% speedup, 0.91mm2

How Performance/Area Was Won SVW load queue: big performance gain (no conflicts) & small area loss SQIP store queue: small performance loss & big area gain (no CAM) Big SVW performance gain offsets small SQIP performance loss Big SQIP area gain offsets small SVW area loss DSC/SDR: big performance gain & small area gain

DSC/SDR Performance/Area DSC/SDR improve SVW/SQIP IPC and reduce its area No new structures, just new ways of using existing structures No SSBF checkpoints No checkpoint-creation predictor More tolerant to reduction in checkpoints, SSBF size

Pareto Analysis SVW/SQIP+DSC/SDR dominates all other designs SVW/SQIP are low area (no CAMs) DSC/SDR needed to match IPC of fully-associative load queue (FA-LQ)

Related Work SRL (Store Redo Log)[Gandhi05] Large associative store queue  FIFO buffer + forwarding cache Expands store queue only under LL$ misses  under-performs HSQ Unordered late-binding load/store queues [Sethumadhavan08] Entries only for executed loads and stores Poor match for centralized latency tolerant processors Cherry [Martinez02] “Post retirement” checkpoints No large load/store queues, but may benefit from DSC/SDR Deterministic replay (e.g., race debugging) [Xu04, Narayanasamy06]

Conclusions Checkpoint granularity … … register management: good … store commit: somewhat painful DSC/SDR: the good parts of the checkpoint world Checkpoint granularity registers + instruction granularity stores Key 1: disassociate commit from oldest register checkpoint Key 2: reconstruct architected state silently on demand Committed load values available in load queue Allow checkpoint processor to use SVW/SQIP load/store queues Performance and area advantages Simplify multi-processor operation for checkpoint processors

[ 27 ]

Andrew Hilton, Amir Roth University of Pennsylvania {adhilton, amir}@cis.upenn.edu