Memory Ordering: A Value-based Approach

Memory Ordering:A Value-based Approach Trey Cain and Mikko Lipasti University of Wisconsin-Madison

Value-based replay • High ILP => Large instruction windows • Larger physical register file • Larger scheduler • Larger load/store queues • Result in increased access latency • Value-based Replay • If load queue scalability a problem…who needs one! • Instead, re-execute load instructions a 2nd time in program order • Filter replays: heuristics reduce extra cache bandwidth to 3.5% on average Cain and Lipasti, ISCA 2004

Outline • Conventional load queue functionality/microarchitecture • Value-based memory ordering • Replay-reduction heuristics • Performance evaluation Cain and Lipasti, ISCA 2004

Enforcing RAW dependences Program order (Exe order) • Load queue contains load addresses • One search per store address calculation • If match, the load is squashed • (1) store A • (3) store ? • (2) load A Cain and Lipasti, ISCA 2004

Enforcing memory consistency • Processor p1 • (3) load A • 2. (1) load A • Processor p2 • (2) store A • Two approaches • Snooping: Search per incoming invalidate • Insulated: Search per load address calculation raw war Cain and Lipasti, ISCA 2004

address CAM load meta-data RAM Load queue implementation queue management • # of write ports = load address calc width • # of read ports = load+store address calc width ( + 1) • Current generation designs (32-48 entries, 2 write ports, 2 (3) read ports) squash determination external request external address store address store age load address load age Cain and Lipasti, ISCA 2004

Load queue scaling • Larger instruction window => larger load queue • Increases access latency • Increases energy consumption • Wider issue width => more read/write ports • Also increases latency and energy Cain and Lipasti, ISCA 2004

Related work: MICRO 2003 • Park et al., Purdue • Extra structure dedicated to enforcing memory consistency • Increase capacity through segmentation • Sethumadhavan et al., UT-Austin • Add set of filters summarizing contents of load queue Cain and Lipasti, ISCA 2004

Keep it simple… • Throw more hardware at the problem? • Need to design/implement/verify • Execution core is already complicated • Load queue checks for rare errors • Why not move error checking away from exe? Cain and Lipasti, ISCA 2004

Value-based ordering … • Replay: access the cache a second time -cheaply! • Almost always cache hit • Reuse address calculation and translation • Share cache port used by stores in commit stage • Compare: compares new value to original value • Squash if the values differ • DIVA ála carte [Austin, Micro 99] IF1 IF2 D R Q S EX REP C CMP WB Cain and Lipasti, ISCA 2004

Rules of replay • All prior stores must have written data to the cache • No store-to-load forwarding • Loads must replay in program order • If a cache miss occurs, all subsequent loads must be replayed • If a load is squashed, it should not be replayed a second time • Ensures forward progress Cain and Lipasti, ISCA 2004

Replay reduction • Replay costs • Consumes cache bandwidth (and power) • Increases reorder buffer occupancy • Can we avoid these penalties? • Infer correctness of certain operations • Four replay filters Cain and Lipasti, ISCA 2004

No-Reorder filter • Avoid replay if load isn’t reordered wrt other memory operations • Can we do better? Cain and Lipasti, ISCA 2004

Enforcing single-thread RAW dependencies • No-Unresolved Store Address Filter • Load instruction i is replayed if there are prior stores with unresolved addresses when i issues • Works for intra-processor RAW dependences • Doesn’t enforce memory consistency Cain and Lipasti, ISCA 2004

Enforcing MP consistency • No-Recent-Miss Filter • Avoid replay if there have been no cache line fills (to any address) while load in instruction window • No-Recent-Snoop Filter • Avoid replay if there have been no external invalidates (to any address) while load in instruction window Cain and Lipasti, ISCA 2004

Constraint graph • Defined for sequential consistency by Landin et al., ISCA-18 • Directed-graph represents a multithreaded execution • Nodes represent dynamic instruction instances • Edges represent their transitive orders (program order, RAW, WAW, WAR). • If the constraint graph is acyclic, then the execution is correct Cain and Lipasti, ISCA 2004

Constraint graph example - SC Proc 1 ST A Proc 2 WAR 2. 4. LD B Program order Program order ST B LD A 3. RAW 1. Cycle indicates that execution is incorrect Cain and Lipasti, ISCA 2004

Anatomy of a cycle Proc 1 ST A Proc 2 Incoming invalidate WAR LD B Program order Program order Cache miss ST B RAW LD A Cain and Lipasti, ISCA 2004

Enforcing MP consistency • No-Recent-Miss Filter • Avoid replay if there have been no cache line fills (to any address) while load in instruction window • No-Recent-Snoop Filter • Avoid replay if there have been no external invalidates (to any address) while load in instruction window Cain and Lipasti, ISCA 2004

Filter Summary Conservative Replay all committed loads No-Reorder Filter No-Unresolved Store/ No-Recent-Miss Filter No-Unresolved Store/ No-Recent-Snoop Filter Aggressive Cain and Lipasti, ISCA 2004

Outline • Conventional load queue functionality/microarchitecture • Value-based memory ordering • Replay-reduction heuristics • Performance evaluation Cain and Lipasti, ISCA 2004

Base machine model Cain and Lipasti, ISCA 2004

%L1 DCache bandwidth increase SPECint2000 SPECfp2000 commercial multiprocessor • replay all (b) no-reorder filter (c) no-recent-miss filter (d) no-recent-snoop filter On average, 3.4% bandwidth overhead using no-recent-snoop filter Cain and Lipasti, ISCA 2004

Value-based replay performance (relative to constrained load queue) SPECint2000 SPECfp2000 commercial multiprocessor Value-based replay 8% faster on avg than baseline using 16-entry ld queue Cain and Lipasti, ISCA 2004

Value-based replay Pros/Cons • Eliminates associative lookup hardware • Load queue becomes simple FIFO • Negligible IPC or L1D bandwidth impact • Can be used to fix value prediction • Enforces dependence order consistency constraint [Martin et al., Micro 2001] • Requires additional pipeline stages • Requires additional cache datapath for loads Cain and Lipasti, ISCA 2004

The End • Questions? Cain and Lipasti, ISCA 2004

Backups Cain and Lipasti, ISCA 2004

Does value locality help? • Not much… • Value locality does avoid memory ordering violations • 59% single-thread violations avoided • 95% consistency violations avoided • But these violations rarely occur • ~1 single-thread violation per 100 million instr • 4 consistency violation per 10,000 instr Cain and Lipasti, ISCA 2004

What About Power? • Simple power model: • Empirically: 0.02 replay loads per committed instruction • If load queue CAM energy/insn > 0.02 × energy expenditure of a cache access and comparison: • value-based implementation saves power! DEnergy = # replays ( Eper cache access + Eper word comparison ) + replay overhead – ( Eper ldq search× # ldq searches ) Cain and Lipasti, ISCA 2004

Caveat: Memory Dependence Prediction • Some predictors train using the conflicting store • (e.g. store-set predictor) • Replay mechanism is unable to pinpoint conflicting store • Fair comparison: • Baseline machine: store-set predictor w/ 4k entry SSIT and 128 entry LFST • Experimental machine: Simple 21264-style dependence predictor w/ 4k entry history table Cain and Lipasti, ISCA 2004

Load queue search energy Based on 0.09 micron process technology using Cacti v. 3.2 Cain and Lipasti, ISCA 2004

Load queue search latency Based on 0.09 micron process technology using Cacti v. 3.2 Cain and Lipasti, ISCA 2004

Benchmarks • MP (16-way) • Commercial workloads (SPECweb, TPC-H) • SPLASH2 scientific application (ocean) • Error bars signify 95% statistical confidence • UP • 3 from SPECfp2000 • Selected due to high reorder buffer utilization • apsi, art, wupwise • 3 commercial • SPECjbb2000, TPC-B, TPC-H • A few from SPECint2000 Cain and Lipasti, ISCA 2004

Life cycle of a load ST ? ST ? LD ? ST ? LD ? LD ? LD ? ST ? LD ? ST ? LD A ST A ST ? OoO Execution Window Blam! LD ? LD A Load queue Cain and Lipasti, ISCA 2004

Performance relative to unconstrained load queue Good news: Replay w/ no-recent-snoop filter only 1% slower on average Cain and Lipasti, ISCA 2004

Reorder-Buffer Utilization Cain and Lipasti, ISCA 2004

Why focus on load queue? • Load queue has different constraints that store queue • More loads than stores (30% vs 14% dynamic instructions) • Load queue searched more frequently (consuming more power) • Store-forwarding logic performance critical • Many non-scalable structures in OoO processor • Scheduler • Physical register file • Register map Cain and Lipasti, ISCA 2004

Prior work: formal memory model representations • Local, WRT, global “performance” of memory ops (Dubois et al., ISCA-13) • Acyclic graph representation (Landin et al., ISCA-18) • Modeling memory operation as a series of sub-operations (Collier, RAPA) • Acyclic graph + sub-operations (Adve, thesis) • Initiation event, for modeling early store-to-load forwarding (Gharachorloo, thesis) Cain and Lipasti, ISCA 2004

Memory Ordering: A Value-based Approach