1 / 38

Memory Ordering: A Value-based Approach

Memory Ordering: A Value-based Approach. Trey Cain and Mikko Lipasti University of Wisconsin-Madison. Value-based replay. High ILP => Large instruction windows Larger physical register file Larger scheduler Larger load/store queues Result in increased access latency Value-based Replay

merrill
Download Presentation

Memory Ordering: A Value-based Approach

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Memory Ordering:A Value-based Approach Trey Cain and Mikko Lipasti University of Wisconsin-Madison

  2. Value-based replay • High ILP => Large instruction windows • Larger physical register file • Larger scheduler • Larger load/store queues • Result in increased access latency • Value-based Replay • If load queue scalability a problem…who needs one! • Instead, re-execute load instructions a 2nd time in program order • Filter replays: heuristics reduce extra cache bandwidth to 3.5% on average Cain and Lipasti, ISCA 2004

  3. Outline • Conventional load queue functionality/microarchitecture • Value-based memory ordering • Replay-reduction heuristics • Performance evaluation Cain and Lipasti, ISCA 2004

  4. Enforcing RAW dependences Program order (Exe order) • Load queue contains load addresses • One search per store address calculation • If match, the load is squashed • (1) store A • (3) store ? • (2) load A Cain and Lipasti, ISCA 2004

  5. Enforcing memory consistency • Processor p1 • (3) load A • 2. (1) load A • Processor p2 • (2) store A • Two approaches • Snooping: Search per incoming invalidate • Insulated: Search per load address calculation raw war Cain and Lipasti, ISCA 2004

  6. address CAM load meta-data RAM Load queue implementation queue management • # of write ports = load address calc width • # of read ports = load+store address calc width ( + 1) • Current generation designs (32-48 entries, 2 write ports, 2 (3) read ports) squash determination external request external address store address store age load address load age Cain and Lipasti, ISCA 2004

  7. Load queue scaling • Larger instruction window => larger load queue • Increases access latency • Increases energy consumption • Wider issue width => more read/write ports • Also increases latency and energy Cain and Lipasti, ISCA 2004

  8. Related work: MICRO 2003 • Park et al., Purdue • Extra structure dedicated to enforcing memory consistency • Increase capacity through segmentation • Sethumadhavan et al., UT-Austin • Add set of filters summarizing contents of load queue Cain and Lipasti, ISCA 2004

  9. Keep it simple… • Throw more hardware at the problem? • Need to design/implement/verify • Execution core is already complicated • Load queue checks for rare errors • Why not move error checking away from exe? Cain and Lipasti, ISCA 2004

  10. Value-based ordering … • Replay: access the cache a second time -cheaply! • Almost always cache hit • Reuse address calculation and translation • Share cache port used by stores in commit stage • Compare: compares new value to original value • Squash if the values differ • DIVA ála carte [Austin, Micro 99] IF1 IF2 D R Q S EX REP C CMP WB Cain and Lipasti, ISCA 2004

  11. Rules of replay • All prior stores must have written data to the cache • No store-to-load forwarding • Loads must replay in program order • If a cache miss occurs, all subsequent loads must be replayed • If a load is squashed, it should not be replayed a second time • Ensures forward progress Cain and Lipasti, ISCA 2004

  12. Replay reduction • Replay costs • Consumes cache bandwidth (and power) • Increases reorder buffer occupancy • Can we avoid these penalties? • Infer correctness of certain operations • Four replay filters Cain and Lipasti, ISCA 2004

  13. No-Reorder filter • Avoid replay if load isn’t reordered wrt other memory operations • Can we do better? Cain and Lipasti, ISCA 2004

  14. Enforcing single-thread RAW dependencies • No-Unresolved Store Address Filter • Load instruction i is replayed if there are prior stores with unresolved addresses when i issues • Works for intra-processor RAW dependences • Doesn’t enforce memory consistency Cain and Lipasti, ISCA 2004

  15. Enforcing MP consistency • No-Recent-Miss Filter • Avoid replay if there have been no cache line fills (to any address) while load in instruction window • No-Recent-Snoop Filter • Avoid replay if there have been no external invalidates (to any address) while load in instruction window Cain and Lipasti, ISCA 2004

  16. Constraint graph • Defined for sequential consistency by Landin et al., ISCA-18 • Directed-graph represents a multithreaded execution • Nodes represent dynamic instruction instances • Edges represent their transitive orders (program order, RAW, WAW, WAR). • If the constraint graph is acyclic, then the execution is correct Cain and Lipasti, ISCA 2004

  17. Constraint graph example - SC Proc 1 ST A Proc 2 WAR 2. 4. LD B Program order Program order ST B LD A 3. RAW 1. Cycle indicates that execution is incorrect Cain and Lipasti, ISCA 2004

  18. Anatomy of a cycle Proc 1 ST A Proc 2 Incoming invalidate WAR LD B Program order Program order Cache miss ST B RAW LD A Cain and Lipasti, ISCA 2004

  19. Enforcing MP consistency • No-Recent-Miss Filter • Avoid replay if there have been no cache line fills (to any address) while load in instruction window • No-Recent-Snoop Filter • Avoid replay if there have been no external invalidates (to any address) while load in instruction window Cain and Lipasti, ISCA 2004

  20. Filter Summary Conservative Replay all committed loads No-Reorder Filter No-Unresolved Store/ No-Recent-Miss Filter No-Unresolved Store/ No-Recent-Snoop Filter Aggressive Cain and Lipasti, ISCA 2004

  21. Outline • Conventional load queue functionality/microarchitecture • Value-based memory ordering • Replay-reduction heuristics • Performance evaluation Cain and Lipasti, ISCA 2004

  22. Base machine model Cain and Lipasti, ISCA 2004

  23. %L1 DCache bandwidth increase SPECint2000 SPECfp2000 commercial multiprocessor • replay all (b) no-reorder filter (c) no-recent-miss filter (d) no-recent-snoop filter On average, 3.4% bandwidth overhead using no-recent-snoop filter Cain and Lipasti, ISCA 2004

  24. Value-based replay performance (relative to constrained load queue) SPECint2000 SPECfp2000 commercial multiprocessor Value-based replay 8% faster on avg than baseline using 16-entry ld queue Cain and Lipasti, ISCA 2004

  25. Value-based replay Pros/Cons • Eliminates associative lookup hardware • Load queue becomes simple FIFO • Negligible IPC or L1D bandwidth impact • Can be used to fix value prediction • Enforces dependence order consistency constraint [Martin et al., Micro 2001] • Requires additional pipeline stages • Requires additional cache datapath for loads Cain and Lipasti, ISCA 2004

  26. The End • Questions? Cain and Lipasti, ISCA 2004

  27. Backups Cain and Lipasti, ISCA 2004

  28. Does value locality help? • Not much… • Value locality does avoid memory ordering violations • 59% single-thread violations avoided • 95% consistency violations avoided • But these violations rarely occur • ~1 single-thread violation per 100 million instr • 4 consistency violation per 10,000 instr Cain and Lipasti, ISCA 2004

  29. What About Power? • Simple power model: • Empirically: 0.02 replay loads per committed instruction • If load queue CAM energy/insn > 0.02 × energy expenditure of a cache access and comparison: • value-based implementation saves power! DEnergy = # replays ( Eper cache access + Eper word comparison ) + replay overhead – ( Eper ldq search× # ldq searches ) Cain and Lipasti, ISCA 2004

  30. Caveat: Memory Dependence Prediction • Some predictors train using the conflicting store • (e.g. store-set predictor) • Replay mechanism is unable to pinpoint conflicting store • Fair comparison: • Baseline machine: store-set predictor w/ 4k entry SSIT and 128 entry LFST • Experimental machine: Simple 21264-style dependence predictor w/ 4k entry history table Cain and Lipasti, ISCA 2004

  31. Load queue search energy Based on 0.09 micron process technology using Cacti v. 3.2 Cain and Lipasti, ISCA 2004

  32. Load queue search latency Based on 0.09 micron process technology using Cacti v. 3.2 Cain and Lipasti, ISCA 2004

  33. Benchmarks • MP (16-way) • Commercial workloads (SPECweb, TPC-H) • SPLASH2 scientific application (ocean) • Error bars signify 95% statistical confidence • UP • 3 from SPECfp2000 • Selected due to high reorder buffer utilization • apsi, art, wupwise • 3 commercial • SPECjbb2000, TPC-B, TPC-H • A few from SPECint2000 Cain and Lipasti, ISCA 2004

  34. Life cycle of a load ST ? ST ? LD ? ST ? LD ? LD ? LD ? ST ? LD ? ST ? LD A ST A ST ? OoO Execution Window Blam! LD ? LD A Load queue Cain and Lipasti, ISCA 2004

  35. Performance relative to unconstrained load queue Good news: Replay w/ no-recent-snoop filter only 1% slower on average Cain and Lipasti, ISCA 2004

  36. Reorder-Buffer Utilization Cain and Lipasti, ISCA 2004

  37. Why focus on load queue? • Load queue has different constraints that store queue • More loads than stores (30% vs 14% dynamic instructions) • Load queue searched more frequently (consuming more power) • Store-forwarding logic performance critical • Many non-scalable structures in OoO processor • Scheduler • Physical register file • Register map Cain and Lipasti, ISCA 2004

  38. Prior work: formal memory model representations • Local, WRT, global “performance” of memory ops (Dubois et al., ISCA-13) • Acyclic graph representation (Landin et al., ISCA-18) • Modeling memory operation as a series of sub-operations (Collier, RAPA) • Acyclic graph + sub-operations (Adve, thesis) • Initiation event, for modeling early store-to-load forwarding (Gharachorloo, thesis) Cain and Lipasti, ISCA 2004

More Related