Advanced Microarchitecture

Advanced Microarchitecture Lecture 11: Memory Scheduling

If R1 != R7, then Load R8 gets correct value from cache If R1 == R7, then Load R8 should have gotten value from the Store, but it didn’t! Issue Issue Issue Issue Issue Executing Memory Instructions Cache Miss! Load R3 = 0[R6] Miss serviced… Add R7 = R3 + R9 Store R4  0[R7] Sub R1 = R1 – R2 Cache Hit! Load R8 = 0[R1] But there was a later load… Lecture 13: Memory Scheduling

Memory Disambiguation Problem • Ordering problem is a data-dependence violation • Why can’t this happen with non-memory insts? • Operand specifiers in non-memory insts are absolute • “R1” refers to one specific location • Operand specifiers in memory insts are ambiguous • “R1” refers to a memory location specified by the value of R1. As pointers change, so does this location. • Determining whether it is safe to issue a load OOO requires disambiguating the operand specifiers Lecture 13: Memory Scheduling

Two Problems • Memory disambiguation • Are there any earlier unexecuted stores to the same address as myself? (I’m a load) • Binary question: answer is yes or no • Store-to-load forwarding problem • Which earlier store do I get my value from? (I’m a load) • Which later load(s) do I forward my value to? (I’m a store) • Non-binary question: answer is one or more instruction identifiers Lecture 13: Memory Scheduling

Oldest Load Store Queue (LSQ) L/S Data Cache PC Seq Addr Value L 0xF048 41773 0x3290 42 0x3290 42 -17 0x3300 1 S 0xF04C 41774 0x3410 25 S 0xF054 41775 0x3290 -17 0x3410 38 25 L 0xF060 41776 0x3418 1234 0x3418 1234 L 0xF840 41777 0x3290 -17 L 0xF858 41778 0x3300 1 S 0xF85C 41779 0x3290 0 L 0xF870 41780 0x3410 25 L 0xF628 41781 0x3290 0 Youngest L 0xF63C 41782 0x3300 1 Lecture 13: Memory Scheduling

Most Conservative Policy • No Memory Reordering • LSQ still needed for forwarded data (last slide) • Easy to schedule 1 Ready! bid grant Ready! bid grant … … Least IPC, all memory executed sequentially Lecture 13: Memory Scheduling

Loads OOO Between Stores • Let loads exec OOO w.r.t. each other, but no ordering past earlier unexecuted stores S=0 L=1 rd ex all earlier stores executed S L L S L Lecture 13: Memory Scheduling

Loads Wait for Only STA’s • Stores normally don’t “Execute” until both inputs are ready: address and data • Only address is needed to disambiguate Address ready Data ready S L Lecture 13: Memory Scheduling

Loads Execute When Ready • Most aggressive approach • Relies on fact that storeload forwarding is not the common case • Greatest potential IPC – loads never stall • Potential for incorrect execution Lecture 13: Memory Scheduling

Detecting Ordering Violations • Case 1: Older store execs before younger load • No problem; if same address stld forwarding happens • Case 2: Older store execs after younger load • Store scans all younger loads • Address match  ordering violation Lecture 13: Memory Scheduling

(Load 41773 ignores because it has a lower seq #) Store broadcasts value, address and sequence # Loads CAM-match on address, only care if store seq-# is lower than own seq Detecting Ordering Violations (2) L 0xF048 41773 0x3290 42 S 0xF04C 41774 0x3410 25 (-17,0x3290,41775) S 0xF054 41775 0x3290 -17 L 0xF060 41776 0x3418 1234 IF younger load hadn’t executed, and address matches, grab b’casted value L 0xF840 41777 0x3290 -17 L 0xF858 41778 0x3300 1 S 0xF85C 41779 0x3290 0 (0,0x3290,41779) L 0xF870 41780 0x3410 25 An instruction may be involved in more than one ordering violation L 0xF628 41781 0x3290 42 -17 IF younger load has executed, and address matches, then ordering violation! L 0xF63C 41782 0x3300 1 Grab value, flush pipeline after load Lecture 13: Memory Scheduling

Dealing with Misspeculations • Instructions using the load’s stale/wrong value will propagate more wrong values • These must somehow be re-executed • Easiest: flush all instructions after (and including?) the misspeculated load, and just refetch • Load uses forwarded value • Correct value propagated when instructions re-execute Lecture 13: Memory Scheduling

Recovery Complications • When flushing only part of the pipeline (everything after the load), RAT must be repaired to the state just after the load was renamed • Solutions? • Checkpoint at every load • Not so good, between loads and branches, very large number of checkpoints needed • Rollback to previous branch (which has its own checkpoint) • Make sure load doesn’t misspeculate on 2nd time around • Have to redo the work between the branch and the load which were all correct the first time around • Works with undo-list style of recovery Lecture 13: Memory Scheduling

Flushing is Expensive • Not all later instructions are dependent on the bogus load value • Pipeline latency due to refetch is exposed • Hunting down RS entries to squash is tricky Lecture 13: Memory Scheduling

Selective Re-Execution • Ideal case w.r.t. maintaining high IPC • Very complicated • need to hunt down only data-dependent insts • messier because some instructions may have already executed (now in ROB) while others may not have executed yet (still in RS) • iteratively walk dependence graph? • use some sort of load/store coloring scheme? • P4 uses replay for load-latency misspeculation • But replay wouldn’t work in this case (why?) Lecture 13: Memory Scheduling

Store “complete” Forward value to later Loads Independently Schedule S ea ea Store D D Crack at Dispatch Independently Execute time Load/Store Execution • “SimpleScalar” style alloc schedule LSQ st-data ld-data Store RS Add ea-comp Add ea-comp Load Load is similar, but LD-data portion is data-dependent on the LD ea-comp Lecture 13: Memory Scheduling

op dest srcL srcR ADD T17 T12 T43 St-ea Lsq-5 T18 #0 Complications • LSQ needs data-capture support • Store Data needs to capture value • EA-comps can write to LSQ entries directly using LSQ index (no associative search) Ld-d St-d LSQ add L-ea xor S-ea RS Load ea-comp done the same; Load’s LSQ entry handles “real” destination tag broadcast Store normally doesn’t have a dest; overload field for LSQ index Lecture 13: Memory Scheduling

Complications (2) • Load must bid/select twice • once for ea-comp portion • once for cache access (includes LSQ check) RS LSQ Ea-comp Exec Select Select Data Cache Ld-ea Ld-data Data cache and LSQ search in parallel Lecture 13: Memory Scheduling

Load/Store Execution • “Pentium” Style • STA and STD still execute independently • LSQ does not need data-capture • uses RS’s data-capture (for data-capture scheduler) • or RSPRFLSQ • Potentially adds a little delay from STD-ready to STLD forwarding dispatch/ alloc schedule LSQ “store” “load” Store RS Add STA STD LD Load Lecture 13: Memory Scheduling

Load queuepart doesn’t “execute”, but just holds address for detecting ordering violations Load Execution • Only one select/bid RS LSQ Ea-comp Exec Select Data Cache Load Load LSQ search in parallel Lecture 13: Memory Scheduling

Store Execution • STA and STD independently issue from RS • STA does ea comp • STD just reads operand and moves it to the LSQ • When both have executed and reached the LSQ, then perform LSQ search for younger loads that have already executed (i.e., ordering violations) Lecture 13: Memory Scheduling

LSQ Hardware in More Detail • CAM logic – harder than regular scheduler because we need address + age information • Age information not needed for physical registers since register renaming guarantees one writer per address • No easy way to prevent more than one store to the same address Lecture 13: Memory Scheduling

If |LSQ| is large, logic can be adapted to have log delay Loads checking for earlier matching stores Address Bank Data Bank Valid store = Use this store Addr match ST 0x4000 = No earlier matches = ST 0x4000 = = Need to adjust this so that load need not be at bottom, and that LSQ can wrap-around ST 0x4120 = = LD 0x4000 0 Lecture 13: Memory Scheduling

Data Forwarding Data Bank Overwritten Similar Logic to Previous Slide Capture Value ST 0x4000 Is Load Addr Match ST 0x4120 LD 0x4000 Overwritten ST 0x4000 This logic is ugly, complicated, slow and power hungry! Lecture 13: Memory Scheduling

Alternative: Store Colors • Each store is assigned a unique, increasing number (its color) • Loads inherit the color of the most recently alloc’dst St Color=1 St Ld Color=2 St Ld Ignore store broadcasts If store’s color > your own Color=3 St Ld Ld Ld All three loads have same color: only care about ordering w.r.t. stores, not other loads Ld Ld Special care is needed to deal with the eventual overflow/wrap-around ofthe color/age counter Color=4 St Ld Lecture 13: Memory Scheduling

Don’t Make Stores Forward • When load receives data, it still needs to wakeup its dependents… value not needed until dependents make it to execute stage • Alternative timing/implementation: • Broadcast address only • When load wakes up, search LSQ again (should hit now) Lecture 13: Memory Scheduling

Even if load value is ready, dependent op hasn’t been scheduled With decoupled Scheduling: std sta LD LD S X E add add add i+4 i i+1 i+2 i+3 Re-search: std sta LD LD LD LD: search LSQ S X E add add add i+4 i i+1 i+2 i+3 StoreLoadOp Timing Ideal Case: std sta LD LD add Cycle i Cycle i+1 Cycle i+2 Load predicted dependent on store: waits for STA No performance benefit for direct STLD forwarding at time of address broadcast Lecture 13: Memory Scheduling

LSQ is Full Of Associative Searches • We should all know by now that associative searches do not scale well • So how do we manage this? Lecture 13: Memory Scheduling

Associative search for later loads for STLD forwarding only needs to check entries that actually contain loads Associative search for earlier stores only needs to check entries that actually contain stores Split Load Queue/Store Queue • Stores don’t need to b’cast address to stores • Loads don’t need to check for collisions against earlier loads Load Queue (LDQ) Store Queue (STQ) Lecture 13: Memory Scheduling

Load Execution • Load issue  EA computation  DL1 access and LSQ search in parallel • Typical Latencies • DL1: 3 cycles • LSQ search: 1 cycle (more?) • Remember: instructions are speculatively scheduled! Lecture 13: Memory Scheduling

Pipeline timing assuming DL1 hit LOAD S X X X E E E S X X X E ADD But at time of scheduling, how do we know LSQ hit vs. DL1 hit? Load Execution (2) Pipeline timing assuming LSQ hit LOAD S X X X E S X X X E ADD Lecture 13: Memory Scheduling

Load Execution (3) • Can predict latency • similar to predicting L1 hit vs. L2 hit vs. going to DRAM • If predict LSQ hit but wrong  scheduling replay • If predict L1 hit but wrong  waste a few cycles • Normalize latencies • Make LSQ hit and L1 hit have same latency • Greatly simplifies scheduler • Loses some performance since in theory you could do STLD forwarding in less time than the L1 latency • Loss is not too great since most loads do not hit in LSQ Lecture 13: Memory Scheduling

0 A 0 Z X 1 0 0 0 B 0 0 0 Make Note of it! Next time around don’t let B issue before previous STA’s known Reducing Ordering Violations • Dependence violations can be predicted Table has finite number of entries; eventually all will be set to “do not speculate”  equivalent to machine with no ordering speculation A 1 A 1 Z X 1 1 B B 1 1 1 Ordering Violation Detected All previous STA’s known; now it’s safe to issue Lecture 13: Memory Scheduling

Dealing with “Full” Table • Do similar to branch predictors: use counters • asymmetric costs • mispredicting T-branch as NT, or NT-branch as T makes no difference; need to flush and re-fetch either way • predicting a no-conflict load as conflict causes load to stall unnecessarily, but other insts may still execute • predicting a conflict as no-conflict causes pipeline flush • asymmetric frequencies • no conflict loads much more common than conflicting loads Lecture 13: Memory Scheduling

Dealing with “Full” Tables (2) • Asymmetric updates • when no ordering violation, decrement counter by 1 • on ordering violation, increment by X > 1 • choose X based on frequency of misspeculations and penalty/performance cost of misspeculation • Periodic reset • Every K cycles, reset the entire table • Works reasonably well, lower hardware cost than using saturating counters Lecture 13: Memory Scheduling

A Z X A B B Next time around don’t let B issue before A’s STA is known (don’t have to wait for X and Z) Make Note of It! Store-Load Pair Prediction • Explicitly remember which load conflicted with which store A A Z X B B A’s STA is known, but X and Z still unknown; it’s hopefully safe to issue Ordering Violation Detected Lecture 13: Memory Scheduling

Store Sets Prediction • A load may have conflicts with more than one previous store basic block #1 basic block #2 Store R1  0x4000 A A B Store R4  0x4000 B basic block #3 Load R2  0x4000 C C Lecture 13: Memory Scheduling

B A Z B A C Next time around don’t let C issue before A&B’s STA’s are known (don’t have to wait for Z) Make Note of It! Store Sets Prediction (2) A A Z B B C C A&B’s STA’s are known, but Z still unknown; it’s hopefully safe to issue Another Ordering Violation Detected Lecture 13: Memory Scheduling

Last Fetched Store Table (LFST) 0 C Fetched A Fetched C A 1 2 3 Update LFST w/ LSQ index 4 A:L12 5 Store Sets Implementation If B fetched before C, then B waits on A, updates LFST, then C will wait on B Store Sets Identification Table (SSIT) E 1 A 4 SSIT lookup  SSID = 4 4 C 4 A, B, C belong to same store set B 3 SSIT lookup  SSID = 4 D LFST says load should wait on LSQ entry 12 before issuing PC hash into SSIT; entry indicates store set Lecture 13: Memory Scheduling

Note on Dependence Prediction • Few processors actually support this • 21264 did; used the “load wait table” • Core 2 supports this now… so this is becoming much more important • Many machines only use wait-for-earlier-STAs approach • becomes bottleneck as instruction window size increases Lecture 13: Memory Scheduling

Advanced Microarchitecture