1 / 39

PEEP: Exploiting Predictability of Memory Dependences in SMT Processors

PEEP: Exploiting Predictability of Memory Dependences in SMT Processors. Samantika Subramaniam, Milos Prvulovic, Gabriel H. Loh. Simplified view of SMT execution. Front-end. Reservation Stations. Icache. Execution Units. Store per thread state Enough work from all threads put together

limei
Download Presentation

PEEP: Exploiting Predictability of Memory Dependences in SMT Processors

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. PEEP: Exploiting Predictability of Memory Dependences in SMT Processors Samantika Subramaniam, Milos Prvulovic, Gabriel H. Loh

  2. Simplified view of SMT execution Front-end Reservation Stations Icache Execution Units Store per thread state Enough work from all threads put together High throughput

  3. Something bad happens… Producer insn stalls Front-end Icache Reservation Stations Execution Units Low ILP thread eventually uses up the CPU resources Other independent high ILP threads forced to stall Defeats purpose of SMT Tackle the problem at the source FETCH UNIT

  4. Previously proposed solution ICOUNT (Instruction Count) [Tullsen et al. ISCA 1996]: Count the number of instructions in the pipeline per thread Fetch Policy: Less priority to thread with more instructions Clogged resources OOPS! Front-end Icache Reservation Stations Execution Units REACTIVE EXCLUSION !

  5. So can we do better? “Oracle” Front-end Icache Reservation Stations Execution Units PROACTIVE EXCLUSION !

  6. Proactive Exclusion Strategies (PE) • Load Misses [Moursy et al. ISCA 2003] predicted load miss GATE • MLP [Eyerman et al. HPCA 2007] all available MLP exposed  GATE • Memory Dependences

  7. A Brief Overview of Memory Dependences LSQ Memory Dependence Predictor PRED ADDR INST 0xF023 ST 1 PC 1 0xF380 LD 1 0xF793 0xF060 ST 2 ? 0xF060 LD 2 Predictability of Memory Dependences Predictor can indicate future stalls

  8. Proactive Exclusion using Memory Dependences T0 T0 ST LD LD T1 T1 ST LD ST T2 T2 Learn ST-LD relationships ST A LD A ST ? LD A T3 T3

  9. Starvation: Problem with Proactive Exclusion Stall resolves Insn enters RS T0 T1 Reservation Stations Exclusion (any strategy) could cause temporary STARVATION T2 T3 Especially bad for short duration stalls!!!

  10. Addr Resolves Addr Resolves ST ? LD A ADD SUB Short Duration Stall ST A LD A ADD SUB ST ? LD A ADD SUB ADD Original ST A LD A ADD SUB ST ? LD A ADD SUB ADD Original + PE Memory Dependence Predictor

  11. DELAY PRED LSQ PC 1 20 0xF023 ST 1 0xF380 LD 1 ST 2 0xF060 LD 2 Predictability of Memory Disambiguation Latency Predictor can indicate duration of future stalls Can we avoid starvation? With PE based on memory dependences we can Memory Dependence Predictor ADDR INST ? 0xF060 20 cycles

  12. Delay Predictor Details Memory Dependence Predictor • Conservative Maximum observed delay • Aggressive Last observed delay • Adaptive Average of last observed ‘n’ delays DELAY PRED PC 1 20

  13. ST ? LD A ADD SUB How does this help us? ST A LD A ADD SUB ADD Addr resolves Original ST A LD A ADD SUB ST ? LD A ADD SUB ADD Addr resolves Memory Dependence Predictor Original + PE Choose an appropriate delay threshold

  14. B:LD 1 Performance Impact of Delay Information Phase 1: After 20 cycles… MDP A:ST 1 B:LD 1 P D ST ? ST xF060 B 0 1 0 20 ST 1 . . . A:ST 21 B:LD 21 LD1 LD xF060 Reservation Stations P: prediction D: delay Execution Units

  15. B:LD 21 Phase 2: Delay Threshold = Front End Depth = 5 MDP A:ST 1 B:LD 1 P D B 1 20 . . . A:ST 21 B:LD 21 Front-end P: prediction D: delay

  16. PE without delay information Phase 3: Front End Depth = 5 Reservation Stations Front-end Stall resolves Restart fetch Insn enter RS 5 20 25 cycles Instructions enter RS after stall resolves

  17. PE with delay information Phase 3: Delay Threshold = Front End Depth = 5 ReservationStations Front-end Restart fetch Stall resolves Insn enter RS 5 15 20 cycles Instructions enter RS right in time as stall resolves

  18. PEEP What does this give us? • Proactive Exclusion • When a memory dependence stall is predicted • Avoid starvation • Ignore short stalls • Give the thread a head start • Restart fetch of gated thread few cycles before stall resolves Early Parole!!! PROACTIVE EXCLUSION AND EARLY PAROLE

  19. PEEP In Our Context Memory Dependence and Delay Predictor 20 cycles Front-end Icache Reservation Stations Execution Units Predicted delay – FE pipeline depth 15 cycles

  20. Simulation Parameters • Aggressive four-way SMT processor • MDP modeled on Load Wait Table • SPEC2000, MediaBench and others • 32 four-thread application mixes evaluated • Application Classification S: sensitive to memory dependences N: non-sensitive to memory dependences L: low-ILP M: medium-ILP H: high-ILP

  21. 13% Proactive Exclusion Strategies S: Sensitive N: Non-sensitive L: low-ILP M: medium-ILP H: high-ILP • PE using memory dependencies shows 13% speedup • Maximum benefit with both sensitive (S) and non-sensitive (N) threads • All sensitive threads: all PE strategies perform comparably

  22. PEEP 17% • PEEP using delay prediction outperforms MLP and PE mdep • All sensitive threads: PEEP does better since it can predict stall durations accurately • PEEP with an oracle-based MDP shows performance speedup of 19%

  23. 2-threaded Workloads 12% • Less threads  less opportunities to fetch from non-stalled threads • 12% performance speedup over 25 application mixes shows there is potential benefit even in a 2-way SMT Intel Simulator shows 8% performance speedup over 150 application mixes

  24. Relationship with OOO Load Scheduling Hypothesis: Performance benefit purely due to a more efficient fetch policy based on a highly predictable attribute Experiment: PEEP on a processor without OOO memory scheduling Prediction is used only for controlling fetch policy Result: Avg. Speedup over ICOUNT=17% (same as PEEP!) Conclusion: Memory Dependencies are a very good indicator of future stalls Even a machine without load reordering benefits from predicting these stalls

  25. ST 1 ST 1 LD 2 LD 2 LD 3 LD 4 ST 1 LD 2 LD 3 Why does it work so well? LMP PEEP LD 1 LD 1 ST 1 ST 1 LD 2 LD 2 LD 3 LD 3 Reservation Stations Reservation Stations LD 4 LD 4

  26. ST 1 LD 2 ST 1 LD 2 ADD SUB LMP PEEP MLP LD 1 LD 1 LD 1 ST 1 ST 1 ST 1 LD 2 LD 2 LD 2 ADD ADD ADD Reservation Stations Reservation Stations Reservation Stations SUB SUB SUB Can expose more ILP

  27. Key Points • Need a mechanism for efficient resource management in SMT • Improve the fetch unit • Memory Dependences and Associated Latencies are predictable • Proactively Exclude “bad” threads but give them Early Parole to avoid temporary starvation • Performance improvements on both 4-way and 2-way SMT machines

  28. Thank You www.cc.gatech.edu/~samantik LD LD LD LD LD LD LD LD “ When will I get paroled?”

  29. B1:Sensitivity Analysis

  30. Predictor Size Delay Threshold

  31. B2:PEEP* 17.3% • Memory Dependences are a very good indicator of future stalls • Performance shows that PEEP works because it leverages knowledge of future stalls to improve instruction fetch

  32. B3:Fairness 19% • Speedup is computed for harmonic mean of weighted IPCs • Since all PE strategies run on top of ICOUNT, they inherit its fairness • SDS (standard deviation of speedup) for PEEP ~ 0.17 and for ICOUNT ~ 0.11

  33. B4: OOO memory scheduling on SMT machine

  34. B5: Accuracy of MDP

  35. B6: Delays associated with PEEP

  36. B7: Delay Predictors Conservative: Maximum observed delay Aggressive: Last observed delay Adaptive: Average of last ‘n’ observed delays

  37. B8:Simulator Configuration

  38. 4-threaded mixes

  39. 2-threaded mixes

More Related