1 / 20

Coherence Decoupling: Making Use of Incoherence

Coherence Decoupling: Making Use of Incoherence. J. Huh, J. Chang, D. Burger, G. Sohi ASPLOS 2004. Motivation. Multi-threading and Multi-processing have become common When a cache line is marked as invalid very often not all data in the line is incorrect

Download Presentation

Coherence Decoupling: Making Use of Incoherence

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Coherence Decoupling: Making Use of Incoherence J. Huh, J. Chang, D. Burger, G. Sohi ASPLOS 2004

  2. Motivation • Multi-threading and Multi-processing have become common • When a cache line is marked as invalid very often not all data in the line is incorrect • If the data in invalid lines can be used speculatively there is a great potential for performance improvement

  3. Background Cache Coherence Protocol • Used in shared-memory multiprocessors for managing correct data sharing • Vital to the design of multiprocessors since it contributes the most to inter-processor communication latency

  4. Proposed Idea • Separate the traditional cache coherence protocol into two parts • Speculative cache lookup (SCL) – uses a speculative value from an invalid cache line thus allowing the processor to work continuously • Safe coherence protocol – obtains the correct value which is then compared with the value provided by SCL

  5. Coherence Decoupling

  6. Related Work • Customized Coherence Protocols • Speculative Coherence Operations Dynamic self-invalidation, coherence message predictor, token coherence etc. • Speculation on outcome of events in multi-processor execution

  7. Coherence Decoupling Architecture Must support the following: • Split - means to split a memory op into speculative load and a coherence operation • Compute -mechanisms to support execution with speculative values • Recover – means to recover and rollback upon misprediction

  8. SCL Protocols for Coherence Decoupling • Use a simple safe coherence protocol and rely on an aggressive SCL protocol to increase performance • Two components of an SCL protocol • Read component – obtains the speculative value • Update component – updates an invalid cache line so subsequent speculative reads can use it (can be left out in some SCL protocols)

  9. Read vs Update components • SCL protocol with only a read component can be used if the word in an invalid block has: • Not changed remotely (false sharing) • Changed remotely to a same value (silent stores) • Changed remotely to a different value and then back to the original value (temporally silent stores) • For truly-shared data an update component needs to be added • Speculatively sends data around the system by writing it into invalid cache lines

  10. SCL protocol Read component • CD - Use the locally cached incoherent value for every L2 miss Simple but since it is triggered on every load operation it could produce many mis-speculations • CD-F - Add a PC-indexed confidence predictor to filter speculations Reduces the number of (mis)speculative reads thus improving the average accuracy

  11. SCL protocol Update component • CD-IA Use invalidation piggyback to update all invalid blocks • CD-C Use invalidation piggyback if the value is compressed

  12. SCL protocol Update component (Ctd.) • CD-N - Update all sharers after N writes to a block Increases the number of messages (bandwidth) • CD-W - Update on every write if any sharers exist CD assumed wherever Write update is being used

  13. Methodology • Simulator MP-Sauce & SimpleScalar • 16-node SMP systems simulated • Coherence protocol used – simple invalidation snooping-bus protocol • 3 commercial applications and 5 scientific shared memory SPLASH2 suite benchmarks simulated

  14. Results - Microbenchmarks Simple-fs – loads falsely shared data and then executes (in)dependent instructions Critical-fs – forces data dependence between two loads by placing consecutive false sharing misses in critical path

  15. L2 Miss Profiling Results

  16. Coherence Decoupling Accuracy Results CD, CD-F, CD-IA, CD-C, CD-N, CD-W

  17. Timing Results

  18. Bandwidth Requirements

  19. Latency Tolerance Profiles • Executed instructions during coherence decoupling • The number of control dependent instructions will grow in future processors

  20. Conclusions • Coherence Misses – significant fraction of L2 misses ranging from 10% to 80% • Coherence Decoupling has the potential to hide the miss latency for 40% to 90% of coherence misses • Mis-speculation occurs 20% of the time

More Related