1 / 65

The Locality-Aware Adaptive Cache Coherence Protocol

The Locality-Aware Adaptive Cache Coherence Protocol. George Kurian 1 , Omer Khan 2 , Srini Devadas 1 1 Massachusetts Institute of Technology 2 University of Connecticut, Storrs. Cache Hierarchy Organization Directory-Based Coherence. Private cache Write miss.

jake
Download Presentation

The Locality-Aware Adaptive Cache Coherence Protocol

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. The Locality-Aware Adaptive Cache Coherence Protocol George Kurian1, Omer Khan2, SriniDevadas1 1 Massachusetts Institute of Technology 2 University of Connecticut, Storrs

  2. Cache Hierarchy OrganizationDirectory-Based Coherence Private cache Write miss • Private caches: 1 or 2 levels • Shared cache: Last-level Write word Sharer 1 3 Shared Cache + Directory 2 • Concurrent reads lead to replication in private caches • Directory maintains coherence for replicated lines 4 Sharer

  3. Private CachingAdvantages & Drawbacks • Inefficientlyhandles datawith LOW spatio-temporal locality • Working set > privatecache size • Inefficient cache utilization (Cache thrashing) • Unnecessary fetch of entire cache line • Shared data replication increases working set • Exploits spatio- temporal locality • Efficient low-latency local access to private + shared data (cache line replication)

  4. Private CachingAdvantages & Drawbacks • Inefficientlyhandles datawith LOW spatio-temporal locality • Working set > privatecache size • Shared data with frequent writes • Wasteful invalidations, synchronous writebacks, cache line ping-ponging • Exploits spatio-temporal locality • Efficient low-latency local access to private + shared data (cache line replication) Increased on-chip communication and time spent waiting for expensive events

  5. On-Chip Communication Problem Bill Dally, Stanford Shekhar Borkar, Intel • Wires relative to gates are getting worse every generation Bit movement is much more expensive than computation Must Architect Efficient Coherence Protocols

  6. Locality of BenchmarksEvaluating Reuse before Evictions • Utilization: # private L1 cache accesses before cache line is evicted • 40% of lines evicted have a utilization < 4 80% 20%

  7. Locality of BenchmarksEvaluating Reuse before Invalidations • Utilization: # private L1 cache accesses before cache line is invalidated (intervening write) 80% 10%

  8. Remote-Word Access (RA) • Assign each memory address to unique “home” core • Cache line present only in shared cache at “home” core (single location) • For access to non-locally cached word, request “remote” shared cache on “home” core to perform the read/write access Homecore 2 1 Write word NUCA-based protocol [Fensch et al HPCA’08] [Hoffmann et al HiPEAC’10]

  9. Remote-Word AccessAdvantages & Drawbacks • Round-trip network request for remote-WORD access • Expensive for high locality data • Data placement dictates distance & frequency of remote accesses • Energy Efficient(low locality data) Word access (~200 bits) cheaper than cache line fetch (~640 bits) • NO data replication  Efficient private cache utilization • NO invalidations / synchronous writebacks

  10. Locality-Aware Cache Coherence • Combine advantages of private caching and remote access • Privately cache high locality lines • Optimize hit latency and energy • Remotely cache low locality lines • Prevent data replication & costly data movement • Private Caching Threshold (PCT) • Utilization >= PCT  Mark as private • Utilization < PCT  Mark as remote

  11. Locality-Aware Cache Coherence • Private Caching Theshold (PCT) = 4 Private Remote Invalidations vs Utilization

  12. Outline • Motivation for Locality-Aware Coherence • Detailed Implementation • Optimizations • Evaluation • Conclusion

  13. Baseline System Core M Compute Pipeline L1 D-Cache L1 I-Cache M L2 Shared Cache Directory M Router • Compute pipeline • Private L1-I and L1-D caches • Logically shared physically distributed L2 cache with integrated directory • L2 cache managed by Reactive-NUCA [Hardavellas – ISCA09] • ACKwise limited-directory protocol [Kurian– PACT10]

  14. Locality-Aware CoherenceImportant Features • Intelligent allocation of cache lines • In the private L1 cache • Allocation decision made per-core at cache line level • Efficient locality tracking hardware • Decoupled from traditional coherence tracking structures • Protocol complexity low • NO additional networks for deadlock avoidance

  15. Implementation DetailsPrivate Cache Line Tag • Private Utilization bits to track cache line usage in L1 cache • Communicated back to directory on eviction or invalidation • Storage overhead is only 0.4% State LRU Tag Private Utilization

  16. Implementation DetailsDirectory Entry State ACKwise Pointers 1 … p Tag P/R1 … P/Rn • P/Ri: Private/Remote Mode • Remote-Utilizationi: Line usage by Coreiat shared L2 cache • Complete Locality Classifier: Track mode/remote-utilization for all cores • Storage overhead reduced later Remote Utilization1 Remote Utilizationn …

  17. Mode Transitions Summary • Classification based on previous behavior Remote Utilization < PCT Private Utilization < PCT Initial Private Remote Private Utilization >= PCT Remote Utilization >= PCT

  18. Walk Through Example Core A Private Caching Threshold PCT = 2 All cores start out in private mode Pipeline + L1 Cache Network Core B Core C Pipeline + L1 Cache Pipeline + L1 Cache Directory Core-A Private U Core-B Private U Core-C Private U Core D Uncached L2 Cache + Directory

  19. Walk Through Example Core A PCT = 2 Read[X] Core B Core C Directory Core-A Private U Core-B Private U Core-C Private U Core D Uncached

  20. Walk Through Example Core A PCT = 2 Core B Core C Directory Cache Line [X] Core-A Private C Core-B Private U Core-C Private U Core D Shared Clean -

  21. Walk Through Example Core A PCT = 2 Shared1 Cache Line [X] Core B Core C Directory Core-A Private C Core-B Private U Core-C Private U Core D Shared Clean -

  22. Walk Through Example Core A PCT = 2 Shared1 Core B Core C Read[X] Directory Core-A Private C Core-B Private U Core-C Private U Core D Shared Clean -

  23. Walk Through Example Core A PCT = 2 Shared1 Core B Core C Directory Cache Line [X] Core-A Private C Core-B Private U Core-C Private C Core D Shared Clean -

  24. Walk Through Example Core A PCT = 2 Shared1 Core B Core C Shared1 Cache Line [X] Directory Core-A Private C Core-B Private U Core-C Private C Core D Shared Clean -

  25. Walk Through Example Core A PCT = 2 Shared1 Core B Core C Shared1 Read[X] Directory Core-A Private C Core-B Private U Core-C Private C Core D Shared Clean -

  26. Walk Through Example Core A PCT = 2 Shared1 Core B Core C Shared2 Directory Core-A Private C Core-B Private U Core-C Private C Core D Shared Clean -

  27. Walk Through Example Core A PCT = 2 Shared1 Core B Core C Shared2 Write[X] Directory Core-A Private C Core-B Private U Core-C Private C Core D Shared Clean -

  28. Walk Through Example Core A PCT = 2 Shared1 Core B Core C Shared2 Directory Inv [X] Core-A Private C Core-B Private U Core-C Private C Core D Shared Clean -

  29. Walk Through Example Core A PCT = 2 Invalid 0 Inv-Reply [X] (1) Core B Core C Shared2 Directory Core-A Private C Core-B Private U Core-C Private C Core D Shared Clean -

  30. Walk Through Example Core A PCT = 2 Core B Core C Shared2 Inv-Reply [X] (1) Directory Core-A Remote 0 Core-B Private U Core-C Private C Core D Shared Clean -

  31. Walk Through Example Core A PCT = 2 Core B Core C Invalid 0 Inv-Reply [X] (2) Directory Core-A Remote 0 Core-B Private U Core-C Private C Core D Shared Clean -

  32. Walk Through Example Core A PCT = 2 Core B Core C Inv-Reply [X] (2) Directory Core-A Remote 0 Core-B Private U Core-C Private U Core D Uncached Clean -

  33. Walk Through Example Core A PCT = 2 Core B Core C Directory Cache Line [X] Core-A Remote 0 Core-B Private C Core-C Private U Core D Modified Clean -

  34. Walk Through Example Core A PCT = 2 Core B Core C Modified 1 Cache Line [X] Directory Core-A Remote 0 Core-B Private C Core-C Private U Core D Modified Clean -

  35. Walk Through Example Core A PCT = 2 Read[X] Core B Core C Modified 1 Directory Core-A Remote 0 Core-B Private C Core-C Private U Core D Modified Clean -

  36. Walk Through Example Core A PCT = 2 Core B Core C Modified 1 Directory WB [X] Core-A Remote 0 Core-B Private C Core-C Private U Core D Modified Clean -

  37. Walk Through Example Core A PCT = 2 Core B Core C Shared 1 WB-Reply [X] Directory Core-A Remote 0 Core-B Private C Core-C Private U Core D Modified Clean -

  38. Walk Through Example Core A PCT = 2 Core B Core C Shared 1 Directory WB-Reply [X] Core-A Remote 0 Core-B Private C Core-C Private U Core D Shared Dirty -

  39. Walk Through Example Core A PCT = 2 Core B Core C Shared 1 Directory Word [X] Core-A Remote 1 Core-B Private C Core-C Private U Core D Shared Dirty -

  40. Walk Through Example Core A PCT = 2 Core B Core C Shared 1 Write [X] Directory Core-A Remote 1 Core-B Private C Core-C Private U Core D Shared Dirty -

  41. Walk Through Example Core A PCT = 2 Core B Core C Shared 1 Upgrade-Reply [X] Directory Core-A Remote 0 Core-B Private C Core-C Private U Core D Modified Dirty -

  42. Walk Through Example Core A PCT = 2 Core B Core C Modified 2 Directory Core-A Remote 0 Core-B Private C Core-C Private U Core D Modified Dirty -

  43. Walk Through Example Core A PCT = 2 Read [X] Core B Core C Modified 2 Directory Core-A Remote 0 Core-B Private C Core-C Private U Core D Shared Dirty -

  44. Walk Through Example Core A PCT = 2 Core B Core C Shared 2 Directory Read [X] Core-A Remote 1 Core-B Private C Core-C Private U Core D Shared Dirty -

  45. Walk Through Example Core A PCT = 2 Core B Core C Shared 2 Directory Word [X] Core-A Remote 1 Core-B Private C Core-C Private U Core D Shared Dirty -

  46. Walk Through Example Core A PCT = 2 Read [X] Core B Core C Shared 2 Directory Core-A Remote 1 Core-B Private C Core-C Private U Core D Shared Dirty -

  47. Walk Through Example Core A PCT = 2 Core B Core C Shared 2 Directory Read [X] Core-A Remote 2 Core-B Private C Core-C Private U Core D Shared Dirty -

  48. Walk Through Example Core A PCT = 2 Core B Core C Shared 2 Cache Line [X] (2) Directory Core-A Private C Core-B Private C Core-C Private U Core D Shared Dirty -

  49. Walk Through Example Core A PCT = 2 Shared 2 Cache Line [X] (2) Core B Core C Shared 2 Directory Core-A Private C Core-B Private C Core-C Private U Core D Shared Dirty -

  50. Outline • Motivation for Locality-Aware Coherence • Detailed Implementation • Optimizations • Evaluation • Conclusion

More Related