1 / 43

Locality-Aware Data Replication in the Last-Level Cache

Locality-Aware Data Replication in the Last-Level Cache. George Kurian 1 , Srinivas Devadas 1 , Omer Khan 2 , 1 Massachusetts Institute of Technology 2 University of Connecticut, Storrs. The Problem. Future multicore processors will have 100s of cores

Download Presentation

Locality-Aware Data Replication in the Last-Level Cache

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Locality-Aware Data Replication in the Last-Level Cache George Kurian1, SrinivasDevadas1, Omer Khan2, 1 Massachusetts Institute of Technology 2 University of Connecticut, Storrs

  2. The Problem • Future multicore processors will have 100s of cores • LLC management key to optimizing performance and energy • Last-level cache (LLC) data locality and off-chip miss rates often show opposing trends # Network Hops = ⅔ * √N • Goal: Intelligent replication at the LLC

  3. LLC Replication Strategy 3 2 1 L1 D L1 I L1 D L1 I L1 D L1 I • Black block shows benefit with replication • E.g., Frequently-read shared data • Core-1 and Core-2 allowed to create replicas • Red block shows NO benefit with replication • E.g., Frequently-written shared data LLC Slice LLC Slice LLC Slice Core Replica Replica L1 D L1 I L1 D L1 I L1 D L1 I Private L1 Caches LLC Slice LLC Slice LLC Slice Directory Compute Pipeline 4 L1 D L1 I L1 D L1 I L1 D L1 I Router L2 Cache (LLC Slice) LLC Slice LLC Slice LLC Slice Home Home

  4. Outline • Motivation • Comparison to Previous Schemes • Design & Implementation • Evaluation • Conclusion

  5. MotivationReuse at the LLC • Reuse: # Accesses to a cache line by a core before eviction or a conflicting access by another core • Note: Private L1 cache hits are filtered out Core 3 5 Accesses 3 L1 D L1 I L1 D L1 I L1 D L1 I LLC Slice LLC Slice LLC Slice Core Compute Pipeline Core 4 Write L2 Cache (LLC Slice) 4 L1 D L1 I L1 D L1 I L1 D L1 I Private L1 Caches LLC Slice LLC Slice LLC Slice Directory Reuse = 5 L1 D L1 I L1 D L1 I L1 D L1 I Router Home LLC Slice LLC Slice LLC Slice

  6. MotivationReuse Determines Replication Benefit • Reuse: # Accesses to a cache line by a core before eviction or a conflicting access by another core • Higher the reuse, higher the efficacy of replication Fig: LLC Access Count vs Reuse

  7. Motivation (cont’d)Reuse Determines Replication Benefit • Reuse: # Accesses to a cache line by a core before eviction or a conflicting access by another core • Higher the reuse, higher the efficacy of replication Replicate Don’t Replicate Fig: LLC Access Count vs Reuse

  8. Motivation (cont’d)Reuse Independent of Cache Line Type • Private data exhibits varying degrees of reuse 1-2 3-9 ≥10 Private Fig: LLC Access Count vs Reuse

  9. Motivation (cont’d)Reuse Independent of Cache Line Type • Instructions mostly exhibit high reuse 1-2 3-9 ≥10 Private Instruction Fig: LLC Access Count vs Reuse

  10. Motivation (cont’d)Reuse Independent of Cache Line Type • Shared read-only data exhibits varying degrees of reuse 1-2 3-9 ≥10 1-2 3-9 ≥10 Private Shared Read-Only Instruction Fig: LLC Access Count vs Reuse

  11. Motivation (cont’d)Reuse Independent of Cache Line Type 1-2 3-9 ≥10 1-2 3-9 ≥10 Private Shared Read-Only • Shared read-write data exhibits varying degrees of reuse Instruction Shared Read-Write Fig: LLC Access Count vs Reuse

  12. Motivation (cont’d)Reuse Independent of Cache Line Type 1-2 3-9 ≥10 1-2 3-9 ≥10 Private Shared Read-Only • Replication must be based onreuseand notcache line classification Instruction Shared Read-Write • Replicate based on Reuse • Instructions • Shared read-only data • Shared read-write data • (even)Private data Fig: LLC Access Count vs Reuse

  13. Locality-Aware ReplicationSalient Features • Locality-based: Based on reuse and not memory classification information • Replicate data with high reuse • Bypass replication mechanisms for low reuse data • Cache-line Level: Reuse measured and replication decision made at cache-line level • Dynamic: Reuse profiled at runtime using highly-accurate hardware counters • Minimal Coherence Protocol Changes: Replication is done at the local LLC slice • Fully Hardware: LLC replication techniques require no modification to operating system

  14. Comparison to Previous Schemes

  15. Outline • Motivation • Comparison to Previous Schemes • Design & Implementation • Evaluation • Conclusion

  16. Baseline System Core M Compute Pipeline L1 D-Cache L1 I-Cache M L2 Cache (LLC) Directory M Router • Compute pipeline with private L1-I and L1-D caches • Logically shared physically distributed L2 cache (LLC) with integrated directory • LLC managed using Reactive-NUCA [Hardavellas – ISCA09] • Local placement of private pages, shared pages are striped • ACKwise limited-directory protocol [Kurian– PACT10]

  17. Locality Tracking IntelligenceReplica Reuse Counter … Tag State LRU • Replica Reuse: Tracks cache line usage by a core at the LLC replica • Replica reuse counter is communicated back to directory on eviction or invalidation for classification • NO additional network messages • Storage overhead: 1KB - 0.4% ACKWise Pointers (1 … p) Replica Reuse Mode1 Moden … Home Reuse1 Home Reusen Complete Locality List (1 .. n)

  18. Locality Tracking IntelligenceMode & Home Reuse Counters … Tag State LRU • Modei: Can cache line be replicated at Corei? • Home Reusei: Tracks cache line usage by Coreiat home LLC slice • Complete Locality Classifier: Tracks locality information for all cores and for all LLC cache lines • Storage Overhead: 96KB - 30% • We’ll fix this later ACKWise Pointers (1 … p) Replica Reuse Mode1 Moden … Home Reuse1 Home Reusen Complete Locality List (1 .. n)

  19. Mode TransitionsReplication Intelligence • Replication decision made based on previous cache line reuse behavior • Initially, no replica is created • All requests are serviced at the LLC home Initial No Replica

  20. Mode Transitions • Replication decision made based on previous cache line reuse behavior • Home-Reuse counter: Tracks the # accesses by a core at the LLC home location Initial No Replica

  21. Mode Transitions Home Reuse >= RT Initial • A replica is created if enough reuse is detected at the LLC home • If (Home-Reuse >= Replication-Threshold) Promote to “Replica” mode Create Replica •  Replication-Threshold : #Replicas •  Replication-Threshold :# Replicas RT: Replication Threshold No Replica Replica

  22. Mode Transitions • Replica-Reuse counter: Tracks the # accesses to the LLC at the replica location Home Reuse >= RT Initial RT: Replication Threshold No Replica Replica

  23. Mode Transitions Replica Reuse >= RT Home Reuse >= RT Initial • Eviction from LLC Replica Location • Triggered by capacity limitations • If (Replica-Reuse >= Replication-Threshold) Stay in “Replica” modeElse Demote to “No-Replica” mode No Replica Replica RT: Replication Threshold Replica Reuse < RT

  24. Mode Transitions (Replica + Home) Reuse >= RT Home Reuse >= RT Initial • Invalidation at LLC Replica Location • Triggered by a conflicting write • If ( [Replica+Home] Reuse >= Replication-Threshold) Stay in “Replica” modeElse Demote to “No-Replica” mode No Replica Replica RT: Replication Threshold (Replica + Home) Reuse < RT

  25. Mode Transitions XReuse >= RT • Conflicting-Write from another core:Reset Home-Reuse counter to ‘0’ Home Reuse >= RT Initial Replica No Replica RT: Replication Threshold XReuse < RT Home Reuse < RT

  26. Mode Transitions Summary • Replication decision made based on previous cache line reuse behavior XReuse >= RT Home Reuse >= RT Initial No Replica Replica RT: Replication Threshold Home Reuse < RT XReuse < RT

  27. Locality Tracking IntelligenceLimitedk Locality Classifier … Core ID1 Core IDk Replica Reuse ACKWise Pointers (1 … p) Tag State LRU • Complete Locality Classifier:Prohibitive storage overhead (30%) • Limited Locality Classifier (k): Mode and Home Reuse information tracked for only k cores • Modes of other cores obtained by majority voting • Smaller k -> Lower overhead • Inactive cores replaced in locality list based on access pattern to accommodate new sharers … Mode1 Modek … Home Reuse1 Home Reusek Limited Locality List (1 .. k)

  28. Limited3Locality Classifier • Mode and Home Reuse tracked for 3 sharers • Limited-3 classifier approximates performance & energy of Complete classifier

  29. Outline • Motivation • Comparison to Previous Schemes • Design & Implementation • Evaluation • Conclusion

  30. Evaluation Methodology • Evaluations done using • Graphite simulator for 64 cores • McPAT/CACTI cache energy models and DSENT network energy models at 11 nm • Evaluated 21 benchmarks from the SPLASH-2 (11), PARSEC (8), Parallel MI-bench (1) and UHPC (1) suites • LLC managements schemes compared: • Static-NUCA (S-NUCA) • Reactive-NUCA (R-NUCA) • Victim Replication (VR) • Adaptive Selective Replication (ASR) [modified] • Locality-Aware Replication (RT-1, RT-3, RT-8)

  31. Replicate Shared Read-Write DataLLC Accesses: BARNES 1-2 3-9 ≥10 1-2 3-9 ≥10 Private Shared Read-Only • Most LLC accesses are reads to widely-shared high-reuseshared read-write data • Important to replicate shared read-write data Instruction Shared Read-Write

  32. Replicate Shared Read-Write DataEnergy Results: BARNES • Locality-aware protocol reduces network router & link energy by replicating shared read-write data locally • Victim replication (VR) obtains limited energy benefits • (Almost) blind replica creation scheme • Simplistic LLC replacement policy • Removing and re-inserting replicas on L1 misses & evictions • Adaptive selective replication (ASR) and Reactive-NUCA do not replicate shared read-write data

  33. Replicate Shared Read-Write DataCompletion Time Results: BARNES • Locality-aware protocol reduces communication time with the LLC home(L1-To-LLC-Home)

  34. Replicate Private Cache LinesPage vs Cache Line Classification: BLACKSCHOLES • Page-level classification incurs false positives • Multiple cores work privately on cache lines in the same page • Page classified shared read-only instead of private • Page-level data placement not optimal • Reactive-NUCA cannot localize most LLC accesses • Replicate private data to localize all LLC accesses

  35. Replicate Private Cache LinesEnergy Results: BLACKSCHOLES • Locality-aware protocol reduces network energy through replication of private cache lines • ASR replicates just shared read-only cache lines • VR obtains limited improvements in energy • Still restricted by replication mechanisms

  36. Replicate All Classes of Cache LinesLLC Accesses: BODYTRACK 1-2 3-9 ≥10 1-2 3-9 ≥10 Private Shared Read-Only • Most LLC accesses are reads to widely-shared high-reuseinstructions, shared read-only and shared read-write data • Best replication policy should optimize handling of all 3 classes of cache lines Instruction Shared Read-Write

  37. Replicate All Classes of Cache LinesEnergy Results: BODYTRACK • R-NUCA replicates instructions, hence obtains small network energy improvements • ASR replicates instructions and shared read-only data and obtains larger energy improvements • The locality-aware protocol replicates shared read-write data as well

  38. Use Optimal Replication ThresholdEnergy Results: STREAMCLUSTER • Perform intelligent replication • RT-1 performs badly due to LLC pollution • RT-8 identifies less replicas, slow to identify useful ones • RT-3 identifies more replicas and faster while not creating LLC pollution • Use optimal replication threshold of 3

  39. Results Summary • We choose a static Replication threshold (RT) of 3 • Energyimproved by 13-21% • Completion Timeimproved by 4-13% Energy Completion Time

  40. Conclusion • Locality-aware instruction and data replication in the last-level cache (LLC) • Spatio-temporal locality profiled dynamically at the cache line level using low-overhead yet highly accurate hardware counters • Enables replication only for lines with high reuse • Requires minimal changes to the baseline cache coherence protocol since replicas are placed locally • Significant energy and performance improvements against state-of-the-art replication mechanisms

  41. See The Paper For … • Exhaustive benchmark case studies • Apps with migratory shared data • Apps with NO benefit from replication • Limited locality classifier study • Sensitivity to number of tracked cores (k) • Cluster-level locality-aware LLC replication study • Sensitivity to cluster size (C)

  42. Thank You!Questions?

  43. Locality-Aware Data Replication in the Last-Level Cache George Kurian1, SrinivasDevadas1, Omer Khan2, 1 Massachusetts Institute of Technology 2 University of Connecticut, Storrs

More Related