1 / 49

CACM July 2012

CACM July 2012. Talk: Mark D. Hill, Wisconsin @ Cornell University, 10/2012. Executive Summary. Today chips provide shared memory w/ HW coherence as low-level support for OS & application SW As #cores per chip scales? Some argue HW coherence gone due to growing overheads

asasia
Download Presentation

CACM July 2012

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. CACM July 2012 Talk: Mark D. Hill, Wisconsin@ Cornell University, 10/2012

  2. Executive Summary • Today chips provide shared memory w/ HW coherence as low-level support for OS & application SW • As #cores per chip scales? • Some argue HW coherence gone due to growing overheads • We argue it’s stays by managing overheads • Develop scalable on-chip coherence proof-of-concept • Inclusive caches first • Exact tracking of sharers & replacements (key to analysis) • Larger systems need to use hierarchy (clusters) • Overheads similar to today’s  Compatibility of on-chipHW coherence is here to stay • Let’s spend programmer sanity on parallelism,not lost compatibility! 2

  3. Outline Motivation & Coherence Background Scalability Challenges • Communication • Storage • Enforcing Inclusion • Latency • Energy Extension to Non-Inclusive Shared Caches Criticisms & Summary 3

  4. Academics Criticize HW Coherence • Choi et al. [DeNovo]: • Directory…coherence…extremely complex & inefficient .... Directory … incurring significant storage and invalidation traffic overhead. • Kelm et al. [Cohesion]: • A software-managed coherence protocol ... avoids .. directories and duplicate tags , & implementing & verifying … less traffic ... 4

  5. Industry Eschews HW Coherence • Intel 48-Core IA-32 Message-Passing Processor … SW protocols … to eliminate the communication & HW overhead • IBM Cell processor … the greatest opportunities for increased application performance, is the existence of the local store memory and the fact that software must manage this memory BUT… 5

  6. Source: AvinashSodani "Race to Exascale: Challenges and Opportunities,“ Micro 2011.

  7. Define “Coherence as Scalable” • Define a coherent system as scalablewhenthe cost of providing coherence grows (at most) slowly as core count increases • Our Focus • YES: coherence • NO: Any scalable system also requires scalable HW (interconnects, memories) and SW (OS, middleware, apps) • Method • Identify each overhead & show it can grow slowly • Expect more cores • Moore Law’s provide more transistors • Power-efficiency improvements (w/o Dennard Scaling) • Experts disagree on how many core possible

  8. Caches & Coherence • Cache— fast, hidden memory—to reduce • Latency: average memory access time • Bandwidth: interconnect traffic • Energy: cache misses cost more energy • Caches hidden (from software) • Naturally for single core system • Via Coherence Protocol for multicore • Maintain coherence invariant • For a given (memory) block at a give time either • Modified (M): A single core can read & write • Shared (S): Zero or more cores can read, but not write

  9. Baseline Multicore Chip Core 1 Core 2 Core C Block in private cache state tag block data Private cache Private cache Private cache • Intel Core i7 like • C = 16 Cores (not 8) • Private L1/L2 Caches • Shared Last-Level Cache (LLC) • 64B blocks w/ ~8B tag • HW coherence pervasive in general-purpose multicore chips: AMD, ARM, IBM, Intel, Sun (Oracle) ~2 bits ~64 bits ~512 bits Interconnection network Block in shared cache tracking bits state tag block data ~C bits ~2 bits ~64 bits ~512 bits 9

  10. Baseline Chip Coherence Core 1 Core 2 Core C Block in private cache state tag block data Private cache Private cache Private cache • 2B per 64+8B L2 block to track L1 copies • Inclusive L2 (w/ recall messages on LLC evictions) ~2 bits ~64 bits ~512 bits Interconnection network Block in shared cache tracking bits state tag block data ~C bits ~2 bits ~64 bits ~512 bits 10

  11. Coherence Example Setup Core 0 Core 1 Core 2 Core 3 • Block A in no private caches: state Invalid (I) • Block B in no private caches: state Invalid (I) Private cache Private cache Private cache Private cache Interconnection network Bank 0 Bank 1 Bank 3 Bank 2 A: {0000} I … B: {0000} I … 11

  12. Coherence Example 1/4 Core 0 Core 1 Core 2 Core 3 Write A • Block A at Core 0 exclusive read-write: Modified(M) Private cache Private cache Private cache Private cache A: M, … Interconnection network Bank 0 Bank 1 Bank 3 Bank 2 A: {1000} M … {0000} I … B: {0000} I … 12

  13. Coherence Example 2/4 Core 0 Read B Core 1 Read B Core 2 Core 3 • Block B at Cores 1+2 shared read-only: Shared (S) Private cache Private cache Private cache Private cache A: M, … B: B: S, … S, … Interconnection network Bank 0 Bank 1 Bank 3 Bank 2 A: {1000} M … B: {0000} I … {0110} S … {0100} S … 13

  14. Coherence Example 3/4 Write A Core 0 Core 1 Core 2 Core 3 • Block A moved from Core 0 to 3 (still M) Private cache Private cache Private cache Private cache A: A: M, … M, … B: B: S, … S, … Interconnection network Bank 0 Bank 1 Bank 3 Bank 2 A: {0001} M … {1000} M … B: {0110} S … 14

  15. Coherence Example 4/4 Write B Core 0 Core 1 Core 2 Core 3 • Block B moved from Cores1+2 (S) to Core 1 (M) Private cache Private cache Private cache Private cache A: B: M, … M, … B: B: S, … S, … Interconnection network Bank 0 Bank 1 Bank 3 Bank 2 A: {0001} M … B: {0110} S … {1000} M … 15

  16. Caches & Coherence

  17. Outline Motivation & Coherence Background Scalability Challenges • Communication: Extra bookkeeping messages (longer section) • Storage: Extra bookkeeping storage • Enforcing Inclusion: Extra recall messages (subtle) • Latency: Indirection on some requests • Energy: Dynamic & static overhead Extension to Non-Inclusive Shared Caches (subtle) Criticisms & Summary 17

  18. 1. Communication: (a) No Sharing, Dirty Core 1 Core 2 Core C Private cache Private cache Private cache Interconnection network Key: Green for Required Red for Overhead Thin is 8-byte control Thick is 72-byte data • W/o coherence: RequestDataData(writeback) • W/ coherence: RequestDataData(writeback)Ack • Overhead = 8/(8+72+72) = 5% (independent of #cores!) 18

  19. 1. Communication: (b) No Sharing, Clean Core 1 Core 2 Core C Private cache Private cache Private cache Interconnection network Key: Green for Required Red for Overhead Thin is 8-byte control Thick is 72-byte data • W/o coherence: RequestData0 • W/ coherence: RequestData(Evict)Ack • Overhead = 16/(8+72) = 10-20% (independent of #cores!) 19

  20. 1. Communication: (c) Sharing, Read Core 1 Core 2 Core C Private cache Private cache Private cache Interconnection network Key: Green for Required Red for Overhead Thin is 8-byte control Thick is 72-byte data • To memory: RequestData • To one other core: RequestForwardData(Cleanup) • Charge 1-2 Control messages (independent of #cores!) 20

  21. 1. Communication: (d) Sharing, Write Core 1 Core 2 Core C Private cache Private cache Private cache Interconnection network Key: Green for Required Red for Overhead Thin is 8-byte control Thick is 72-byte data • If Shared at C other cores • Request{Data,C Invalidations + C Acks}(Cleanup) • Needed since most directory protocols send invalidations to caches that have & sometimes do not have copies • Not Scalable 21

  22. 1. Communication: Extra Invalidations Core 1 Core 2 Core C Private cache Private cache Private cache Interconnection network Key: Green for Required Red for Overhead Thin is 8-byte control Thick is 72-byte data {1|2 3|4 .. C-1|C} { 1 0 .. 0 } { 0 0 .. 1 } { 0 0 .. 0 } • Core 1 Read: RequestData • Core C Write: Request{Data, 2 Inv+ 2 Acks}(Cleanup) • Charge Write for all necessary & unnecessary invalidations • What if all invalidations necessary? Charge reads that get data! 22

  23. 1. Communication: No Extra Invalidations Core 1 Core 2 Core C Private cache Private cache Private cache Interconnection network Key: Green for Required Red for Overhead Thin is 8-byte control Thick is 72-byte data {1 2 3 4 .. C-1 C} {1 0 0 0 .. 0 0} {0 0 0 0 .. 0 1} {0 0 0 0 .. 0 0} • Core 1 Read: RequestData+ {Inv + Ack} (in future) • Core C Write: RequestData(Cleanup) • If all invalidations necessary, coherence adds • Bounded overhead to each miss -- Independent of #cores! 23

  24. 1. Communication Overhead (1) Communication overhead bounded & scalable (a) Without Sharing & Dirty (b) Without Sharing & Clean (c) Shared Read Miss (charge future inv + ack) (d) Shared Write Miss (not charged for inv+ acks) • But depends on tracking exact sharers (next) 24

  25. Total CommunicationC Read Misses per Write Miss Exact (unbounded storage) Inexact (32b coarse vector) How get performance of “exact” w/ reasonable storage? 25

  26. Outline Motivation & Coherence Background Scalability Challenges • Communication: Extra bookkeeping messages (longer section) • Storage: Extra bookkeeping storage • Enforcing Inclusion: Extra recall messages • Latency: Indirection on some requests • Energy: Dynamic & static overhead Extension to Non-Inclusive Shared Caches Criticisms & Summary 26

  27. 2. Storage Overhead (Small Chip) Core 1 Core 2 Core C Block in private cache state tag block data Private cache Private cache Private cache • Track up to C=#readers (cores) per LLC block • Small #Cores: C bit vector acceptable • e.g., 16 bits for 16 cores : 2 bytes / 72 bytes = 3% ~2 bits ~64 bits ~512 bits Interconnection network Block in shared cache tracking bits state tag block data ~C bits ~2 bits ~64 bits ~512 bits 27

  28. 2. Storage Overhead (Larger Chip) Cluster 1 Cluster K Cluster of K cores Cluster of K cores core core core core core core • Use Hierarchy! private cache private cache private cache private cache private cache private cache Intra-clusterInterconnection network Intra-clusterInterconnection network tracking state bits tag block data Cluster Cache tracking state bits tag block data Cluster Cache Cache Inter-cluster Interconnection network tracking state bits tag block data Shared last-level cache {11..1} S … {10..1} S … {11..1 … 10..1} S … 28 {1 … 1} S …

  29. 2. Storage Overhead (Larger Chip) • Medium-Large #Cores: Use Hierarchy! • Cluster: K1 cores with L2 cluster cache • Chip: K2 clusters with L3 global cache • Enables K1*K2 Cores • E.g., 16 16-core clusters • 256 cores (16*16) • 3% storage overhead!! • More generally? 29

  30. Storage Overhead for Scaling (2) Hierarchy enables scalable storage 16 clusters of16 cores each 30

  31. Outline Motivation & Coherence Background Scalability Challenges • Communication: Extra bookkeeping messages (longer section) • Storage: Extra bookkeeping storage • Enforcing Inclusion: Extra recall messages (subtle) • Latency: Indirection on some requests • Energy: dynamic & static overhead Extension to Non-Inclusive Shared Caches (subtle) Criticisms & Summary 31

  32. 3. Enforcing Inclusion (Subtle) • Inclusion: Block in a private cache  In shared cache + Augment shared cache to trackprivate cache sharers (as assumed) • Replace in shared cache  Replace in private c. • Make impossible? • Requires too much shared cache associativity  • E.g., 16 cores w/ 4-way caches  64-way assoc • Use recall messages  • Make recall messages necessary & rare  32

  33. Inclusion Recall Example Core 0 Core 1 Core 2 Write C Core 3 • Shared cache miss to new block C • Needs to replace (victimize) block B in shared cache • Inclusion forces replacement of B in private caches Private cache Private cache Private cache Private cache A: M, … B: B: S, … S, … Interconnection network Bank 0 Bank 1 Bank 3 Bank 2 A: {1000} M … B: {0110} S … 33

  34. Make All Recalls Necessary Exact state tracking (cover earlier) + L1/L2 replacement messages (even clean) = Every recall message finds cached block  Every recall message necessary & occurs after a cache miss (bounded overhead) 34

  35. Make Necessary Recalls Rare Assume misses to random sets [Hill & Smith 1989] • Recalls naturally rare when Shared Cache Size/ ΣPrivate Cache sizes > 2 (3) Recalls made rare Core i7 35

  36. Outline Motivation & Coherence Background Scalability Challenges • Communication: Extra bookkeeping messages (longer section) • Storage: Extra bookkeeping storage • Enforcing Inclusion: Extra recall messages • Latency: Indirection on some requests • Energy: Dynamic & static overhead Extension to Non-Inclusive Shared Caches Criticisms & Summary 36

  37. 4. Latency Overhead – Often None Core 1 Core 2 Core C Private cache Private cache Private cache Interconnection network Key: Green for Required Red for Overhead Thin is 8-byte control Thick is 72-byte data • None: private hit • “None”: private miss + “direct” shared cache hit • “None”: private miss + shared cache miss • BUT … 37

  38. 4. Latency Overhead -- Some Core 1 Core 2 Core C Private cache Private cache Private cache Interconnection network Key: Green for Required Red for Overhead Thin is 8-byte control Thick is 72-byte data 4. 1.5-2X: private miss + shared cache hit with indirection(s) • How bad? 38

  39. 4. Latency Overhead -- Indirection 4. 1.5-2X: private miss + shared cache hit with indirection(s) interconnect + cache + interconnect + cache + interconnect --------------------------------------------------------------------------------------------- interconnect + cache + interconnect • Acceptable today • Relative latency similar w/ more cores/hierarchy • Vs. magically having data at shared cache (4) Latency overhead bounded & scalable

  40. 5. Energy Overhead • Dynamic -- Small • Extra message energy – traffic increase small/bounded • Extra state lookup – small relative to cache block lookup • … • Static – Also Small • Extra state – state increase small/bounded • … • Little effect on energy-intensive cores, cache data arrays, off-chip DRAM, secondary storage, … • (5) Energy overhead bounded & scalable

  41. Outline Motivation & Coherence Background Scalability Challenges • Communication: Extra bookkeeping messages (longer section) • Storage: Extra bookkeeping storage • Enforcing Inclusion: Extra recall messages (subtle) • Latency: Indirection on some requests • Energy: Dynamic & static overhead Extension to Non-Inclusive Shared Caches (subtle) Apply analysis to caches used by AMD Criticisms & Summary 41

  42. Review Inclusive Shared Cache Core 1 Core 2 Core C Private cache Private cache Private cache Interconnection network tracking bits state tag block data ~1 bit per core ~2 bits ~64 bits ~512 bits • Inclusive Shared Cache: • Block in a private cache  In shared cache • Blocks must be cached redundantly  42

  43. Non-Inclusive Shared Cache Core 1 Core 2 Core C Private cache Private cache Private cache Interconnection network 2. InclusiveDirectory (probe filter) 1. Non-InclusiveShared Cache state tag block data tracking bits state tag ~2 bits ~64 bits ~512 bits ~1 bit per core ~2 bits ~64 bits Dataless Ensures coherence  But duplicates tags   Any size or associativity • Avoids redundant caching • Allows victim caching 43

  44. Non-Inclusive Shared Cache • Non-Inclusive Shared Cache: Data Block + Tag(Any Configuration ) • Inclusive Directory: Tag (Again)  + State • Inclusive Directory == Coherence State Overhead • WITH TWO LEVELS • Directory size proportional to sum of private cache sizes • 64b/(48b+512b) * 2 (for rare recalls) = 22% * Σ L1 size • Coherence overhead higher than w/ inclusion 44

  45. Non-Inclusive Shared Caches WITH THREE LEVELS • Cluster has L2 cache & cluster directory • Cluster directory points to cores w/ L1 block (as before) • (1) Size = 22% * ΣL1s sizes • Chip has L3 cache & global directory • Global directory points to cluster w/ block in • (2) Cluster directory for size 22% * ΣL1s + • (3) Cluster L2 cache for size 22% * ΣL2s • Hierarchical overhead higher than w/ inclusion 45

  46. Outline Motivation & Coherence Background Scalability Challenges • Communication: Extra bookkeeping messages (longer section) • Storage: Extra bookkeeping storage • Enforcing Inclusion: Extra recall messages (subtle) • Latency: Indirection on some requests • Energy: Dynamic & static overhead Extension to Non-Inclusive Shared Caches (subtle) Criticisms & Summary 46

  47. Some Criticisms (1) Where are workload-driven evaluations? • Focused on robust analysis of first-order effects (2) What about non-coherent approaches? • Showed compatible of coherence scales (3) What about protocol complexity? • We have such protocols today (& ideas for better ones) (4) What about multi-socket systems? • Apply non-inclusive approaches (5) What about software scalability? • Hard SW work need not re-implement coherence

  48. Executive Summary • Today chips provide shared memory w/ HW coherence as low-level support for OS & application SW • As #cores per chip scales? • Some argue HW coherence gone due to growing overheads • We argue it’s stays by managing overheads • Develop scalable on-chip coherence proof-of-concept • Inclusive caches first • Exact tracking of sharers & replacements (key to analysis) • Larger systems need to use hierarchy (clusters) • Overheads similar to today’s  Compatibility of on-chipHW coherence is here to stay • Let’s spend programmer sanity on parallelism,not lost compatibility! 48

  49. Coherence NOT this Awkward

More Related