1 / 24

RegionScout: Exploiting Coarse Grain Sharing in Snoop Coherence

RegionScout: Exploiting Coarse Grain Sharing in Snoop Coherence. www.eecg.toronto.edu/aenao. Andreas Moshovos moshovos@eecg.toronto.edu. Improving Snoop Coherence. Conventional Considerations: Complexity and Correctness NOT Power/Bandwidth Can we: (1) Reduce Power/bandwidth

sumi
Download Presentation

RegionScout: Exploiting Coarse Grain Sharing in Snoop Coherence

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. RegionScout: Exploiting Coarse Grain Sharing in Snoop Coherence www.eecg.toronto.edu/aenao Andreas Moshovos moshovos@eecg.toronto.edu

  2. Improving Snoop Coherence • Conventional Considerations: Complexity and Correctness NOT Power/Bandwidth • Can we: (1) Reduce Power/bandwidth (2) Leverage snoop coherence? • Remains Attractive: Simple / Design Re-use Yes: Exploit Program Behavior to Dynamically Identify Requests that do not Need Snooping CPU CPU CPU I$ D$ I$ D$ I$ D$ interconnect Main Memory

  3. RegionScout: Avoid Some Snoops • Frequent case: non-sharing even at a coarse level/Region • RegionScout: Dynamically Identify Non-Shared Regions • First Request to a Region Identifies it as not Shared • Subsequent Requests do not need to be broadcast • Uses Imprecise Information • Small structures • Layer on top of conventional coherence • No additional constraints CPU CPU CPU I$ D$ I$ D$ I$ D$ interconnect Main Memory

  4. Roadmap • Conventional Coherence: • The need for power-aware designs • Potential: Program Behavior • RegionScout: What and How • Implementation • Evaluation • Summary

  5. Coherence Basics • Given request for memory block X (address) • Detect where its current value resides CPU CPU CPU X snoop snoop hit Main Memory

  6. Conventional Coherence not Power-Aware/Bandwidth-Effective CPU CPU CPU L2 miss miss Main Memory All L2 tags see all accesses Perf. & Complexity: Have L2 tags why not use them Power:All L2 tags consume power on all accesses Bandwidth: broadcast all coherent requests

  7. RegionScout Motivation:Sharing is Coarse • Region: large continuous memory area, power of 2 size • CPU X asks for data block in region R • No one else has X • No one else has any block in R RegionScout Exploits this Behavior Layered Extension over Snoop Coherence Typical Memory Space Snapshot: colored by owner(s) addresses

  8. CPU CPU CPU I$ I$ I$ D$ D$ D$ Optimization Opportunities • Power and Bandwidth • Originating node: avoid asking others • Remote node: avoid tag lookup SWITCH Memory

  9. Potential: Region Miss Frequency better % of all requests Global Region Misses Region Size Even with a 16K Region ~45% of requests miss in all remote nodes

  10. 1 2 2 3 RegionScout at Work: Non-Shared Region Discovery First request detects a non-shared region CPU CPU CPU Region Miss Region Miss Global Region Miss Main Memory Record: Non-Shared Regions Record: Locally Cached Regions

  11. 1 2 RegionScout at Work:Avoiding Snoops Subsequent request avoids snoops CPU CPU CPU Global Region Miss Main Memory Record: Non-Shared Regions Record: Locally Cached Regions

  12. 1 2 2 RegionScout is Self-Correcting Request from another node invalidates non-shared record CPU CPU CPU Main Memory Record: Non-Shared Regions Record: Locally Cached Regions

  13. Region Tag offset CPU Implementation: Requirements • Requesting Node provides address: • At Originating Node – from CPU: • Have I discovered that this region is not shared? • At Remote Nodes – from Interconnect: • Do I have a block in the region? address lg(Region Size)

  14. Remembering Non-Shared Regions address • Records non-shared regions • Lookup by Region portion prior to issuing a request • Snoop requests and invalidate Region Tag offset Non-Shared Region Table valid Few entries 16x4 in most experiments

  15. What Regions are Locally Cached? Region Tag offset • If we had as many counters as regions: • Block Allocation: counter[region]++ • Block Eviction: counter[region]-- • Region cached only if counter[region] non-zero • Not Practical: • E.g., 16K Regions and 4G Memory  256K counters counter

  16. p bits P-bit 1 if counter non-zero used for lookups What Regions are Locally Cached? • Use few Counters Imprecise: • Records a superset of locally cached Regions • False positives: lost opportunity, correctness preserved Region Tag offset Cached Region Hash “Counter”: + on block allocation - on block eviction Few entries, e.g., 256 hash counter

  17. Roadmap • Conventional Coherence • Program Behavior: Region Miss Frequency • RegionScout • Evaluation • Summary

  18. Evaluation Overview • Methodology • Filter rates • Practical Filters can capture many Region Misses • Interconnect bandwidth reduction

  19. Methodology • In-House simulator based on Simplescalar • Execution driven • All instructions simulated – MIPS like ISA • System calls faked by passing them to host OS • Synchronization using load-linked/store-conditional • Simple in-order processors • Memory requests complete instantaneously • MESI snoop coherence • 1 or 2 level memory hierarchy • WATTCH power models • SPLASH II benchmarks • Scientific workloads • Feasibility study

  20. Filter Rates better Identified Global Region Misses CRH Size For small CRH better to use large regions Practical RegionScout filters capture a lot of the potential

  21. Bandwidth Reduction Messages better CMP Region Size Moderate Bandwidth Savings for SMP (15%-22%) More so for CMP (>25%)

  22. Related Work • RegionScout • Technical Report, Dec. 2003 • Jetty • Moshovos, Memik, Falsafi, Choudhary, HPCA 2001 • PST • Eckman, Dahlgren, and Stenström, ISLPED 2002 • Coarse-Grain Coherence • Cantin, Lipasti and Smith, ISCA 2005

  23. Summary • Exploit program behavior/optimize a frequent case • Many requests result in a global region miss • RegionScout • Practical filter mechanism • Dynamically detect would-be region misses • Avoid broadcasts • Save tag lookup power and interconnect bandwidth • Small structures • Layered extension over existing mechanisms • Invisible to programmer and the OS

  24. RegionScout and Directories • Different information • Directory block-level sharing • RegionScout: Region-level sharing • Could build Region-level directory • This work serves as motivation • Directories use precise information • RegionScout does not have to • Directories/Implementation • RegionScout can approximate a directory • If remote nodes sent sharing info as opposed to a single bit

More Related