1 / 91

Snoop Filtering and Coarse-Grain Memory Tracking

Snoop Filtering and Coarse-Grain Memory Tracking. Andreas Moshovos Univ. of Toronto/ECE Short Course at the University of Zaragoza, July 2009 Some slides by J. Zebchuk or the original paper authors. JETTY Snoop-Filtering for Reduced Power in SMP Servers.

marla
Download Presentation

Snoop Filtering and Coarse-Grain Memory Tracking

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Snoop Filtering and Coarse-Grain Memory Tracking Andreas Moshovos Univ. of Toronto/ECE Short Course at the University of Zaragoza, July 2009 Some slides by J. Zebchuk or the original paper authors

  2. JETTY Snoop-Filtering for Reduced Power in SMP Servers BabakFalsafi, ECE, Carnegie Mellon GokhanMemik, ECE, Northwestern AlokChoudhary, ECE, Northwestern Int’l Conference on High-Performance Architecture, 2001 Andreas Moshovos

  3. Power is Becoming Important • Architecture is a science of tradeoffs • Thus far: Performance vs. Cost vs. Complexity • Today: vs. Power • Where? • Mobile Devices • Desktops/Servers Our Focus

  4. Power-Aware Servers • Revisit the design of SMP servers • 2 or more CPUs per machine • Snoop coherence-based • Why? • File, web, databases, your typical desktop • Cost effective too • This work - a first step: Power-Aware Snoopy-Coherence

  5. Power-Aware Snoop-Coherence • Conventional • All L2 caches snoop all memory traffic • Power expended by all on any memory access • Jetty-Enhanced • Tiny structure on L2-backside • Filters most “would-be-misses” • Less power expended on most snoop misses • No changes to protocol necessary • No performance loss

  6. Roadmap • Why Power is a Concern for Servers? • Snoopy-Coherence Basics • An Opportunity for Reducing Power • JETTY • Results • Summary

  7. Why is Power Important? Power Could Ultimately Limit Performance • Power Demands have been increasing • Deliver Energy to and on chip • Dissipate Heat • Limit: • Amount of resources & frequency • Feasibility • Cooling a solution: Cost & Integration? Reducing Power Demands is much more convenient

  8. What can be done? • Redesign Circuits • Clock Gating and Frequency Scaling • A lot has been done thus far • Still active • Rethink Architectural Decisions • Orthogonal to others Reduce Power Under Performance Constraints

  9. The “Silver Bullet” Solution • Good if there was one • However, till one is found... • Look at all structures • Rethink Design • Propose Power-Optimized versions • This is what we’re doing for performance

  10. Snoopy Cache Coherence CPU Core CPU Core L1 L2 Hit Main Memory All L2 tags see all bus accesses Intervene when necessary

  11. How About Power? CPU Core CPU Core CPU Core L1 L2 miss miss Main Memory All L2 tags see all bus accesses Perf. & Complexity: Have L2 tags why not use them Power: All L2 tags consume power on all accesses

  12. JETTY: A Would be Snoop-Miss Filter Would be Snoop-Miss: Would be Snoop-Hit: CPU n CPU n JETTY JETTY addr addr Not here! Don’t Know Detect most misses using fewer resources Imprecise: May filter a would-be miss Never filters snoop-hits

  13. Potential for Savings Exist • Most Snoops miss • 91% AVG • Many L2 accesses are due to Snoop Misses • 55% AVG • Sizeable Potential Power Savings: • 20% - 50% of total L2 power

  14. Exclude JETTY Exclude-Jetty • Subset of what is not cached cached not cached How? Cache recent snoop-misses locally

  15. Exclude-Jetty • Subset of what you don’t have Works well for producer-consumer

  16. Include-Jetty • Superset of what is cached cached include JETTY not cached How? Well...

  17. Include-Jetty bit vector 0 address • Not-Cached • Anyzero bit • May be Cached • Allbits set f( ) bit vector 1 g( ) bit vector 2 h( ) Later I was told this is a Bloom filter…

  18. Include-Jetty • Superset of what you have Partial overlapping indexes worked better This is a counting bloom filter: L-CBF: A Low Power, Fast Counting Bloom Filter ImplementationElham Safi, Andreas Moshovos and Andreas Veneris,In Proc. Annual International Symposium on Low Power Electronics and Design (ISLPED), Oct. 2006.

  19. Hybrid-Jetty • Some cases Exclude-J works well • Some other Include-J is better • Combine • Access in parallel on snoop • Allocation • IJ always • If IJ fails to filter then to EJ • EJ coverage increases

  20. Latency? • Jetty may increase snoop-response time • Can only be determined on a design by design basis • Largest Jetty: • Five 32x32 bit register files

  21. Results • Used SPLASH-II • Scientific applications • “Large” Datasets • e.g., 4-80Megs of main memory allocated • Access Counts: 60M-1.7B • 4-way SMP, MOESI • 1M direct-mapped L2, 64b 32b subblocks • 32k direct-mapped L1, 32b blocks • Coverage & Power (analytical model)

  22. Coverage: Hybrid-Jetty • Can capture 74% of all snoop-misses better

  23. Power-Savings • 28% of overall L2 power better

  24. Summary • Power is becoming important • Performance, Reliability and Feasibility • Unique Opportunities Exist for Servers • JETTY: Filter Snoops that would miss • 74% of all snoops • 28% of L2 power saved • No protocol changes • No performance loss

  25. Power efficient cache coherence C. Saldanha, M. Lipasti Workshop on Memory Performance Issues (in conjunction with ISCA), June 2001.

  26. Avoids Speculative transmission of Snoop packets. Check the nearest neighbor Data supplied with minimum latency and power Serial Snooping MEMORY

  27. TLB and Snoop Energy-Reduction using Virtual Caches inLow-Power Chip-Multiprocessors Magnus Ekman, *Fredrik Dahlgren, and Per Stenström Chalmers University of Technology Ericsson Mobile Platforms Int’l Symposium on Low Power Electronic Design and Devices, Aug. 2002

  28. Page Sharing Tables • On snoop requesting node gets a page-level sharing vector If a PST entry is evicted the whole page must be evicted Paper by same authors demonstrates the Jetty is not beneficial for small-scale CMPs

  29. RegionScout: Exploiting Coarse Grain Sharing in Snoop Coherence Andreas Moshovos moshovos@eecg.toronto.edu Int’l Conference on Computer Architecture 2005

  30. Improving Snoop Coherence • Conventional Considerations: Complexity and Correctness NOT Power/Bandwidth • Can we: (1) Reduce Power/bandwidth (2) Leverage snoop coherence? • Remains Attractive: Simple / Design Re-use Yes: Exploit Program Behavior to Dynamically Identify Requests that do not Need Snooping CPU CPU CPU I$ D$ I$ D$ I$ D$ interconnect Main Memory

  31. RegionScout: Avoid Some Snoops • Frequent case: non-sharing even at a coarse level/Region • RegionScout: Dynamically Identify Non-Shared Regions • First Request to a Region Identifies it as not Shared • Subsequent Requests do not need to be broadcast • Uses Imprecise Information • Small structures • Layer on top of conventional coherence • No additional constraints CPU CPU CPU I$ D$ I$ D$ I$ D$ interconnect Main Memory

  32. Roadmap • Conventional Coherence: • The need for power-aware designs • Potential: Program Behavior • RegionScout: What and How • Implementation • Evaluation • Summary

  33. Coherence Basics • Given request for memory block X (address) • Detect where its current value resides CPU CPU CPU X snoop snoop hit Main Memory

  34. Conventional Coherence not Power-Aware/Bandwidth-Effective CPU CPU CPU L2 miss miss Main Memory All L2 tags see all accesses Perf. & Complexity: Have L2 tags why not use them Power:All L2 tags consume power on all accesses Bandwidth: broadcast all coherent requests

  35. RegionScoutMotivation: Sharing is Coarse • Region: large continuous memory area, power of 2 size • CPU X asks for data block in region R • No one else has X • No one else has any block in R RegionScout Exploits this Behavior Layered Extension over Snoop Coherence Typical Memory Space Snapshot: colored by owner(s) addresses

  36. CPU CPU CPU I$ I$ I$ D$ D$ D$ Optimization Opportunities • Power and Bandwidth • Originating node: avoid asking others • Remote node: avoid tag lookup SWITCH Memory

  37. Potential: Region Miss Frequency better % of all requests Global Region Misses Region Size Even with a 16K Region ~45% of requests miss in all remote nodes

  38. 1 2 2 3 RegionScout at Work: Non-Shared Region Discovery First request detects a non-shared region CPU CPU CPU Region Miss Region Miss Global Region Miss Main Memory Record: Non-Shared Regions Record: Locally Cached Regions

  39. 1 2 RegionScout at Work: Avoiding Snoops Subsequent request avoids snoops CPU CPU CPU Global Region Miss Main Memory Record: Non-Shared Regions Record: Locally Cached Regions

  40. 1 2 2 RegionScout is Self-Correcting Request from another node invalidates non-shared record CPU CPU CPU Main Memory Record: Non-Shared Regions Record: Locally Cached Regions

  41. Region Tag offset CPU Implementation: Requirements • Requesting Node provides address: • At Originating Node – from CPU: • Have I discovered that this region is not shared? • At Remote Nodes – from Interconnect: • Do I have a block in the region? address lg(Region Size)

  42. Remembering Non-Shared Regions • Records non-shared regions • Lookup by Region portion prior to issuing a request • Snoop requests and invalidate address Region Tag offset Non-Shared Region Table valid Few entries 16x4 in most experiments

  43. What Regions are Locally Cached? • If we had as many counters as regions: • Block Allocation: counter[region]++ • Block Eviction: counter[region]-- • Region cached only if counter[region] non-zero • Not Practical: • E.g., 16K Regions and 4G Memory  256K counters Region Tag offset counter

  44. p bits P-bit 1 if counter non-zero used for lookups What Regions are Locally Cached? • Use few Counters Imprecise: • Records a superset of locally cached Regions • False positives: lost opportunity, correctness preserved Region Tag offset Cached Region Hash “Counter”: + on block allocation - on block eviction Few entries, e.g., 256 hash counter

  45. Roadmap • Conventional Coherence • Program Behavior: Region Miss Frequency • RegionScout • Evaluation • Summary

  46. Evaluation Overview • Methodology • Filter rates • Practical Filters can capture many Region Misses • Interconnect bandwidth reduction

  47. Methodology • In-House simulator based on Simplescalar • Execution driven • All instructions simulated – MIPS like ISA • System calls faked by passing them to host OS • Synchronization using load-linked/store-conditional • Simple in-order processors • Memory requests complete instantaneously • MESI snoop coherence • 1 or 2 level memory hierarchy • WATTCH power models • SPLASH II benchmarks • Scientific workloads • Feasibility study

  48. Filter Rates better Identified Global Region Misses CRH Size For small CRH better to use large regions Practical RegionScout filters capture a lot of the potential

  49. Bandwidth Reduction Messages better CMP Region Size Moderate Bandwidth Savings for SMP (15%-22%) More so for CMP (>25%)

  50. Related Work • RegionScout • Technical Report, Dec. 2003 • Jetty • Moshovos, Memik, Falsafi, Choudhary, HPCA 2001 • PST • Eckman, Dahlgren, and Stenström, ISLPED 2002 • Coarse-Grain Coherence • Cantin, Lipasti and Smith, ISCA 2005

More Related