1 / 46

Region-Centric Memory Design

Region-Centric Memory Design. AENAO Research Group Patrick Akl , M.A.Sc. Ioana Burcea , Ph.D. C. Myrto Papadopoulou , M.A.Sc. C. Elham Safi , Ph.D. C. Jason Zebchuk , M.A.Sc. C. Andreas Moshovos. {pakl, ioana, myrto, elham, zebchuk, moshovos}@eecg.toronto.edu. CPU. CPU. I$. I$. D$.

ziya
Download Presentation

Region-Centric Memory Design

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Region-Centric Memory Design AENAO Research Group Patrick Akl, M.A.Sc. Ioana Burcea, Ph.D. C. Myrto Papadopoulou, M.A.Sc. C. Elham Safi, Ph.D. C. Jason Zebchuk, M.A.Sc. C. Andreas Moshovos {pakl, ioana, myrto, elham, zebchuk, moshovos}@eecg.toronto.edu

  2. CPU CPU I$ I$ D$ D$ Future On-Chip Caches: Just Larger? CPU Observe and Exploit Memory Access Behavior at a Coarse Grain D$ I$ interconnect 10s – 100s of MB Main Memory Aenao Group/Toronto

  3. Conventional Block-Centric Memory Hierarchy • “Small” Blocks • Performance and Bandwidth • Several optimizations exist Big picture is lost Conventional Fine-Grain Tracking Aenao Group/Toronto

  4. “Big Picture” View Supplemental Coarse-Grain Tracking • Region: 2n sized, aligned memory area • Concept already in use: TLBs • Patterns Emerge in Space / Time • Exploit for performance & power • Expose to software Aenao Group/Toronto

  5. This Presentation • Examples of Coarse-Grain Optimizations • Snoop Coherence • Thread-level speculation disambiguation • Region-Centric Memory Design • RegionTracker Cache • Snoop Coherence Revisited • Current Activities • Coherence Delegation • Predictor Virtualization Aenao Group/Toronto

  6. An Example: Snoop Coherence • Conventional Considerations: Complexity and Correctness NOT Power/Bandwidth • Can we: (1) Reduce Power/bandwidth (2) Leverage snoop coherence? • Remains Attractive: Simple / Design Re-use Yes: Exploit Program Behavior to Dynamically Identify Requests that do not Need Snooping CPU CPU CPU I$ D$ I$ D$ I$ D$ interconnect Main Memory Aenao Group/Toronto

  7. Coherence Basics • Given request for memory block X (address) • Detect where current value resides CPU CPU CPU X snoop snoop hit Main Memory Aenao Group/Toronto

  8. Conventional Coherence not Power-Aware/Bandwidth-Effective CPU CPU CPU L2 miss miss Main Memory All L2 tags see all accesses Perf. & Complexity: Have L2 tags why not use them Power:All L2 tags consume power on all accesses Bandwidth: broadcast all coherent requests Aenao Group/Toronto

  9. RegionScout Motivation:Sharing is Coarse • Region: large continuous memory area, power of 2 size • CPU X asks for data block in region R • No one else has X • No one else has any block in R RegionScout Exploits this Behavior Layered Extension over Snoop Coherence Typical Memory Space Snapshot: colored by owner(s) addresses Aenao Group/Toronto

  10. CPU CPU CPU I$ I$ I$ D$ D$ D$ Optimization Opportunities • Power and Bandwidth • Originating node: avoid asking others • Remote node: avoid tag lookup SWITCH Memory Aenao Group/Toronto

  11. Potential: Region Miss Frequency better % of all requests Global Region Misses Region Size Even with a 16K Region ~45% of requests miss in all remote nodes Aenao Group/Toronto

  12. 1 2 2 3 RegionScout at Work: Non-Shared Region Discovery First request detects a non-shared region CPU CPU CPU Region Miss Region Miss Global Region Miss Main Memory Record: Non-Shared Regions Record: Locally Cached Regions Aenao Group/Toronto

  13. 1 2 RegionScout at Work:Avoiding Snoops Subsequent request avoids snoops CPU CPU CPU Global Region Miss Main Memory Record: Non-Shared Regions Record: Locally Cached Regions Aenao Group/Toronto

  14. 1 2 2 RegionScout is Self-Correcting Request from another node invalidates non-shared record CPU CPU CPU Main Memory Record: Non-Shared Regions Record: Locally Cached Regions Aenao Group/Toronto

  15. Region Tag offset CPU Implementation: Requirements • Requesting Node provides address: • At Originating Node – from CPU: • Have I discovered that this region is not shared? • At Remote Nodes – from Interconnect: • Do I have a block in the region? address lg(Region Size) Aenao Group/Toronto

  16. Remembering Non-Shared Regions address • Records non-shared regions • Lookup by Region portion prior to issuing a request • Snoop requests and invalidate Region Tag offset Non-Shared Region Table valid Few entries 16x4 in most experiments Aenao Group/Toronto

  17. What Regions are Locally Cached? Region Tag offset • If we had as many counters as regions: • Block Allocation: counter[region]++ • Block Eviction: counter[region]-- • Region cached only if counter[Region] non-zero • Not Practical: • E.g., 16K Regions and 4G Memory  256K counters counter Aenao Group/Toronto

  18. What Regions are Locally Cached? Region Tag offset counter hash() • Imprecise: • Records a superset of locally cached Regions • False positives: lost opportunity, correctness preserved • Small: e.g., 256 entries for 1M cache • Power-Optimized structures described in the paper Aenao Group/Toronto

  19. LFSR-Based Implementation Region Tag offset • Linear-Feedback Shift Register Array • Increment/Decrement/Is Zero? • 130nm commercial technology • ISLPED ’06 • Faster: 1.6x to 3.7x • More Energy Efficient: 1.4x to 2.3x • But Area: 3.2x LFSR hash() Zero Detector Aenao Group/Toronto

  20. Filter Rates: SPLASH-II better Identified Global Region Misses CRH Size Jason Cantin@Wisconsin studied commercial workloads 40% filter rate Aenao Group/Toronto

  21. Region-Centric Disambiguation Join work w/ Greg Steffan and Mihai Burcea Patrick Akl Andreas Moshovos

  22. Speculative Parallelization Models • Thread level speculation • Transactional Memory Speculative Parallelization Original Good Scenario Bad Scenario read a read b time write a write a Need to Compare Addresses Across Code Pieces Aenao Group/Toronto

  23. Ex #2: Region-Centric Disambiguation Region-Centric Conventional • Send digest at region level • Region-conflict • Send block-level info • Reduced bandwidth, potential for performance and power Task 1 Task 2 Task 1 Task 2 Memory Space Aenao Group/Toronto

  24. TLS benchmarks from STAMPEDE group (G. Steffan) Approximate timing model How Much Traffic Can We Save? Better Potential for traffic reduction by 38% Aenao Group/Toronto

  25. Exploiting Region-Level Information • Region Coherence Arrays • Cantin, Lipasti and Smith • RegionScout • Both of these reduce snoop lookups (and broadcasts) in snoop coherence protocolsOur work • Spatial Memory Prefetching • Leverages spatial memory patterns for prefetching with commercial workloads • Impetus Group at CMU • Stealth Prefetching • Cantin, Lipasti and Smith Aenao Group/Toronto

  26. CPU I$ D$ Coarse-Grain Techniques Today Conventional Cache • Overhead • Storage: e.g., 60% of tags • Functionality: Restrict placement, Region Evictions • Loss of Information Hard to justify for a commercial design Auxiliary Tracking DATA TAGS Aenao Group/Toronto

  27. CPU I$ D$ Rethinking Cache Design Embedded Tracking DATA Dual-Grain TAGS • Can we provide a common substrate for all these optimizations? • Redesign caches: • Regions a first class citizen • RegionTracker Cache Aenao Group/Toronto

  28. RegionTracker Cache • Goals • Expose region behavior • Is region X cached? • Which blocks are? • Facilitate management at the region level • Evict/migrate region X • Do something with all blocks in X • Constraints: • Data movement only at the block level • No increase in area • No decrease in performance • Complexity • Associativity Aenao Group/Toronto

  29. Region-Based Caches • Start with conventional 16-way cache and replace tag array • Sector Caches • Hit rate suffers: 20% loss • Sector Pool Caches • High Associavity: 48-way for matching a 16-way cache • Decoupled-Sector Caches • No coarse-grain info • Replacements require searching • No previous design is adequate • RegionTracker: • Meets all requirements • But does not save as much tag resources Aenao Group/Toronto

  30. Sector Cache D-way Data • Reduced Area and Power • Increased miss-rates (2.5% - 96% for 1kB sectors) • Replacement? D-way Region Tags { RVA Data Array Aenao Group/Toronto

  31. M-way Region Tags RVA Sector Pool Cache D-way Data • M > D • Requires highly associative cache to achieve same performance as RegionTracker (~48-way) { 1 DSR Data Array Aenao Group/Toronto

  32. Decoupled-Sectored Cache • Has multiple block evictions • Requires scanning “status” array • No simple mechanism to avoid this • Does NOT expose region-level information Aenao Group/Toronto

  33. D-way Data L-way Region Tags { 1 DSR RVA Data Array RegionTracker • In practice L <= D • Decouple Data and Lookup organizations • Lower Associativity lookups with no hit-rate penalty • RegionTracker provides complete solution Aenao Group/Toronto

  34. L1 Data Array L1 RVA L1 ERB L1 BST RegionTracker Cache Block and Region Lookups Region Tag + Way Per Block Evict Region Blocks Lazily Simplify replacement and reduce area Status per block + RVA set backpointer Can be banked and partitioned Aenao Group/Toronto

  35. Region-Aware Cache: Performance vs. Area • Commercial workloads: DB2, Oracle, TPC-C and TPC-H, Apache, Zeus • SimICS + SimFlex, Sampling, 2K Regions better Aenao Group/Toronto

  36. RegionTracker-RegionScout • One bit per Region tag: Known to be not shared • 1KB Regions, Commercial workloads • 512KB L2 private caches Filter 41% of snoops at “Zero Cost” compared to conventional cache BlockScout better Reduction in Broadcasts Aenao Group/Toronto

  37. Directory Optimizations Base Architecture Core L3 Data DRAM L2 Tags Directory L3 Tags L2 Data Aenao Group/Toronto

  38. Coherence Delegation Ideal Path Requesting Node • Eliminate 3-hop overhead • Attract directory tracking to nodes Directory Lookup Remote L2 containing data Aenao Group/Toronto

  39. CPU CPU CPU CPU L1-D L1-D L1-D L1-D L1-I L1-I L1-I L1-I Optimization Engines: Predictors PredictorVirtualization CPU CPU CPU CPU CPU CPU CPU CPU CPU CPU CPU L1-D L1-I L1-D L1-I L1-D L1-D L1-I L1-D L1-I L1-D L1-I L1-D L1-I L1-D L1-D L1-D L1-I L1-D L1-I L1-D L1-I L1-D L1-I L1-D L1-I L1-D L1-I L1-D L1-I Interconnect L2 Main Memory Aenao Group/Toronto

  40. Motivating Trends • Chip multiprocessors • Space dedicated to predictors X #processors • Larger predictor table • Increased performance • Memory hierarchies • Increased capacities Use conventional memory hierarchies to store predictor information Aenao Group/Toronto

  41. PV Architecture Optimization Engine entry index prediction Predictor Table Aenao Group/Toronto

  42. PV Architecture Optimization Engine entry index prediction Predictor Virtualization Aenao Group/Toronto

  43. + PV Architecture Optimization Engine entry index prediction PVCache MSHR PVStart index PVProxy L2 PVTable Main Memory Aenao Group/Toronto

  44. Virtualized Spatial Memory Streaming Original Prefetcher: Cost: 80KB Virtualized Prefetcer: Cost: <1Kbyte Nearly Identical Performance Aenao Group/Toronto

  45. Region-Centric Memory Design AENAO Research Group Patrick Akl, M.A.Sc. C. Ioana Burcea, Ph.D. C. Myrto Papadopoulou, M.A.Sc. C. Elham Safi, Ph.D. C. Jason Zebchuk, M.A.Sc. C. Andreas Moshovos {pakl, ioana, myrto, elham, zebchuk, moshovos}@eecg.toronto.edu

  46. Summary • Caches are getting larger • Time to look at the “big picture” • Region-Centric Memory Design • Expose region-level info • Allow management at the region-level • RegionScout • eliminate broadcasts for snoop coherence • Region-Centric Disambiguation • Reduce bandwidth for TLS or TM • Region-Aware Memory • “Same” area and performance as conventional + region info • Predictor Virtualization Aenao Group/Toronto

More Related