1 / 29

A Framework for Coarse-Grain Optimizations in the On-Chip Memory Hierarchy

A Framework for Coarse-Grain Optimizations in the On-Chip Memory Hierarchy. Jason Zebchuk , Elham Safi, and Andreas Moshovos { zebchuk ,elham,moshovos}@eecg.toronto.edu AENAO Research Group Department of Electrical and Computer Engineering University of Toronto.

ossie
Download Presentation

A Framework for Coarse-Grain Optimizations in the On-Chip Memory Hierarchy

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. A Framework for Coarse-Grain Optimizations in the On-Chip Memory Hierarchy Jason Zebchuk, Elham Safi, and Andreas Moshovos {zebchuk,elham,moshovos}@eecg.toronto.edu AENAO Research Group Department of Electrical and Computer Engineering University of Toronto

  2. Conventional Block Centric Cache Fine-Grain View of Memory • “Small” Blocks • Optimizes Bandwidth and Performance • Large L2/L3 caches especially L2 Cache Big Picture Lost A Framework for Coarse-Grain Optimizations in the On-Chip Memory Hierarchy

  3. “Big Picture” View Coarse-Grain View of Memory • Region: 2n sized, aligned area of memory • Patterns and behavior exposed • Spatial locality • Exploit for performance/area/power L2 Cache A Framework for Coarse-Grain Optimizations in the On-Chip Memory Hierarchy

  4. Coarse-Grain Framework Exploiting Coarse-Grain Patterns Circuit-Switched Coherence • Many existing coarse-grain optimizations • Add new structures to track coarse-grain information CPU Stealth Prefetching RegionScout Flexible Snooping L2 Cache Destination-Set Prediction Coarse-Grain Coherence Tracking Spatial Memory Streaming Hard to justify for a commercial design A Framework for Coarse-Grain Optimizations in the On-Chip Memory Hierarchy

  5. Coarse-Grain Framework Exploiting Coarse-Grain Patterns CPU • Embed coarse-grain information in tag array • Support many different optimizations with less area overhead L2 Cache Adaptable optimization FRAMEWORK A Framework for Coarse-Grain Optimizations in the On-Chip Memory Hierarchy

  6. RegionTracker Solution L2 Cache Manage blocks, but also track and manage regions L1 Data Array Data Blocks Tag Array Region Tracker L1 Block Requests L1 Region Probes L1 Region Responses Block Requests A Framework for Coarse-Grain Optimizations in the On-Chip Memory Hierarchy

  7. RegionTracker Summary • Replace conventional tag array: • 4-core CMP with 8MB shared L2 cache • Within 1% of original performance • Up to 20% less tag area • Average 33% less energy consumption • Optimization Framework: • Stealth Prefetching: same performance, 36% less area • RegionScout: 2x more snoops avoided, no area overhead A Framework for Coarse-Grain Optimizations in the On-Chip Memory Hierarchy

  8. Road Map • Introduction • Goals • Coarse-Grain Cache Designs • RegionTracker: A Tag Array Replacement • RegionTracker: An Optimization Framework • Conclusion A Framework for Coarse-Grain Optimizations in the On-Chip Memory Hierarchy

  9. Goals • Conventional Tag Array Functionality • Identify data block location and state • Leave data array un-changed • Optimization Framework Functionality • Is Region X cached? • Which blocks of Region X are cached? Where? • Evict or migrate Region X A Framework for Coarse-Grain Optimizations in the On-Chip Memory Hierarchy

  10. Large Block Size Coarse-Grain Cache Designs Tag Array Data Array • Increased BW, Decreased hit-rates Region X A Framework for Coarse-Grain Optimizations in the On-Chip Memory Hierarchy

  11. Sector Cache Data Array • Decreased hit-rates Tag Array Region X A Framework for Coarse-Grain Optimizations in the On-Chip Memory Hierarchy

  12. Sector Pool Cache Tag Array Data Array • High Associativity (2 - 4 times) Region X A Framework for Coarse-Grain Optimizations in the On-Chip Memory Hierarchy

  13. Decoupled Sector Cache Tag Array Status Table Data Array • Region information not exposed • Region replacement requires scanning multiple entries Region X A Framework for Coarse-Grain Optimizations in the On-Chip Memory Hierarchy

  14. Design Requirements • Small block size (64B) • Miss-rate does not increase • Lookup associativity does not increase • No additional access latency • (i.e., No scanning, no multiple block evictions) • Does not increase latency, area, or energy • Allows banking and interleaving • Fit in conventional tag array “envelope” A Framework for Coarse-Grain Optimizations in the On-Chip Memory Hierarchy

  15. RegionTracker: A Tag Array Replacement L1 Data Array • 3 SRAM arrays, combined smaller than tag array Region Vector Array L1 L1 Evicted Region Buffer L1 Block Status Table A Framework for Coarse-Grain Optimizations in the On-Chip Memory Hierarchy

  16. Common Case: Hit Ex: 8MB, 16-way set-associative cache, 64-byte blocks, 1KB region 49 21 10 6 0 Address: Region Tag RVA Index Region Offset Block Offset Data Array + BST Index Block Offset 19 6 0 Region Vector Array (RVA) Block Status Table (BST) Region Tag …… status block15 block0 3 2 V way To Data Array 1 4 A Framework for Coarse-Grain Optimizations in the On-Chip Memory Hierarchy

  17. Worst Case (Rare): Region Miss 49 21 10 6 0 Address: Region Tag RVA Index Region Offset Block Offset Ptr Data Array + BST Index Block Offset 19 6 0 Region Vector Array (RVA) Block Status Table (BST) Evicted Region Buffer (ERB) No Match! Region Tag …… status Ptr block15 block0 3 2 V way A Framework for Coarse-Grain Optimizations in the On-Chip Memory Hierarchy

  18. Methodology P P P P • Flexus simulator from CMU SimFlex group • Based on Simics full-system simulator • 4-core CMP modeled after Piranha • Private 32KB, 4-way set-associative L1 caches • Shared 8MB, 16-way set-associative L2 cache • 64-byte blocks • Miss-rates: Functional simulation of 2 billion instructions per core • Performance and Energy: Timing simulation using SMARTS sampling methodology • Area and Power: Full custom implementation on 130nm commercial technology • 9 commercial workloads: • WEB: SpecWEB on Apache and Zeus • OLTP: TPC-C on DB2 and Oracle • DSS: 5 TPC-H queries on DB2 D$ I$ D$ I$ D$ I$ D$ I$ Interconnect L2 A Framework for Coarse-Grain Optimizations in the On-Chip Memory Hierarchy

  19. Miss-Rates vs. Area Sector Cache (0.25, 1.26) • Sector Cache: 512KB sectors, SPC and RT: 1KB regions • Trade-offs comparable to conventional cache 48-way Relative Miss-Rate 52-way 14-way 15-way better Relative Tag Array Area A Framework for Coarse-Grain Optimizations in the On-Chip Memory Hierarchy

  20. Performance & Energy Performance Energy better better Reduction in Tag Energy Normalized Execution Time • 12-way set-associative RegionTracker: 20% less area • Error bars: 95% confidence interval • Performance within 1%, with 33% tag energy reduction A Framework for Coarse-Grain Optimizations in the On-Chip Memory Hierarchy

  21. Road Map • Introduction • Goals • Coarse-Grain Cache Designs • RegionTracker: A Tag Array Replacement • RegionTracker: An Optimization Framework • Conclusion A Framework for Coarse-Grain Optimizations in the On-Chip Memory Hierarchy

  22. L1 Data Array L1 RVA L1 ERB L1 BST RegionTracker: An Optimization Framework Stealth Prefetching: Average 20% performance improvement Drop-in RegionTracker for 36% less area overhead RegionScout: In-depth analysis A Framework for Coarse-Grain Optimizations in the On-Chip Memory Hierarchy

  23. Snoop Coherence: Common Case Read x Read x+1 CPU CPU CPU Read x+n miss miss Main Memory Many snoops are to non-shared regions A Framework for Coarse-Grain Optimizations in the On-Chip Memory Hierarchy

  24. RegionScout CPU CPU CPU Read x Miss Miss Region Miss Region Miss Global Region Miss Main Memory Non-Shared Regions Locally Cached Regions Eliminate broadcasts for non-shared regions A Framework for Coarse-Grain Optimizations in the On-Chip Memory Hierarchy

  25. Locally Cached Regions Non-Shared Regions Already provided by RVA Add 1 bit to each RVA entry RegionTracker Implementation • Minimal overhead to support RegionScout optimization • Still uses less area than conventional tag array A Framework for Coarse-Grain Optimizations in the On-Chip Memory Hierarchy

  26. RegionTracker + RegionScout • 4 processors, 512KB L2 Caches • 1KB regions New optimization possible with RegionTracker BlockScout (4KB) better Reduction in Snoop Broadcasts Avoid 41% of Snoop Broadcasts, no area overhead compared to conventional tag array A Framework for Coarse-Grain Optimizations in the On-Chip Memory Hierarchy

  27. Result Summary • Replace Conventional Tag Array: • 20% Less tag area • 33% Less tag energy • Within 1% of original performance • Coarse-Grain Optimization Framework: • 36% reduction in area overhead for Stealth Prefetching • Filter 41% of snoop broadcasts with no area overhead compared to conventional cache A Framework for Coarse-Grain Optimizations in the On-Chip Memory Hierarchy

  28. Circuit-Switched Coherence CPU CPU Stealth Prefetching RegionScout L2 Cache Run-time Adaptive Cache Hierarchy Management via Reference Analysis L2 Cache Destination-Set Prediction Coarse-Grain Coherence Tracking Spatial Memory Streaming Conclusion Exploiting Coarse-Grain Patterns RegionTracker framework makes coarse-grain optimizations more attractive A Framework for Coarse-Grain Optimizations in the On-Chip Memory Hierarchy

  29. A Framework for Coarse-Grain Optimizations in the On-Chip Memory Hierarchy Jason Zebchuk, Elham Safi, and Andreas Moshovos {zebchuk,elham,moshovos}@eecg.toronto.edu AENAO Research Group Department of Electrical and Computer Engineering University of Toronto

More Related