1 / 49

Two Ways to Exploit Multi-Megabyte Caches

Two Ways to Exploit Multi-Megabyte Caches. AENAO Research Group @ Toronto Kaveh Aasaraai Ioana Burcea Myrto Papadopoulou Elham Safi Jason Zebchuk Andreas Moshovos. {aasaraai, ioana, myrto, elham, zebchuk, moshovos}@eecg.toronto.edu. CPU. CPU. I$. I$. D$. D$.

zinna
Download Presentation

Two Ways to Exploit Multi-Megabyte Caches

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Two Ways to Exploit Multi-Megabyte Caches AENAO Research Group @ Toronto Kaveh Aasaraai Ioana Burcea Myrto Papadopoulou Elham Safi Jason Zebchuk Andreas Moshovos {aasaraai, ioana, myrto, elham, zebchuk, moshovos}@eecg.toronto.edu

  2. CPU CPU I$ I$ D$ D$ Future Caches: Just Larger? CPU • “Big Picture” Management • Store Metadata D$ I$ interconnect 10s – 100s of MB Main Memory Aenao Group/Toronto

  3. Conventional Block Centric Cache Fine-Grain View of Memory • “Small” Blocks • Optimizes Bandwidth and Performance • Large L2/L3 caches especially L2 Cache Big Picture Lost Aenao Group/Toronto

  4. “Big Picture” View Coarse-Grain View of Memory • Region: 2n sized, aligned area of memory • Patterns and behavior exposed • Spatial locality • Exploit for performance/area/power L2 Cache Aenao Group/Toronto

  5. Coarse-Grain Framework Exploiting Coarse-Grain Patterns Circuit-Switched Coherence • Many existing coarse-grain optimizations • Add new structures to track coarse-grain information CPU Stealth Prefetching RegionScout • Embed coarse-grain information in tag array • Support many different optimizations with less area overhead Run-time Adaptive Cache Hierarchy Management via Reference Analysis L2 Cache Destination-Set Prediction Coarse-Grain Coherence Tracking Spatial Memory Streaming Adaptable optimization FRAMEWORK Hard to justify for a commercial design Aenao Group/Toronto

  6. RegionTracker Solution L2 Cache Manage blocks, but also track and manage regions L1 Data Array Data Blocks Tag Array Region Tracker L1 Block Requests L1 Region Probes L1 Region Responses Block Requests Aenao Group/Toronto

  7. RegionTracker Summary • Replace conventional tag array: • 4-core CMP with 8MB shared L2 cache • Within 1% of original performance • Up to 20% less tag area • Average 33% less energy consumption • Optimization Framework: • Stealth Prefetching: same performance, 36% less area • RegionScout: 2x more snoops avoided, no area overhead Aenao Group/Toronto

  8. Road Map • Introduction • Goals • Coarse-Grain Cache Designs • RegionTracker: A Tag Array Replacement • RegionTracker: An Optimization Framework • Conclusion Aenao Group/Toronto

  9. Goals • Conventional Tag Array Functionality • Identify data block location and state • Leave data array un-changed • Optimization Framework Functionality • Is Region X cached? • Which blocks of Region X are cached? Where? • Evict or migrate Region X • Easy to assign properties to each Region Aenao Group/Toronto

  10. Large Block Size Coarse-Grain Cache Designs Tag Array Data Array • Increased BW, Decreased hit-rates Region X Aenao Group/Toronto

  11. Sector Cache Tag Array Data Array • Decreased hit-rates Region X Aenao Group/Toronto

  12. Sector Pool Cache Tag Array Data Array • High Associativity (2 - 4 times) Region X Aenao Group/Toronto

  13. Decoupled Sector Cache Tag Array Status Table Data Array • Region information not exposed • Region replacement requires scanning multiple entries Region X Aenao Group/Toronto

  14. Design Requirements • Small block size (64B) • Miss-rate does not increase • Lookup associativity does not increase • No additional access latency • (i.e., No scanning, no multiple block evictions) • Does not increase latency, area, or energy • Allows banking and interleaving • Fit in conventional tag array “envelope” Aenao Group/Toronto

  15. RegionTracker: A Tag Array Replacement L1 Data Array • 3 SRAM arrays, combined smaller than tag array Region Vector Array L1 L1 Evicted Region Buffer L1 Block Status Table Aenao Group/Toronto

  16. Basic Structures Ex: 8MB, 16-way set-associative cache, 64-byte blocks, 1KB region Region Vector Array (RVA) Block Status Table (BST) • Address: specific RVA set and BST set • RVA entry: multiple, consecutive BST sets • BST entry: one of four RVA sets Region Tag …… status block15 block0 3 2 V way 1 4 Aenao Group/Toronto

  17. Common Case: Hit Ex: 8MB, 16-way set-associative cache, 64-byte blocks, 1KB region 49 21 10 6 0 Address: Region Tag RVA Index Region Offset Block Offset Data Array + BST Index Block Offset 19 6 0 Region Vector Array (RVA) Block Status Table (BST) Region Tag …… status block15 block0 3 2 V way To Data Array 1 4 Aenao Group/Toronto

  18. Worst Case (Rare): Region Miss 49 21 10 6 0 Address: Region Tag RVA Index Region Offset Block Offset Ptr Data Array + BST Index Block Offset 19 6 0 Region Vector Array (RVA) Block Status Table (BST) Evicted Region Buffer (ERB) No Match! Region Tag …… status Ptr block15 block0 3 2 V way Aenao Group/Toronto

  19. Methodology P P P P • Flexus simulator from CMU SimFlex group • Based on Simics full-system simulator • 4-core CMP modeled after Piranha • Private 32KB, 4-way set-associative L1 caches • Shared 8MB, 16-way set-associative L2 cache • 64-byte blocks • Miss-rates: Functional simulation of 2 billion instructions per core • Performance and Energy: Timing simulation using SMARTS sampling methodology • Area and Power: Full custom implementation on 130nm commercial technology • 9 commercial workloads: • WEB: SpecWEB on Apache and Zeus • OLTP: TPC-C on DB2 and Oracle • DSS: 5 TPC-H queries on DB2 D$ I$ D$ I$ D$ I$ D$ I$ Interconnect L2 Aenao Group/Toronto

  20. Miss-Rates vs. Area Sector Cache (0.25, 1.26) • Sector Cache: 512KB sectors, SPC and RT: 1KB regions • Trade-offs comparable to conventional cache 48-way Relative Miss-Rate 52-way 14-way 15-way better Relative Tag Array Area Aenao Group/Toronto

  21. Performance & Energy Performance Energy better better Reduction in Tag Energy Normalized Execution Time • 12-way set-associative RegionTracker: 20% less area • Error bars: 95% confidence interval • Performance within 1%, with 33% tag energy reduction Aenao Group/Toronto

  22. Road Map • Introduction • Goals • Coarse-Grain Cache Designs • RegionTracker: A Tag Array Replacement • RegionTracker: An Optimization Framework • Conclusion Aenao Group/Toronto

  23. L1 Data Array L1 RVA L1 ERB L1 BST RegionTracker: An Optimization Framework Stealth Prefetching: Average 20% performance improvement Drop-in RegionTracker for 36% less area overhead RegionScout: In-depth analysis Aenao Group/Toronto

  24. Snoop Coherence: Common Case CPU CPU CPU Read x+1 Read x+2 Read x+n Read x miss miss Main Memory Many snoops are to non-shared regions Aenao Group/Toronto

  25. RegionScout CPU CPU CPU Read x Miss Miss Region Miss Region Miss Global Region Miss Main Memory Non-Shared Regions Locally Cached Regions Eliminate broadcasts for non-shared regions Aenao Group/Toronto

  26. Locally Cached Regions Non-Shared Regions Already provided by RVA Add 1 bit to each RVA entry RegionTracker Implementation • Minimal overhead to support RegionScout optimization • Still uses less area than conventional tag array Aenao Group/Toronto

  27. RegionTracker + RegionScout • 4 processors, 512KB L2 Caches • 1KB regions BlockScout (4KB) better Reduction in Snoop Broadcasts Avoid 41% of Snoop Broadcasts, no area overhead compared to conventional tag array Aenao Group/Toronto

  28. Result Summary • Replace Conventional Tag Array: • 20% Less tag area • 33% Less tag energy • Within 1% of original performance • Coarse-Grain Optimization Framework: • 36% reduction in area overhead for Stealth Prefetching • Filter 41% of snoop broadcasts with no area overhead compared to conventional cache Aenao Group/Toronto

  29. Predictor Virtualization Ioana Burcea Joint work with Stephen Somogyi Babak Falsafi

  30. CPU CPU CPU L1-D L1-D L1-D L1-I L1-I L1-I Optimization Engines: Predictors Predictor Virtualization CPU CPU CPU CPU CPU CPU CPU CPU CPU CPU CPU L1-D L1-I L1-D L1-I L1-D L1-D L1-I L1-D L1-I L1-D L1-I Interconnect L2 Main Memory Aenao Group/Toronto

  31. Motivating Trends • Dedicating resources to predictors hard to justify: • Chip multiprocessors • Space dedicated to predictors X #processors • Larger predictor tables • Increased performance • Memory hierarchies offer the opportunity • Increased capacity • How many apps really use the space? Use conventional memory hierarchies to store predictor information Aenao Group/Toronto

  32. PV Architecture contd. Optimization Engine request request prediction Predictor Table Aenao Group/Toronto

  33. PV Architecture contd. Optimization Engine request prediction Predictor Virtualization Aenao Group/Toronto

  34. + PV Architecture contd. Optimization Engine request prediction PVCache MSHR PVStart index PVProxy On the backside of the L1 L2 PVTable Main Memory Aenao Group/Toronto

  35. CPU Infrequent I$ D$ interconnect L2/L3 Main Memory To Virtualize Or Not to Virtualize? Common Case • Re-Use2. Predictor Info Prefetching Aenao Group/Toronto

  36. To Virtualize or Not? • Challenge • Hit in the PVCache most of the time • Will not work for all predictors out of the box • Reuse is necessary • Intrinsic • Easy to virtualize • Non-intrinsic • Must be engineered • More so if the predictor needs to be fast to start with Aenao Group/Toronto

  37. CPU I$ D$ interconnect L2/L3 Main Memory Will There Be Reuse? • Intrinsic: • Multiple [predictions per entry • We’ll see an example • Can be engineered • Group temporally correlated entries together: Cache block Aenao Group/Toronto

  38. Spatial Memory Streaming • Footprint: • Blocks accessed per memory region • Predict next time the footprint will be the same • Handle: PC + offset within region Aenao Group/Toronto

  39. Spatial Generations Aenao Group/Toronto

  40. Virtualizing SMS Virtualize patterns Predictor Detector patterns triggeraccess prefetches Aenao Group/Toronto

  41. Virtualizing SMS Virtual Table PVCache 8 1K 11 11 tag pattern tag pattern tag pattern unused 85 0 11 43 54 Aenao Group/Toronto

  42. Packing Entries in One Cache Block • Index: PC + offset within spatial group • PC →16 bits • 32 blocks in a spatial group → 5 bit offset → 32 bit spatial pattern • Pattern table: 1K sets • 10 bits to index the table → 11 bit tag • Cache block: 64 bytes • 11 entries per cache block → Pattern table 1K sets – 11-way set associative 21 bit index tag pattern tag pattern tag pattern unused 85 0 11 43 54 Aenao Group/Toronto

  43. + Memory Address Calculation PC Block offset 16 bits 5 bits PV Start Address 10 bits 000000 Memory Address Aenao Group/Toronto

  44. Simulation Infrastructure • SimFlex: CMU Impetus • Full-system simulator based on Simics • Base processor configuration • 8-wide OoO • 256-entry ROB / 64-entry LSQ • L1D/L1I 64KB 4-way set-associative • UL2 8MB 16-way set-associative • Commercial workloads • TPC-C: DB2 and Oracle • TPC-H: Query 1, Query 2, Query 16, Query 17 • Web: Apache and Zeus Aenao Group/Toronto

  45. SMS – Performance Potential better Aenao Group/Toronto

  46. Virtualized Spatial Memory Streaming better Original Prefetcher: Cost: 60KB Virtualized Prefetcher: Cost: <1Kbyte Nearly Identical Performance Aenao Group/Toronto

  47. Impact of Virtualization on L2 Misses Aenao Group/Toronto

  48. Impact of Virtualization on L2 Requests Aenao Group/Toronto

  49. Coarse-Grain Tracking Jason Zebchuk

More Related