Coarse-Grain Cache Design Optimization Framework

Two Ways to Exploit Multi-Megabyte Caches AENAO Research Group @ Toronto Kaveh Aasaraai Ioana Burcea Myrto Papadopoulou Elham Safi Jason Zebchuk Andreas Moshovos {aasaraai, ioana, myrto, elham, zebchuk, moshovos}@eecg.toronto.edu

CPU CPU I$ I$ D$ D$ Future Caches: Just Larger? CPU • “Big Picture” Management • Store Metadata D$ I$ interconnect 10s – 100s of MB Main Memory Aenao Group/Toronto

Conventional Block Centric Cache Fine-Grain View of Memory • “Small” Blocks • Optimizes Bandwidth and Performance • Large L2/L3 caches especially L2 Cache Big Picture Lost Aenao Group/Toronto

“Big Picture” View Coarse-Grain View of Memory • Region: 2n sized, aligned area of memory • Patterns and behavior exposed • Spatial locality • Exploit for performance/area/power L2 Cache Aenao Group/Toronto

Coarse-Grain Framework Exploiting Coarse-Grain Patterns Circuit-Switched Coherence • Many existing coarse-grain optimizations • Add new structures to track coarse-grain information CPU Stealth Prefetching RegionScout • Embed coarse-grain information in tag array • Support many different optimizations with less area overhead Run-time Adaptive Cache Hierarchy Management via Reference Analysis L2 Cache Destination-Set Prediction Coarse-Grain Coherence Tracking Spatial Memory Streaming Adaptable optimization FRAMEWORK Hard to justify for a commercial design Aenao Group/Toronto

RegionTracker Solution L2 Cache Manage blocks, but also track and manage regions L1 Data Array Data Blocks Tag Array Region Tracker L1 Block Requests L1 Region Probes L1 Region Responses Block Requests Aenao Group/Toronto

RegionTracker Summary • Replace conventional tag array: • 4-core CMP with 8MB shared L2 cache • Within 1% of original performance • Up to 20% less tag area • Average 33% less energy consumption • Optimization Framework: • Stealth Prefetching: same performance, 36% less area • RegionScout: 2x more snoops avoided, no area overhead Aenao Group/Toronto

Road Map • Introduction • Goals • Coarse-Grain Cache Designs • RegionTracker: A Tag Array Replacement • RegionTracker: An Optimization Framework • Conclusion Aenao Group/Toronto

Goals • Conventional Tag Array Functionality • Identify data block location and state • Leave data array un-changed • Optimization Framework Functionality • Is Region X cached? • Which blocks of Region X are cached? Where? • Evict or migrate Region X • Easy to assign properties to each Region Aenao Group/Toronto

Large Block Size Coarse-Grain Cache Designs Tag Array Data Array • Increased BW, Decreased hit-rates Region X Aenao Group/Toronto

Sector Cache Tag Array Data Array • Decreased hit-rates Region X Aenao Group/Toronto

Sector Pool Cache Tag Array Data Array • High Associativity (2 - 4 times) Region X Aenao Group/Toronto

Decoupled Sector Cache Tag Array Status Table Data Array • Region information not exposed • Region replacement requires scanning multiple entries Region X Aenao Group/Toronto

Design Requirements • Small block size (64B) • Miss-rate does not increase • Lookup associativity does not increase • No additional access latency • (i.e., No scanning, no multiple block evictions) • Does not increase latency, area, or energy • Allows banking and interleaving • Fit in conventional tag array “envelope” Aenao Group/Toronto

RegionTracker: A Tag Array Replacement L1 Data Array • 3 SRAM arrays, combined smaller than tag array Region Vector Array L1 L1 Evicted Region Buffer L1 Block Status Table Aenao Group/Toronto

Basic Structures Ex: 8MB, 16-way set-associative cache, 64-byte blocks, 1KB region Region Vector Array (RVA) Block Status Table (BST) • Address: specific RVA set and BST set • RVA entry: multiple, consecutive BST sets • BST entry: one of four RVA sets Region Tag …… status block15 block0 3 2 V way 1 4 Aenao Group/Toronto

Common Case: Hit Ex: 8MB, 16-way set-associative cache, 64-byte blocks, 1KB region 49 21 10 6 0 Address: Region Tag RVA Index Region Offset Block Offset Data Array + BST Index Block Offset 19 6 0 Region Vector Array (RVA) Block Status Table (BST) Region Tag …… status block15 block0 3 2 V way To Data Array 1 4 Aenao Group/Toronto

Worst Case (Rare): Region Miss 49 21 10 6 0 Address: Region Tag RVA Index Region Offset Block Offset Ptr Data Array + BST Index Block Offset 19 6 0 Region Vector Array (RVA) Block Status Table (BST) Evicted Region Buffer (ERB) No Match! Region Tag …… status Ptr block15 block0 3 2 V way Aenao Group/Toronto

Methodology P P P P • Flexus simulator from CMU SimFlex group • Based on Simics full-system simulator • 4-core CMP modeled after Piranha • Private 32KB, 4-way set-associative L1 caches • Shared 8MB, 16-way set-associative L2 cache • 64-byte blocks • Miss-rates: Functional simulation of 2 billion instructions per core • Performance and Energy: Timing simulation using SMARTS sampling methodology • Area and Power: Full custom implementation on 130nm commercial technology • 9 commercial workloads: • WEB: SpecWEB on Apache and Zeus • OLTP: TPC-C on DB2 and Oracle • DSS: 5 TPC-H queries on DB2 D$ I$ D$ I$ D$ I$ D$ I$ Interconnect L2 Aenao Group/Toronto

Miss-Rates vs. Area Sector Cache (0.25, 1.26) • Sector Cache: 512KB sectors, SPC and RT: 1KB regions • Trade-offs comparable to conventional cache 48-way Relative Miss-Rate 52-way 14-way 15-way better Relative Tag Array Area Aenao Group/Toronto

Performance & Energy Performance Energy better better Reduction in Tag Energy Normalized Execution Time • 12-way set-associative RegionTracker: 20% less area • Error bars: 95% confidence interval • Performance within 1%, with 33% tag energy reduction Aenao Group/Toronto

Road Map • Introduction • Goals • Coarse-Grain Cache Designs • RegionTracker: A Tag Array Replacement • RegionTracker: An Optimization Framework • Conclusion Aenao Group/Toronto

L1 Data Array L1 RVA L1 ERB L1 BST RegionTracker: An Optimization Framework Stealth Prefetching: Average 20% performance improvement Drop-in RegionTracker for 36% less area overhead RegionScout: In-depth analysis Aenao Group/Toronto

Snoop Coherence: Common Case CPU CPU CPU Read x+1 Read x+2 Read x+n Read x miss miss Main Memory Many snoops are to non-shared regions Aenao Group/Toronto

RegionScout CPU CPU CPU Read x Miss Miss Region Miss Region Miss Global Region Miss Main Memory Non-Shared Regions Locally Cached Regions Eliminate broadcasts for non-shared regions Aenao Group/Toronto

Locally Cached Regions Non-Shared Regions Already provided by RVA Add 1 bit to each RVA entry RegionTracker Implementation • Minimal overhead to support RegionScout optimization • Still uses less area than conventional tag array Aenao Group/Toronto

RegionTracker + RegionScout • 4 processors, 512KB L2 Caches • 1KB regions BlockScout (4KB) better Reduction in Snoop Broadcasts Avoid 41% of Snoop Broadcasts, no area overhead compared to conventional tag array Aenao Group/Toronto

Result Summary • Replace Conventional Tag Array: • 20% Less tag area • 33% Less tag energy • Within 1% of original performance • Coarse-Grain Optimization Framework: • 36% reduction in area overhead for Stealth Prefetching • Filter 41% of snoop broadcasts with no area overhead compared to conventional cache Aenao Group/Toronto

Predictor Virtualization Ioana Burcea Joint work with Stephen Somogyi Babak Falsafi

CPU CPU CPU L1-D L1-D L1-D L1-I L1-I L1-I Optimization Engines: Predictors Predictor Virtualization CPU CPU CPU CPU CPU CPU CPU CPU CPU CPU CPU L1-D L1-I L1-D L1-I L1-D L1-D L1-I L1-D L1-I L1-D L1-I Interconnect L2 Main Memory Aenao Group/Toronto

Motivating Trends • Dedicating resources to predictors hard to justify: • Chip multiprocessors • Space dedicated to predictors X #processors • Larger predictor tables • Increased performance • Memory hierarchies offer the opportunity • Increased capacity • How many apps really use the space? Use conventional memory hierarchies to store predictor information Aenao Group/Toronto

PV Architecture contd. Optimization Engine request request prediction Predictor Table Aenao Group/Toronto

PV Architecture contd. Optimization Engine request prediction Predictor Virtualization Aenao Group/Toronto

+ PV Architecture contd. Optimization Engine request prediction PVCache MSHR PVStart index PVProxy On the backside of the L1 L2 PVTable Main Memory Aenao Group/Toronto

CPU Infrequent I$ D$ interconnect L2/L3 Main Memory To Virtualize Or Not to Virtualize? Common Case • Re-Use2. Predictor Info Prefetching Aenao Group/Toronto

To Virtualize or Not? • Challenge • Hit in the PVCache most of the time • Will not work for all predictors out of the box • Reuse is necessary • Intrinsic • Easy to virtualize • Non-intrinsic • Must be engineered • More so if the predictor needs to be fast to start with Aenao Group/Toronto

CPU I$ D$ interconnect L2/L3 Main Memory Will There Be Reuse? • Intrinsic: • Multiple [predictions per entry • We’ll see an example • Can be engineered • Group temporally correlated entries together: Cache block Aenao Group/Toronto

Spatial Memory Streaming • Footprint: • Blocks accessed per memory region • Predict next time the footprint will be the same • Handle: PC + offset within region Aenao Group/Toronto

Spatial Generations Aenao Group/Toronto

Virtualizing SMS Virtualize patterns Predictor Detector patterns triggeraccess prefetches Aenao Group/Toronto

Virtualizing SMS Virtual Table PVCache 8 1K 11 11 tag pattern tag pattern tag pattern unused 85 0 11 43 54 Aenao Group/Toronto

Packing Entries in One Cache Block • Index: PC + offset within spatial group • PC →16 bits • 32 blocks in a spatial group → 5 bit offset → 32 bit spatial pattern • Pattern table: 1K sets • 10 bits to index the table → 11 bit tag • Cache block: 64 bytes • 11 entries per cache block → Pattern table 1K sets – 11-way set associative 21 bit index tag pattern tag pattern tag pattern unused 85 0 11 43 54 Aenao Group/Toronto

+ Memory Address Calculation PC Block offset 16 bits 5 bits PV Start Address 10 bits 000000 Memory Address Aenao Group/Toronto

Simulation Infrastructure • SimFlex: CMU Impetus • Full-system simulator based on Simics • Base processor configuration • 8-wide OoO • 256-entry ROB / 64-entry LSQ • L1D/L1I 64KB 4-way set-associative • UL2 8MB 16-way set-associative • Commercial workloads • TPC-C: DB2 and Oracle • TPC-H: Query 1, Query 2, Query 16, Query 17 • Web: Apache and Zeus Aenao Group/Toronto

SMS – Performance Potential better Aenao Group/Toronto

Virtualized Spatial Memory Streaming better Original Prefetcher: Cost: 60KB Virtualized Prefetcher: Cost: <1Kbyte Nearly Identical Performance Aenao Group/Toronto

Impact of Virtualization on L2 Misses Aenao Group/Toronto

Impact of Virtualization on L2 Requests Aenao Group/Toronto

Coarse-Grain Tracking Jason Zebchuk

Coarse-Grain Cache Design Optimization Framework

Coarse-Grain Cache Design Optimization Framework

Presentation Transcript

5 Easy Ways To Exploit Network Vulnerabilities

Two Ways to Run GAMS

Two Ways To Cha-La-Kee

Caches

Two Ways To Succeed in Marriage

Caches

Caches

Caches

Caches

Caches

Exploit-Me

Caches

Two ways to launch backend reports

Two Ways to Keep Track

Caches

Caches

Two ways to access new IEP’s

Amsterdam Exploit

Two Ways to Help Chief!

Two Ways

Two Ways to Connect to Oracle8

Ways To Configure Two Factor Authentication