360 likes | 498 Views
This paper discusses the innovative concept of Predictor Virtualization (PV) to overcome the challenges posed by growing application footprints. By emulating larger predictor tables within conventional cache hierarchies, PV enhances prediction accuracy while drastically reducing resource requirements. The introduction of a virtualized data prefetcher demonstrates promising average performance improvements, maintaining levels within 1% of current systems with significantly reduced space usage. The authors argue for the scalability of predictors in the context of multicore processors and large caches, presenting a roadmap for future applications.
E N D
Predictor Virtualization Ioana Burcea* Stephen Somogyi§, Andreas Moshovos*, Babak Falsafi§# *University of Toronto Canada §Carnegie Mellon University #École Polytechnique Fédérale de Lausanne ASPLOS 13 March 4, 2008
Why Predictors? History Repeats Itself CPU Branch Prediction Prefetching Value Prediction Predictors Pointer Caching Cache Replacement • Application footprints grow • Predictors need to scale to remain effective
Extra Resources: CMPs With Large On-Chip Caches CPU CPU CPU CPU I$ I$ I$ I$ D$ D$ D$ D$ L2 Cache 10’s – 100’s of MB Main Memory
Predictor Virtualization CPU CPU CPU CPU I$ I$ I$ I$ D$ D$ D$ D$ L2 Cache Physical Memory Address Space
Predictor Virtualization (PV) • Emulate large predictor tables • Reduce predictor table dedicated resources
Research Contributions • PV – metadata stored in conventional cache hierarchy • Benefits • Emulate larger tables → increased accuracy • Less dedicated resources • Why now? • Large caches / CMPs / Need for larger predictors • Will this work? • Metadata locality → intrinsically exploited by caches • First Step – Virtualized Data Prefetcher • Performance: within 1% on average • Space: 60KB down to < 1KB • Advantages of virtualization
Talk Road Map • PV architecture • PV in action • Virtualized “Spatial Memory Streaming” [ISCA 06]* • Conclusions *[ISCA 06] S. Somogyi, T. Wenisch, A. Ailamaki, B. Falsafi, and A. Moshovos. “Spatial Memory Streaming”
Talk Road Map • PV architecture • PV in action • Virtualized “Spatial Memory Streaming” [ISCA 06]* • Conclusions *[ISCA 06] S. Somogyi, T. Wenisch, A. Ailamaki, B. Falsafi, and A. Moshovos. “Spatial Memory Streaming”
PV Architecture Optimization Engine CPU request prediction I$ D$ Predictor Table Virtualize L2 Cache Main Memory
PV Architecture Optimization Engine CPU PVStart request prediction I$ D$ index PVCache PVProxy L2 Cache Physical Memory Address Space PVTable
PV: Variable Prediction Latency Optimization Engine CPU PVStart request prediction I$ D$ index Common Case PVCache PVProxy L2 Cache Infrequent Physical Memory Address Space Rare PVTable
Metadata Locality • Entry reuse • Temporal • One entry used for multiple predictions • Spatial – can be engineered • One miss overcome by several subsequent hits • Metadata access pattern predictability • Predictor metadata prefetching
Talk Road Map • PV architecture • PV in action • Virtualized “Spatial Memory Streaming” [ISCA 06]* • Conclusions *[ISCA 06] S. Somogyi, T. Wenisch, A. Ailamaki, B. Falsafi, and A. Moshovos. “Spatial Memory Streaming”
Spatial Memory Streaming [ISCA 06] 1100001010001… Spatial patterns stored in a pattern history table (PHT) Memory 1100000001101… spatial patterns *[ISCA 06] S. Somogyi, T. Wenisch, A. Ailamaki, B. Falsafi, and A. Moshovos. “Spatial Memory Streaming”
Virtualizing “Spatial Memory Streaming” (SMS) Virtualize data access stream patterns Detector Predictor patterns ~1KB ~60 KB triggeraccess prefetches
Virtualizing SMS tag pattern tag pattern tag pattern unused 39 bits 11 bits 32 bits Virtual Table PVCache 8 sets 1K sets 11 ways 11 ways Set entries → cache block – 64 bytes
Current Implementation • Non-Intrusive • Virtual table stored in reserved physical address space • One table per core • Caches oblivious to metadata • Options • Predictor tables stored in virtual memory • Single, shared table per application • Caches aware of metadata
Simulation Infrastructure • SimFlex • Full-system simulator based on Simics • Base processor configuration • 4-core CMP • 8-wide OoO • 256-entry ROB • L1D/L1I 64KB 4-way set-associative • UL2 8MB 16-way set-associative • Commercial workloads • TPC-C: DB2 and Oracle • TPC-H: Query 1, Query 2, Query 16, Query 17 • SpecWeb: Apache and Zeus
Original Prefetcher – Accuracy vs. Predictor Size L1 Read Misses better
Original Prefetcher – Accuracy vs. Predictor Size L1 Read Misses better
Original Prefetcher – Accuracy vs. Predictor Size L1 Read Misses better
Original Prefetcher – Accuracy vs. Predictor Size L1 Read Misses better Small Tables Diminish Prefetching Accuracy
Virtualized Prefetcher - Performance Speedup better Original Prefetcher ~60KB Virtualized Prefetcher < 1KB Hardware Cost
Impact on L2 Memory Requests L2 Memory Requests Increase better Dark Side: Increased L2 Memory Requests
Impact of Virtualization on Off-Chip Bandwidth Indirect impact on performance Off-Chip Bandwidth Increase Direct impact on performance better Minimal Impact on Off-Chip Bandwidth
Conclusions • Predictor Virtualization • Metadata stored in conventional cache hierarchy • Benefits • Emulate larger tables → increased accuracy • Less dedicated resources • First Step – Virtualized Data Prefetcher • Performance: within 1% on average • Space: 60KB down to < 1KB • Opportunities • Metadata sharing and persistence • Application directed prediction • Predictor adaptation
Predictor Virtualization Ioana Burcea* ioana@eecg.toronto.edu Stephen Somogyi§, Andreas Moshovos*, Babak Falsafi§# *University of Toronto Canada §Carnegie Mellon University #École Polytechnique Fédérale de Lausanne ASPLOS 13 March 4, 2008
Predictor Virtualization Ioana Burcea* ioana@eecg.toronto.edu Stephen Somogyi§, Andreas Moshovos*, Babak Falsafi§# *University of Toronto Canada §Carnegie Mellon University #École Polytechnique Fédérale de Lausanne ASPLOS 13 March 4, 2008
Predictor Virtualization Ioana Burcea* ioana@eecg.toronto.edu Stephen Somogyi§, Andreas Moshovos*, Babak Falsafi§# *University of Toronto Canada §Carnegie Mellon University #École Polytechnique Fédérale de Lausanne ASPLOS 13 March 4, 2008
PV – Motivating Trends • Dedicating resources to predictors hard to justify • Larger predictor tables • Increased performance • Chip multiprocessors • Space dedicated to predictors ↔ # processors • Memory hierarchies offer the opportunity • Increased capacity • Diminishing returns Use conventional memory hierarchies to store predictor metadata
Virtualizing the Predictor Table Pattern History Table Trigger Access Tag Pattern Tag Pattern Address PC … 1 1 1 0 1 0 1 0 … 0 0 1 1 1 0 1 1 Pattern index Tag … 0 0 1 1 1 0 1 0 Prefetch Virtualize • PHT stored in physical address space • Multiple PHT entries packed in one memory block • one memory request brings an entire table set
Packing Entries in One Cache Block • Index: PC + offset within spatial group • PC →16 bits • 32 blocks in a spatial group → 5 bit offset → 32 bit spatial pattern • Pattern table: 1K sets • 10 bits to index the table → 11 bit tag • Cache block: 64 bytes • 11 entries per cache block → Pattern table 1K sets – 11-way set associative 21 bit index tag pattern tag pattern tag pattern unused 85 0 11 43 54
Memory Address Calculation + PC Block offset 16 bits 5 bits PV Start Address 10 bits 000000 tag Memory Address
Increase in Off-Chip Bandwidth – different L2 sizes Off-Chip Bandwidth Increase
Increased L2 Latency Speedup
Conclusions • PV – metadata stored in conventional cache hierarchy • Benefits • Less dedicated resources • Emulate larger tables → increased accuracy • Example – Virtualized Data Prefetcher • Performance: within 1% on average • Space: 60KB down to < 1KB • Why now? • Large caches / CMPs / Need for larger predictors • Will this work? • Metadata locality →intrinsically exploited by caches • Metadata access pattern predictability • Opportunities • Metadata sharing and persistence • Application directed prediction • Predictor adaptation