1 / 37

Interplay between Hardware Prefetcher and Page Eviction Policy in CPU-GPU Unified Virtual Memory

Interplay between Hardware Prefetcher and Page Eviction Policy in CPU-GPU Unified Virtual Memory. ISCA 2019. Debashis Ganguly Ziyu Zhang, Jun Yang, Rami Melhem. Why do we need Hardware Prefetchers?. Kernel Execution. Far fault. Data Migration. Why do we need Hardware Prefetchers?.

jarrett
Download Presentation

Interplay between Hardware Prefetcher and Page Eviction Policy in CPU-GPU Unified Virtual Memory

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Interplay between Hardware Prefetcher and Page Eviction Policyin CPU-GPU Unified Virtual Memory ISCA 2019 Debashis GangulyZiyu Zhang, Jun Yang, Rami Melhem

  2. Why do we need Hardware Prefetchers? Kernel Execution Far fault Data Migration

  3. Why do we need Hardware Prefetchers? Kernel Execution Far fault Data Migration Stream 0 Stream 1 User Directed Prefetch

  4. Why do we need Hardware Prefetchers? Kernel Execution Far fault Data Migration Stream 0 Stream 1 User Directed Prefetch • What and when to prefetch? • How do I synchronize between streams?

  5. Why do we need Hardware Prefetchers? Kernel Execution Far fault Data Migration Stream 0 Stream 1 User Directed Prefetch • What and when to prefetch? • How do I synchronize between streams? Hardware Prefetch

  6. Why do we need Hardware Prefetchers? Kernel Execution Far fault Data Migration Stream 0 Stream 1 User Directed Prefetch • What and when to prefetch? • How do I synchronize between streams? • Takes away the programming effort • Follows spatio-temporal locality of past accesses • Overlap kernel execution and data migration Hardware Prefetch

  7. Different Hardware Prefetchers Random Prefetcher (Rp) 2MB 2MB 2MB Randomly prefetch a 4KB page local to the 2MB large page to which the current faulty page belongs

  8. Different Hardware Prefetchers Random Prefetcher (Rp) 2MB 2MB 2MB Randomly prefetch a 4KB page local to the 2MB large page to which the current faulty page belongs Sequential-local 64KB Prefetcher (SLp) [Variation of Sequential and Locality-aware] 2MB 2MB 2MB • 64KB • 64KB 64KB Prefetch 64KB basic block corresponding to which the current faulty page belongs

  9. Tree-based Neighborhood Prefetcher (TBNp) 64K 64K 64K 64K 64K 64K 64K 64K Invalid Page Access Far fault Prefetch 0% 0% 0% 0% 0% 0% 0% 0% 0% 0% 0% 0% 0% 0% 0%

  10. Tree-based Neighborhood Prefetcher (TBNp) 64K 64K 64K 64K 64K 64K 64K Invalid Page Access Far fault Prefetch 12.5% 0% 25% 0% 0% 0% 50% 0% 100% 0% 0% 0% 0% 0% 0% 0% 4K 60K 1 1

  11. Tree-based Neighborhood Prefetcher (TBNp) 64K 64K 64K 64K 64K 64K Invalid Page Access Far fault Prefetch 25% 50% 0% 50% 0% 0% 50% 100% 0% 0% 100% 0% 0% 0% 0% 0% 0% 4K 60K 4K 60K 2 1 2 1

  12. Tree-based Neighborhood Prefetcher (TBNp) 64K 64K 64K 64K 64K Invalid Page Access Far fault Prefetch 37.5% 0% 75% 100% 50% 0% 0% 100% 100% 0% 0% 100% 0% 0% 0% 0% 0% 0% 4K 60K 4K 60K 4K 60K 2 1 3 2 1 3

  13. Tree-based Neighborhood Prefetcher (TBNp) 64K 64K 64K 64K 64K 4 Invalid Page Access Far fault Prefetch 50% 0% 100% 100% 100% 0% 0% 100% 100% 100% 0% 100% 0% 0% 0% 0% 0% 0% 0% 4K 60K 4K 60K 4K 60K 2 1 3 2 1 3

  14. Tree-based Neighborhood Prefetcher (TBNp) 64K 64K 64K 64K 64K 4 Invalid Page Access Far fault Prefetch 62.5% 25% 100% 100% 100% 0% 50% 100% 100% 100% 0% 100% 0% 0% 0% 0% 100% 0% 0% 0% 4K 60K 4K 60K 4K 60K 4K 60K 2 4 1 3 2 5 1 3

  15. Tree-based Neighborhood Prefetcher (TBNp) 64K 64K 64K 64K 64K 5 5 4 5 Invalid Page Access Far fault Prefetch 100% 100% 100% 100% 100% 100% 100% 100% 100% 100% 0% 0% 100% 0% 0% 0% 100% 0% 0% 0% 100% 100% 100% 4K 60K 4K 60K 4K 60K 4K 60K 2 4 1 3 2 5 1 3

  16. When working set fits in device memory • Larger the transfer size, higher the bandwidth • Reduced number of far-faults TBNp has 1-2 order of magnitude performance improvement over no prefetching

  17. What happens under device memory oversubscription? • Disable hardware prefetchers • To avoid displacement of heavily referenced pages • Pre-eviction to maintain free-page buffer • To avoid write-back latency Early disabling of prefetcher by pre-eviction ~100x performance degradation with just 110% oversubscription

  18. Interplay between Prefetcher and Naïve Eviction Policies LRU 4KB 2MB 2MB 64KB 64KB 64KB 64KB 64KB 64KB

  19. Interplay between Prefetcher and Naïve Eviction Policies LRU 4KB 2MB 2MB 64KB 64KB 64KB 64KB 64KB 64KB

  20. Interplay between Prefetcher and Naïve Eviction Policies LRU 4KB 2MB 2MB 64KB 64KB 64KB 64KB 64KB 64KB • No contiguous free space to prefetch • Renders prefetcher ineffective

  21. Interplay between Prefetcher and Naïve Eviction Policies LRU 4KB LRU 2MB 2MB 2MB 2MB 2MB 64KB 64KB 64KB 64KB 64KB 64KB 64KB 64KB 64KB 64KB 64KB 64KB • No contiguous free space to prefetch • Renders prefetcher ineffective

  22. Interplay between Prefetcher and Naïve Eviction Policies LRU 4KB LRU 2MB 2MB 2MB 2MB 2MB 64KB 64KB 64KB 64KB 64KB 64KB 64KB 64KB 64KB 64KB 64KB 64KB • No contiguous free space to prefetch • Renders prefetcher ineffective

  23. Interplay between Prefetcher and Naïve Eviction Policies LRU 4KB LRU 2MB 2MB 2MB 2MB 2MB 2MB 64KB 64KB 64KB 64KB 64KB 64KB 64KB 64KB 64KB 64KB 64KB 64KB 64KB 64KB 64KB • No contiguous free space to prefetch • Renders prefetcher ineffective

  24. Interplay between Prefetcher and Naïve Eviction Policies LRU 4KB LRU 2MB 2MB 2MB 2MB 2MB 2MB 64KB 64KB 64KB 64KB 64KB 64KB 64KB 64KB 64KB 64KB 64KB 64KB 64KB 64KB 64KB • Displace heavily referenced pages • Causes large thrashing • No contiguous free space to prefetch • Renders prefetcher ineffective

  25. Prefetcher Inspired Eviction Policies Random Eviction (Re) 2MB 2MB 2MB Randomly evict a 4KB page from the entire virtual address space

  26. Prefetcher Inspired Eviction Policies Random Eviction (Re) 2MB 2MB 2MB Randomly evict a 4KB page from the entire virtual address space Sequential-local 64KB Pre-eviction (SLe) 2MB 2MB 2MB • 64KB • 64KB 64KB Pre-evict 64KB basic block corresponding to the 4KB LRU candidate

  27. Tree-based Neighborhood Pre-eviction (TBNe) 64K 64K 64K 64K 64K 64K 64K 64K Valid LRU Candidate LRU Eviction Pre-eviction 100% 100% 100% 100% 100% 100% 100% 100% 100% 100% 100% 100% 100% 100% 100%

  28. Tree-based Neighborhood Pre-eviction (TBNe) 64K 64K 64K 64K 64K 64K 64K Valid LRU Candidate LRU Eviction Pre-eviction 87.5% 100% 75% 100% 100% 100% 50% 100% 100% 100% 100% 100% 100% 100% 100% 0% 4K 60K 1 1

  29. Tree-based Neighborhood Pre-eviction (TBNe) 64K 64K 64K 64K 64K 64K Valid LRU Candidate LRU Eviction Pre-eviction 75% 100% 50% 100% 100% 50% 50% 0% 100% 100% 100% 100% 100% 100% 100% 100% 0% 4K 60K 4K 60K 2 1 2 1

  30. Tree-based Neighborhood Pre-eviction (TBNe) 64K 64K 64K 64K 64K Valid LRU Candidate LRU Eviction Pre-eviction 62.5% 75% 50% 100% 50% 50% 50% 0% 0% 100% 100% 100% 100% 100% 100% 100% 100% 0% 4K 60K 4K 60K 4K 60K 2 3 1 2 3 1

  31. Tree-based Neighborhood Pre-eviction (TBNe) 64K 64K 64K 64K Valid LRU Candidate LRU Eviction Pre-eviction 50% 75% 25% 100% 50% 50% 0% 0% 0% 100% 100% 100% 100% 100% 100% 100% 100% 0% 0% 4K 60K 4K 60K 4K 60K 4K 60K 2 3 1 4 2 3 1 4

  32. Tree-based Neighborhood Pre-eviction (TBNe) 64K 64K 64K 64K 5 Valid LRU Candidate LRU Eviction Pre-eviction 37.5% 75% 0% 100% 50% 0% 0% 0% 0% 100% 100% 100% 100% 100% 100% 100% 100% 0% 0% 0% 4K 60K 4K 60K 4K 60K 4K 60K 2 3 1 4 2 3 1 4

  33. Tree-based Neighborhood Pre-eviction (TBNe) 64K 64K 64K 64K 6 6 6 5 Valid LRU Candidate LRU Eviction Pre-eviction 0% 0% 0% 0% 0% 0% 0% 0% 0% 0% 0% 0% 100% 100% 100% 100% 100% 100% 100% 100% 0% 0% 0% 4K 60K 4K 60K 4K 60K 4K 60K 2 3 1 4 2 3 1 4

  34. Combining Pre-evictions (4KB Granularity) and Prefetchers • No additional co-ordination required • Respecting each other pays off Order of magnitude performance improvement by TBNp and TBNe combo

  35. Combining Pre-evictions (2MB Granularity) and Prefetchers • Dynamic eviction granularity • Reduced number of thrashing Average 18.5% performance improvement by TBNe

  36. Conclusion • Leverages the framework for hardware prefetcher • No additional implementation and performance overhead • Builds on generic concepts • Vendor agnostic • Opportunistically decide on dynamic eviction granularity • Navigates between two extremes: 4KB and 2MB • Overcomes limitations with static granularity • Micro-benchmarks, UVM benchmarks, and simulator • Public for future collaboration • https://github.com/DebashisGanguly/gpgpu-sim_UVMSmart

  37. Interplay between Hardware Prefetcher and Page Eviction Policyin CPU-GPU Unified Virtual Memory Debashis GangulyPh.D. Student • debashis@cs.pitt.edu • https://people.cs.pitt.edu/~debashis/

More Related