1 / 48

PageNUCA: Selected Policies for Page-grain Locality Management in Large Shared CMP Caches

PageNUCA: Selected Policies for Page-grain Locality Management in Large Shared CMP Caches. Mainak Chaudhuri, IIT Kanpur mainakc@iitk.ac.in. Talk in one slide. Large shared caches in CMPs are designed as a collection of a number of smaller banks

holtz
Download Presentation

PageNUCA: Selected Policies for Page-grain Locality Management in Large Shared CMP Caches

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. PageNUCA: Selected Policies for Page-grain Locality Management in Large Shared CMP Caches Mainak Chaudhuri, IIT Kanpur mainakc@iitk.ac.in

  2. Talk in one slide • Large shared caches in CMPs are designed as a collection of a number of smaller banks • The banks are distributed across the floor of the chip and connected to the cores by some point-to-point interconnect giving rise to a NUCA • We explore page-grain dynamic data migration in a NUCA and compare it with block-grain migration and OS-assisted static bank-to-page mapping techniques (first touch and application-directed) PageNUCA (IIT, Kanpur)

  3. Sketch • Preliminaries • Why page-grain • Hypothesis and observations • Dynamic page migration • Dynamic cache block migration • OS-assisted static page mapping • Simulation environment • Simulation results • An analytical model • Summary PageNUCA (IIT, Kanpur)

  4. Preliminaries: Example floorplan Memory control B0 B1 B2 B3 B4 B5 B6 B7 L2 bank C0 C1 C2 C3 L2 bank control Ring C4 C5 C6 C7 Core w/ L1$ B8 B9 B10 B11 B12 B13 B14 B15 PageNUCA (IIT, Kanpur)

  5. Preliminaries: Baseline mapping • Virtual address to physical address mapping is demand-based L2 cache-aware bin-hopping • Good for reducing L2 cache conflicts • An L2 cache block is found in a unique bank at any point in time • Home bank maintains the directory entry of each block in the bank as an extended state • Home bank is a function of physical address coming out of the L1 cache controller • Home bank may change as a block migrates • Replication not explored in this work PageNUCA (IIT, Kanpur)

  6. Preliminaries: Baseline mapping • Physical address to bank mapping is page-interleaved • Bank number bits are located right next to the page offset bits • Delivers performance and energy-efficiency similar to the more popular block-interleaved scheme • Private L1 caches are kept coherent via a home-based MESI directory protocol • Every L1 cache request is forwarded to the home bank first for consulting the directory entry • The cache hierarchy maintains inclusion PageNUCA (IIT, Kanpur)

  7. Preliminaries: Why page-grain • Past research has explored block-grain data migration and replication in NUCAs • See paper for a detailed account • Learning dynamic reference patterns at coarse-grain requires less storage • Can pipeline the transfer of multiple cache blocks (amortizes the overhead) • Page-grain is particularly attractive • Contiguous physical data exceeding a page may include completely unrelated virtual pages (we compare two ends of spectrum) • Success in NUMAs (Origin 2000 and Wildfire) PageNUCA (IIT, Kanpur)

  8. Preliminaries: Observations Fraction of all pages or L2$ accesses >= 32 1.0 [16, 31] [8, 15] [1, 7] 0.8 Solo pages 0.6 Access coverage 0.4 0.2 0 Barnes Matrix Equake FFTW Ocean Radix PageNUCA (IIT, Kanpur)

  9. Preliminaries: Observations • For five out of six applications, more than 75% of pages accessed in a 0.1M-cycle sample period are solo • For five out of six applications, more than 50% L2 cache accesses are covered by these solo pages • Major portion of L2 cache accesses are covered by solo pages with 32 or more accesses • Potential for compensating migration overhead by enjoying subsequent reuses PageNUCA (IIT, Kanpur)

  10. Sketch • Preliminaries • Why page-grain • Hypothesis and observations • Dynamic page migration • Dynamic cache block migration • OS-assisted static page mapping • Simulation environment • Simulation results • An analytical model • Summary PageNUCA (IIT, Kanpur)

  11. Dynamic page migration • Fully hardwired solution composed of four central algorithms • When to migrate a page • Where to migrate a candidate page • How to locate a cache block belonging to a migrated page • How the physical data transfer takes place • Definition: an L2 cache bank B is local to a core C if B is in {x | RTWD (x, C) ≤ RTWD (y, C) for all y ≠ x} = LOCAL(C) • A core can have multiple local banks PageNUCA (IIT, Kanpur)

  12. When to migrate a page • When an L1$ request from core R for address A belonging to physical page P arrives at the L2 cache provided • HOME(A) is not in LOCAL(R) • Sharer mode migration decision • SHARER(P) > 1 and MaxAccess(P) – SecondMaxAccess(P) < T1 and AccessesSinceLastSharerAdded(P) > T2 • Solo mode migration decision • (SHARER(P) == 1 or MaxAccess(P) – SecondMaxAccess(P) ≥ T1) and R is in MaxAccessCluster(P) PageNUCA (IIT, Kanpur)

  13. When to migrate a page • Hardware support • Page access counter table (PACT) per L2 cache bank and associated logic • PACT is a set-associative cache that maintains several information about a page • Valid, tag, LRU states • Saturating counters keeping track of access count from a topologically close cluster of cores (pair of adjacent) • Max. and second max. counts, max. cluster • Sharer bitvector and population count • Count of accesses since last sharer added PageNUCA (IIT, Kanpur)

  14. When to migrate a page k ways PageSet 0 Psz/Bsz • PACT organization k ways 0 PageSet 1 1 2 PageSet 2 N-1 PACT PageSet N-1 L2 cache bank PageNUCA (IIT, Kanpur)

  15. Where to migrate a page • Consists of two sub-algorithms • Find a destination bank of migration • Find an appropriate “region” in the destination bank for holding the migrated page • Find a destination bank D for a candidate page P for solo mode migration • Definition: load on a bank is defined as the number of pages mapped on to that bank either by OS or dynamically by migration • Set D to the least loaded bank among LOCAL(R) where R is the requesting core for the current transaction PageNUCA (IIT, Kanpur)

  16. Where to migrate a page • Find a destination bank D for a candidate page P for sharer mode migration • Ideally we want D to minimize Σi ai(P)*RTWD (x, Si(P)) where i ranges over the sharers of P (read out from PACT), ai(P) is the number of accesses from the ith sharer to page P, and Si(P) is the ith sharer • Simplification: assume ai(P) == aj(P) • Maintain a “Proximity ROM” of size 2#C per L2 cache bank indexed by the sharer vector of P and returning top four solutions of the minimization problem; cancel migration if HOME(P) is one of these four • Set D to the one with least load PageNUCA (IIT, Kanpur)

  17. Where to migrate a page • Find a region in destination bank D for migrated page P • A design decision: migration is done by swapping the contents of page frame P’ mapping to D with those of P in HOME(P); no gradual migration => saves power • Look for an invalid entry in PACT(D) => unused index range covering a page in D; generate a frame id P’ outside physical address range mapping to that index range • If not found, let P’ be the LRU page in a randomly picked non-MRU set in PACT(D) PageNUCA (IIT, Kanpur)

  18. How to locate a cache block in L2$ • The migration process is confined within the boundaries of the L2 cache only • Not visible to OS, TLBs, L1 caches, external memory system (may contain other CMP nodes) • Definition: OS-generated physical address (OS PA) is the address assigned to a page at the time of a page fault • Definition: L2 cache address (L2 CA) of a cache block is the address of the block within the L2 cache • Appropriate translation must be carried out between OS PA and L2 CA at L2 cache boundaries PageNUCA (IIT, Kanpur)

  19. How to locate a cache block in L2$ • On-core translation of OS PA to L2 CA (showing the L1 data cache misses only) Offset L1 Data Cache dTLB OS PA LSQ VPN PPN One-to-one Filled on dTLB miss Miss Exercised by all L1 to L2 transactions dL1 Map L2 CA Ring OS PPN to L2 PPN Core outbound PageNUCA (IIT, Kanpur)

  20. How to locate a cache block in L2$ Offset OS PA L2 Cache Bank Forward L2Map • Uncore translation between OS PA and L2 CA L2 CA L2 CA (RING) Ring MC OS PPN L2 PPN L2 PPN Miss Inverse L2Map MC PACT Mig.? Hit OS PPN Ring Refill/Ext. PageNUCA (IIT, Kanpur)

  21. How to locate a cache block in L2$ • Storage overhead • L1Maps: instruction and data per core; organization same as iTLB and dTLB; filled at the time of TLB miss from forward L2Map (if not found, filled with identity) • Forward and inverse L2Maps per L2 cache bank: organized as a set-associative cache; sized to achieve small volume of replacements • Invariant: Map(P, Q) є fL2Map(HOME(P)) iff Map(Q, P) є iL2Map(HOME(Q)) PageNUCA (IIT, Kanpur)

  22. How to locate a cache block in L2$ • Implications on miss paths • L1Map lookup can be hidden under the write to outbound queue in the local switch • L2 cache miss path gets lengthened because on a miss, the request must be routed to the original home bank over the ring for allocating the MSHR and going through the proper MC • On an L2 cache refill or external intervention, the transaction arrives at the original home bank and must be routed to its migrated bank (if any) PageNUCA (IIT, Kanpur)

  23. How data is transferred • Page P from bank B is being swapped with page P’ from bank B’ • Note that these are L2 CAs • Step1: iL2Map(B) produces OS PA of P (call it Q) and iL2Map(B’) produces OS PA of P’ (call it Q’); swap these two entries • Step2: fL2Map(HOME(Q)) must have Map(Q, P) and fL2Map(HOME(Q’)) must have Map(Q’, P’); swap these two entries • Step3: Send the new forward maps i.e. Map(Q, P’) and Map(Q’, P) to the sharing cores of P and P’ [obtained from PACT(B) and PACT(B’)] so that they can update their L1Maps PageNUCA (IIT, Kanpur)

  24. How data is transferred • Page P from bank B is being swapped with page P’ from bank B’ • Step4: Sharing cores acknowledge L1Map update • Step5: Start the pipelined transfer of data blocks, coherence states, and directory entry • Banks B and B’ stop accepting any request until the migration is complete • Migration protocol may evict cache blocks from B or B’ to make room for the migrated blocks (perfect swap may not be possible) • Cycle-free virtual lane dependence graph guarantees freedom from deadlock PageNUCA (IIT, Kanpur)

  25. Sketch • Preliminaries • Why page-grain • Hypothesis and observations • Dynamic page migration • Dynamic cache block migration • OS-assisted static page mapping • Simulation environment • Simulation results • An analytical model • Summary PageNUCA (IIT, Kanpur)

  26. Dynamic cache block migration • Modeled as a special case of page-grain migration where the grain is a single L2 cache block • PACT is replaced by BACT and is now tightly coupled with the L2 cache tag array (doesn’t require separate tags and LRU states) • T1 and T2 are retuned for best performance • Destination bank selection algorithm is similar except the load on a bank is the number of cache blocks fills to the bank • Destination set is selected by first looking for the next round-robin set with an invalid way and resorting to a random selection if none found PageNUCA (IIT, Kanpur)

  27. Dynamic cache block migration • The algorithm for locating a cache block in the L2 cache is similar • The per-core L1Map is now a replica of the forward L2Map so that on an L1 cache miss request can be routed to the correct bank • As an optimization, we store the target set and way also in the L1Map so that the L2 cache tag access latency can be eliminated (races with migration are resolved by NACKing the racing L1 cache request) • The forward and inverse L2Maps get bigger (same organization as the L2 cache) • The inverse L2Map shares the tag array with the L2 cache PageNUCA (IIT, Kanpur)

  28. Sketch • Preliminaries • Why page-grain • Hypothesis and observations • Dynamic page migration • Dynamic cache block migration • OS-assisted static page mapping • Simulation environment • Simulation results • An analytical model • Summary PageNUCA (IIT, Kanpur)

  29. OS-assisted first touch mapping • The OS-assisted techniques (static or dynamic) change the default VA to PA mapping to indirectly achieve a “good” PA to L2 cache bank mapping • Contrast with the hardware techniques that keep the VA to PA mapping unchanged and introduce a new PA to shadow PA indirection • First touch mapping is a static technique where the OS assigns a PA to a virtual page such that the PA is mapped to a bank local to the core touching the page for the first time • Resort to a spill mechanism if all local page frames are exhausted (e.g., pick the globally least loaded) PageNUCA (IIT, Kanpur)

  30. OS-assisted application-directed • The application can provide a one-time (manually coded) hint to the OS about the affinity of data structures • The hint is sent through special system calls just before the first parallel section begins • Completely private data structures can provide accurate hints • Shared pages provide hints such that they are placed round-robin within the local banks of the sharing cores • Avoid flushing the re-mapped pages from cache hierarchy or copying in memory by leveraging the hardware page-grain map tables PageNUCA (IIT, Kanpur)

  31. Sketch • Preliminaries • Why page-grain • Hypothesis and observations • Dynamic page migration • Dynamic cache block migration • OS-assisted static page mapping • Simulation environment • Simulation results • An analytical model • Summary PageNUCA (IIT, Kanpur)

  32. Simulation environment • Single-node CMP with eight OOO cores • Private L1 caches: 32KB 4-way LRU • Shared L2 cache: 1MB 16-way LRU banks, 16 banks distributed over a bidirectional ring • Round-trip L2 cache hit latency from L1 cache: maximum 20 ns, minimum 7.5 ns (local access), mean 13.75 ns (assumes uniform access distribution) [65 nm process, M5 for ring with optimally placed repeaters] • Ring widths evaluated: 1024 bits, 512 bits, 256 bits (area based on wiring pitch:30 mm2, 15 mm2, 7.5 mm2) • Off-die DRAM latency: 70 ns row miss, 30 ns row hit PageNUCA (IIT, Kanpur)

  33. Simulation environment • Shared memory applications • Barnes, Ocean, Radix from SPLASH-2; Matrix (sparse solver using iterative CG) from DIS; Equake from SPEC; FFTW • All optimized with array-based queue locks and tree barriers • Multi-programmed workloads • Mix of SPEC 2000 and BioBench • We report average turn-around time (i.e. average CPI) for each application to commit a representative set of one billion dynamic instructions (identified using SimPoint) PageNUCA (IIT, Kanpur)

  34. Storage overhead • Comparison of storage overhead between page-grain and block-grain migration • Page-grain: Proximity ROM (8 KB) + PACT (49 KB) + L1Maps (7.1 KB) + Forward L2Map (392 KB) + Inverse L2Map (392 KB) = 848.1 KB (4.8% of total L2 cache storage) • Block-grain: Proximity ROM (8 KB) + BACT (1088 KB) + L1Map (4864 KB) + Forward L2Map (608 KB) + Inverse L2Map (208 KB) = 6776 KB (28.5%) • Idealized block-grain: only one L1Map (608 KB) shared by all cores; total = 2520 KB (12.9%)[hard to plan the floor of the chip] PageNUCA (IIT, Kanpur)

  35. Sketch • Preliminaries • Why page-grain • Hypothesis and observations • Dynamic page migration • Dynamic cache block migration • OS-assisted static page mapping • Simulation environment • Simulation results • An analytical model • Summary PageNUCA (IIT, Kanpur)

  36. Performance comparison Normalized cycles (lower is better) 18.7% 22.5% Perfect 1.1 1.46 1.69 App.-dir. Lock placement First touch 1.0 Block Page 0.9 0.8 0.7 0.6 Barnes Matrix Equake FFTW Ocean Radix gmean PageNUCA (IIT, Kanpur)

  37. Performance comparison Normalized avg. cycles (lower is better) 1.1 Perfect Spill effect 12.6% 15.2% First touch Block 1.0 Page 0.9 0.8 0.7 0.6 MIX1 MIX2 MIX3 MIX4 MIX5 MIX6 MIX7 MIX8 gmean PageNUCA (IIT, Kanpur)

  38. Performance analysis • Why page-grain sometimes outperforms block-grain (counter-intuitive) • Pipelined block transfer during page migration helps amortize the cost and allows page migration to be more aggressively tuned for T1 and T2 • The degree of aggression gets reflected in local L2 cache access percentage: Base Page Block FT AP ShMem 21.0% 81.7% 72.6% 43.1% 54.1% MProg 21.6% 85.3% 84.0% 69.6% PageNUCA (IIT, Kanpur)

  39. Performance analysis • Impact of ring bandwidth • Results presented till now assume a bidirectional data ring of width 1024 bits in each direction • A 256-bit data ring causes a 3.6% increase in execution time of page migration for shared memory applications and 1.3% increase in execution time of multiprogrammed workloads • Block migration is more tolerant to ring bandwidth variation PageNUCA (IIT, Kanpur)

  40. L1 cache prefetching • Impact of a 16 read/write stream stride prefetcher per core L1 Pref. Page Mig. Both ShMem 14.5% 18.7% 25.1% MProg 4.8% 12.6% 13.0% PageNUCA (IIT, Kanpur)

  41. Energy Savings • Energy savings originate from • Reduced execution time • Potential show stoppers • Extra dynamic interconnect energy due to migration • Extra leakage in added SRAMs • Extra dynamic energy in consulting the additional tables and logic PageNUCA (IIT, Kanpur)

  42. Energy Savings • Good news • Dynamic page migration is most energy-efficient among all the options • Saves 14% energy for shared memory applications and 11% for multiprogrammed workloads compared to baseline static NUCA • Extra leakage in large tables kills block migration: saves only 4% and 2% energy for shared memory and multiprogrammed workloads PageNUCA (IIT, Kanpur)

  43. Sketch • Preliminaries • Why page-grain • Hypothesis and observations • Dynamic page migration • Dynamic cache block migration • OS-assisted static page mapping • Simulation environment • Simulation results • An analytical model • Summary PageNUCA (IIT, Kanpur)

  44. An analytical model • Normalized execution time for data migration is given by N = rA + (1 - r)[s + t(1 – s)] ------------------------------- rA + 1 – r r = L2$ miss rate A = L2$ miss latency/avg. L2$ hit latency s = Ratio of average hit latency after migration to before migration t = Fraction of busy cycles • Observations: limr 1N = 1, lims 1N = 1, limt 1N = 1, limA ∞N = 1 PageNUCA (IIT, Kanpur)

  45. Sketch • Preliminaries • Why page-grain • Hypothesis and observations • Dynamic page migration • Dynamic cache block migration • OS-assisted static page mapping • Simulation environment • Simulation results • An analytical model • Summary PageNUCA (IIT, Kanpur)

  46. Summary • Explored hardwired and OS-assisted page migration in CMPs • Page migration reduces execution time by 18.7% for shared memory applications and 12.6% for multiprogrammed workloads • Storage overhead of page migration is less than 5% • Performance-optimized block migration algorithms come close to page migration, but require at least 13% extra storage PageNUCA (IIT, Kanpur)

  47. Acknowledgments • Intel Research Council • Financial support • Gautam Doshi • Moral support, useful “tele-brain-storming” • Vijay Degalahal, Jugash Chandarlapati • HSPICE simulations for leakage modeling • Sreenivas Subramoney • Detailed feedback on early manuscript • Kiran Panesar, Shubhra Roy, Manav Subodh • Initial connections PageNUCA (IIT, Kanpur)

  48. PageNUCA: Selected Policies for Page-grain Locality Management in Large Shared CMP Caches THANK YOU! Mainak Chaudhuri, IIT Kanpur mainakc@iitk.ac.in [Presented at HPCA’09]

More Related