Scavenger: A New Last Level Cache Architecture with Global Block Priority

Scavenger:A New Last Level Cache Architecture with Global Block Priority Arkaprava Basu, IIT Kanpur Nevin Kirman, Cornell Mainak Chaudhuri, IIT Kanpur Meyrem Kirman, Cornell Jose F. Martinez, Cornell

Talk in one slide • Observation#1: large number of blocks miss repeatedly in the last-level cache • Observation#2: number of evictions between an eviction-reuse pair is too large to be captured by a conventional fully associative victim file • How to exploit this temporal behavior with “large period”? • Our solution prioritizes blocks evicted from the last-level cache by their miss frequencies • Top k frequently missing blocks are scavenged and retained in a fast k-entry victim file Scavenger (IITK-Cornell)

Contributions • Three major contributions • Novel application of counting Bloom filters to estimate the frequency of occurrence of a block address seen so far in the last-level cache miss stream • Design of a k-entry pipelined priority queue organized as a min-heap to maintain the miss frequencies of the top k frequently missing blocks • Design of a k-entry fast victim file offering fully associative functionality to retain the top k frequently missing blocks Scavenger (IITK-Cornell)

Result highlights • Compared to a 1 MB 8-way baseline L2 cache, a 512 KB 8-way L2 cache operating with a 512 KB Scavenger victim file offers IPC improvement of • Up to 63% and average 14.2% for nine memory-bound SPEC 2000 applications • Average 8% for a larger set of sixteen SPEC 2000 applications • Aggressive multi-stream stride prefetcher enabled in both cases Scavenger (IITK-Cornell)

Sketch • Observations and hypothesis • Scavenger overview • Scavenger architecture • Frequency estimator • Priority queue • Victim file • Simulation environment • Simulation results • Related work • Summary Scavenger (IITK-Cornell)

Observations and hypothesis 1 ROB stall cycles (%) 2-9 10-99 100-999 100 >= 1000 512 KB 8-way 1 MB 8-way 80 60 40 20 0 gz wu sw ap vp gc me ar mc eq cr am pe bz tw aps Scavenger (IITK-Cornell)

Observations and hypothesis Wish, but too large (FA?) Too small Scavenger (IITK-Cornell)

Observations and hypothesis • Block addresses repeat in the miss address stream of the last-level cache • Repeating block addresses in the miss stream cause significant ROB stall • Hypothesis: identifying and retaining the most frequently missing blocks in a victim file should be beneficial, but … • Number of evictions between an eviction-reuse pair is very large • Temporal behavior happens at too large a scale to be captured by any reasonably sized fully associative victim file Scavenger (IITK-Cornell)

Scavenger overview • Requirements • Given the address of an evicted block, need to determine the frequency of occurrence of this block address in the miss stream so far • Quickly determine (preferably O(1) time) the min. frequency among the top k frequently missing blocks and if the frequency of the current block is bigger than or equal to this min., replace the min., insert this frequency, and compute new minimum quickly • Allocate a new block in the victim file by replacing the minimum frequency block, irrespective of the addresses of these blocks Scavenger (IITK-Cornell)

Scavenger overview (L2 eviction) To MC Evicted block address L2 tag & data Bloom filter Freq. Replace min., insert new Min. Victim file Min- heap Freq. >= Min. Scavenger (IITK-Cornell)

Scavenger overview (L1 miss) L2 tag & data Bloom filter Miss address L1 To MC De-allocate Hit Victim file Min- heap Scavenger (IITK-Cornell)

Miss frequency estimator [24:19] [18:9] BF4 BF3 Block address BF2 BF1 BF0 [25:23] [22:15] [14:0] Min Estimate Scavenger (IITK-Cornell)

Priority queue (min-heap) 5 T1 6 T2 5 10 T3 6 T4 10 T5 6 10 13 T6 11 T7 11 T8 6 10 13 11 9 T9 15 T10 12 T11 15 T12 11 9 15 12 15 18 11 13 18 T13 Right child: (i << 1) | 1 Left child: i << 1 11 T14 13 T15 Priority | VPTR VF tag Scavenger (IITK-Cornell)

De-allocation from min-heap Invoked on a VF hit -Need a back-pointer (HPTR) in VF tag 5 T1 6 T2 5 10 T3 6 T4 0 10 T5 6 10 13 T6 11 T7 11 T8 6 10 13 11 9 T9 15 T10 12 T11 15 T12 11 9 15 12 15 18 11 13 18 T13 Right child: (i << 1) | 1 Left child: i << 1 11 T14 13 T15 Priority | VPTR VF tag Scavenger (IITK-Cornell)

De-allocation from min-heap 5 T1 6 T2 5 10 T3 6 T4 10 T5 6 10 13 T6 0 T7 11 Invalid 6 10 13 0 9 T9 15 T10 12 T11 15 T12 11 9 15 12 15 18 11 13 18 T13 Right child: (i << 1) | 1 Left child: i << 1 11 T14 13 T15 Priority | VPTR VF tag Scavenger (IITK-Cornell)

De-allocation from min-heap 5 T1 6 T2 5 0 T3 6 T4 10 T5 6 0 13 T6 10 T7 11 Invalid 6 10 13 10 9 T9 15 T10 12 T11 15 T12 11 9 15 12 15 18 11 13 18 T13 Right child: (i << 1) | 1 Left child: i << 1 11 T14 13 T15 Priority | VPTR VF tag Scavenger (IITK-Cornell)

De-allocation from min-heap • First step involves locating the priority of invalidated tag by reading out HPTR and overwriting it with zero • Each subsequent step involves • Computing parent index (right shift) • Reading out parent priority (read port) • Comparing against parent priority (byte comparator) • Swap priority if parent is more (two write ports); swap VPTR values also • Update HPTR contents (shift/OR) • O(d) steps for a d-depth heap Scavenger (IITK-Cornell)

Insertion into min-heap • Invoked when a new block is allocated in the VF • In the usual case, the insertion happens at the root because this holds the priority of the replaced block • In some special cases, an insertion is needed at an arbitrary node of the heap • Insertion at root is very well understood and requires O(d) time for depth d Scavenger (IITK-Cornell)

Insertion into min-heap 5 T1 6 T2 5 20 10 T3 6 T4 10 T5 6 10 13 T6 11 T7 11 T8 6 10 13 11 9 T9 15 T10 12 T11 15 T12 11 9 15 12 15 18 11 13 18 T13 Right child: (i << 1) | 1 Left child: i << 1 11 T14 13 T15 Priority | VPTR VF tag Scavenger (IITK-Cornell) T7’

Insertion into min-heap 5 T1 20 T2 5 10 T3 6 T4 10 T5 20 10 13 T6 11 T7’ 11 T8 6 10 13 11 9 T9 15 T10 12 T11 15 T12 11 9 15 12 15 18 11 13 18 T13 Right child: (i << 1) | 1 Left child: i << 1 11 T14 13 T15 Priority | VPTR VF tag Scavenger (IITK-Cornell)

Insertion into min-heap 5 T1 6 T2 5 10 T3 20 T4 10 T5 6 10 13 T6 11 T7’ 11 T8 20 10 13 11 9 T9 15 T10 12 T11 15 T12 11 9 15 12 15 18 11 13 18 T13 Right child: (i << 1) | 1 Left child: i << 1 11 T14 13 T15 Priority | VPTR VF tag Scavenger (IITK-Cornell)

Insertion into min-heap • First step involves locating the priority of replaced tag by reading out HPTR and overwriting it with the new priority • If new priority is bigger than the replaced priority, flow moves downward; otherwise upward • O(d) steps for a d-depth heap Scavenger (IITK-Cornell)

Pipelined min-heap • Both insertion and de-allocation require O(log k) steps for a k-entry heap • Each step involves read, comparison, and write operations; step latency: r+c+w cycles • We are targeting very large k (close to ten thousand) • Latency of (r+c+w)log(k) cycles is too high to cope up with bursty cache misses • Both insertion and de-allocation must be pipelined • We unify insertion and de-allocation into a single pipelined operation called replacement Scavenger (IITK-Cornell)

Pipelined min-heap • Each level of the heap is stored in a separate RAM bank and becomes a macro pipe stage • Each macro stage is divided into three micro stages: read, compare, write giving a pipe depth of 3log(k) and throughput of max(r, c, w) cycles • Data hazards between consecutive levels and within a level are resolved by short bypass busses Scavenger (IITK-Cornell)

Pipelined replacement#1 0, 111, 0 5 0 5 R C 0 W 6 111 0 5 5, 111, 0 10 10 R 6 6 10 5 C 10 W 13 10, 111, 0 11 10 6 10 13 11 R 11 C 9 W 15 11 9 15 12 15 18 11 13 12 11, 111, 1 Right child: (i << 1) | 1 Left child: i << 1 15 11 18 10 R 11 C 13 13 W Scavenger (IITK-Cornell)

Pipelined replacement#2 20, 10, 0 5 20 5 R C W 6 10 20 5 20, 10, 0 10 20; R 6 6 10 C 10 W 13 6, 10 20, 10, 1 11 6 10 13 11 R 11 20; 6 C 9 W 15 11 9 15 12 15 18 11 13 20, 100, 1 12 Right child: (i << 1) | 1 Left child: i << 1 15 11, 9 18 R 9 11 C 20 13 W Scavenger (IITK-Cornell)

Victim file • Functional requirements • Should be able to replace a block with minimum priority by a block of higher or equal priority irrespective of addresses (fully associative functionality) • Should offer fast lookup (conventional fully associative won’t do) • On a hit, should de-allocate the block and move it to main L2 cache (different from conventional victim buffers) Scavenger (IITK-Cornell)

Victim file organization • Tag array • Direct-mapped hash table with collisions (i.e., conflicts) resolved by chaining • Each tag entry contains an upstream (toward head) and a downstream (toward tail) pointer, and a head (H) and a tail (T) bit • Victim file lookup at address A walks the tag list sequentially starting at direct-mapped index of A • Each tag lookup has latency equal to the latency of a direct-mapped cache of same size • A replacement delinks the replaced tag from its list and links it up with the list of the new tag Scavenger (IITK-Cornell)

Victim file lookup (A >> BO) & (k-1) VF Tag VF Data Tail k Head Hit Invalid Scavenger (IITK-Cornell)

Why is it attractive? • Address to hit/miss signal latency is O(L), L being the length of the accessed tag list • Compared to an equally sized conventional N-way cache, our victim file gains log N extra index bits: helps tremendously in reducing conflicts and hence the list length • Okay to have slightly higher lookup latency than usual for the last-level cache • To offer a guarantee on lookup latency (may be needed for designing various interfaces, e.g., snoop response, load/store replay), one can limit the list length: L=8 loses 0.1% performance compared to unlimited Scavenger (IITK-Cornell)

Handling head de-allocation (A’ >> BO) & (k-1) Requires block migration VF Tag VF Data Invalid Tail k Hit Head Invalid Scavenger (IITK-Cornell)

Insertion into VF (A >> BO) & (k-1) Cannot insert at head position VF Tag VF Data VPTR[r] RTag Tail k Tail’ Head Head’ Scavenger (IITK-Cornell)

Simulation environment • Configs differ only in L2 cache arch. • Common attributes (more in paper) • 4 GHz, 4-4-6 pipe, 128-entry ROB, 160 i/fpRF • L1 caches: 32 KB/4-way/64B/LRU/0.75 ns • L2 cache miss latency (load-to-use): 121 ns • 16-stream stride prefetcher between L2 cache and memory with max. stride 256B • Applications: 1 billion representative dynamic instructions from sixteen SPEC 2000 applications (will discuss results for ten memory-bound applications; rest in paper) Scavenger (IITK-Cornell)

Simulation environment • L2 cache configurations • Baseline: 1 MB/8-way/64B/LRU/2.25 ns/ 15.54 mm2 • Scavenger: 512 KB/8-way/64B/LRU/2 ns conventional L2 cache + 512 KB VF (8192 entries x 64 B/entry)/0.5 ns, 0.75 ns + auxiliary data structures (8192-entry priority queue, BFs, pointer RAMs)/0.5 ns 16.75 mm2 • 16-way: 1 MB/16-way/64B/LRU/2.75 ns/ 26.4 mm2 • 512KB-FA-VC: 512 KB/8-way/64B/LRU/2 ns conventional L2 cache + 512 KB/FA/64B/ Random/3.5 ns conventional VC Scavenger (IITK-Cornell)

VF characteristics • Number of tag accesses per L1 cache miss request • Mean below 1.5 for 14 applications • Mode (common case) is one for 15 applications (enjoy direct-mapped latency) • More than 90% requests require at most three for 15 applications • Migration in allocation/de-allocation • At most 25% allocations/de-allocations require block migration in 15 applications Scavenger (IITK-Cornell)

Performance (Speedup) Higher is better 1.63 1.4 16-way (1.01, 1.00) 512 KB-FA-VC (1.01, 1.01) 1.3 Scavenger-StBF (1.03, 1.02) Scavenger-SkBF (1.14, 1.08) 1.2 1.1 1.0 0.9 sw ap vp ar mc eq am wu tw Scavenger (IITK-Cornell)

Performance (L2 cache misses) 16-way (0.98, 0.98) Lower is better 512 KB-FA-VC (0.94, 0.96) 1.35 Scavenger-StBF (1.02, 1.02) Scavenger-SkBF (0.85, 0.90) 1.20 6% slowdown 1.05 0.9 0.75 0.6 sw ap vp ar mc eq am wu tw Scavenger (IITK-Cornell)

Local L2 hit rate contributions 512 KB VF 512 KB Conv. L2 wupwise 0.01 0.15 swim 0.40 0.30 applu 0.00 0.03 vpr 0.18 0.58 art 0.28 0.13 mcf 0.14 0.06 equake 0.01 0.03 ammp 0.15 0.60 twolf 0.21 0.44 Scavenger (IITK-Cornell)

Comparison with recent proposals • Dynamic insertion policy (DIP) [ISCA’07] • Dynamically decides whether to insert a new block in the MRU position (traditional) or in the LRU position within a set • Block inserted at LRU position is promoted to MRU only after another access • Attacks a capacity issue and retains some parts of large working sets if access pattern is cyclic • Scavenger does not differentiate between capacity and conflict misses, but tries to retain the most frequently missed blocks Scavenger (IITK-Cornell)

Comparison with recent proposals • V-way cache [ISCA’05] • Doubles tag store and gains one index bit • Each tag maintains pointers to/from decoupled data store • Since the number of tags is double the number of blocks, replacement of an invalid tag can invoke a global reuse-based data replacement policy • Sequentially locates the block with least reuse (search limit of five cycles maximum) • Scavenger’s pointer-linked tag store offers more flexibility in having variable assoc. Scavenger (IITK-Cornell)

Comparison of L2 cache misses Lower is better DIP (0.84) [Beats Scavenger in art and mcf only] [Beats Scavenger only in ammp] V-way (0.87) Scavenger-SkBF (0.84) [Improvement across the board] 1.00 Bottleneck: BFs 0.85 0.7 0.55 0.4 sw ap vp ar mc eq am wu tw Scavenger (IITK-Cornell)

Other closely related work • Indirect index cache (IIC) [ISCA2k] • Our VF organization shares similarities with this proposal • IIC organizes the tags in a primary four-way hash table backed by a secondary direct-mapped hash table with chained tags • Scavenger enjoys the advantage of direct-mapped access in case of singleton lists (common case) • Scavenger’s miss frequency-based replacement policy is different from the generational replacement policy of IIC Scavenger (IITK-Cornell)

Summary of Scavenger • A new cache organization best suited for the last level of cache hierarchy • Divides the storage into a conventional set-associative cache and a fast VF offering the functionality of a FA VF • Insertion into VF is controlled by a priority queue backed by a cache block miss frequency estimator • Offers IPC improvement of up to 63% and on average 8% for a set of sixteen SPEC 2000 applications Scavenger (IITK-Cornell)

Scavenger:A New Last Level Cache Architecture with Global Block Priority THANK YOU! Arkaprava Basu, IIT Kanpur Nevin Kirman, Cornell Mainak Chaudhuri, IIT Kanpur Meyrem Kirman, Cornell Jose F. Martinez, Cornell

Scavenger: A New Last Level Cache Architecture with Global Block Priority