Scavenger: A New Last-Level Cache Architecture with Global Block Priority

Scavenger:A New Last Level Cache Architecture with Global Block Priority Arkaprava Basu, IIT Kanpur Nevin Kirman, Cornell Mainak Chaudhuri, IIT Kanpur Meyrem Kirman, Cornell Jose F. Martinez, Cornell

Talk in one slide • Observation#1: large number of blocks miss repeatedly in the last-level cache • Observation#2: number of evictions between an eviction-reuse pair is too large to be captured by a conventional fully associative victim file • How to exploit this temporal behavior with “large period”? • Our solution prioritizes blocks evicted from the last-level cache by their miss frequencies • Top k frequently missing blocks are scavenged and retained in a fast k-entry victim file Scavenger (IITK-Cornell)

Sketch • Observations and hypothesis • Scavenger overview (Contributions) • Scavenger architecture • Frequency estimator • Priority queue • Victim file • Simulation environment • Simulation results • Related work • Summary Scavenger (IITK-Cornell)

Observations and hypothesis 1 ROB stall cycles (%) 2-9 10-99 100-999 100 >= 1000 512 KB 8-way 1 MB 8-way 80 60 40 20 0 gz wu sw ap vp gc me ar mc eq cr am pe bz tw aps Scavenger (IITK-Cornell)

Observations and hypothesis Wish, but too large (FA?) Too small Scavenger (IITK-Cornell)

Observations and hypothesis • Block addresses repeat in the miss address stream of the last-level cache • Repeating block addresses in the miss stream cause significant ROB stall • Hypothesis: identifying and retaining the most frequently missing blocks in a victim file should be beneficial, but … • Number of evictions between an eviction-reuse pair is very large • Temporal behavior happens at too large a scale to be captured by any reasonably sized fully associative victim file Scavenger (IITK-Cornell)

Scavenger overview (Contributions) • Functional requirements • Determine the frequency of occurrence of an evicted block address in the miss stream seen so far • Determine (preferably O(1) time) the min. frequency among the top k frequently missing blocks and if the frequency of the current block is bigger than or equal to this min., replace the min., insert this frequency, and compute new minimum quickly • Allocate a new block in the victim file by replacing the minimum frequency block, irrespective of the addresses of these blocks Scavenger (IITK-Cornell)

Scavenger overview (L2 eviction) To MC Evicted block address L2 tag & data Bloom filter Freq. Replace min., insert new Min. Victim file Min- heap Freq. >= Min. Scavenger (IITK-Cornell)

Scavenger overview (L1 miss) L2 tag & data Bloom filter Miss address L1 To MC De-allocate Hit Victim file Min- heap Scavenger (IITK-Cornell)

Miss frequency estimator [24:19] [18:9] BF4 BF3 Block address BF2 BF1 BF0 [25:23] [22:15] [14:0] Min Estimate Scavenger (IITK-Cornell)

Priority queue (min-heap) 5 T1 6 T2 5 10 T3 6 T4 10 T5 6 10 13 T6 11 T7 11 T8 6 10 13 11 9 T9 15 T10 12 T11 15 T12 11 9 15 12 15 18 11 13 18 T13 Right child: (i << 1) | 1 Left child: i << 1 11 T14 13 T15 Priority | VPTR VF tag Scavenger (IITK-Cornell)

Pipelined min-heap • Both insertion and de-allocation require O(log k) steps for a k-entry heap • Each step involves read, comparison, and write operations; step latency: r+c+w cycles • Latency of (r+c+w)log(k) cycles is too high to cope up with bursty cache misses • Both insertion and de-allocation must be pipelined • We unify insertion and de-allocation into a single pipelined operation called replacement • De-allocation is same as a zero insertion Scavenger (IITK-Cornell)

Pipelined heap replacement 20, 10, 0 5 20 5 R C W 6 10 20 5 20, 10, 0 10 R 6 6 10 C 10 W 13 11 6 10 13 11 R 11 C 9 W 15 11 9 15 12 15 18 11 13 12 Right child: (i << 1) | 1 Left child: i << 1 15 18 R 11 C 13 W Scavenger (IITK-Cornell)

Pipelined heap replacement 20, 10, 0 5 5 R C W 20 5 20, 10, 0 10 20 R 6 20 10 C 10 W 13 20, 10, 1 11 6 10 13 11 R 11 C 9 W 15 11 9 15 12 15 18 11 13 12 Right child: (i << 1) | 1 Left child: i << 1 15 18 R 11 C 13 W Scavenger (IITK-Cornell)

Pipelined heap replacement 20, 10, 0 5 5 R C W 20 5 20, 10, 0 10 R 6 20 10 C 10 W 13 6, 10 20, 10, 1 11 6 10 13 11 R 11 20 C 9 6 W 15 11 9 15 12 15 18 11 13 20, 100, 1 12 Right child: (i << 1) | 1 Left child: i << 1 15 18 R 11 C 13 W Scavenger (IITK-Cornell)

Pipelined heap replacement 20, 10, 0 5 5 R C W 6 5 20, 10, 0 10 6 R 20 6 10 C 6 10 W 13 6, 10 20, 10, 1 11 20 10 13 11 R 11 C 9 W 15 11 9 15 12 15 18 11 13 20, 100, 1 12 Right child: (i << 1) | 1 Left child: i << 1 15 11, 9 18 R 9 11 C 20 13 W Scavenger (IITK-Cornell)

Victim file • Functional requirements • Should be able to replace a block with minimum priority by a block of higher or equal priority irrespective of addresses (fully associative functionality) • Should offer fast lookup (conventional fully associative won’t do) • On a hit, should de-allocate the block and move it to main L2 cache (different from conventional victim caches) Scavenger (IITK-Cornell)

Victim file organization • Tag array • Direct-mapped hash table with collisions (i.e., conflicts) resolved by chaining • Each tag entry contains an upstream (toward head) and a downstream (toward tail) pointer, and a head (H) and a tail (T) bit • Victim file lookup at address A walks the tag list sequentially starting at direct-mapped index of A • Each tag lookup has latency equal to the latency of a direct-mapped cache of same size • A replacement delinks the replaced tag from its list and links it up with the list of the new tag Scavenger (IITK-Cornell)

Victim file lookup (A >> BO) & (k-1) Requires a back pointer to heap VF Tag VF Data Tail k Head Insert zero priority in heap node Hit Invalid Scavenger (IITK-Cornell)

Simulation environment • Single-stream evaluation in this paper • Configs differ only in L2 cache arch. • Common attributes (more in paper) • 4 GHz, 4-4-6 pipe, 128-entry ROB, 160 i/fpRF • L1 caches: 32 KB/4-way/64B/LRU/0.75 ns • L2 cache miss latency (load-to-use): 121 ns • 16-stream stride prefetcher between L2 cache and memory with max. stride 256B • Applications: 1 billion representative dynamic instructions from sixteen SPEC 2000 applications (will discuss results for nine memory-bound applications; rest in paper) Scavenger (IITK-Cornell)

Simulation environment • L2 cache configurations • Baseline: 1 MB/8-way/64B/LRU/2.25 ns/ 15.54 mm2 • Scavenger: 512 KB/8-way/64B/LRU/2 ns conventional L2 cache + 512 KB VF (8192 entries x 64 B/entry)/0.5 ns, 0.75 ns + auxiliary data structures (8192-entry priority queue, BFs, pointer RAMs)/0.5 ns 16.75 mm2 • 16-way: 1 MB/16-way/64B/LRU/2.75 ns/ 26.4 mm2 • 512KB-FA-VC: 512 KB/8-way/64B/LRU/2 ns conventional L2 cache + 512 KB/FA/64B/ Random/3.5 ns conventional VC Scavenger (IITK-Cornell)

Victim file characteristics • Number of tag accesses per L1 cache miss request • Mean below 1.5 for 14 applications • Mode (common case) is one for 15 applications (enjoy direct-mapped latency) • More than 90% requests require at most three for 15 applications Scavenger (IITK-Cornell)

Performance (Speedup) Higher is better 1.63 1.4 16-way (1.01, 1.00) 512 KB-FA-VC (1.01, 1.01) 1.3 Scavenger (1.14, 1.08) 1.2 1.1 1.0 0.9 sw ap vp ar mc eq am wu tw Scavenger (IITK-Cornell)

Performance (L2 cache misses) 16-way (0.98, 0.98) Lower is better 512 KB-FA-VC (0.94, 0.96) 1.1 Scavenger (0.85, 0.90) 1.0 0.9 0.8 0.7 0.6 sw ap vp ar mc eq am wu tw Scavenger (IITK-Cornell)

L2 cache misses in recent proposals Lower is better DIP [ISCA’07] (0.84) [Beats Scavenger in art and mcf only] [Beats Scavenger only in ammp] V-way [ISCA’05] (0.87) Scavenger (0.84) [Improvement across the board] 1.00 Bottleneck: BFs 0.85 0.7 0.55 0.4 sw ap vp ar mc eq am wu tw Scavenger (IITK-Cornell)

Summary of Scavenger • Last-level cache arch. with algorithms to discover global block priority • Divides the storage into a conventional set-associative cache and a large fast VF offering the functionality of a FA VF without using any CAM • Insertion into VF is controlled by a priority queue backed by a cache block miss frequency estimator • Offers IPC improvement of up to 63% and on average 8% for a set of sixteen SPEC 2000 applications Scavenger (IITK-Cornell)

Scavenger:A New Last Level Cache Architecture with Global Block Priority THANK YOU! Arkaprava Basu, IIT Kanpur Nevin Kirman, Cornell Mainak Chaudhuri, IIT Kanpur Meyrem Kirman, Cornell Jose F. Martinez, Cornell

Scavenger: A New Last-Level Cache Architecture with Global Block Priority