Cooling the Hot Sets: Improved Space Utilization in Large Caches via Dynamic Set Balancing

Cooling the Hot Sets:Improved Space Utilization in Large Caches via Dynamic Set Balancing Mainak Chaudhuri, IIT Kanpur mainakc@iitk.ac.in

Talk in one slide • Closed-addressed hashing used in traditional cache designs with a fixed collision chain length (known as associativity) • Clustering of physical addresses to a few hot sets is a well-known phenomenon • Non-uniform set utilization leads to high volume of conflict misses • First proposal on a fully dynamic scheme to re-balance sets by migrating blocks from “hot regions” to “cooler regions” Balanced $ (IIT, Kanpur)

Sketch • Observations • Design detail • Destination of migration • Locating migrated blocks • Hit/Miss critical path • Selective migration • Throttling migration • Retaining migrated blocks • Scaling to CMPs • Simulation results • Summary Balanced $ (IIT, Kanpur)

Observation#1 Balanced $ (IIT, Kanpur)

Observation#2, 3 Balanced $ (IIT, Kanpur)

Design detail • Overview • The basic idea is to migrate evicted blocks to sets with smaller fill count • Involves the following sub-problems • Identify a good receiver set quickly • Locate migrated blocks efficiently • Offer dynamic control of hit/miss critical path • Optimizations worth exploring • Selective migration (not all blocks are important) • Bound migrations from a particular set • Retain migrated blocks (the difficult part) Balanced $ (IIT, Kanpur)

Destination of migration • Associate a saturating counter C(s) with each set s and a global counter G • Increment C(s) on a refill into s • When C(s) reaches a value equal to the associativity, increment G • When G reaches a value equal to the number of sets, reset G and C(s) for all s • Size C(s) so that it can count up to k times the associativity (we set k to 4) Balanced $ (IIT, Kanpur)

Destination of migration • Divide the sets into clusters of sets and associate a saturating counter D(u) with each cluster u • Increment D(u) whenever C(s) is incremented for some s in u • Reset D(u) when all C(s) are reset • Have a comparator tree to compute the minimum among all D(u) whenever an increment takes place (scalable?) • Have a second comparator tree to compute the minimum among all C(s) within the minimum u found by the first tree; the set t with this minimum is the target of migration provided C(s) > C(t) for source set s Balanced $ (IIT, Kanpur)

Locating migrated blocks • The migrated tags are duplicated in a migration tag cache (MTC) • MTC is organized as a direct-mapped table • Each entry has a tag, a target set index, a forward pointer to an MTC entry, a backward pointer to an MTC entry, a head bit, and a tail bit • Starting at an index of the MTC, one can follow the forward pointers in a linked list until the tail bit is encountered • One tag list in the MTC corresponds to the migrated tags from a particular parent set in the main cache Balanced $ (IIT, Kanpur)

Locating migrated blocks • Tag lookup protocol • With each set s in the main cache, a head pointer H(s) to the MTC is maintained; H(s) points to the index of MTC where the list of migrated tags belonging to set s begins • The main cache is looked up first as usual • On a miss, H(s) is read out and an MTC walk is initiated at index H(s) • Note that on reset, the MTC is organized as a free list; a new migration from set s allocates an MTC entry, links it at the head of the list starting at H(s), and updates H(s) Balanced $ (IIT, Kanpur)

Locating migrated blocks • Tag lookup protocol • On an MTC hit, the block is swapped with the LRU block in the parent set to improve future hit latency (behaves like a folded victim cache) • It is necessary to avoid false hits • Now the same set may contain the same tag multiple times • Each tag is extended by log(A) bits where A is the associativity; the target way of a migrated tag is stored along with the tag Balanced $ (IIT, Kanpur)

Locating migrated blocks • Replacement of migrated blocks • A migrated block may get replaced due to primary or secondary replacements • A primary migrated block replacement is again migrated to a different target set; this case is easy to handle because it requires only MTC entry modification • But to get to the MTC entry, one needs to maintain a direct MTC entry pointer MEP(t) with each migrated tag t in the main cache Balanced $ (IIT, Kanpur)

Locating migrated blocks • Replacement of migrated blocks • A secondary migrated block replacement evicts the block from the cache • This requires delinking the tag from its list • Efficient delinking is possible only in doubly-linked lists and this is why we need a backward pointer with each MTC entry • Also, this may need updating the H(s) field in the parent set s • To be able to get to the parent set, each MTC entry needs to store the parent set index Balanced $ (IIT, Kanpur)

Locating migrated blocks • Summary of structures added till now • Per set s: one saturating counter C(s), one head pointer H(s) and VALID(H(s)) • Per tag t: MTC entry pointer MEP(t) and VALID(MEP(t)), extra way bits W(t) • Per MTC entry m: migrated tag MT(m) including the extra way bits, target set index TS(m), parent set index PS(m), forward pointer FPTR(m), backward pointer BPTR(m), head/tail bits HT(m) • Per set cluster u: saturating counter D(u) • A global saturating counter • Two comparator trees Balanced $ (IIT, Kanpur)

Hit/Miss critical path • Reducing the MTC walk latency • Proposal#1: Make MTC dual-ported so that a list can be walked from both ends (a win-win situation); halves hit as well as miss paths • Add a tail pointer T(s) to each set (along with H(s)) so that the tail of a list can be accessed directly • Proposal#2: Maintain the summary of migrated tags from a set s in a small filter F(s) attached to s • Query F(s) first before walking MTC; a negative response from F(s) means the tag is definitely not there in MTC; optimizes the miss path only Balanced $ (IIT, Kanpur)

Hit/Miss critical path • Reducing the MTC walk latency • We experimented with a simple design of a 60-bit F(s) with great success • Divide the 60 bits into nine segments: each of the lower eight segments is seven bits wide and the upper segment is four bits wide • When a tag t is queried, the lower three bits of t identifies one of the lower eight segments of F(s) • Let the contents of the identified segment be f[6:0] and the contents of the upper segment be g[3:0] Balanced $ (IIT, Kanpur)

Hit/Miss critical path • Reducing the MTC walk latency • The filter says “yes” if and only if (f[6:0] AND t[9:3]) == t[9:3] and (g[3:0] AND t[13:10]) == t[13:10] • A newly migrated tag t is hashed into F(s) by ORing t[9:3] into the identified segment and ORing t[13:10] with the upper segment • F(s) is not updated if a migrated tag is removed (not possible to update) • On a false positive from F(s), all the migrated tags for the set s will have to be visited anyway; at this time F(s) is cleared and rebuilt Balanced $ (IIT, Kanpur)

Selective migration • Not all blocks are important • Unnecessary migrations waste energy and may hurt performance by using up MTC space • Ideally, we want to migrate the most frequently missing blocks • Usually, these blocks are associated with the hot sets • The idea, therefore, should be to identify the hot sets and migrate only the blocks evicted from the hot sets Balanced $ (IIT, Kanpur)

Selective migration • Identifying hot sets • Associate a saturating counter R(s) with each set s to count the number of external refills to the set • Whenever some R(s) reaches its maximum value, all R(s) are reset (leader-decides rule) • Maintain the total refill count across all sets in a register TRC and the maximum refill count across all sets in another register MaxRC; let average refill count be ARC = TRC >> log(|S|) • Definition: A set s is hot if and only if R(s) > ARC + (MaxRC – ARC) >> delta • Delta is dynamically incremented Balanced $ (IIT, Kanpur)

Throttling migration • If a set becomes very hot, it may start migrating a large number of blocks • While this may appear desirable, monotonically increasing expected MTC walk cost outweighs the benefits soon • We impose a limit on the length of the migrated tag list belonging to a particular set • However, a static limit may not work; so the limit is dynamically increased by monitoring the volume of rejected migrations due to too short a length limit • Each set s now maintains a list length register LLR(s) Balanced $ (IIT, Kanpur)

Retaining migrated blocks • Number of misses between two misses to the same block is often very high • Points to the danger of losing the migrated blocks before they get reused • We need to design a replacement policy that gives lower replacement priority to the migrated blocks because these are the blocks we really want to retain • Classify the sets into high-hit and low-hit sets • For high-hit sets continue with baseline policy (LRU in our case) • For low-hit sets, consider the non-migrated blocks before the migrated ones Balanced $ (IIT, Kanpur)

Retaining migrated blocks • Associate a hit counter HC(s) with each set s • Reset HC(s) when the refill counter is reset • Count a hit on a migrated block as a hit in the parent set • Classify a set as low-hit if and only if HC(s) ≤ hR(s) and R(s) > r for some constant h > 1 and r < associativity • We fix h to 4 and r to 1/8th of associativity • More research is needed on better retention schemes • This is going to play a big role Balanced $ (IIT, Kanpur)

Scaling to CMPs • Assume that the CMP caches will be banked • All the policies can be applied to each bank or a subset of close-by banks independently • No cross-bank (or cross-switch) migration • Use cross-bank migration only for proximity enhancement (more detail in second talk) • The entire design scales seamlessly to larger caches • In our simulations, we assume that a pair of banks share a switch on a ring and cross-bank migration is allowed only within a pair Balanced $ (IIT, Kanpur)

Simulation results • Single-threaded and multi-threaded applications • Single-threaded runs are done on 2 MB 16-way L2 caches • Multi-threaded runs are done on 8 cores sharing a 4 MB 16-way L2 cache • Each core has private L1 caches • The MTC is sized to hold half the tags compared to the main cache • Space overhead of about 56 KB per 1 MB bank Balanced $ (IIT, Kanpur)

Simulation results Balanced $ (IIT, Kanpur)

Summary • Huge potential for improving performance and saving energy with slightly over 5% extra storage • Logic simplifications need to be explored further Balanced $ (IIT, Kanpur)

Cooling the Hot Sets:Improving Space Utilization in Large Caches viaDynamic Set Balancing THANK YOU! Mainak Chaudhuri, IIT Kanpur mainakc@iitk.ac.in

Cooling the Hot Sets: Improved Space Utilization in Large Caches via Dynamic Set Balancing

Cooling the Hot Sets: Improved Space Utilization in Large Caches via Dynamic Set Balancing

Presentation Transcript

Solar Cooling

Cooling System

Greenhouse Cooling Concepts

Advances in VA Utilization and Cost Data

On Dynamic Load Balancing on Graphics Processors

Dynamic Programming

BEING, SPACE, AND TIME ON THE WEB

Electronics Cooling MPE 635

Dynamic Reference Frames

Quick overview of everything you should have learned

Electronics Cooling MPE 635

Dynamic Scheduling Using Pools and Dynamic Pools in Dynamic Domains

May 21, 2013

Newtonâ€™s Law of Cooling

Shared Memory Multiprocessors

Problem Statement

Cooling System Testing, Maintenance, and Repair

DYNAMIC PLANET

Chapter Seven Large and Fast: Exploiting Memory Hierarchy

MATH 110 Sec 2-1 , 2-2 Lecture on Sets and Comparing Sets

Cache Memories

Two Wheels Self Balancing Robot