ECE8833 Polymorphous and Many-Core Computer Architecture

ECE8833 Polymorphous and Many-Core Computer Architecture Lecture 5 Non-Uniform Cache Architecture for CMP Prof. Hsien-Hsin S. Lee School of Electrical and Computer Engineering

CMP Memory Hierarchy • Continuing device scaling leads to • Deeper memory hierarchy (L2, L3, etc.) • Growing cache capacity • 6MB in AMD’s Phenom quad-core • 8MB in Intel Core i7 • 24MB L3 in Itanium 2 • Global wire delay • Routing dominates access time • Design for worst case • Compromise for the slowest access • Penalize overall memory accesses • Undesirable

Evolution of Cache Access Time • Facts • Large shared on-die L2 • Wire-delay dominating on-die cache 3 cycles 1MB 180nm, 1999 11 cycles 4MB 90nm, 2004 24 cycles 16MB 50nm, 2010

Multi-banked L2 cache Bank=128KB 11 cycles 2MB @ 130nm Bank Access time = 3 cycles Interconnect delay = 8 cycles

Multi-banked L2 cache Bank=64KB 47 cycles 16MB @ 50nm Bank Access time = 3 cycles Interconnect delay = 44 cycles

NUCA: Non-Uniform Cache Architecture [Kim et al. ASPLOS-X, 2002] • Partition a large cache into banks • Non-uniform latencies for different banks • Design space exploration • Mapping • How many banks? (i.e., what’s the granularity) • How to map lines to each bank? • Search • Strategy for searching the set of possible locations for a line • Movement • Should a line always be placed in the same bank? • How a line migrates to different banks over its lifetime?

L3 41 41 L2 10 UCA 1 bank 255 cycles ML-UCA 1 bank 11/41 cycles Avg access time 17 41 9 32 4 47 17 41 S-NUCA-2 32 banks 24 cycles D-NUCA 256 banks 18 cycles S-NUCA-1 32 banks 34 cycles Cache Hierarchy Taxonomy (16MB @50nm) Contentionless latency from CACTI From simulation modeling bank & channel conflict [Kim et al., ASPLOS-X 2002]

Sub-bank Bank Data Bus Predecoder Address Bus Sense amplifier Tag Array Wordline driver and decoder Static NUCA-1 Using Private Channels • Upside • Increase the number of banks to avoid bulky access • Parallelize accesses to different banks • Overhead • Decoders • Wire-dominated due to same set of private wires is required for every bank • Each bank has its distinct access latency • Statically pre-determine data location for its given address • Average access latency =34.2 cycles • Wire overhead = 20.9%  an issue Use low-order bits for bank index

9 32 Static NUCA-2 Using Switched Channels Tag Array • Improved wire congestion from Static NUCA-1 using 2D switched network • Wormhole-routed flow control • Each switch buffers 128-bit packets • Average access latency =24.2 cycles • On avg, 0.8 cycle of “bank” contention + 0.7 cycle of “link” contention in the network • Wire overhead = 5.9% Switch Bank Data bus Predecoder Wordline driver and decoder

4 47 D-NUCA 256 banks 18 cycles Dynamic NUCA • Data can dynamically migrate • Promote frequently used cache lines closer to CPU • Data management • Mapping • How many banks? (i.e., what’s the granularity) • How to map lines to each bank? • Search • Strategy for searching the set of possible locations for a line • Movement • Should a line always be placed in the same bank? • How a line migrates to different banks over its lifetime?

Dynamic NUCA • Simple Mapping • All 4 ways of each bank set needs to be searched • Non-uniform access times for different bank sets • Farther bank sets  longer access bank 8 bank sets one set way 3 way 2 way 1 way 0 Memory Controller

Dynamic NUCA • Fair Mapping (proposed, not studied in the paper) • Average access time across all bank sets are equal • Complex routing, likely more contention bank 8 bank sets one set Memory Controller

Dynamic NUCA • Shared Mapping • Sharing the closet banks among multiple banks • Some banks have slightly higher associativity which offset the increased avg. access latency due to the distance bank 8 bank sets Memory Controller

Locate A NUCA Line • Incremental search • From the closest to the farthest • (Limited, partitioned) Multicast search • Search all (or a partition of) the banks in parallel • Return time depending on the routing distance • Smart search • Use partial tag comparison[Kessler’89] (used in P6) • Keep the partial tag array in cache controller • Similar modern techniques: Bloom Filters

Old state New state D-NUCA: Dynamic Movement of Cache Lines Cache Line Placement Upon Hit • LRU ordering • Conventional implementation only adjust LRU bits • Require physical movement in order to get latency benefits for NUCA (n copy operations) • Generational Promotion • Only swap with the line in the neighbor bank closest to the controller • Receive more “latency reward” when hit contiguously Hit Old state New state Controller Hit Controller

new victim victim victim victim Controller Controller Controller Controller Controller Some distant bank (Zero copy) Some distant bank (One copy) MRU bank (One copy) Most distant bank (assist cache concept) D-NUCA: Dynamic Movement of Cache Lines • Victim eviction • Zero copy • One copy Upon Miss • Incoming Line Insertion • To a distant bank • To MRU position

Sharing NUCA Cache in CMP • Sharing Degree (SD) of N: Number of processor cores share a cache • Low SD • Smaller private partitions • Good hit latency, poor hit rate • More discrete L2 caches • Expensive L2 coherence • E.g., Need a centralized L2 tag directory for L2 coherence • High SD • Good hit rate, bad for hit latency • More efficient inter-core communication • More expensive L1 coherence

16-Core CMP Substrate and SD • Low SD (e.g., 1), need either snooping or a central L2 tag directory for coherence • High SD (e.g., 16) also needs some directory to indicate whose L1 has a copy (used in Piranha CMP) [Huh et al. ICS’05]

Trade-off for Cache Sharing Among Cores • Downside • Larger structure, slower access • Longer wire delay • More congestion on the shared interconnect • Upside • Keep single copy data • Use area more efficiently • Faster inter-core communication • No coherence fabric

Flexible Cache Mapping • Static mapping • Fixed L2 access latency upon line placement time • Dynamic mapping • D-NUCA idea: line can migrate across multiple banks • Line will move closer to the core that frequently accesses it Lookup could be expensive Search all partial tags first [Huh et al. ICS’05]

Flexible Cache Sharing • Multiple sharing degrees for different classes of blocks • Classify lines to be (Per-line sharing degree) • Private (assign smaller SD) • Shared (assign larger SD) • Study found 6 to 7% improvement vs. the best uniform SD • SD=1 or 2 for private data • SD=16 for shared data

Enhance Cache/Memory Performance • Cache Partitioning • Explicitly manage cache allocation among processes • Each process gets different benefit for more cache space • Similar to main memory partition [Stone’92] in the good old days • Memory-aware Scheduling • Choose a set of simultaneous processes to minimize cache contention • Symbiotic scheduling for SMT by OS • Sample and collect info (perf. counters) about possible schedules • Predict the best schedule (e.g., based on resource contention) • Complexity is high for many processes • Admission control for gang scheduling • Based on footprint of a job (total memory usage) Slide adapted from Ed Suh’s HPCA’02 presentation

Victim Replication

L2 Cache Intra-Chip Switch L1$ L1$ L1$ L1$ core core core core Today’s Chip Multiprocessors (Shared L2) • Layout: “Dance-Hall” • Per processing node: Core + L1 cache • Shared L2 cache • Small L1 cache • Fast access • Large L2 cache • Good hit rate • Slower access latency core core core core L1$ L1$ L1$ L1$ Intra-Chip Switch Slide adapted from presentation by Zhang and Asanovic, ISCA’05

core core core L1$ L1$ L1$ Intra-Chip Switch L1$ L1$ L1$ L1$ core core core core Today’s Chip Multiprocessors (Shared L2) • Layout: “Dance-Hall” • Per processing node: Core + L1 cache • Shared L2 cache • Alternate large L2 cache • Divided into slices to minimize latency and power • i.e., NUCA • Challenge • Minimize average access latency • Avg memory latency == Best latency core L1$ Intra-Chip Switch L2 Slice L2 Slice L2 Slice L2 Slice L2 Slice L2 Slice L2 Slice L2 Slice L2 Slice L2 Slice L2 Slice L2 Slice L2 Slice L2 Slice L2 Slice L2 Slice L2 Slice L2 Slice L2 Slice L2 Slice L2 Slice L2 Slice L2 Slice L2 Slice L2 Slice L2 Slice L2 Slice L2 Slice L2 Slice L2 Slice L2 Slice L2 Slice Slide adapted from presentation by Zhang and Asanovic, ISCA’05

core core core core L1$ L1$ L1$ L1$ Intra-Chip Switch Intra-Chip Switch L1$ L1$ L1$ L1$ core core core core Dynamic NUCA Issues • Does not work well with CMPs • The “unique” copy of data cannot be close to all of its sharers • Behavior • Over time, shared data migrates to a location “equidistant” to all sharers [Beckmann & Wood, MICRO-36] Slide adapted from presentation by Zhang and Asanovic, ISCA’05

c c c c c c c c c c c c c c c c L1 L1 L1 L1 L1 L1 L1 L1 L1 L1 L1 L1 L1 L1 L1 L1 SW SW SW SW SW SW SW SW SW SW SW SW SW SW SW SW L2$ Data L2$ Data L2$ Data L2$ Data L2$ Data L2$ Data L2$ Data L2$ Data L2$ Data L2$ Data L2$ Data L2$ Data L2$ Data L2$ Data L2$ Data L2$ Data L2$ Tag L2$ Tag L2$ Tag L2$ Tag L2$ Tag L2$ Tag L2$ Tag L2$ Tag L2$ Tag L2$ Tag L2$ Tag L2$ Tag L2$ Tag L2$ Tag L2$ Tag L2$ Tag Tiled CMP with Directory-Based Protocol Switch • Tiled CMPs for Scalability • Minimal redesign effort • Use directory-based protocol for scalability • Managing the L2s to minimize the effective access latency • Keep data close to the requestors • Keep data on-chip • Two baseline L2 cache designs • Each tile has own private L2 • All tiles share a single distributed L2 core L1$ L2$ Slice Data L2$ Slice Tag Slide adapted from presentation by Zhang and Asanovic, ISCA’05

Switch core L1$ Private L2$ Data L2$ Tag DIR Switch core L1$ Private L2$ Data L2$ Tag DIR “Private L2” Design Keeps Low Hit Latency • The local L2 slice is used as a private L2 cache for the tile • Shared data is “duplicated” in the L2 of each sharer • Coherence must be kept among all sharers at the L2 level • Similar to DSM • On an L2 miss: • Data not on-chip • Data available in the private L2 cache of another chip Sharer i Sharer j Slide adapted from presentation by Zhang and Asanovic, ISCA’05

Switch core L1$ Private L2$ Data L2$ Tag DIR Switch Switch core L1$ core L1$ Private L2$ Data L2$ Tag DIR Private L2$ Data L2$ Tag DIR “Private L2” Design Keeps Low Hit Latency • The local L2 slice is used as a private L2 cache for the tile • Shared data is “duplicated” in the L2 of each sharer • Coherence must be kept among all sharers at the L2 level • Similar to DSM • On an L2 miss: • Data not on-chip • Data available in the private L2 cache of another tile (cache-to-cache reply-forwarding) Requestor Owner/Sharer Off-chip Access Home Node statically determined by address

Switch Switch Switch L1$ L1$ L1$ core core core Shared L2$ Data Shared L2$ Data Shared L2$ Data L2$ Tag L2$ Tag L2$ Tag DIR DIR DIR “Shared L2” Design Gives Maximum Capacity • All L2 slices on-chip form a distributed shared L2, backing up all L1s • “No duplication,” data kept in a unique L2 location • Coherence must be kept among all sharers at the L1 level • On an L2 miss: • Data not in L2 • Coherence miss (cache-to-cache reply-forwarding) Requestor Owner/Sharer Off-chip Access Home Node statically determined by address

Private vs. Shared L2 CMP • Private L2 • Uniform lower latency if found in local L2 • Duplication reduces L2 capacity • Shared L2 • Long/non-uniform L2 hit latency • No duplication maximizes L2 capacity

Private vs. Shared L2 CMP • Private L2 • Uniform lower latency if found in local L2 • Duplication reduces L2 capacity • Shared L2 • Long/non-uniform L2 hit latency • No duplication maximizes L2 capacity Victim Replication: Provides low hit latency while keeping the working set on-chip

Normal L1 Eviction for a Shared L2 CMP Switch Switch core L1$ core L1$ • When an L1 cache line is being evicted • Write back to home L2 if dirty • Update home directory Shared L2$ Data L2$ Tag DIR Shared L2$ Data L2$ Tag DIR Sharer i Sharer j Switch core L1$ Shared L2$ Data L2$ Tag DIR Home Node

Victim Replication Switch Switch • Replicas • L1 victims stored in the Local L2 slice • Reused later for faster access latency core L1$ core L1$ Shared L2$ Data L2$ Tag DIR Shared L2$ Data L2$ Tag DIR Sharer i Sharer j Switch core L1$ Shared L2$ Data L2$ Tag DIR Home Node

Hitting the Victim Replica Switch Switch • Look up local L2 slice • A miss will follow the normal transaction to get the line in home node • A replica hit will invalidate the replica core L1$ core L1$ Replica Hit Shared L2$ Data L2$ Tag DIR Shared L2$ Data L2$ Tag DIR Sharer i Sharer j Switch core L1$ Shared L2$ Data L2$ Tag DIR Home Node

Replication Policy • Replica is only inserted when one of the following is found (in the priority) Switch Switch core L1$ core L1$ Shared L2$ Data L2$ Tag DIR DIR Sharer i Sharer j Switch core L1$ Shared L2$ Data L2$ Tag DIR Home Node

Replication Policy, Where to Insert? • Replica is only inserted when one of the following is found (in the priority) • Invalid line Switch Switch core L1$ core L1$ Shared L2$ Data L2$ Tag DIR DIR Sharer i Sharer j Switch core L1$ Shared L2$ Data L2$ Tag DIR Home Node

Replication Policy, Where to Insert? • Replica is only inserted when one of the following is found (in the priority) • Invalid line • A global line with no sharer Switch Switch core L1$ core L1$ Shared L2$ Data L2$ Tag DIR DIR Sharer i Sharer j The line in its home with no sharer Switch core L1$ Shared L2$ Data L2$ Tag DIR Home Node

Replication Policy, Where to Insert? • Replica is only inserted when one of the following is found (in the priority) • Invalid line • A global line with no sharer • An existing replica Switch Switch core L1$ core L1$ Shared L2$ Data L2$ Tag DIR DIR Sharer i Sharer j Switch core L1$ Shared L2$ Data L2$ Tag DIR Home Node

Replication Policy, Where to Insert? • Replica is only inserted when one of the following is found (in the priority) • Invalid line • A global line with no sharer • An existing replica • Line is never replicated when • A global line has remote sharers Switch Switch core L1$ core L1$ Shared L2$ Data L2$ Tag DIR DIR  Sharer i Sharer j Switch core L1$ Shared L2$ Data L2$ Tag DIR Home Node

Replication Policy, Where to Insert? • Replica is only inserted when one of the following is found (in the priority) • Invalid line • A global line with no sharer • An existing replica • Line is never replicated when • A global line has remote sharers • the victim’s home tile is local Switch Switch core L1$ core L1$  Shared L2$ Data L2$ Tag DIR DIR Home Node Switch core L1$ Shared L2$ Data L2$ Tag DIR

VR Combines Global Lines and Replica Switch Switch core L1$ core L1$ Private L2$ L2$ Tag DIR Shared L2$ L2$ Tag DIR Victim Replication dynamically creates a large local private, victim cache for the local L1 cache Private L2 Design Shared L2 Design Switch core L1$ L2$ Tag DIR Shared L2$ Private L2$ (filled w/ L1 victims) Victim Replication Slide adapted from presentation by Zhang and Asanovic, ISCA’05

L2P L2P L2S L2S L2VR L2VR When Working Set Does not Fit in Local L2 Average Data Access Latency Access Breakdown • The capacity advantage of the shared design yields many fewer off-chip misses • The latency advantage of the private design is offset by costly off-chip accesses • Victim replication is even better than shared design by creating replicas to reduce access latency Off-chip misses Not Good … Hits in Non-Local L2 O.K. Very Good Hits in Local L2 Best Hits in L1

Average Latencies of Different CMPs Single thread applications L2VR excels 11 out of 12 cases Multi-programmed workload L2P is always the best

ECE8833 Polymorphous and Many-Core Computer Architecture