Predicting Inter-Thread Cache Contention on a Chip Multi-Processor Architecture

Predicting Inter-Thread Cache Contention on a Chip Multi-Processor Architecture Dhruba Chandra Fei Guo Seongbeom Kim Yan Solihin Electrical and Computer Engineering North Carolina State University HPCA-2005

Cache Sharing in CMP Processor Core 1 Processor Core 2 L1 $ L1 $ L2 $ …… Chandra, Guo, Kim, Solihin - Contention Model

Need a model to understand cache sharing impact Impact of Cache Space Contention • Application-specific (what) • Coschedule-specific (when) • Significant: Up to 4X cache misses, 65% IPC reduction Chandra, Guo, Kim, Solihin - Contention Model

Related Work • Uniprocessor miss estimation: Cascaval et al., LCPC 1999 Chatterjee et al., PLDI 2001 Fraguela et al., PACT 1999 Ghosh et al., TPLS 1999 J. Lee at al., HPCA 2001 Vera and Xue, HPCA 2002 Wassermann et al., SC 1997 • Context switch impact on time-shared processor: Agarwal, ACM Trans. On Computer Systems, 1989 Suh et al., ICS 2001 • No model for cache sharing impact: • Relatively new phenomenon: SMT, CMP • Many possible access interleaving scenarios Chandra, Guo, Kim, Solihin - Contention Model

Contributions • Inter-Thread cache contention models • 2 Heuristics models (refer to the paper) • 1 Analytical model • Input: circular sequence profiling for each thread • Output: Predicted num cache misses per thread in a co-schedule • Validation • Against a detailed CMP simulator • 3.9% average error for the analytical model • Insight • Temporal reuse patterns impact of cache sharing Chandra, Guo, Kim, Solihin - Contention Model

Outline • Model Assumptions • Definitions • Inductive Probability Model • Validation • Case Study • Conclusions Chandra, Guo, Kim, Solihin - Contention Model

Assumptions • One circular sequence profile per thread • Average profile yields high prediction accuracy • Phase-specific profile may improve accuracy • LRU Replacement Algorithm • Others are usu. LRU approximations • Threads do not share data • Mostly true for serial apps • Parallel apps: threads likely to be impacted uniformly Chandra, Guo, Kim, Solihin - Contention Model

Outline • Model Assumptions • Definitions • Inductive Probability (Prob) Model • Validation • Case Study • Conclusions Chandra, Guo, Kim, Solihin - Contention Model

seq(5,8) cseq(5,7) cseq(4,5) cseq(1,2) Definitions • seqX(dX,nX) = sequence of nX accesses to dX distinct addresses by a thread X to the same cache set • cseqX(dX,nX) (circular sequence) = a sequence in which the first and the last accesses are to the same address A B C D A E E B Chandra, Guo, Kim, Solihin - Contention Model

Circular Sequence Properties • Thread X runs alone in the system: • Given a circular sequence cseqX(dX,nX), the last access is a cache miss iff dX > Assoc • Thread X shares the cache with thread Y: • During cseqX(dX,nX)’s lifetime ifthere is a sequence of intervening accesses seqY(dY,nY), the last access of thread X is a miss iff dX+dY > Assoc Chandra, Guo, Kim, Solihin - Contention Model

Y’s intervening access sequence X’s circular sequence cseqX(2,3) lifetime A B A U V V W Example • Assume a 4-way associative cache: No cache sharing: A is a cache hit Cache sharing: is A a cache hit or miss? Chandra, Guo, Kim, Solihin - Contention Model

Y’s intervening access sequence X’s circular sequence cseqX(2,3) A B A U V V W Cache Hit Cache Miss Example • Assume a 4-way associative cache: A U B V V A W A U B V V W A seqY(3,4) intervening in cseqX’s lifetime seqY(2,3) intervening in cseqX’s lifetime Chandra, Guo, Kim, Solihin - Contention Model

Inductive Probability Model • For each cseqX(dX,nX) of thread X • Compute Pmiss(cseqX): the probability of the last access is a miss • Steps: • Compute E(nY): expected number of intervening accesses from thread Y during cseqX’s lifetime • For each possible dY, compute P(seq(dY, E(nY)): probability of occurrence of seq(dY, E(nY)), • If dY + dX > Assoc, add to Pmiss(cseqX) • Misses = old_misses + ∑ Pmiss(cseqX) x F(cseqX)  Chandra, Guo, Kim, Solihin - Contention Model

Computing P(seq(dY, E(nY))) • Basic Idea: • P(seq(d,n)) = A * P(seq(d-1,n)) + B * P(seq(d-1,n-1)) • Where A and B are transition probabilities • Detailed steps in paper seq(d,n) + 1 access to a distinct address + 1 access to a non-distinct address seq(d-1,n-1) seq(d,n-1) Chandra, Guo, Kim, Solihin - Contention Model

Validation • SESC simulator • Detailed CMP + memory hierarchy • 14 co-schedules of benchmarks (Spec2K and Olden) • Co-schedule terminated when an app completes Chandra, Guo, Kim, Solihin - Contention Model

Validation Error = (PM-AM)/AM • Larger error happens when miss increase is very large • Overall, the model is accurate Chandra, Guo, Kim, Solihin - Contention Model

Other Observations • Based on how vulnerable to cache sharing impact: • Highly vulnerable (mcf, gzip) • Not vulnerable (art, apsi, swim) • Somewhat / sometimes vulnerable (applu, equake, perlbmk, mst) • Prediction error: • Very small, except for highly vulnerable apps • 3.9% (average), 25% (maximum) • Also small for different cache associativities and sizes Chandra, Guo, Kim, Solihin - Contention Model

Case Study • Profile approx. by geometric progression F(cseq(1,*)) F(cseq(2,*)) F(cseq(3,*)) … F(cseq(A,*)) … Z Zr Zr2 … ZrA … • Z = amplitude • 0 < r < 1 = common ratio • Larger r  larger working set • Impact of interfering thread on the base thread? • Fix the base thread • Interfering thread: vary • Miss frequency = # misses / time • Reuse frequency = # hits / time Chandra, Guo, Kim, Solihin - Contention Model

Base Thread: r = 0.5 (Small WS) • Base thread: • Not vulnerable to interfering thread’s miss frequency • Vulnerable to interfering thread’s reuse frequency Chandra, Guo, Kim, Solihin - Contention Model

Base Thread: r = 0.9 (Large WS) • Base thread: • Vulnerable to interfering thread’s miss and reuse frequency Chandra, Guo, Kim, Solihin - Contention Model

Conclusions • New Inter-Thread cache contention models • Simple to use: • Input: circular sequence profiling per thread • Output: Number of misses per thread in co-schedules • Accurate • 3.9% average error • Useful • Temporal reuse patterns cache sharing impact • Future work: • Predict and avoid problematic co-schedules • Release the tool at http://www.cesr.ncsu.edu/solihin Chandra, Guo, Kim, Solihin - Contention Model

Predicting Inter-Thread Cache Contention on a Chip Multi-Processor Architecture

Predicting Inter-Thread Cache Contention on a Chip Multi-Processor Architecture

Presentation Transcript

Processor Architecture

Fair Cache Sharing and Partitioning in a Chip Multiprocessor Architecture

Pulse-modulated Radar Display Processor on a Chip

A Multi-Processor System on Chip Architecture for Real Time Remote Sensing Data Processing

Processor Architecture

Network On Chip Cache Coherency

A New Reachability Algorithm for Symmetric Multi-processor Architecture

Processor Architecture

Network On Chip Cache Coherency

Network On Chip Cache Coherency

A New Reachability Algorithm for Symmetric Multi-processor Architecture

Architecture Design of a Scalable Single-Chip Multi-Processor

Multi-Thread Programming

Network On Chip Cache Coherency

The IBM Cell Processor – Architecture and On-Chip Communication Interconnect

A New Reachability Algorithm for Symmetric Multi-processor Architecture

Cluster Processor Chip

Pulse-modulated Radar Display Processor on a Chip

Inter-Processor Parallel Architecture

NOCARC Network on Chip Architecture

The IBM Cell Processor – Architecture and On-Chip Communication Interconnect