Enhanced Memory Prefetching Techniques for Improved Load Latency

DATA ADDRESS PREDICTION Zohair Hyder Armando Solar-Lezama CS252 – Fall 2003

Motivation • Large and increasing gap between CPU and memory speeds • Miss penalty on today’s processors over 600 cycles • Load latency is bottleneck on performance Solution: Prefetch • Static: Compilers may insert prefetch instructions. Limited because of lack of run-time information • Dynamic: High adaptability

Metrics • Coverage: Fraction of DL1 misses that hit in prefetch buffer • Higher implies lower load latency • Accuracy: Fraction of prefetches that are actually used by CPU • Higher implies less memory bandwidth needed • Tradeoffs between coverage and accuracy • For given memory bandwidth, coverage is probably more important

Architecture • Prefetch buffer acts as Level 1.5 cache • Hit time of prefetch buffer is same as DL1 because of small size and same associativity • Demand fetches always get priority over prefetches • Predictor uses DL1 miss information to determine prefetches

Previous Approaches • Stream buffers • Introduced by Jouppi in 1990 • Kessler and Palacharla augmented them in 1994 to allow filtering and prefetching for non-unit strides • Reference Prediction Table • Introduced by Baer and Chen in 1992 to detect arbitrary strides • Markov Predictor: • Introduced by Joseph and Grunwald in 1999

Reference Prediction Table • RPT indexed by PC of load instruction • RPT holds last effective address, and offset with second to last effective address • If current effective address results in same offset, then prefetch

Markov Predictor • Index by current address: table holds 4 possible next addresses • Issue all 4 into prefetch request queue • If queue is full, replace an element with lower priority • LRU prioritization: more recently used has higher priority

Strides • Consider the following statements in a loop: n += k; u += x[n]; v += y[n]; where k is larger than the block size. The miss address stream will be: A, B, A+k, B+k, A+2k, B+2k • Stream buffers perform poorly in interleaved access streams • RPT works great. • Markov predictor is incapable of detecting ANY strides.

Our Contributions • Difference markov predictor: • Use similar markov implementation • Predict differences rather than addresses • Input to predictor is current difference, output is predicted difference • Bayes predictor: • Use 3 inputs: current difference, current PC, and current address • Output is predicted difference

Difference Markov Predictor • Use difference coding • Index by current difference = current address – last address • Predict next difference

Difference Markov - Advantages • Works well with small table size • Detects strides, even in interleaved access streams • More compact than RPT, e.g. stride of 1 needs a single entry • Performs especially well on floating point applications that are stride-intensive • The Joseph-Grunwald markov predictor is incapable of predicting any address it has not yet seen • Performs only slightly worse than Joseph-Grunwald markov on integer applications: difference correlation information can contain address correlation information too

Bayes Predictor • Predicts based on current PC, current address and current difference • Use Naïve Bayes method to combine information from all 3 • Predict next difference

Bayes Predictor - Details • Idea: • For every possible Δn+1, calculate P(Δn+1 | Δn,PC, Addr) • Predict the Δn+1 with highest probability • If missing data, use the conditional probabilities given the data we have. • Implementation • Assume Independence! P(Δn+1 | Δn,PC, Addr)=P(Δn | Δn+1)*P(PC| Δn+1)*P(Addr | Δn+1)*P(Δn+1) P(Δn , PC, Addr) • Keep a limited number of the Ps in a table. • Integer representation

Bayes - Advantages • Works well for small table size • Performs well on both Floating Point and Integer applications • Detects most forms of regularity that we have observed in applications • Has good accuracy across applications

Performance For SPEC2000

Performance With Table Size

Conclusion • Both our predictors have high coverage: for most applications higher than any other predictor • Bayes predictor generally has best accuracy across applications • Difference markov has fairly good accuracy too • Difference markov predictor has great performance even with small tables, and requires very simple hardware • Bayes predictor needs more complex hardware

Enhanced Memory Prefetching Techniques for Improved Load Latency