200 likes | 360 Views
Tile Size Selection Using Cache Organization and Data Layout. Stephanie Coleman Intermetrics, Inc. Kathryn S. M c Kinley Computer Science, LGRC, University of Massachusetts Amherst 10/27/01. Where to Use Tiling/Blocking?. Register TLB L1 cache L2 cache any other memory hierarchy.
E N D
Tile Size Selection Using Cache Organization and Data Layout Stephanie Coleman Intermetrics, Inc. Kathryn S. M c Kinley Computer Science, LGRC, University of Massachusetts Amherst 10/27/01
Where to Use Tiling/Blocking? • Register • TLB • L1 cache • L2 cache • any other memory hierarchy
CacheMisses • Compulsory misses • Capacity misses • Interference misses • Self-interference • Cross-interference
Data Reuse and locality • Data reuse • Temporal reuse • Spatial reuse • Locality: reused data remain in cache • Reuse does not necessarily result in locality
Without Tiling • Matrix Multiply for I=1 to N do for K=1 to N do R=X(K,I) for J=1 to N do Z(J,I)=Z(J,I)+R*Y(J,K)
After tiling (tile size=TK* TJ) for KK=1 to N by TK do for JJ=1 to N by TJ do for I=1 to N do for K=KK to MIN(KK+TK-1,N) do R=X(K,I) for J=JJ to MIN(JJ+TJ-1,N) do Z(J,I)=Z(J,I)+R*Y(J,K)
General Formula for tiling • Before tiling: for I= lo to hi do • Tiled into: for It=floor((lo-off)/ts)*ts+off to floor((hi-off)/ts)*ts+off by ts do for I=max(lo, It) to min(hi, It+ts-1) (off: offset ts: tile size)
Loop Interchange • Interchange an innter tile loop with an outer element loop: for I=max(l1,l2,..) to min(u1,u2,…) do for Jt=floor((k1*I+m1)/ts)*ts+off to floor((ku*I+mu)/ts)*ts+off by ts do • The limit for the I loop: do not change; • The new lower/upper limit for Jt loop will be the max of a set of expressions,where each expression is its old limit with I replaced by one of l1,l2,…(if k1>0) , or u1,u2,…(if k1<0).
Tile Size selection Cache layout with a tile size of 24
Potential column dimensions • Euclidean algorithm • G.C.D(a,b)=G.C.D(a-b,b) CS= q1*N+r1 N = q2*r1+r2 r1 = q3*r2+r3 … 1024 = 5* 200 + 24 200 = 8*24 + 8 Potential column dimensions: 24, 8.
Improve Spatial Locality with Cache Line Size colSize if colSize mod CLS =0, or if colSize=column length colSize= floor(colSize/CLS)*CLSotherwise
Minimize Cross Interference • Working set size constraint: TJ*TK+TJ+1*CLS<CS
Other Algorithm for Computing Tile Size • LRW • improves the average cache performance • sensitive to the array size • ineffective cache utilization • ESS • effective only for one-dimensional tiling • no consideration on cross-interference
Conclusion • TSS incorporate the effect of cache line size and cross-interference between arrays • Performs better on direct-mapped caches and higher associative caches than ESS and LRW • sensitive to array dimension • not fully exploit temporal reuse for some matrix sizes