1 / 32

The Study of Cache Oblivious Algorithms

The Study of Cache Oblivious Algorithms. Prepared by Jia Guo. Cache-Oblivious Algorithms by Matteo Frigo, Charles E. Leiserson, Harald Prokop, and Sridhar Ramachandran . In the 40th Annual Symposium on Foundations of Computer Science, FOCS '99, 17-18 October, 1999, New York, NY, USA .

lars
Download Presentation

The Study of Cache Oblivious Algorithms

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. The Study of Cache Oblivious Algorithms Prepared by Jia Guo

  2. Cache-ObliviousAlgorithmsby Matteo Frigo, Charles E. Leiserson, Harald Prokop, and Sridhar Ramachandran. In the 40th Annual Symposium on Foundations of Computer Science, FOCS '99, 17-18 October, 1999, New York, NY, USA.

  3. Outline • Cache complexity • Cache aware algorithms • Cache oblivious algorithms • Matrix multiplication • Matrix transposition • FFT • Conclusion

  4. Assumption • Only two levels of memory hierarchies: • An ideal cache • Fully associative • Optimal replacement strategy • “Tall cache” • A very large memory

  5. An Ideal Cache Model An ideal cache model (Z,L) Z: Total words in the cache L: Words in one cache line

  6. Cache Complexity • An algorithm with input size n is measured by: • Work complexity W(n) • Cache complexity: the number of cache misses it incurs. Q(n; Z, L)

  7. Outline • Cache complexity • Cache aware algorithms • Cache oblivious algorithms • Matrix multiplication • Matrix transposition • FFT • Conclusion

  8. Cache Aware Algorithms • Contain parameters to minimize the cache complexity for a particular cache size (Z) and line length (L). • Need to adjust parameters when running on different platforms.

  9. Example: • A blocked matrix multiplication algorithm • s is a tuning parameter to make the algorithm run fast s s A11 A n

  10. Example (2) • Cache complexity • The three s x s sub matrices should fit into the cache so they occupy cache lines • Optimal performance is obtained when • Z/L cache misses needed to bring 3 sub matrices into cache • n2/L cache misses needed to read n2 elements • It is

  11. Outline • Cache complexity • Cache aware algorithms • Cache oblivious algorithms • Matrix multiplication • Matrix transposition and FFT • Conclusion

  12. Cache Oblivious Algorithms • Have no parameters about hardware, such as cache size (Z), cache-line length (L). • No tuning needed, platform independent. • The following algorithms introduced are proved to have the optimal cache complexity.

  13. Matrix Multiplication • Partition matrix A and B by half in the largest dimension. A: n x m, B: m x p • Proceed recursively until reach the base case - one element. n≥ max (m, p) m≥ max (n, p) p ≥ max (n, m)

  14. Matrix Multiplication (2) Assume Sizes of A, B are nx4n, 4nxn A*B + A1*B1 A2*B2 + + A11*B11 A12*B12 A21*B21 A22*B22

  15. Matrix Multiplication (3) • Intuitively, once a sub problem fits into the cache, its smaller sub problems can be solved in cache with no further misses.

  16. Matrix Multiplication (4) • Cache complexity • Can achieve the same as the cache complexity of Block-MULT algorithm (cache aware) • For a square matrix, the optimal cache complexity is achieved.

  17. Outline • Cache complexity • Cache aware algorithms • Cache oblivious algorithms • Matrix multiplication • Matrix transposition • FFT • Conclusion

  18. Matrix Transposition • If n is very large, the access of B in column will cause cache miss every time! (No spatial locality in B) A AT for i 1 to m for j 1 to n B( j, i ) = A( i, j ) m x n B n x m

  19. Matrix Transposition (2) • Partition array A along the longer dimension and recursively execute the transpose function. A21 A11 A11T A12T A12 A22 A21T A22T

  20. Matrix Transposition (3) • Cache complexity • It has the optimal cache complexity • Q(m, n) = Θ(1+mn/L)

  21. Fast Fourier Transform • Use Cooley-Tukey algorithm • Cooley-Tukey algorithms recursively re-express a DFT of a composite size n = n1n2 as: • Perform n2 DFTs of size n1. • Multiply by complex roots of unity called twiddle factors. • Perform n1 DFTs of size n2.

  22. n1 n2

  23. Assume X is a row-major n1× n2 matrix • Steps: • Transpose X in place. • Compute n2 DFTs • Multiply by twiddle factors • Transpose X in place • Compute n1 DFTs • Transpose X in-place

  24. Fast Fourier Transform n1=4, n2=2 Transpose to select n2 DFT of size n1 Call FFT recursively with n1=2, n2=2 Reach the base case, return *twiddle factor Transpose to select n1 DFT of size n2 Transpose and return

  25. Fast Fourier Transform • Cache complexity • Optimal for a Cooley-Tukey algorithm, when n is an exact power of 2 • Q(n) = O(1+(n/L)(1+logzn)

  26. Other Cache Oblivious Algorithms • Funnelsort • Distribution sort • LU decomposition without pivots

  27. Outline • Cache complexity • Cache aware algorithms • Cache oblivious algorithms • Matrix multiplication • Matrix transposition • FFT • Conclusion

  28. Questions • How large is the range of practicality of cache-oblivious algorithms? • What are the relative strengths of cache-oblivious and cache-aware algorithms?

  29. Practicality of Cache-oblivious Algorithms Average time to transpose an NxN matrix, divided by N2

  30. Practicality of Cache-oblivious Algorithms (2) Average time taken to multiply two NxN matrices, divided by N3

  31. Question 2 • Do cache-oblivious algorithms perform as well as cache-aware algorithms? • FFTW library • No answer yet.

  32. References • Cache-Oblivious Algorithmsby Matteo Frigo, Charles E. Leiserson, Harald Prokop, and Sridhar Ramachandran. In the 40th Annual Symposium on Foundations of Computer Science, FOCS '99, 17-18 October, 1999, New York, NY, USA. • Cache-Oblivious Algorithmsby Harald Prokop. Master's Thesis, MIT Department of Electrical Engineering and Computer Science. June 1999. • Optimizing Matrix Multiplication with a Classifier Learning System by Xiaoming Li and María Jesus Garzarán. LCPC 2005.

More Related