1 / 70

Research Proficiency Examination Piyush Kumar Department of Computer Science

Cache Oblivious Algorithms. Theory & Practice. Static. Research Proficiency Examination Piyush Kumar Department of Computer Science Advisor: Joseph S.B. Mitchell. CO Algorithms: Brief History. Frigo, Leiserson, Prokop, Ramachandran (FOCS 99) Cache Oblivious Algorithms

jlana
Download Presentation

Research Proficiency Examination Piyush Kumar Department of Computer Science

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Cache Oblivious Algorithms Theory & Practice Static Research Proficiency Examination Piyush Kumar Department of Computer Science Advisor: Joseph S.B. Mitchell

  2. CO Algorithms: Brief History • Frigo, Leiserson, Prokop, Ramachandran (FOCS 99)Cache Oblivious Algorithms • Harold Prokop’sThesis • Bender, Demaine, Farch-Coltun(FOCS 00) Cache Oblivious B-Trees • … • Arge, Bender, Demaine et.al. (STOC02) CO Priority Queue

  3. Talk Outline… • Motivation  Matrix Multiplication/Transposition  Static Searches in Bal. Bin. Trees • The Model • CO-Sorting • Some Analysis • CO-Sorting Experiments • Do’s and Don’ts of the model • Future work

  4. Workstations SUN UltraSparc 2: UltraSparc 16kB L1, 512kB L2. SGI Visual Workstation 540: Quad-Pentium III 32kB L1, 1024kB L2. Dell Precision: Dual-Pentium III 32kB L1 512kB L2. IBM ThinkPad 600: Pentium II 32kB L1, 256kB L2. Compaq Presario: AMD K6-III 64kB L1, 256kB L2, 1024kB L3. How can we write portable code that runs efficiently on different multilevel caching architectures?

  5. Intel Itaniums

  6. n ∑ = c a b ij ik kj = k 1 = × C A B Matrix Multiplication (MM)

  7. s s s s n n n n Cache - Aware MM Cache - Aware MM B - M ( , , , ) A B C n LOCK ULT B - M ( , , , ) A B C n LOCK ULT ¬ 1 1 / for i to n s ¬ 1 1 / for i to n s ¬ 2 1 / do for j to n s ¬ 2 1 / do for j to n s ¬ 3 1 / do for k to n s ¬ 3 1 / do for k to n s 4 O - M ( , , , ) do A B C s RD ULT 4 O - M ( , , , ) do A B C s RD ULT ik kj ij ik kj ij [HK81]

  8. Oracle?! s • Tune so that , , and s A B C ( ) ik kj ij s ? just fit into cache = Q s Z • If > , then n s ( ) n ( ) = Q 3 2 ( ) ( ) Q n n s s L ( ) = Q 3 . n L Z n Cache - Aware MM Cache - Aware MM B - M ( , , , ) A B C n LOCK ULT B - M ( , , , ) A B C n LOCK ULT ¬ 1 1 / for i to n s ¬ 1 1 / for i to n s ¬ 2 1 / do for j to n s ¬ 2 1 / do for j to n s ¬ 3 1 / do for k to n s ¬ 3 1 / do for k to n s 4 O - M ( , , , ) do A B C s RD ULT 4 O - M ( , , , ) do A B C s RD ULT ik kj ij ik kj ij • Optimal [HK81] .

  9. Two Three - Level Cache - Level Cache Two Three - Level Cache - Level Cache n s t One parameter per caching level! s One voodoo parameter per caching level! B B B M M M ( ( ( , , , , , , , , , ) ) ) A A A B B B C C C n n n LOCK LOCK LOCK - - - ULT ULT ULT B B B M M M ( ( ( , , , , , , , , , ) ) ) A A A B B B C C C n n n LOCK LOCK LOCK - - - ULT ULT ULT ¬ ¬ ¬ 1 1 1 1 1 1 / / / n for for for to to to i i i n n n s s s ¬ ¬ ¬ 1 1 1 1 1 1 / / / for for for to to to i i i n n n s s s 1 1 1 ¬ ¬ ¬ 2 1 1 1 / / / do do do for for for to to to 2 2 j j j n n n s s s 1 1 1 ¬ ¬ ¬ 2 1 1 1 / / / do do do for for for to to to 2 2 j j j n n n s s s 1 1 1 ¬ ¬ ¬ 3 1 1 1 / / / do do do for for for to to to 3 3 k k k n n n s s s 1 1 1 ¬ ¬ ¬ 3 1 1 1 / / / do do do for for for to to to 3 3 k k k n n n s s s 1 1 1 ¬ ¬ 4 1 1 / / do do for for to to 4 i i s s t t 1 1 1 O M ( , , , ) do 4 A B C s RD - ULT ¬ ¬ 4 1 1 / / do do for for to to 4 i i s s t t O M ( , , , ) do 4 A B C s 2 2 RD - ULT ik kj ij ¬ ¬ 5 1 1 / / do do for for to to 5 j j s s t t 2 2 ik kj ij ¬ ¬ 5 1 1 / / do do for for to to 5 j j s s t t 2 2 ¬ ¬ 6 1 1 / / do do for for to to 6 k k s s t t 2 2 ¬ ¬ 6 1 1 / / do do for for to to 6 k k s s t t 2 2 ¬ 7 1 / do for to i t u 2 2 O M ( , , , ) do 7 A B C t RD - ¬ ULT 7 1 / do for to i t u O M ( , , , ) do 7 A B C t 3 RD - ULT ik kj ij ¬ 8 1 / do for to j t u 3 ik kj ij ¬ 8 1 / do for to j t u 3 ¬ 9 1 / do for to k t u 3 ¬ 9 1 / do for to k t u 3 3 10 O M ( , , , ) do A B C u RD - ULT 10 O M ( , , , ) do A B C u RD - ULT ik kj ij ik kj ij

  10. Recursive Matrix Multiplication Recursive Matrix Multiplication . Divide and conquer on × matrices n n C C A A B B = × 11 12 11 12 11 12 C C A A B B 21 22 21 22 21 22 A B A B A B A B 11 11 11 12 12 21 12 22 = + A B A B A B A B 21 11 21 12 22 21 22 22 8 multiplications of ( /2) × ( /2) matrices. n n . 1 addition of × matrices n n

  11. Experiments: MM • Linux Athlon 1Ghz/1Gb/g++ -O3

  12. Experiments: MM • Linux/Itanium/2GB/g++ -O3

  13. Code: The InPlace Loop

  14. Code : Co Transpose

  15. Experiments: MT • Notebook, Windows 2k/512Mb/PIII 1GHz/g++ -O3

  16. Experiments: MT • Notebook, Windows 2k/512Mb/PIII 1GHz/g++ -O3

  17. Experiments: MT • Linux Athlon 1Ghz/1Gb/g++ -O3

  18. Experiments: MT • Linux Athlon 1Ghz/1Gb/g++ -O3/ Size =N x (P =100) , tall matrices

  19. Experiments: MT • Linux Athlon 1Ghz/1Gb/g++ -O3/ Size = N x (P =1000)

  20. What went Wrong? Blocking! And the loop was InPlace!

  21. Loop not Inplace Experiments: MT • Linux Athlon 1Ghz/1Gb/g++ -O3/ Size = N x N

  22. Loop not Inplace Experiments: MT • Notebook, Windows 2k/512Mb/PIII 1GHz/g++ -O3

  23. Did we miss something? • Alg 1: Naïve Algorithm • Alg 2: Simple blocking using fixed B • Alg 3: Half Copy • Alg 4: Full Copy • Alg 5: CO • Alg 6: Morton Ordering Chatterjee & Sen HPCA 00

  24. Did we miss something?

  25. Static Searches • Only for balanced binary trees • Assume there are no insertions and deletions • Only searches Better than O(log n)???!! Can we speed it up?

  26. What is a layout? • Mapping of nodes of a tree to the Memory • Different kinds of layouts • In-order • Post-order • Pre-order • Van Emde Boas • Main Idea : Store Recursive subtrees in contiguous memory

  27. Example of Van Emde Boas

  28. Another View

  29. Theoretical Guarantees? • Cache ComplexityQ(n) = • Work ComplexityW(n) = From Prokop’s Thesis

  30. In Practice?? Windows notebook/512MB/PIII 1Gz/256 byte nodes

  31. In Practice II Windows notebook/512MB/PIII 1Gz/32 byte nodes

  32. In Practice III Linux/Itanium/2GB/g++ -O3/ 48 byte nodes

  33. In Practice! • Matrix Operations by Morton Ordering By David S. Wise (Cache oblivious Practical Matrix operation results) • Bender, Duan, Wu (Cache oblivious dictionaries) • Rahman, Cole, Raman (CO B-Trees)

  34. Talk outline… • Motivation (Searching BBT) • The Model • CO-Sorting • …

  35. (M,B) Ideal Cache Model

  36. main memory cache P L Z/L Cache Lines (Z,L) Ideal Cache Model Q • Features: • Two-level hierarchy. • Cache of size Z. • Cache-line lengthL. • Fully associative. • Optimal, omniscient replacement. • Measures: • WorkW. • Cache missesQ.

  37. Assumptions? • Two Levels of Memory • Tall Cache Assumption • Optimal Cache Replacement “No Asymptotic loss” • Fully-associative LRU can be used instead of optimal replacement with no asymptotic loss of performance [ST85]. • Fully-associative LRU caches can be maintained in ordinary memory with constant slowdown in expected performance.

  38. Cache Obliviousness • Cache-oblivious algorithms naturally tune for • varying cache sizes. • multiple levels of cache. When a subproblem fits into a given level of cache, no further cache misses are incurred beyond those required to bring the subproblem itself into the cache. • An optimal cache-oblivious algorithm can be made to run optimally in the HMM [AACN87] and SUMH [VN93] models

  39. CO-Sorting! • Only two methods known • Funnel Sort(Modified Merge Sort) • Distribution Sort  ( Modified Sample Sort, We implement a randomized version ) • Column Sort

  40. Funnel Sort • Partition Input into pieces of size each. • Sort each piece Recursively • Merge sorted pieces using a -merger Input Array Sorted Output

  41. Funnel Sort: k-mergers • Takes input k sorted sequences • Outputs k^3 elements! • It’s a clever scheduling of mergers! • Keeps work complexity O(nlogn)

  42. Invoked times (Make sure buffers have enough elements) Buffers Maintained as Circular Queue Funnel Sort: k-Merger R One invocation of R outputs elements

  43. Agarwal and Vitter show that there is an Bound on the number of cache misses. Funnel Sort : Optimality • Work Complexity • Cache Complexity

  44. Distribution Sort • Partition A into sub-arrays each of size ; Sort Recursively • Distribute into buckets • Sort Buckets Recursively • Copy Buckets to output

  45. The Distribution Step • Has to distribute subarrays into buckets • Not In-Place • Similar to recursive Sample-Sort without doing Binary Search on pivots

  46. The Recursive Bucketing used SubArray1 SubArray2 SubArray2 Buffer 1 Buffer 1 Buffer 2

More Related