Research Proficiency Examination Piyush Kumar Department of Computer Science

Cache Oblivious Algorithms Theory & Practice Static Research Proficiency Examination Piyush Kumar Department of Computer Science Advisor: Joseph S.B. Mitchell

CO Algorithms: Brief History • Frigo, Leiserson, Prokop, Ramachandran (FOCS 99)Cache Oblivious Algorithms • Harold Prokop’sThesis • Bender, Demaine, Farch-Coltun(FOCS 00) Cache Oblivious B-Trees • … • Arge, Bender, Demaine et.al. (STOC02) CO Priority Queue

Talk Outline… • Motivation  Matrix Multiplication/Transposition  Static Searches in Bal. Bin. Trees • The Model • CO-Sorting • Some Analysis • CO-Sorting Experiments • Do’s and Don’ts of the model • Future work

Workstations SUN UltraSparc 2: UltraSparc 16kB L1, 512kB L2. SGI Visual Workstation 540: Quad-Pentium III 32kB L1, 1024kB L2. Dell Precision: Dual-Pentium III 32kB L1 512kB L2. IBM ThinkPad 600: Pentium II 32kB L1, 256kB L2. Compaq Presario: AMD K6-III 64kB L1, 256kB L2, 1024kB L3. How can we write portable code that runs efficiently on different multilevel caching architectures?

Intel Itaniums

n ∑ = c a b ij ik kj = k 1 = × C A B Matrix Multiplication (MM)

s s s s n n n n Cache - Aware MM Cache - Aware MM B - M ( , , , ) A B C n LOCK ULT B - M ( , , , ) A B C n LOCK ULT ¬ 1 1 / for i to n s ¬ 1 1 / for i to n s ¬ 2 1 / do for j to n s ¬ 2 1 / do for j to n s ¬ 3 1 / do for k to n s ¬ 3 1 / do for k to n s 4 O - M ( , , , ) do A B C s RD ULT 4 O - M ( , , , ) do A B C s RD ULT ik kj ij ik kj ij [HK81]

Oracle?! s • Tune so that , , and s A B C ( ) ik kj ij s ? just fit into cache = Q s Z • If > , then n s ( ) n ( ) = Q 3 2 ( ) ( ) Q n n s s L ( ) = Q 3 . n L Z n Cache - Aware MM Cache - Aware MM B - M ( , , , ) A B C n LOCK ULT B - M ( , , , ) A B C n LOCK ULT ¬ 1 1 / for i to n s ¬ 1 1 / for i to n s ¬ 2 1 / do for j to n s ¬ 2 1 / do for j to n s ¬ 3 1 / do for k to n s ¬ 3 1 / do for k to n s 4 O - M ( , , , ) do A B C s RD ULT 4 O - M ( , , , ) do A B C s RD ULT ik kj ij ik kj ij • Optimal [HK81] .

Two Three - Level Cache - Level Cache Two Three - Level Cache - Level Cache n s t One parameter per caching level! s One voodoo parameter per caching level! B B B M M M ( ( ( , , , , , , , , , ) ) ) A A A B B B C C C n n n LOCK LOCK LOCK - - - ULT ULT ULT B B B M M M ( ( ( , , , , , , , , , ) ) ) A A A B B B C C C n n n LOCK LOCK LOCK - - - ULT ULT ULT ¬ ¬ ¬ 1 1 1 1 1 1 / / / n for for for to to to i i i n n n s s s ¬ ¬ ¬ 1 1 1 1 1 1 / / / for for for to to to i i i n n n s s s 1 1 1 ¬ ¬ ¬ 2 1 1 1 / / / do do do for for for to to to 2 2 j j j n n n s s s 1 1 1 ¬ ¬ ¬ 2 1 1 1 / / / do do do for for for to to to 2 2 j j j n n n s s s 1 1 1 ¬ ¬ ¬ 3 1 1 1 / / / do do do for for for to to to 3 3 k k k n n n s s s 1 1 1 ¬ ¬ ¬ 3 1 1 1 / / / do do do for for for to to to 3 3 k k k n n n s s s 1 1 1 ¬ ¬ 4 1 1 / / do do for for to to 4 i i s s t t 1 1 1 O M ( , , , ) do 4 A B C s RD - ULT ¬ ¬ 4 1 1 / / do do for for to to 4 i i s s t t O M ( , , , ) do 4 A B C s 2 2 RD - ULT ik kj ij ¬ ¬ 5 1 1 / / do do for for to to 5 j j s s t t 2 2 ik kj ij ¬ ¬ 5 1 1 / / do do for for to to 5 j j s s t t 2 2 ¬ ¬ 6 1 1 / / do do for for to to 6 k k s s t t 2 2 ¬ ¬ 6 1 1 / / do do for for to to 6 k k s s t t 2 2 ¬ 7 1 / do for to i t u 2 2 O M ( , , , ) do 7 A B C t RD - ¬ ULT 7 1 / do for to i t u O M ( , , , ) do 7 A B C t 3 RD - ULT ik kj ij ¬ 8 1 / do for to j t u 3 ik kj ij ¬ 8 1 / do for to j t u 3 ¬ 9 1 / do for to k t u 3 ¬ 9 1 / do for to k t u 3 3 10 O M ( , , , ) do A B C u RD - ULT 10 O M ( , , , ) do A B C u RD - ULT ik kj ij ik kj ij

Recursive Matrix Multiplication Recursive Matrix Multiplication . Divide and conquer on × matrices n n C C A A B B = × 11 12 11 12 11 12 C C A A B B 21 22 21 22 21 22 A B A B A B A B 11 11 11 12 12 21 12 22 = + A B A B A B A B 21 11 21 12 22 21 22 22 8 multiplications of ( /2) × ( /2) matrices. n n . 1 addition of × matrices n n

Experiments: MM • Linux Athlon 1Ghz/1Gb/g++ -O3

Experiments: MM • Linux/Itanium/2GB/g++ -O3

Code: The InPlace Loop

Code : Co Transpose

Experiments: MT • Notebook, Windows 2k/512Mb/PIII 1GHz/g++ -O3

Experiments: MT • Linux Athlon 1Ghz/1Gb/g++ -O3

Experiments: MT • Linux Athlon 1Ghz/1Gb/g++ -O3/ Size =N x (P =100) , tall matrices

Experiments: MT • Linux Athlon 1Ghz/1Gb/g++ -O3/ Size = N x (P =1000)

What went Wrong? Blocking! And the loop was InPlace!

Loop not Inplace Experiments: MT • Linux Athlon 1Ghz/1Gb/g++ -O3/ Size = N x N

Loop not Inplace Experiments: MT • Notebook, Windows 2k/512Mb/PIII 1GHz/g++ -O3

Did we miss something? • Alg 1: Naïve Algorithm • Alg 2: Simple blocking using fixed B • Alg 3: Half Copy • Alg 4: Full Copy • Alg 5: CO • Alg 6: Morton Ordering Chatterjee & Sen HPCA 00

Did we miss something?

Static Searches • Only for balanced binary trees • Assume there are no insertions and deletions • Only searches Better than O(log n)???!! Can we speed it up?

What is a layout? • Mapping of nodes of a tree to the Memory • Different kinds of layouts • In-order • Post-order • Pre-order • Van Emde Boas • Main Idea : Store Recursive subtrees in contiguous memory

Example of Van Emde Boas

Another View

Theoretical Guarantees? • Cache ComplexityQ(n) = • Work ComplexityW(n) = From Prokop’s Thesis

In Practice?? Windows notebook/512MB/PIII 1Gz/256 byte nodes

In Practice II Windows notebook/512MB/PIII 1Gz/32 byte nodes

In Practice III Linux/Itanium/2GB/g++ -O3/ 48 byte nodes

In Practice! • Matrix Operations by Morton Ordering By David S. Wise (Cache oblivious Practical Matrix operation results) • Bender, Duan, Wu (Cache oblivious dictionaries) • Rahman, Cole, Raman (CO B-Trees)

Talk outline… • Motivation (Searching BBT) • The Model • CO-Sorting • …

(M,B) Ideal Cache Model

main memory cache P L Z/L Cache Lines (Z,L) Ideal Cache Model Q • Features: • Two-level hierarchy. • Cache of size Z. • Cache-line lengthL. • Fully associative. • Optimal, omniscient replacement. • Measures: • WorkW. • Cache missesQ.

Assumptions? • Two Levels of Memory • Tall Cache Assumption • Optimal Cache Replacement “No Asymptotic loss” • Fully-associative LRU can be used instead of optimal replacement with no asymptotic loss of performance [ST85]. • Fully-associative LRU caches can be maintained in ordinary memory with constant slowdown in expected performance.

Cache Obliviousness • Cache-oblivious algorithms naturally tune for • varying cache sizes. • multiple levels of cache. When a subproblem fits into a given level of cache, no further cache misses are incurred beyond those required to bring the subproblem itself into the cache. • An optimal cache-oblivious algorithm can be made to run optimally in the HMM [AACN87] and SUMH [VN93] models

CO-Sorting! • Only two methods known • Funnel Sort(Modified Merge Sort) • Distribution Sort  ( Modified Sample Sort, We implement a randomized version ) • Column Sort

Funnel Sort • Partition Input into pieces of size each. • Sort each piece Recursively • Merge sorted pieces using a -merger Input Array Sorted Output

Funnel Sort: k-mergers • Takes input k sorted sequences • Outputs k^3 elements! • It’s a clever scheduling of mergers! • Keeps work complexity O(nlogn)

Invoked times (Make sure buffers have enough elements) Buffers Maintained as Circular Queue Funnel Sort: k-Merger R One invocation of R outputs elements

Agarwal and Vitter show that there is an Bound on the number of cache misses. Funnel Sort : Optimality • Work Complexity • Cache Complexity

Distribution Sort • Partition A into sub-arrays each of size ; Sort Recursively • Distribute into buckets • Sort Buckets Recursively • Copy Buckets to output

The Distribution Step • Has to distribute subarrays into buckets • Not In-Place • Similar to recursive Sample-Sort without doing Binary Search on pivots

The Recursive Bucketing used SubArray1 SubArray2 SubArray2 Buffer 1 Buffer 1 Buffer 2

Research Proficiency Examination Piyush Kumar Department of Computer Science

Research Proficiency Examination Piyush Kumar Department of Computer Science

Presentation Transcript

Department of Computer Science

Department of Computer Science

Department of Computer Science

Department of Computer Science

computer science department

Database Research Rutgers Department of Computer Science

Computer Science Department

Artificial Intelligence Research Laboratory Department of Computer Science

Department of computer science

Department of Computer Science

Department of Computer Science

Department of Computer Science

Swarup Kumar Sahoo , John Criswell, Vikram Adve Department of Computer Science

DEPARTMENT OF COMPUTER SCIENCE

Department of Computer Science

WHOWEDA : Warehouse of Web Data Sanjay Kumar Madria Department of Computer Science

Department of Computer Science

WHOWEDA : Warehouse of Web Data Sanjay Kumar Madria Department of Computer Science

Department of Computer Science

Vipin Kumar Army High Performance Computing Research Center Department of Computer Science

Department of Computer Science