1 / 31

Performance Analysis of Divide and Conquer Algorithms for the WHT

Performance Analysis of Divide and Conquer Algorithms for the WHT. Jeremy Johnson Mihai Furis, Pawel Hitczenko, Hung-Jen Huang Dept. of Computer Science Drexel University. www.spiral.net. Motivation. On modern machines operation count is not always the most important performance metric.

amy
Download Presentation

Performance Analysis of Divide and Conquer Algorithms for the WHT

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Performance Analysis of Divide and Conquer Algorithms for the WHT Jeremy Johnson Mihai Furis, Pawel Hitczenko, Hung-Jen Huang Dept. of Computer Science Drexel University www.spiral.net

  2. Motivation • On modern machines operation count is not always the most important performance metric. • Effective utilization of the memory hierarchy, pipelining, and Instruction Level Parallelism is important, and it is not easy to determine such utilization from source code. • Automatic Performance Tuning and Architecture Adaptation • Generate and Test • FFT, Matrix Multiplication, … • Explain performance distribution

  3. Outline • Space of WHT Algorithms • WHT Package and Performance Distribution • Performance Model • Instruction Count • Cache

  4. Walsh-Hadamard Transform • y = WHTN x, N = 2n

  5. Factoring the WHT Matrix • AC Ä BD = (A Ä B)(C Ä D) • A Ä B = (A Ä I)(I Ä B) • A Ä (BÄ C) = (A Ä B)Ä C • ImÄ In = Imn WHT2Ä WHT2 = (WHT2 Ä I2)(I2 Ä WHT2)

  6. Recursive Algorithm (WHT2 I4)(I2 (WHT2 I2) (I2 WHT2))

  7. Iterative Algorithm (WHT2 I4)(I2 WHT2 I2) (I4 WHT2))

  8. WHT Algorithms • Recursive • Iterative • General

  9. n + ··· + n n + ··· + n i +1 t 1 i - 1 WHT Implementation • N = N1* N2**NtNi=2ni • x = WHTN*x x =(x(b),x(b+s),…x(b+(M-1)s)) • Implementation(nested loop) R=N; S=1; for i=t,…,1 R=R/Ni forj=0,…,R-1 for k=0,…,S-1 S=S* Ni; M b,s t Õ ) Ä Ä ( I I WHT WHT = n n 2 2 2 2 i i = 1

  10. 9 4 4 3 4 2 1 1 3 3 4 1 2 1 1 1 2 2 1 1 1 1 4 1 1 1 1 1 1 2 2 1 1 1 1 Partition Trees Left Recursive Right Recursive Balanced Iterative

  11. Number of Algorithms

  12. Outline • WHT Algorithms • WHT Package and Performance Distribution • Performance Model • Instruction Count • Cache

  13. WHT PackagePüschel & Johnson (ICASSP ’00) • Allows easy implementation of any of the possible WHT algorithms • Partition tree representation W(n)=small[n] | split[W(n1),…W(nt)] • Tools • Measure runtime of any algorithm • Measure hardware events (coupled with PCL/PAPI) • Search for good implementation • Dynamic programming • Evolutionary algorithm

  14. Algorithm Comparison

  15. Cache Miss Data

  16. Histogram (n = 16, 10,000 samples) • Wide range in performance despite equal number of arithmetic operations (n2n flops) • Pentium III vs. UltraSPARC II

  17. Outline • WHT Algorithms • WHT Package and Performance Distribution • Performance Model • Instruction Count • Cache

  18. n + ··· + n n + ··· + n i +1 t 1 i - 1 WHT Implementation • N = N1* N2NtNi=2ni • x = WHTN*x x =(x(b),x(b+s),…x(b+(M-1)s)) • Implementation(nested loop) R=N; S=1; for i=t,…,1 R=R/Ni forj=0,…,R-1 for k=0,…,S-1 S=S* Ni; M b,s t Õ ) Ä Ä ( I I WHT WHT = n n 2 2 2 2 i i = 1

  19. Instruction Count Model • A(n) = number of calls to WHT procedure • = number of instructions outside loops Al(n) = Number of calls to base case of size l •  l = number of instructions in base case of size l • Li = number of iterations of outer (i=1), middle (i=2), and • outer (i=3) loop • i = number of instructions in outer (i=1), middle (i=2), and • outer (i=3) loop body

  20. Small[1] .file "s_1.c" .version "01.01" gcc2_compiled.: .text .align 4 .globl apply_small1 .type apply_small1,@function apply_small1: movl 8(%esp),%edx //load stride S to EDX movl 12(%esp),%eax //load x array's base address to EAX fldl (%eax) // st(0)=R7=x[0] fldl (%eax,%edx,8) //st(0)=R6=x[S] fld %st(1) //st(0)=R5=x[0] fadd %st(1),%st // R5=x[0]+x[S] fxch %st(2) //st(0)=R5=x[0],s(2)=R7=x[0]+x[S] fsubp %st,%st(1) //st(0)=R6=x[S]-x[0] ????? fxch %st(1) //st(0)=R6=x[0]+x[S],st(1)=R7=x[S]-x[0] fstpl (%eax) //store x[0]=x[0]+x[S] fstpl (%eax,%edx,8) //store x[0]=x[0]-x[S] ret

  21. Recurrences

  22. Histogram using Instruction Model (P3)  l = 12,  l = 34, and  l = 106  = 27 1 = 18, 2 = 18, and 1 = 20

  23. Cache Model • Different WHT algorithms access data in different patterns • All algorithms with the same set of leaf nodes have the same number of memory accesses • Count misses for accesses to data array • Parameterized by cache size, associativity, and block size • simulate using program traces (restrict to data vector accesses) • Analytic formula?

  24. 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 Blocked Access 4 1 3 1 2

  25. 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 Interleaved Access 4 3 1 2 1

  26. 4 4 3 1 1 3 2 1 1 2 Cache Simulator • 144 memory accesses • C = 4, A = 1, B = 1 (80, 112) • C = 4, A = 4, B = 1 (48, 48) • C = 4, A = 1, B = 2 (72, 88) • Iterative vs. Recursive (192 memory accesses) • C = 4, A = 1, B = 1 (128, 112)

  27. Cache Misses as a Function of Cache Size C=22 C=23 C=24 C=25

  28. Formula for Cache Misses • M(L,WN,R) = Number of misses for (IL Ä WHTN ÄIR)

  29. 4 1 3 4 1 2 1 1 1 1 4 1 1 2 2 1 1 1 1 Closed Form • M(L,WN,R) = Number of misses for (IL Ä WHTN ÄIR) • M(0,W_n,0) = 3(n-c)*2n + k*2n • C = 2c, k = number of parts in the rightmost c positions • c = 3, n = 4 Iterative Balanced Right Recursive k = 1 k = 3 k = 2

  30. Summary of Results and Future Work • Instruction Count Model • min, max, expected value, variance, limiting distribution • Cache Model • Direct mapped (closed form solution, distribution, expected value, and variance) • Combine models • Extend cache formula to include A and B • Use as heuristic to limit search and predict performance

  31. Sponsors Work supported by DARPA (DSO), Applied & Computational Mathematics Program, OPAL, through grant managed by research grant DABT63-98-1-0004 administered by the Army Directorate of Contracting, DESA: Intelligent HW-SW Compilers for Signal Processing Applications, and NSF ITR/NGS #0325687: Intelligent HW/SW Compilers for DSP.

More Related