1 / 29

18.337 / 6.338: Parallel Computing Project FinalReport Parallelization of Matrix Multiply :

18.337 / 6.338: Parallel Computing Project FinalReport Parallelization of Matrix Multiply : A Look At How Differing Algorithmic Approaches and CPU Hardware Impact Scaling Calculation Performance in Java Elliotte Kim Massachusetts Institute of Technology Class of 2012.

khuong
Download Presentation

18.337 / 6.338: Parallel Computing Project FinalReport Parallelization of Matrix Multiply :

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. 18.337 / 6.338: Parallel Computing Project FinalReport Parallelization of Matrix Multiply: A Look At How Differing Algorithmic Approaches and CPU Hardware Impact Scaling Calculation Performance in Java Elliotte Kim Massachusetts Institute of Technology Class of 2012

  2. A * B = C (n x m) (m x p) (n x p) Matrix Multiplication

  3. Hypothesis: The duration to compute (n x kn) * (kn x n) will take at least k times the duration to compute (n x n) * (n x n) regardless of parallelization if the same parallelization method is applied to both matmuls.

  4. In both cases, resulting matrix C will be (n x n)

  5. Ordinary Matrix Multiply

  6. Under Ordinary Matrix Multiplication, (n x kn) * (kn x n) matmul will have k times the number of multiplication operations than (n x n) * (n x n) matmul

  7. Test Case 1: Intel Atom N270 1.6 GHz 1 core 2 thread/core 2 threads total 56 KB L1 cache 512 KB L2 cache

  8. ms n = 1024 Ordinary Matrix Multiply1 thread

  9. ms n = 1024 Ordinary Matrix Multiply2 threads

  10. Test Case 2: AMD Turion 64 X2 2.0 GHz 2 cores 1 thread/core 2 threads total 128 KB L1 cache per core 512 KB L2 cache per core

  11. ms n = 1024 Ordinary Matrix Multiply1 thread

  12. ms n = 1024 Ordinary Matrix Multiply2 threads

  13. Near doubling in performance going from 1 to 2 Threads. Calculation rate slowdown going from k = 3 to k = 4. Why? L2 cache access at k = 4. Observation

  14. Test Case 3: Intel Core2 Quad Q6700 2.66 GHz 4 cores 1 thread/core 4 threads total 128 KB L1 cache per core 2 x 4 MB L2 cache (shared)

  15. ms n = 1024 Ordinary Matrix Multiply1 thread

  16. ms n = 1024 Ordinary Matrix Multiply2 threads

  17. ms n = 1024 Ordinary Matrix Multiply4 threads

  18. Near doubling in performance going from 1 to 2 Threads. At 4 Threads, increased computation slowdown at k=4, 7. Recoveries at k=6, 8. Effects of shared cache? Observation

  19. All performance times observed were in accordance with the hypothesis. Ordinary Matrix Multiply

  20. Is there an algorithm that can give better than k scaling? The Question

  21. Breaks up a matrix into 4 smaller matrices Spawns a new thread for each matrix Apply recursively, until threshold is reached. Recursive Matrix Multiply

  22. ms n = 1024 Recursive Matrix Multiply

  23. Recursive MatMul 1 to 3 times FASTER than Parallel Ordinary MatMul on the Atom processor. No drastic slowdown in computation rate after k = 1. Near linear relationship between calculation times and values of k. Observation

  24. ms n = 1024 Recursive Matrix Multiply

  25. Recursive MatMul 1.5 to 3.5 times FASTER than Parallel Ordinary MatMul on the Turion processor. No drastic slowdown in computation rate between k=3 to k=4. Near linear relationship between calculation times and values of k. Observation

  26. ms n = 1024 Recursive Matrix Multiply

  27. Recursive MatMul 0.5 to 4 times FASTER than Parallel Ordinary MatMul on the Q6700 processor. Better than k-scaling performance when k = 3, 5, 6, 7 and 8. Why? Observation

  28. Better than k-scaling can be achieved, though uncertain as to why. Hardware? Algorithm? Combination of the two? Further research required. Conclusions

  29. Algorithmic approach can affect time required. Hardware can affect time required. Faster processors help. More cache helps. But best peformance achieved when Algorithms can account for hardware and determine the best approach. Conclusions

More Related