1 / 17

Parallelization of the Classic Gram-Schmidt QR-Factorization

Parallelization of the Classic Gram-Schmidt QR-Factorization. M543 Final Project Ron Caplan May. 6 th 2008. Overview. Introduction MATLAB vs. FORTRAN Parallelization with MPI Parallel Computation Results Conclusion. Introduction.

xander
Download Presentation

Parallelization of the Classic Gram-Schmidt QR-Factorization

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Parallelization of the Classic Gram-Schmidt QR-Factorization M543 Final Project Ron Caplan May. 6th 2008

  2. Overview • Introduction • MATLAB vs. FORTRAN • Parallelization with MPI • Parallel Computation Results • Conclusion

  3. Introduction “It is well established that the classical Gram-Schmidt process is one of the unstable ones. Consequently it is rarely used, except sometimes on parallel computers in situations where advantages in communication may outweigh the disadvantage of instability.” - Trefethem-Bau pg. 66

  4. MATLAB vs. FORTRAN Pentium 4 2.0 Ghzwith 512 KB cache, 512MB RDRAM running Windows XP MATLAB Student Version 7 and g95 with cygwin Each run is done 5 times and the times averaged. In-between runs, memory is cleared.

  5. MATLAB vs. FORTRAN

  6. MATLAB vs. FORTRAN

  7. Parallelization with MPI Collect Q and R 1 vector at a time: Compute Part of Q & R: Q R Send out A To all nodes 1 vector at a time: A A A A A 4-node example Four Dual Quad-core Xeons with 4GB RAM per Node configured as 4 Processors Per Node for MPI/Batch (16 total processors)

  8. Parallelization with MPI

  9. Parallel Computation Results Using Frobenius Norm

  10. Parallel Computation Results

  11. Parallel Computation Results

  12. Parallel Computation Results

  13. Parallel Computation Results

  14. Parallel Computation Results

  15. Parallel Computation Results Should dynamically change send packet size, and probably use sends and receives instead of broadcast and gather…. Why? Because: Also, since CGS inner loop is from i=1:j-1, the distribution of work to Each node is not equal. This should be fixed as well.

  16. Conclusion • Parallelization of CGS does give nice speed-up for large matrices. • Adding some improvements to the code such as non-uniform block sizes, and optimizing communication method would lead to even better speed-up.

More Related