1 / 46

Parallel Out-of-Core LU and QR Factorization

Parallel Out-of-Core LU and QR Factorization . Brian Gunter Center for Space Research The University of Texas at Austin, Austin, TX gunter@csr.utexas.edu Enrique Quintana-Ort í Depto. de Ingenieria y Ciencia de Computadores Universidad Jaume I, Castellón, Spain quintana@icc.uji.es .

gen
Download Presentation

Parallel Out-of-Core LU and QR Factorization

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Parallel Out-of-Core LU and QR Factorization Brian Gunter Center for Space Research The University of Texas at Austin, Austin, TX gunter@csr.utexas.edu Enrique Quintana-Ortí Depto. de Ingenieria y Ciencia de Computadores Universidad Jaume I, Castellón, Spain quintana@icc.uji.es Robert van de Geijn Department of Computer Sciences The University of Texas at Austin, Austin, TX rvdg@cs.utexas.edu Thierry Joffrain Department of Computer Sciences The University of Texas at Austin, Austin, TX joffrain@cs.utexas.edu

  2. Motivation n • Traditional methods use a slab approach, where entire columns of the out-of-core matrix are brought into memory. m In-core

  3. Motivation n • Traditional methods use a slab approach, where entire columns of the out-of-core matrix are brought into memory. m In-core

  4. Motivation n • Traditional methods use a slab approach, where entire columns of the out-of-core matrix are brought into memory. m In-core

  5. Motivation n • While this is effective for many applications, it is inherently unscalable • As m >> n, fewer columns can fit into memory m >> n

  6. Out-of-Core QR Factorization • Given the m×n matrix, A, we wish to apply the factorization A=QR Q = I + YTYT • Compact WY Representation • Q is an orthogonal matrix • R is upper triangular • Y is an m×r collection of Householder vectors, normalized to be unit lower triangular (trapezoidal) • T is r×r upper triangular

  7. QR FactorizationOut-of-Core Implementation Step 1: Begin with an unfactored matrix which resides on disk. = Stored on disk = In memory

  8. QR FactorizationOut-of-Core Implementation Step 2: Divide matrix into a mesh of tiles of size t, where each tile is stored as a separate file. t t = Stored on disk = In memory

  9. Yi QR FactorizationOut-of-Core Implementation Step 3: Read in first tiles and factor, saving T matrices and overwriting lower tile with Y Ti = Stored on disk = In memory

  10. QR FactorizationOut-of-Core Implementation Step 3: Read in first tiles and factor, saving T matrices and overwriting lower tile with Y Ti Yi = Stored on disk = In memory

  11. QR FactorizationOut-of-Core Implementation Step 3: Read in first tiles and factor, saving T matrices and overwriting lower tile with Y Ti Yi = Stored on disk = In memory

  12. QR FactorizationOut-of-Core Implementation Step 3: Read in first tiles and factor, saving T matrices and overwriting lower tile with Y Ti Yi = Stored on disk = In memory

  13. QR FactorizationOut-of-Core Implementation Step 4: Read in remaining tiles in row and apply Q = I + YiTiYi, reading Yi in one panel at a time. Ti Yi = Stored on disk = In memory

  14. QR FactorizationOut-of-Core Implementation Step 4: Read in remaining tiles in row and apply Q = I + YiTiYi, reading Yi in one panel at a time. Ti Yi = Stored on disk = In memory

  15. QR FactorizationOut-of-Core Implementation Step 4: Read in remaining tiles in row and apply Q = I + YiTiYi, reading Yi in one panel at a time. Ti Yi = Stored on disk = In memory

  16. QR FactorizationOut-of-Core Implementation Step 4: Read in remaining tiles in row and apply Q = I + YiTiYi, reading Yi in one panel at a time. Ti Yi = Stored on disk = In memory

  17. QR FactorizationOut-of-Core Implementation Step 4: Read in remaining tiles in row and apply Q = I + YiTiYi, reading Yi in one panel at a time. Ti Yi = Stored on disk = In memory

  18. QR FactorizationOut-of-Core Implementation Step 4: Read in remaining tiles in row and apply Q = I + YiTiYi, reading Yi in one panel at a time. Ti Yi = Stored on disk = In memory

  19. QR FactorizationOut-of-Core Implementation Step 4: Read in remaining tiles in row and apply Q = I + YiTiYi, reading Yi in one panel at a time. Ti Yi = Stored on disk = In memory

  20. QR FactorizationOut-of-Core Implementation Step 4: Read in remaining tiles in row and apply Q = I + YiTiYi, reading Yi in one panel at a time. Ti Yi = Stored on disk = In memory

  21. QR FactorizationOut-of-Core Implementation Step 5: Factor next tile in first column using QR update algorithm. Ti Yi = Stored on disk = In memory

  22. QR FactorizationOut-of-Core Implementation Step 5: Factor next tile in first column using QR update algorithm. Ti Yi = Stored on disk = In memory

  23. QR FactorizationOut-of-Core Implementation Step 5: Factor next tile in first column using QR update algorithm. Ti Yi = Stored on disk = In memory

  24. QR FactorizationOut-of-Core Implementation Step 5: Factor next tile in first column using QR update algorithm. Ti Yi = Stored on disk = In memory

  25. QR FactorizationOut-of-Core Implementation Step 6: Apply transformations to remaining tiles in row. Ti Yi = Stored on disk = In memory

  26. QR FactorizationOut-of-Core Implementation Step 6: Apply transformations to remaining tiles in row. Ti Yi = Stored on disk = In memory

  27. QR FactorizationOut-of-Core Implementation Step 6: Apply transformations to remaining tiles in row. Ti Yi = Stored on disk = In memory

  28. QR FactorizationOut-of-Core Implementation Step 6: Apply transformations to remaining tiles in row. Ti Yi = Stored on disk = In memory

  29. QR FactorizationOut-of-Core Implementation Step 6: Apply transformations to remaining tiles in row. Ti Yi = Stored on disk = In memory

  30. Ti Yi QR FactorizationOut-of-Core Implementation Step 7: Repeat Steps 5 and 6 to any remaining rows of tiles. = Stored on disk = In memory

  31. QR FactorizationOut-of-Core Implementation Step 7: Repeat Steps 5 and 6 to any remaining rows of tiles. = Stored on disk = In memory

  32. Ti QR FactorizationOut-of-Core Implementation Step 8: Repeat Steps 1-7 on lower quadrant. Yi = Stored on disk = In memory

  33. QR FactorizationOut-of-Core Implementation Step 8: Repeat Steps 1-7 on lower quadrant. Continue until entire matrix has been factored. = Stored on disk = In memory

  34. Out-of-Core LU Factorization • Given the m×n matrix, A, we wish to apply the factorization PA=LU • P is an permutation matrix • U is n×n upper triangular • L is lower trapezoidal • Implementation analogous to out-of-core QR factorization

  35. LU FactorizationOut-of-Core Implementation Step 1: Factor first tile, saving permutation matrix. Ui Pi Li = Stored on disk = In memory

  36. Ui Li LU FactorizationOut-of-Core Implementation Step 2: Update remaining tiles in row using panels of L and the saved permutation matrices. Pi = Stored on disk = In memory

  37. LU FactorizationOut-of-Core Implementation Step 3: Factor next tile in first column using LU update algorithm. Ui Pi Li = Stored on disk = In memory

  38. LU FactorizationOut-of-Core Implementation Step 4: Update remaining tiles in row using panels of L and stored permutation matrices. Ui Pi Li = Stored on disk = In memory

  39. Development Environment • Parallel Linear Algebra Package (PLAPACK) • Optimized parallel routines (FORTRAN and C interfaces) • ‘View-based’ infrastructure • Uses standard MPI and BLAS libraries • Parallel Out-Of-Core Parallel Linear Algebra (POOCLAPACK) • Out-of-core extension to PLAPACK • Handles the complexity of the I/O operations (i.e., hidden to user) • Uses standard read/write functions for portability

  40. Performance of Parallel OOC QR IBM P690: 32 Gb, T.P. of 5.2 Gflops, DGEMM of 3.723 Gflops

  41. Performance for Sequential OOC LU

  42. Earth Science Application • Gravity Recovery And Climate Experiment (GRACE) • A collaborative effort between • The University of Texas Center for Space Research (CSR) • The Jet Propulsion Laboratory (JPL) • GeoForschungsZentrum (GFZ) • Deutschen Zentrum für Luft- und Raumfahrt (DLR) • National Aeronautics and Space Administration (NASA)

  43. Earth Science Application • Goal was to compute a rigorous 360x360 gravity model • No approximation techniques • Translates to roughly 100 km2 resolution • Involves the least squares estimation of ~130,000 parameters • Requires the combination of hundreds of millions of observations • surface gravity data (land) – ½ TB • altimetry-based mean sea surface data (ocean) • GRACE data (satellite) • Using new parallel OOC QR algorithm • A 360x360 field was generated, complete with full covariance • Largest rigorous gravity field model ever created • Used a single IBM P690 node • OOC QR required only 32 GB • To do in-core would require 165 GB of memory • Required ~6 days of wall clock time to compute (2326 CPU hours) • A single processor machine with sufficient memory would require 3.2 months

  44. Conclusion • Tile-based out-of-core algorithms provide scalability • Size of the tile is based on the memory of the machine (i.e. fixed) and is independent of the problem size • Algorithms achieve excellent performance • The large tile sizes mean the algorithm spends nearly all of its time in large, highly efficient matrix-matrix operations • This helps to offset the I/O cost associated with moving the tiles to and from disk • Use of the PLAPACK & POOCLAPACK greatly simplified the implementation • Reduces complexity of code • Makes code portable • Has already proven valuable to Earth science applications

  45. Conclusion • Broad spectrum of applications • Large scale problems • Small clusters • Embedded systems • Other small memory machines • Tile-based OOC approach can be extended to other dense linear algebra operations • Cholesky, matrix inverse, BLAS-3, etc. • Goal is to provide a full suite of OOC utilities

  46. For More Information • Visit the PLAPACK website: www.cs.utexas.edu/users/plapack • Visit the GRACE website: www.csr.utexas.edu/grace

More Related