470 likes | 913 Views
Parallel Out-of-Core LU and QR Factorization . Brian Gunter Center for Space Research The University of Texas at Austin, Austin, TX gunter@csr.utexas.edu Enrique Quintana-Ort í Depto. de Ingenieria y Ciencia de Computadores Universidad Jaume I, Castellón, Spain quintana@icc.uji.es .
E N D
Parallel Out-of-Core LU and QR Factorization Brian Gunter Center for Space Research The University of Texas at Austin, Austin, TX gunter@csr.utexas.edu Enrique Quintana-Ortí Depto. de Ingenieria y Ciencia de Computadores Universidad Jaume I, Castellón, Spain quintana@icc.uji.es Robert van de Geijn Department of Computer Sciences The University of Texas at Austin, Austin, TX rvdg@cs.utexas.edu Thierry Joffrain Department of Computer Sciences The University of Texas at Austin, Austin, TX joffrain@cs.utexas.edu
Motivation n • Traditional methods use a slab approach, where entire columns of the out-of-core matrix are brought into memory. m In-core
Motivation n • Traditional methods use a slab approach, where entire columns of the out-of-core matrix are brought into memory. m In-core
Motivation n • Traditional methods use a slab approach, where entire columns of the out-of-core matrix are brought into memory. m In-core
Motivation n • While this is effective for many applications, it is inherently unscalable • As m >> n, fewer columns can fit into memory m >> n
Out-of-Core QR Factorization • Given the m×n matrix, A, we wish to apply the factorization A=QR Q = I + YTYT • Compact WY Representation • Q is an orthogonal matrix • R is upper triangular • Y is an m×r collection of Householder vectors, normalized to be unit lower triangular (trapezoidal) • T is r×r upper triangular
QR FactorizationOut-of-Core Implementation Step 1: Begin with an unfactored matrix which resides on disk. = Stored on disk = In memory
QR FactorizationOut-of-Core Implementation Step 2: Divide matrix into a mesh of tiles of size t, where each tile is stored as a separate file. t t = Stored on disk = In memory
Yi QR FactorizationOut-of-Core Implementation Step 3: Read in first tiles and factor, saving T matrices and overwriting lower tile with Y Ti = Stored on disk = In memory
QR FactorizationOut-of-Core Implementation Step 3: Read in first tiles and factor, saving T matrices and overwriting lower tile with Y Ti Yi = Stored on disk = In memory
QR FactorizationOut-of-Core Implementation Step 3: Read in first tiles and factor, saving T matrices and overwriting lower tile with Y Ti Yi = Stored on disk = In memory
QR FactorizationOut-of-Core Implementation Step 3: Read in first tiles and factor, saving T matrices and overwriting lower tile with Y Ti Yi = Stored on disk = In memory
QR FactorizationOut-of-Core Implementation Step 4: Read in remaining tiles in row and apply Q = I + YiTiYi, reading Yi in one panel at a time. Ti Yi = Stored on disk = In memory
QR FactorizationOut-of-Core Implementation Step 4: Read in remaining tiles in row and apply Q = I + YiTiYi, reading Yi in one panel at a time. Ti Yi = Stored on disk = In memory
QR FactorizationOut-of-Core Implementation Step 4: Read in remaining tiles in row and apply Q = I + YiTiYi, reading Yi in one panel at a time. Ti Yi = Stored on disk = In memory
QR FactorizationOut-of-Core Implementation Step 4: Read in remaining tiles in row and apply Q = I + YiTiYi, reading Yi in one panel at a time. Ti Yi = Stored on disk = In memory
QR FactorizationOut-of-Core Implementation Step 4: Read in remaining tiles in row and apply Q = I + YiTiYi, reading Yi in one panel at a time. Ti Yi = Stored on disk = In memory
QR FactorizationOut-of-Core Implementation Step 4: Read in remaining tiles in row and apply Q = I + YiTiYi, reading Yi in one panel at a time. Ti Yi = Stored on disk = In memory
QR FactorizationOut-of-Core Implementation Step 4: Read in remaining tiles in row and apply Q = I + YiTiYi, reading Yi in one panel at a time. Ti Yi = Stored on disk = In memory
QR FactorizationOut-of-Core Implementation Step 4: Read in remaining tiles in row and apply Q = I + YiTiYi, reading Yi in one panel at a time. Ti Yi = Stored on disk = In memory
QR FactorizationOut-of-Core Implementation Step 5: Factor next tile in first column using QR update algorithm. Ti Yi = Stored on disk = In memory
QR FactorizationOut-of-Core Implementation Step 5: Factor next tile in first column using QR update algorithm. Ti Yi = Stored on disk = In memory
QR FactorizationOut-of-Core Implementation Step 5: Factor next tile in first column using QR update algorithm. Ti Yi = Stored on disk = In memory
QR FactorizationOut-of-Core Implementation Step 5: Factor next tile in first column using QR update algorithm. Ti Yi = Stored on disk = In memory
QR FactorizationOut-of-Core Implementation Step 6: Apply transformations to remaining tiles in row. Ti Yi = Stored on disk = In memory
QR FactorizationOut-of-Core Implementation Step 6: Apply transformations to remaining tiles in row. Ti Yi = Stored on disk = In memory
QR FactorizationOut-of-Core Implementation Step 6: Apply transformations to remaining tiles in row. Ti Yi = Stored on disk = In memory
QR FactorizationOut-of-Core Implementation Step 6: Apply transformations to remaining tiles in row. Ti Yi = Stored on disk = In memory
QR FactorizationOut-of-Core Implementation Step 6: Apply transformations to remaining tiles in row. Ti Yi = Stored on disk = In memory
Ti Yi QR FactorizationOut-of-Core Implementation Step 7: Repeat Steps 5 and 6 to any remaining rows of tiles. = Stored on disk = In memory
QR FactorizationOut-of-Core Implementation Step 7: Repeat Steps 5 and 6 to any remaining rows of tiles. = Stored on disk = In memory
Ti QR FactorizationOut-of-Core Implementation Step 8: Repeat Steps 1-7 on lower quadrant. Yi = Stored on disk = In memory
QR FactorizationOut-of-Core Implementation Step 8: Repeat Steps 1-7 on lower quadrant. Continue until entire matrix has been factored. = Stored on disk = In memory
Out-of-Core LU Factorization • Given the m×n matrix, A, we wish to apply the factorization PA=LU • P is an permutation matrix • U is n×n upper triangular • L is lower trapezoidal • Implementation analogous to out-of-core QR factorization
LU FactorizationOut-of-Core Implementation Step 1: Factor first tile, saving permutation matrix. Ui Pi Li = Stored on disk = In memory
Ui Li LU FactorizationOut-of-Core Implementation Step 2: Update remaining tiles in row using panels of L and the saved permutation matrices. Pi = Stored on disk = In memory
LU FactorizationOut-of-Core Implementation Step 3: Factor next tile in first column using LU update algorithm. Ui Pi Li = Stored on disk = In memory
LU FactorizationOut-of-Core Implementation Step 4: Update remaining tiles in row using panels of L and stored permutation matrices. Ui Pi Li = Stored on disk = In memory
Development Environment • Parallel Linear Algebra Package (PLAPACK) • Optimized parallel routines (FORTRAN and C interfaces) • ‘View-based’ infrastructure • Uses standard MPI and BLAS libraries • Parallel Out-Of-Core Parallel Linear Algebra (POOCLAPACK) • Out-of-core extension to PLAPACK • Handles the complexity of the I/O operations (i.e., hidden to user) • Uses standard read/write functions for portability
Performance of Parallel OOC QR IBM P690: 32 Gb, T.P. of 5.2 Gflops, DGEMM of 3.723 Gflops
Earth Science Application • Gravity Recovery And Climate Experiment (GRACE) • A collaborative effort between • The University of Texas Center for Space Research (CSR) • The Jet Propulsion Laboratory (JPL) • GeoForschungsZentrum (GFZ) • Deutschen Zentrum für Luft- und Raumfahrt (DLR) • National Aeronautics and Space Administration (NASA)
Earth Science Application • Goal was to compute a rigorous 360x360 gravity model • No approximation techniques • Translates to roughly 100 km2 resolution • Involves the least squares estimation of ~130,000 parameters • Requires the combination of hundreds of millions of observations • surface gravity data (land) – ½ TB • altimetry-based mean sea surface data (ocean) • GRACE data (satellite) • Using new parallel OOC QR algorithm • A 360x360 field was generated, complete with full covariance • Largest rigorous gravity field model ever created • Used a single IBM P690 node • OOC QR required only 32 GB • To do in-core would require 165 GB of memory • Required ~6 days of wall clock time to compute (2326 CPU hours) • A single processor machine with sufficient memory would require 3.2 months
Conclusion • Tile-based out-of-core algorithms provide scalability • Size of the tile is based on the memory of the machine (i.e. fixed) and is independent of the problem size • Algorithms achieve excellent performance • The large tile sizes mean the algorithm spends nearly all of its time in large, highly efficient matrix-matrix operations • This helps to offset the I/O cost associated with moving the tiles to and from disk • Use of the PLAPACK & POOCLAPACK greatly simplified the implementation • Reduces complexity of code • Makes code portable • Has already proven valuable to Earth science applications
Conclusion • Broad spectrum of applications • Large scale problems • Small clusters • Embedded systems • Other small memory machines • Tile-based OOC approach can be extended to other dense linear algebra operations • Cholesky, matrix inverse, BLAS-3, etc. • Goal is to provide a full suite of OOC utilities
For More Information • Visit the PLAPACK website: www.cs.utexas.edu/users/plapack • Visit the GRACE website: www.csr.utexas.edu/grace