Parallel Out-of-Core LU and QR Factorization

Parallel Out-of-Core LU and QR Factorization Brian Gunter Center for Space Research The University of Texas at Austin, Austin, TX gunter@csr.utexas.edu Enrique Quintana-Ortí Depto. de Ingenieria y Ciencia de Computadores Universidad Jaume I, Castellón, Spain quintana@icc.uji.es Robert van de Geijn Department of Computer Sciences The University of Texas at Austin, Austin, TX rvdg@cs.utexas.edu Thierry Joffrain Department of Computer Sciences The University of Texas at Austin, Austin, TX joffrain@cs.utexas.edu

Motivation n • Traditional methods use a slab approach, where entire columns of the out-of-core matrix are brought into memory. m In-core

Motivation n • While this is effective for many applications, it is inherently unscalable • As m >> n, fewer columns can fit into memory m >> n

Out-of-Core QR Factorization • Given the m×n matrix, A, we wish to apply the factorization A=QR Q = I + YTYT • Compact WY Representation • Q is an orthogonal matrix • R is upper triangular • Y is an m×r collection of Householder vectors, normalized to be unit lower triangular (trapezoidal) • T is r×r upper triangular

QR FactorizationOut-of-Core Implementation Step 1: Begin with an unfactored matrix which resides on disk. = Stored on disk = In memory

QR FactorizationOut-of-Core Implementation Step 2: Divide matrix into a mesh of tiles of size t, where each tile is stored as a separate file. t t = Stored on disk = In memory

Yi QR FactorizationOut-of-Core Implementation Step 3: Read in first tiles and factor, saving T matrices and overwriting lower tile with Y Ti = Stored on disk = In memory

QR FactorizationOut-of-Core Implementation Step 3: Read in first tiles and factor, saving T matrices and overwriting lower tile with Y Ti Yi = Stored on disk = In memory

QR FactorizationOut-of-Core Implementation Step 4: Read in remaining tiles in row and apply Q = I + YiTiYi, reading Yi in one panel at a time. Ti Yi = Stored on disk = In memory

QR FactorizationOut-of-Core Implementation Step 5: Factor next tile in first column using QR update algorithm. Ti Yi = Stored on disk = In memory

QR FactorizationOut-of-Core Implementation Step 6: Apply transformations to remaining tiles in row. Ti Yi = Stored on disk = In memory

Ti Yi QR FactorizationOut-of-Core Implementation Step 7: Repeat Steps 5 and 6 to any remaining rows of tiles. = Stored on disk = In memory

QR FactorizationOut-of-Core Implementation Step 7: Repeat Steps 5 and 6 to any remaining rows of tiles. = Stored on disk = In memory

Ti QR FactorizationOut-of-Core Implementation Step 8: Repeat Steps 1-7 on lower quadrant. Yi = Stored on disk = In memory

QR FactorizationOut-of-Core Implementation Step 8: Repeat Steps 1-7 on lower quadrant. Continue until entire matrix has been factored. = Stored on disk = In memory

Out-of-Core LU Factorization • Given the m×n matrix, A, we wish to apply the factorization PA=LU • P is an permutation matrix • U is n×n upper triangular • L is lower trapezoidal • Implementation analogous to out-of-core QR factorization

LU FactorizationOut-of-Core Implementation Step 1: Factor first tile, saving permutation matrix. Ui Pi Li = Stored on disk = In memory

Ui Li LU FactorizationOut-of-Core Implementation Step 2: Update remaining tiles in row using panels of L and the saved permutation matrices. Pi = Stored on disk = In memory

LU FactorizationOut-of-Core Implementation Step 3: Factor next tile in first column using LU update algorithm. Ui Pi Li = Stored on disk = In memory

LU FactorizationOut-of-Core Implementation Step 4: Update remaining tiles in row using panels of L and stored permutation matrices. Ui Pi Li = Stored on disk = In memory

Development Environment • Parallel Linear Algebra Package (PLAPACK) • Optimized parallel routines (FORTRAN and C interfaces) • ‘View-based’ infrastructure • Uses standard MPI and BLAS libraries • Parallel Out-Of-Core Parallel Linear Algebra (POOCLAPACK) • Out-of-core extension to PLAPACK • Handles the complexity of the I/O operations (i.e., hidden to user) • Uses standard read/write functions for portability

Performance of Parallel OOC QR IBM P690: 32 Gb, T.P. of 5.2 Gflops, DGEMM of 3.723 Gflops

Performance for Sequential OOC LU

Earth Science Application • Gravity Recovery And Climate Experiment (GRACE) • A collaborative effort between • The University of Texas Center for Space Research (CSR) • The Jet Propulsion Laboratory (JPL) • GeoForschungsZentrum (GFZ) • Deutschen Zentrum für Luft- und Raumfahrt (DLR) • National Aeronautics and Space Administration (NASA)

Earth Science Application • Goal was to compute a rigorous 360x360 gravity model • No approximation techniques • Translates to roughly 100 km2 resolution • Involves the least squares estimation of ~130,000 parameters • Requires the combination of hundreds of millions of observations • surface gravity data (land) – ½ TB • altimetry-based mean sea surface data (ocean) • GRACE data (satellite) • Using new parallel OOC QR algorithm • A 360x360 field was generated, complete with full covariance • Largest rigorous gravity field model ever created • Used a single IBM P690 node • OOC QR required only 32 GB • To do in-core would require 165 GB of memory • Required ~6 days of wall clock time to compute (2326 CPU hours) • A single processor machine with sufficient memory would require 3.2 months

Conclusion • Tile-based out-of-core algorithms provide scalability • Size of the tile is based on the memory of the machine (i.e. fixed) and is independent of the problem size • Algorithms achieve excellent performance • The large tile sizes mean the algorithm spends nearly all of its time in large, highly efficient matrix-matrix operations • This helps to offset the I/O cost associated with moving the tiles to and from disk • Use of the PLAPACK & POOCLAPACK greatly simplified the implementation • Reduces complexity of code • Makes code portable • Has already proven valuable to Earth science applications

Conclusion • Broad spectrum of applications • Large scale problems • Small clusters • Embedded systems • Other small memory machines • Tile-based OOC approach can be extended to other dense linear algebra operations • Cholesky, matrix inverse, BLAS-3, etc. • Goal is to provide a full suite of OOC utilities

For More Information • Visit the PLAPACK website: www.cs.utexas.edu/users/plapack • Visit the GRACE website: www.csr.utexas.edu/grace

Parallel Out-of-Core LU and QR Factorization

Parallel Out-of-Core LU and QR Factorization

Presentation Transcript

A Cholesky Out-of-Core factorization

Similarities in the Structure Prediction of Sparse QR Factorization and Sparse LU Factorization with Partial Pivoting

6. LU Factorization

Linear Systems LU Factorization

Optimizing LU Factorization in Cilk ++

Parallelization of the Classic Gram-Schmidt QR-Factorization

Performance Analysis of Parallel Sparse LU Factorization on CPUs and GPUs

Block LU Factorization Lecture 24

High Performance LU Factorization for Non-dedicated Clusters

Low latency algorithms for QR and LU factorizations

Iterative Methods and QR Factorization

Linear Least Squares QR Factorization

Performance study of multi-GPU acceleration of LU Factorization

Method of LU Factorization

An Out-of-Core Sparse Symmetric Indefinite Factorization Method

Why out-of-core?

Out of Core Simplification

Sparse LU Factorization for Parallel Circuit Simulation on GPU

Parallel Fermat’s Integer Factorization Method

householder QR factorization Slide with matlab code

LU Factorization

Parallel Fermat’s Integer Factorization Method