Low latency algorithms for QR and LU factorizations

Low latency algorithms for QR and LU factorizations Laura Grigori INRIA Futurs, Universite Paris Sud, Orsay France Joint work with James Demmel, Julien Langou, Mark Hoemmen, Hua Xiang

Plan • Motivation • Low latency approach for the QR factorization of tall skinny matrices • Extension to dense QR and LU factorization of general matrices • Preliminary experimental results • Conclusions and future work

Motivation and challenges • Increasing parallelism • Exponentially growing gaps between • Floating point time << 1/Memory BW << Memory Latency • Improving 59%/year vs 23%/year vs 5.5%/year • Floating point time << 1/Network BW << Network Latency • Improving 59%/year vs 26%/year vs 15%/year • Heterogeneity (performance and semantics) • Unreliability

QR factorization of a tall skinny matrix QR factorization of an m-by-n matrix W, with m >> n arises in many applications: • In s-steps Krlyov subspace methods like GMRES. • In linear least squares problems where the number of equations (m) is much larger than the number of unknowns (n). • In dense large QR factorization where they are used as the panel factorization step.

Example of applications: panel factorization of dense blocked factorization LAPACK block LU (right-looking): dgetrf LAPACK block QR (right-looking): dgeqrf U R L H A(1) A(1) dgetf2 dgeqf2 + dlarft Panel factorization lu( ) qr( ) dtrsm (+ dswp) \ Update of the remaining submatrix dgemm dlarfb - - U R L H A(2) A(2) 5 Courtesy of J. Langou

TSQR: an appoach for QR factorization of a tall skinny matrixusing Householder transformations Consider that W (m-by-n) has its rows partitioned in P=4 blocks (4 processors). • Compute the QR factorization of each Wi , where Qi0 is m/p-by-m/p and Ri0 is m/p-by-n. • Perform log2(p) timesthe QR factorizations of groups of succesive R factors, where Qi1 is 2n-by-2n , Ri1 is 2n-by-n.

Parallel TSQR W0 H00` , R00 H01 R01 H02 R02 R00 R01 P0 R10 R11 QR( ) QR( ) QR( ) P0 stores R02 W1 H10 R10 H01 R01 R00 R10 P1 QR( ) QR( ) P1 stores H01 W2 H20 R20 H11 R11 R20 H02 R02 R01 R30 P2 QR( ) R11 QR( ) QR( ) P2 stores H02 H11 R11 W3 H30 R30 R20 P3 R30 QR( ) QR( ) P3 stores H11 time

Characteristics of parallel TSQR • For each local QR factorization, the best sequential algorithm can be used, i.e. recursive algorithms. • For each step i=1 to log2(p) Communication: pairs of processors exchange Rx,i-1, Ry,i-1, the upper triangular factors from the previous QR factorization. Computation: pairs of processors perform redundantly the QR factorization of two upper triangular factors. 3) The Householder vectors are stored implicitely using a tree-like structure, that allows an in-place storage of the factors. Time of TSQR Time of PDGEQR2 (ScaLAPACK) The total time is γ * ( FLOPs) + α * (# msg) + β * ( vol data exchanged)

R R 2 0 R00 1 3 C0(k) C0(k-1) H00 0 2 0 2 R10 C1(k) C1(k-1) H10 1 3 1 3 H00 D0(k) D0(k-1) H H W 0 2 2 0 H10 D1(k) D1(k-1) 1 1 3 3 D0(k) D0(k-1) H00 0 2 0 2 H10 D1(k) D1(k-1) 1 3 1 3 CAQR: communication avoiding QR factorization on 2D grid of processors • A is an m-by-n matrix distributed with 2D block cyclic layout, on Pr-by-Pc procs, block size b. Time of CAQR Time of ScaLAPACK’s QR

The LU factorization of a tall skinny matrix Can we apply the ideas of TSQR to the LU factorization? We call TSLU this approach. Consider that W (m-by-n) has its rows partitioned in P=4 blocks (4 processors). • Compute the LU factorization of each Wi , where i0 are m/p-by-m/p, Li0 are m/p-by-n lower unit trapezoidal, and Ui0 are n-by-n upper triangular. • Perform log2(P) times the LU factorizations of groups of succesive U factors, where i1 and i2 are 2n-by-2n , Li1 and Li1 are 2n-by-n, and Ui1 and Ui2 are n-by-n upper triangular .

Stability of the LU factorization • Consider the growth factor where are the values at the k-th step. • Experiments performed for various distribution of matrices with n < 1024 [Trefethen and Schreiber ’90] showed that the average growth factor normalized by the standard deviation of the initial matrix elements is: - close to n2/3 for partial pivoting, n1/2 for complete pivoting. • Two reasons considered to be important for the average case stability: - the multipliers in L are small, - the correction introduced at each elimination step is of rank 1. Other strategies: - pairwise pivoting is reasonably stable (low rank correction). - TSLU involves a rank-P update at each step.

Growth factor for TSLU based factorization • Unstable for large P and large matrices. • When P equals the number of rows, TSLU is equivalent to parallel pivoting. Courtesy of H. Xiang

Iterative refinement for TSLU based factorization Courtesy of H. Xiang

Evaluation of the performance • Experiments performed on two platforms at NERSC: IBM p575 POWER 5 system, 111 compute nodes, 8 processors per node - each processor is clocked at 1.9 GHz, theoretical peak of 7.6 GFLOPs/sec. - each node has 32 GB memory. - MPI Point to Point internode Latency is 4.5 usec. - peak Bandwidth is 3100 MB/sec. Opteron cluster with 356 dual-processor nodes - each node has 6 GB memory - each processor is clocked at 2.2 GHz, theoretical peak of 4.4 GFLOPs/sec. - Switch MPI Unidirectional Latency is 4.5 usec. - peak Switch MPI Unidirectional Bandwidth is 620 MB/sec. • Analyze the improvement obtained - in part from using a better local LU factorization, - and in part from reducing the latency cost.

Performance results • Performance for tall-skinny matrices with fixed m=1,000,000 and varying n=10,…,200. • PDGETF2 – ScaLAPACK’s LU parallel factorization of a panel. • RGETF2 – sequential recursive LU factorization of a panel. • DGETF2 – sequential classic LU factorization of a panel. • RTSLU – TSLU using RGETF2 for local LU factorization. • CTSLU – TSLU using DGETF2 for local LU factorization. • TT – ratio of PDGETF2 to RTSLU. • LT – ratio of local LU factorization time DGETF2 to RGETF2. TT LT

Performance prediction for a Petascale machine Petascale machine with 8192 procs, each at 500 GFlops/s, a bandwidth of 4 GB/s.

Conclusions • The codes will be available soon. • The work on CALU is still under development. • The new algorithms minimize the number of messages exchanged at the cost of some redundant computation. • The number of messages is equal to O(n/b log P) instead of O(n log P) for classic algorithms, where n represents the number of columns and b represents the block size. • Same ideas can be applied to sparse and incomplete factorizations.

CALU – a communication avoiding LU factorization • Partition A as: • Approach: at each factorization step, on the current panel of columns: - let be the permutation matrices obtained from applying TSLU on - permute A according to the above perms and perform LU factorization without pivoting. • Characteristics: - Performs twice more flops for each panel factorization. - TSLU provides “good” pivots. - CALU can be seen as a threshold pivoting strategy.

Growth factor for CALU approach

Average threshold used in CALU

The LU factorization of a general matrix • The TU factorization of an m-by-n matrix A can be implemented in parallel on a 2D grid of P processors Pr-by-Pc , using a 2D block cyclic layout with square blocks of size b. • It has several advantages: - allows the usage of the best sequential algorithm in the factorization of a block of columns, i.e. a recursive algorithm, - leads to a low latency algorithm. Time of TU Time of ScaLAPACK’s LU - cost per add and multiply, cost per divide. - latency, inverse bandwidth for communication in the same column of the grid. - latency, inverse bandwidth for communication in the same row of the grid.

Low latency algorithms for QR and LU factorizations

Low latency algorithms for QR and LU factorizations

Presentation Transcript

Low latency streaming capability for game applications

Low-Latency Networks for Financial Applications

Using Prime Factorizations:

Low Latency Networking

Low Latency Computations on Massive Data

High -Fidelity Latency Measurements in Low -Latency Networks

Sparrow Distributed , Low Latency Scheduling

Tiered-Latency DRAM: A Low Latency and A Low Cost DRAM Architecture

LU - Factorizations

QR Decomposition Algorithms on Various Architectures

Low latency via redundancy

Low-Latency Pipelined Crossbar Arbitration

Low-Latency FIFO’s Using Token Rings

Exact algorithms for the minimum latency problem

Low Latency Messaging Over Gigabit Ethernet

QA Function for Low Latency Trading Platform

Low-latency HD: Shared Spaces and Undersea Window

Delivering Capacity, Low Latency and Low Jitter

Exact algorithms for the minimum latency problem

Low Latency Rendering with Dataflow Architectures

Low Latency Server