1 / 24

Minimizing Communication in Linear Algebra

Minimizing Communication in Linear Algebra. James Demmel 27 Oct 2009 www.cs.berkeley.edu/~demmel. Outline . What is “communication” and why is it important to avoid? Goal: Algorithms that communicate as little as possible for: BLAS, Ax=b, QR, eig, SVD, whole programs, …

norah
Download Presentation

Minimizing Communication in Linear Algebra

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Minimizing Communication in Linear Algebra James Demmel 27 Oct 2009 www.cs.berkeley.edu/~demmel

  2. Outline • What is “communication” and why is it important to avoid? • Goal: Algorithms that communicate as little as possible for: • BLAS, Ax=b, QR, eig, SVD, whole programs, … • Dense or sparse matrices • Sequential or parallel computers • Direct methods (BLAS, LU, QR, SVD, other decompositions) • Communication lower bounds for all these problems • Algorithms that attain them (all dense linear algebra, some sparse) • Mostly not in LAPACK or ScaLAPACK (yet) • Iterative methods – Krylov subspace methods for Ax=b, Ax=λx • Communication lower bounds, and algorithms that attain them (depending on sparsity structure) • Not in any libraries (yet) • Just highlights • Preliminary speed-up results • Up to 8x measured (or ∞x), 81x modeled

  3. Collaborators Grey Ballard, UCB EECS Ioana Dumitriu, U. Washington Laura Grigori, INRIA Ming Gu, UCB Math Mark Hoemmen, UCB EECS Olga Holtz, UCB Math & TU Berlin Julien Langou, U. Colorado Denver Marghoob Mohiyuddin, UCB EECS Oded Schwartz , TU Berlin Hua Xiang, INRIA Kathy Yelick, UCB EECS & NERSC BeBOP group, bebop.cs.berkeley.edu Thanks to Microsoft, Intel, UC Discovery, NSF, DOE …

  4. Motivation (1/2) Algorithms have two costs: Arithmetic (FLOPS) Communication: moving data between levels of a memory hierarchy (sequential case) processors over a network (parallel case). CPU Cache CPU DRAM CPU DRAM DRAM CPU DRAM CPU DRAM

  5. Motivation (2/2) communication • Time_per_flop << 1/ bandwidth << latency • Gaps growing exponentially with time 59% • Goal : reorganize linear algebra to avoid communication • Between all memory hierarchy levels • L1 L2 DRAM network, etc • Notjust hiding communication (speedup  2x ) • Arbitrary speedups possible • Running time of an algorithm is sum of 3 terms: • # flops * time_per_flop • # words moved / bandwidth • # messages * latency

  6. Direct linear algebra: Prior Work on Matmul • Parallel case on P processors: • Let NNZ be total memory needed; assume load balanced • Lower bound on #words communicated =  (n3 /(P· NNZ )1/2 ) [Irony, Tiskin, Toledo, 04] • Assume n3 algorithm (i.e. not Strassen-like) • Sequential case, with fast memory of size M • Lower bound on #words moved to/from slow memory =  (n3 / M1/2) [Hong, Kung, 81] • Attained using “blocked” algorithms

  7. Lower bound for all “direct” linear algebra • Let M = “fast” memory size per processor • #words_moved by at least one processor = • (#flops / M1/2) • #messages_sent by at least one processor = • (#flops / M3/2) • Holds for • BLAS, LU, QR, eig, SVD, … • Some whole programs (sequences of these operations, no matter how they are interleaved, eg computing Ak) • Dense and sparse matrices (where #flops << n3 ) • Sequential and parallel algorithms • Some graph-theoretic algorithms (eg Floyd-Warshall) • See [BDHS09]

  8. Can we attain these lower bounds? • Do conventional dense algorithms as implemented in LAPACK and ScaLAPACK attain these bounds? • Mostly not • If not, are there other algorithms that do? • Yes • Goals for algorithms: • Minimize #words =  (#flops/ M1/2 ) • Minimize #messages =  (#flops/ M3/2) • Minimize for multiple memory hierarchy levels • Recursive, “Cache-oblivious” algorithms would be simplest • Fewest flops when matrix fits in fastest memory • Cache-oblivious algorithms don’t always attain this • Only a few sparse algorithms so far

  9. Summary of dense sequential algorithms attaining communication lower bounds • Algorithms shown minimizing # Messages use (recursive) block layout • Not possible with columnwise or rowwise layouts • Many references (see reports), only some shown, plus ours • Cache-oblivious are underlined, Green are ours, ? is unknown/future work

  10. Summary of dense 2D parallel algorithms attaining communication lower bounds • Assume nxn matrices on P processors, memory per processor = O(n2 / P) • ScaLAPACK assumes best block size b chosen • Many references (see reports), Green are ours • Recall lower bounds: • #words_moved = ( n2 / P1/2 ) and #messages = ( P1/2 )

  11. QR of a Tall, Skinny matrix is bottleneck;Use TSQR instead: Q is represented implicitly as a product (tree of factors) [DGHL08]

  12. Minimizing Communication in TSQR R00 R10 R20 R30 W0 W1 W2 W3 R01 Parallel: W = R02 R11 R00 W0 W1 W2 W3 Sequential: R01 W = R02 R03 R00 R01 W0 W1 W2 W3 R01 Dual Core: W = R02 R11 R03 R11 Multicore / Multisocket / Multirack / Multisite / Out-of-core: ? Can Choose reduction tree dynamically

  13. Performance of TSQR vs Sca/LAPACK • Parallel • Intel Clovertown • Up to 8x speedup (8 core, dual socket, 10M x 10) • Pentium III cluster, Dolphin Interconnect, MPICH • Up to 6.7x speedup (16 procs, 100K x 200) • BlueGene/L • Up to 4x speedup (32 procs, 1M x 50) • Both use Elmroth-Gustavson locally – enabled by TSQR • Sequential • OOC on PowerPC laptop • As little as 2x slowdown vs (predicted) infinite DRAM • LAPACK with virtual memory never finished • See [DGHL08] • General CAQR = Communication-Avoiding QR in progress

  14. Modeled Speedups of CAQR vs ScaLAPACK Petascale up to 22.9x IBM Power 5 up to 9.7x “Grid” up to 11x Petascale machine with 8192 procs, each at 500 GFlops/s, a bandwidth of 4 GB/s.

  15. Direct Linear Algebra: summary and future work • Communication lower bounds on #words_moved and #messages • BLAS, LU, Cholesky, QR, eig, SVD, … • Some whole programs (compositions of these operations, no matter how they are interleaved, eg computing Ak) • Dense and sparse matrices (where #flops << n3 ) • Sequential and parallel algorithms • Some graph-theoretic algorithms (eg Floyd-Warshall) • Algorithms that attain these lower bounds • Nearly all dense problems (some open problems) • A few sparse problems • Speed ups in theory and practice • Future work • Lots to implement, autotune • Next generation of Sca/LAPACK on heterogeneous architectures (MAGMA) • Few algorithms in sparse case • Strassen-like algorithms • 3D Algorithms (only for Matmul so far), may be important for scaling

  16. Avoiding Communication in Iterative Linear Algebra • k-steps of typical iterative solver for sparse Ax=b or Ax=λx • Does k SpMVs with starting vector • Finds “best” solution among all linear combinations of these k+1 vectors • Many such “Krylov Subspace Methods” • Conjugate Gradients, GMRES, Lanczos, Arnoldi, … • Goal: minimize communication in Krylov Subspace Methods • Assume matrix “well-partitioned,” with modest surface-to-volume ratio • Parallel implementation • Conventional: O(k log p) messages, because k calls to SpMV • New: O(log p) messages - optimal • Serial implementation • Conventional: O(k) moves of data from slow to fast memory • New: O(1) moves of data – optimal • Lots of speed up possible (modeled and measured) • Price: some redundant computation • Much prior work • See bebop.cs.berkeley.edu • CG: [van Rosendale, 83], [Chronopoulos and Gear, 89] • GMRES: [Walker, 88], [Joubert and Carey, 92], [Bai et al., 94]

  17. Minimizing Communication of GMRES to solve Ax=b • GMRES: find x in span{b,Ab,…,Akb} minimizing || Ax-b ||2 • Cost of k steps of standard GMRES vs new GMRES Standard GMRES for i=1 to k w = A · v(i-1) MGS(w, v(0),…,v(i-1)) update v(i), H endfor solve LSQ problem with H Sequential: #words_moved = O(k·nnz) from SpMV + O(k2·n) from MGS Parallel: #messages = O(k) from SpMV + O(k2 · log p) from MGS Communication-avoiding GMRES W = [ v, Av, A2v, … , Akv ] [Q,R] = TSQR(W) … “Tall Skinny QR” Build H from R, solve LSQ problem Sequential: #words_moved = O(nnz) from SpMV + O(k·n) from TSQR Parallel: #messages = O(1) from computing W + O(log p) from TSQR • Oops – W from power method, precision lost!

  18. “Monomial” basis [Ax,…,Akx] fails to converge A different polynomial basis does converge

  19. Speed ups of GMRES on 8-core Intel Clovertown [MHDY09]

  20. Communication Avoiding Iterative Linear Algebra: Future Work • Lots of algorithms to implement, autotune • Make widely available via OSKI, Trilinos, PETSc, Python, … • Extensions to other Krylov subspace methods • So far just Lanczos/CG and Arnoldi/GMRES • BiCGStab, CGS, QMR, … • Add preconditioning • Block diagonal are easiest • Hierarchical, semiseparable, … • Works if interactions between distant points can be “compressed”

  21. Summary Don’t Communic…

  22. 1/3

  23. 2/3

  24. 3/3

More Related