Minimizing Communication in Numerical Linear Algebra
Download
1 / 47

Minimizing Communication in Numerical Linear Algebra cs.berkeley/~demmel - PowerPoint PPT Presentation


  • 142 Views
  • Uploaded on

Minimizing Communication in Numerical Linear Algebra www.cs.berkeley.edu/~demmel Optimizing Krylov Subspace Methods. Jim Demmel EECS & Math Departments, UC Berkeley [email protected] Outline of Lecture 8: Optimizing Krylov Subspace Methods.

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about ' Minimizing Communication in Numerical Linear Algebra cs.berkeley/~demmel' - axl


An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript

Minimizing Communication in Numerical Linear Algebrawww.cs.berkeley.edu/~demmel

Optimizing Krylov Subspace Methods

Jim Demmel

EECS & Math Departments, UC Berkeley

[email protected]


Outline of lecture 8 optimizing krylov subspace methods
Outline of Lecture 8: Optimizing Krylov Subspace Methods

  • k-steps of typical iterative solver for Ax=b or Ax=λx

    • Does k SpMVs with starting vector (eg with b, if solving Ax=b)

    • Finds “best” solution among all linear combinations of these k+1 vectors

    • Many such “Krylov Subspace Methods”

      • Conjugate Gradients, GMRES, Lanczos, Arnoldi, …

  • Goal: minimize communication in Krylov Subspace Methods

    • Assume matrix “well-partitioned,” with modest surface-to-volume ratio

    • Parallel implementation

      • Conventional: O(k log p) messages, because k calls to SpMV

      • New: O(log p) messages - optimal

    • Serial implementation

      • Conventional: O(k) moves of data from slow to fast memory

      • New: O(1) moves of data – optimal

  • Lots of speed up possible (modeled and measured)

    • Price: some redundant computation

Summer School Lecture 8

2


Locally Dependent Entries for

[x,Ax], A tridiagonal, 2 processors

A8x

A7x

A6x

A5x

A4x

A3x

A2x

Ax

x

Proc 1 Proc 2

Can be computed without communication

Summer School Lecture 8

3


Locally Dependent Entries for

[x,Ax,A2x], A tridiagonal, 2 processors

A8x

A7x

A6x

A5x

A4x

A3x

A2x

Ax

x

Proc 1 Proc 2

Can be computed without communication

Summer School Lecture 8

4


Locally Dependent Entries for

[x,Ax,…,A3x], A tridiagonal, 2 processors

A8x

A7x

A6x

A5x

A4x

A3x

A2x

Ax

x

Proc 1 Proc 2

Can be computed without communication

Summer School Lecture 8

5


Locally Dependent Entries for

[x,Ax,…,A4x], A tridiagonal, 2 processors

A8x

A7x

A6x

A5x

A4x

A3x

A2x

Ax

x

Proc 1 Proc 2

Can be computed without communication

Summer School Lecture 8

6


Locally Dependent Entries for

[x,Ax,…,A8x], A tridiagonal, 2 processors

A8x

A7x

A6x

A5x

A4x

A3x

A2x

Ax

x

Proc 1 Proc 2

Can be computed without communication

k=8 fold reuse of A

7


A8x

A7x

A6x

A5x

A4x

A3x

A2x

Ax

x

Remotely Dependent Entries for

[x,Ax,…,A8x], A tridiagonal, 2 processors

Proc 1 Proc 2

One message to get data needed to compute remotely dependent entries, not k=8

Minimizes number of messages = latency cost

Price: redundant work  “surface/volume ratio”

8



Spyplot of A with sparsity pattern of a 5 point stencil, natural order

Summer School Lecture 8

10


Spyplot of A with sparsity pattern of a 5 point stencil, natural order, assigned to p=9 processors

For n x n mesh assigned to p processors,

each processor needs 2n data from neighbors

11



Spyplot of A with sparsity pattern of a 5 point stencil, nested dissection order

For n x n mesh assigned to p processors,

each processor needs 4n/p1/2 data from neighbors

13


Remotely Dependent Entries for nested dissection order

[x,Ax,…,A3x], A with sparsity pattern of a 5 point stencil

14


Remotely Dependent Entries for nested dissection order

[x,Ax,…,A3x], A with sparsity pattern of a 5 point stencil

Summer School Lecture 8

15


Remotely Dependent Entries for [x,Ax,A nested dissection order2x,A3x],

A irregular, multiple processors

16


A nested dissection order8x

A7x

A6x

A5x

A4x

A3x

A2x

Ax

x

Remotely Dependent Entries for

[x,Ax,…,A8x], A tridiagonal, 2 processors

Proc 1 Proc 2

One message to get data needed to compute remotely dependent entries, not k=8

Minimizes number of messages = latency cost

Price: redundant work  “surface/volume ratio”

17


Reducing redundant work for a tridiagonal matrix nested dissection order

Summer School Lecture 8

18


Summary of parallel optimizations so far for akx kernel a x k ax a 2 x a k x
Summary of Parallel Optimizations so far, for nested dissection order“Akx kernel”: (A,x,k)  [Ax,A2x,…,Akx]

  • Best case: tridiagonal (1D mesh): need O(1) data from nbrs, versus

  • O(n/p) data locally;

    • #flops grows by factor 1+O(k/(n/p)) = 1 + O(k/data_per_proc)  1

  • Pretty good: 2D mesh (n x n): need O(n/p1/2) data from nbors, versus

  • O(n2/p) locally; #flops grows by 1+O(k/(data_per_proc)1/2)

  • Less good: 3D mesh (n x n x n): need O(n2/p2/3) data from nbors, versus

  • O(n3/p) locally; #flops grows by 1+O(k/(data_per_proc)1/3)

  • Possible to reduce #messages from O(k) to O(1)

  • Depends on matrix being “well-partitioned”

    • Each processor only needs data from a few neighbors

  • Price: redundant computation of some entries of Ajx

    • Amount of redundant work depends on “surface to volume ratio” of partition: ideally each processor only needs a little data from neighbors, compared to what it owns itself

Summer School Lecture 8

19


Predicted speedup for model petascale machine of ax a 2 x a k x for 2d n x n mesh
Predicted speedup for model Petascale machine of [Ax,A nested dissection order2x,…,Akx] for 2D (n x n) Mesh

k

log2 n

20


Predicted fraction of time spent doing arithmetic for model Petascale machine of [Ax,A2x,…,Akx] for 2D (n x n) Mesh

k

log2 n

21


Predicted ratio of extra arithmetic for model petascale machine of ax a 2 x a k x for 2d n x n mesh
Predicted ratio of extra arithmetic for model Petascale machine of [Ax,A2x,…,Akx] for 2D (n x n) Mesh

k

log2 n

22


Predicted speedup for model petascale machine of ax a 2 x a k x for 3d n x n x n mesh
Predicted speedup for model Petascale machine of [Ax,A machine of [Ax,A2x,…,Akx] for 3D (n x n x n) Mesh

k

log2 n

23


Predicted fraction of time spent doing arithmetic for model Petascale machine of [Ax,A2x,…,Akx] for 3D (n x n x n) Mesh

k

log2 n

24


Predicted ratio of extra arithmetic for model Petascale machine of [Ax,A2x,…,Akx] for 3D (n x n x n) Mesh

k

log2 n

25


Minimizing messages versus words moved
Minimizing #messages versus #words_moved machine of [Ax,A

  • Parallel case

    • Can’t reduce #words needed from other processors

    • Required to get right answer

  • Sequential case

    • Can reduce #words needed from slow memory

    • Assume A, x too large to fit in fast memory,

    • Naïve implementation of [Ax,A2x,…,Akx] by k calls to SpMV need to move A, x between fast and slow memory k times

    • We will move it one time – clearly optimal

Summer School Lecture 8

26


Sequential [x,Ax,…,A machine of [Ax,A4x], with memory hierarchy

v

One read of matrix from slow memory, not k=4

Minimizes words moved = bandwidth cost

No redundant work

27


Remotely Dependent Entries for [x,Ax,…,A machine of [Ax,A3x],

A 100x100 with bandwidth 2

Only ~25% of A, vectors fits in fast memory

28


In what order should the sequential algorithm process a general sparse matrix
In what order should the sequential algorithm process a general sparse matrix?

  • For a band matrix, we obviously

  • want to process the matrix

  • “from left to right”, since data

  • we need for the next partition

  • is already in fast memory

  • Not obvious what best order is

  • for a general matrix

  • We can formulate question of

  • finding the best order as a

  • Traveling Salesman Problem (TSP)

  • One vertex per matrix partition

  • Weight of edge (j, k) is memory cost of processing

  • partition k right after partition j

  • TSP: find lowest cost “tour” visiting all vertices


What about multicore
What about multicore? general sparse matrix?

  • Two kinds of communication to minimize

    • Between processors on the chip

    • Between on-chip cache and off-chip DRAM

  • Use hybrid of both techniques described so far

    • Use parallel optimization so each core can work independently

    • Use sequential optimization to minimize off-chip DRAM traffic of each core

Summer School Lecture 8

30


Speedups on Intel Clovertown (8 core) general sparse matrix?

Test matrices include stencils and practical matrices

See SC09 paper on bebop.cs.berkeley.edu for details

Summer School Lecture 8

31


Minimizing communication of gmres to solve ax b
Minimizing Communication of GMRES to solve Ax=b general sparse matrix?

Standard GMRES

for i=1 to k

w = A · v(i-1)

MGS(w, v(0),…,v(i-1))

update v(i), H

endfor

solve LSQ problem with H

Sequential: #words_moved =

O(k·nnz) from SpMV

+ O(k2·n) from MGS

Parallel: #messages =

O(k) from SpMV

+ O(k2 · log p) from MGS

Communication-Avoiding GMRES

… CA-GMRES

W = [ v, Av, A2v, … , Akv ]

[Q,R] = TSQR(W) … “Tall Skinny QR”

Build H from R, solve LSQ problem

Sequential: #words_moved =

O(nnz) from SpMV

+ O(k·n) from TSQR

Parallel: #messages =

O(1) from computing W

+ O(log p) from TSQR

  • GMRES: find x in span{b,Ab,…,Akb} minimizing || Ax-b ||2

  • Cost of k steps of standard GMRES vs new GMRES

  • Oops – W from power method, precision lost!

32


33 general sparse matrix?


How to make ca gmres stable
How to make CA-GMRES Stable? general sparse matrix?

  • Use a different polynomial basis for the Krylov subspace

  • Not Monomial basis W = [v, Av, A2v, …], instead [v, p1(A)v,p2(A)v,…]

  • Possible choices:

    • Newton Basis WN = [v, (A – θ1 I)v , (A – θ2 I)(A – θ1 I)v, …] where shifts θi chosen as approximate eigenvalues from Arnoldi (using same Krylov subspace, so “free”)

    • Chebyshev Basis WC = [v, T1(A)v, T2(A)v, …] where T1(z) chosen to be small in region of complex plane containing large eigenvalues

Summer School Lecture 8

34


“Monomial” basis [Ax,…,A general sparse matrix?kx]

fails to converge

Newton polynomial basis does converge

35


Speed ups on 8 core clovertown
Speed ups on 8-core Clovertown general sparse matrix?

36


Summary of what is known 1 3 open
Summary of what is known (1/3), general sparse matrix?open

  • GMRES

    • Can independently choose k to optimize speed, restart length r to optimize convergence

    • Need to “co-tune” Akx kernel and TSQR

    • Know how to use more stable polynomial bases

    • Proven speedups

  • Can similarly reorganize other Krylov methods

    • Arnoldi and Lanczos, for Ax = λx and for Ax = λMx

    • Conjugate Gradients, for Ax = b

    • Biconjugate Gradients, for Ax=b

    • BiCGStab?

    • Other Krylov methods?

Summer School Lecture 8

37


Summary of what is known 2 3 open
Summary of what is known (2/3), general sparse matrix?open

  • Preconditioning: MAx = Mb

    • Need different communication-avoiding kernel: [x,Ax,MAx,AMAx,MAMAx,AMAMAx,…]

    • Can think of this as union of two of the earlier kernels

      [x,(MA)x,(MA)2x,…,(MA)kx] and [x,Ax,(AM)Ax,(AM)2Ax,…,(AM)kAx]

    • For which preconditioners M can we minimize communication?

      • Easy: diagonal M

      • How about block diagonal M?

Summer School Lecture 8

38


Examine [x,Ax,MAx,AMAx,MAMAx,…] general sparse matrix?

for A tridiagonal, M block-diagonal

Summer School Lecture 8

39


Examine [x,Ax,MAx,AMAx,MAMAx,…] general sparse matrix?

for A tridiagonal, M block-diagonal

Summer School Lecture 8

40


Examine [x,Ax,MAx,AMAx,MAMAx,…] general sparse matrix?

for A tridiagonal, M block-diagonal

Summer School Lecture 8

41


Examine [x,Ax,MAx,AMAx,MAMAx,…] general sparse matrix?

for A tridiagonal, M block-diagonal

Summer School Lecture 8

42


Examine [x,Ax,MAx,AMAx,MAMAx,…] general sparse matrix?

for A tridiagonal, M block-diagonal

Summer School Lecture 8

43


Summary of what is known 3 3 open
Summary of what is known (3/3), general sparse matrix?open

Reconstruct

yi on proc i

Compute on proc j,

Send to proc i

  • Preconditioning: MAx = Mb

    • Need different communication-avoiding kernel: [x,Ax,MAx,AMAx,MAMAx,AMAMAx,…]

    • For block diagonal M, matrix powers rapidly become dense, but ranks of off-diagonal block grow slowly

    • Can take advantage of low rank to minimize communication

      • yi = (AM)kij · xj = U  V · xj = U  (V · xj)

  • Works (in principle) for Hierarchically Semi-Separable M

  • How does it work in practice?

  • For details (and open problems) see M. Hoemmen’s PhD thesis

  • Summer School Lecture 8

    44


    Of things not said much about
    Of things not said (much about) … general sparse matrix?

    • Other Communication-Avoiding (CA) dense factorizations

      • QR with column pivoting (tournament pivoting)

      • Cholesky with diagonal pivoting

      • GE with “complete pivoting”

      • LDLT ? Maybe with complete pivoting?

    • CA-Sparse factorizations

      • Cholesky, assuming good separators

      • Lower bounds from Lipton/Rose/Tarjan

      • Matrix multiplication, anything else?

    • Strassen, and all linear algebra with

      #words_moved = O(n / M/2-1 )

      • Parallel?

    • Extending CA-Krylov methods to “bottom solver” in multigrid with Adaptive Mesh Refinement (AMR)

    45

    Summer School Lecture 8


    Summary
    Summary general sparse matrix?

    Time to redesign all dense and sparse linear algebra

    Don’t Communic…

    Summer School Lecture 8


    Extra slides
    Extra slides general sparse matrix?

    Summer School Lecture 8

    47


    ad