slide1 n.
Download
Skip this Video
Download Presentation
Minimizing Communication in Numerical Linear Algebra cs.berkeley/~demmel

Loading in 2 Seconds...

play fullscreen
1 / 47

Minimizing Communication in Numerical Linear Algebra cs.berkeley/~demmel - PowerPoint PPT Presentation


  • 142 Views
  • Uploaded on

Minimizing Communication in Numerical Linear Algebra www.cs.berkeley.edu/~demmel Optimizing Krylov Subspace Methods. Jim Demmel EECS & Math Departments, UC Berkeley demmel@cs.berkeley.edu. Outline of Lecture 8: Optimizing Krylov Subspace Methods.

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about 'Minimizing Communication in Numerical Linear Algebra cs.berkeley/~demmel' - axl


Download Now An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
slide1

Minimizing Communication in Numerical Linear Algebrawww.cs.berkeley.edu/~demmel

Optimizing Krylov Subspace Methods

Jim Demmel

EECS & Math Departments, UC Berkeley

demmel@cs.berkeley.edu

outline of lecture 8 optimizing krylov subspace methods
Outline of Lecture 8: Optimizing Krylov Subspace Methods
  • k-steps of typical iterative solver for Ax=b or Ax=λx
    • Does k SpMVs with starting vector (eg with b, if solving Ax=b)
    • Finds “best” solution among all linear combinations of these k+1 vectors
    • Many such “Krylov Subspace Methods”
      • Conjugate Gradients, GMRES, Lanczos, Arnoldi, …
  • Goal: minimize communication in Krylov Subspace Methods
    • Assume matrix “well-partitioned,” with modest surface-to-volume ratio
    • Parallel implementation
      • Conventional: O(k log p) messages, because k calls to SpMV
      • New: O(log p) messages - optimal
    • Serial implementation
      • Conventional: O(k) moves of data from slow to fast memory
      • New: O(1) moves of data – optimal
  • Lots of speed up possible (modeled and measured)
    • Price: some redundant computation

Summer School Lecture 8

2

slide3

Locally Dependent Entries for

[x,Ax], A tridiagonal, 2 processors

A8x

A7x

A6x

A5x

A4x

A3x

A2x

Ax

x

Proc 1 Proc 2

Can be computed without communication

Summer School Lecture 8

3

slide4

Locally Dependent Entries for

[x,Ax,A2x], A tridiagonal, 2 processors

A8x

A7x

A6x

A5x

A4x

A3x

A2x

Ax

x

Proc 1 Proc 2

Can be computed without communication

Summer School Lecture 8

4

slide5

Locally Dependent Entries for

[x,Ax,…,A3x], A tridiagonal, 2 processors

A8x

A7x

A6x

A5x

A4x

A3x

A2x

Ax

x

Proc 1 Proc 2

Can be computed without communication

Summer School Lecture 8

5

slide6

Locally Dependent Entries for

[x,Ax,…,A4x], A tridiagonal, 2 processors

A8x

A7x

A6x

A5x

A4x

A3x

A2x

Ax

x

Proc 1 Proc 2

Can be computed without communication

Summer School Lecture 8

6

slide7

Locally Dependent Entries for

[x,Ax,…,A8x], A tridiagonal, 2 processors

A8x

A7x

A6x

A5x

A4x

A3x

A2x

Ax

x

Proc 1 Proc 2

Can be computed without communication

k=8 fold reuse of A

7

slide8

A8x

A7x

A6x

A5x

A4x

A3x

A2x

Ax

x

Remotely Dependent Entries for

[x,Ax,…,A8x], A tridiagonal, 2 processors

Proc 1 Proc 2

One message to get data needed to compute remotely dependent entries, not k=8

Minimizes number of messages = latency cost

Price: redundant work  “surface/volume ratio”

8

slide10

Spyplot of A with sparsity pattern of a 5 point stencil, natural order

Summer School Lecture 8

10

slide11

Spyplot of A with sparsity pattern of a 5 point stencil, natural order, assigned to p=9 processors

For n x n mesh assigned to p processors,

each processor needs 2n data from neighbors

11

slide13

Spyplot of A with sparsity pattern of a 5 point stencil, nested dissection order

For n x n mesh assigned to p processors,

each processor needs 4n/p1/2 data from neighbors

13

slide14

Remotely Dependent Entries for

[x,Ax,…,A3x], A with sparsity pattern of a 5 point stencil

14

slide15

Remotely Dependent Entries for

[x,Ax,…,A3x], A with sparsity pattern of a 5 point stencil

Summer School Lecture 8

15

slide16

Remotely Dependent Entries for [x,Ax,A2x,A3x],

A irregular, multiple processors

16

slide17

A8x

A7x

A6x

A5x

A4x

A3x

A2x

Ax

x

Remotely Dependent Entries for

[x,Ax,…,A8x], A tridiagonal, 2 processors

Proc 1 Proc 2

One message to get data needed to compute remotely dependent entries, not k=8

Minimizes number of messages = latency cost

Price: redundant work  “surface/volume ratio”

17

summary of parallel optimizations so far for akx kernel a x k ax a 2 x a k x
Summary of Parallel Optimizations so far, for“Akx kernel”: (A,x,k)  [Ax,A2x,…,Akx]
  • Best case: tridiagonal (1D mesh): need O(1) data from nbrs, versus
  • O(n/p) data locally;
    • #flops grows by factor 1+O(k/(n/p)) = 1 + O(k/data_per_proc)  1
  • Pretty good: 2D mesh (n x n): need O(n/p1/2) data from nbors, versus
  • O(n2/p) locally; #flops grows by 1+O(k/(data_per_proc)1/2)
  • Less good: 3D mesh (n x n x n): need O(n2/p2/3) data from nbors, versus
  • O(n3/p) locally; #flops grows by 1+O(k/(data_per_proc)1/3)
  • Possible to reduce #messages from O(k) to O(1)
  • Depends on matrix being “well-partitioned”
    • Each processor only needs data from a few neighbors
  • Price: redundant computation of some entries of Ajx
    • Amount of redundant work depends on “surface to volume ratio” of partition: ideally each processor only needs a little data from neighbors, compared to what it owns itself

Summer School Lecture 8

19

predicted speedup for model petascale machine of ax a 2 x a k x for 2d n x n mesh
Predicted speedup for model Petascale machine of [Ax,A2x,…,Akx] for 2D (n x n) Mesh

k

log2 n

20

slide21
Predicted fraction of time spent doing arithmetic for model Petascale machine of [Ax,A2x,…,Akx] for 2D (n x n) Mesh

k

log2 n

21

predicted ratio of extra arithmetic for model petascale machine of ax a 2 x a k x for 2d n x n mesh
Predicted ratio of extra arithmetic for model Petascale machine of [Ax,A2x,…,Akx] for 2D (n x n) Mesh

k

log2 n

22

predicted speedup for model petascale machine of ax a 2 x a k x for 3d n x n x n mesh
Predicted speedup for model Petascale machine of [Ax,A2x,…,Akx] for 3D (n x n x n) Mesh

k

log2 n

23

slide24
Predicted fraction of time spent doing arithmetic for model Petascale machine of [Ax,A2x,…,Akx] for 3D (n x n x n) Mesh

k

log2 n

24

slide25
Predicted ratio of extra arithmetic for model Petascale machine of [Ax,A2x,…,Akx] for 3D (n x n x n) Mesh

k

log2 n

25

minimizing messages versus words moved
Minimizing #messages versus #words_moved
  • Parallel case
    • Can’t reduce #words needed from other processors
    • Required to get right answer
  • Sequential case
    • Can reduce #words needed from slow memory
    • Assume A, x too large to fit in fast memory,
    • Naïve implementation of [Ax,A2x,…,Akx] by k calls to SpMV need to move A, x between fast and slow memory k times
    • We will move it one time – clearly optimal

Summer School Lecture 8

26

slide27

Sequential [x,Ax,…,A4x], with memory hierarchy

v

One read of matrix from slow memory, not k=4

Minimizes words moved = bandwidth cost

No redundant work

27

slide28

Remotely Dependent Entries for [x,Ax,…,A3x],

A 100x100 with bandwidth 2

Only ~25% of A, vectors fits in fast memory

28

in what order should the sequential algorithm process a general sparse matrix
In what order should the sequential algorithm process a general sparse matrix?
  • For a band matrix, we obviously
  • want to process the matrix
  • “from left to right”, since data
  • we need for the next partition
  • is already in fast memory
  • Not obvious what best order is
  • for a general matrix
  • We can formulate question of
  • finding the best order as a
  • Traveling Salesman Problem (TSP)
  • One vertex per matrix partition
  • Weight of edge (j, k) is memory cost of processing
  • partition k right after partition j
  • TSP: find lowest cost “tour” visiting all vertices
what about multicore
What about multicore?
  • Two kinds of communication to minimize
    • Between processors on the chip
    • Between on-chip cache and off-chip DRAM
  • Use hybrid of both techniques described so far
    • Use parallel optimization so each core can work independently
    • Use sequential optimization to minimize off-chip DRAM traffic of each core

Summer School Lecture 8

30

slide31

Speedups on Intel Clovertown (8 core)

Test matrices include stencils and practical matrices

See SC09 paper on bebop.cs.berkeley.edu for details

Summer School Lecture 8

31

minimizing communication of gmres to solve ax b
Minimizing Communication of GMRES to solve Ax=b

Standard GMRES

for i=1 to k

w = A · v(i-1)

MGS(w, v(0),…,v(i-1))

update v(i), H

endfor

solve LSQ problem with H

Sequential: #words_moved =

O(k·nnz) from SpMV

+ O(k2·n) from MGS

Parallel: #messages =

O(k) from SpMV

+ O(k2 · log p) from MGS

Communication-Avoiding GMRES

… CA-GMRES

W = [ v, Av, A2v, … , Akv ]

[Q,R] = TSQR(W) … “Tall Skinny QR”

Build H from R, solve LSQ problem

Sequential: #words_moved =

O(nnz) from SpMV

+ O(k·n) from TSQR

Parallel: #messages =

O(1) from computing W

+ O(log p) from TSQR

  • GMRES: find x in span{b,Ab,…,Akb} minimizing || Ax-b ||2
  • Cost of k steps of standard GMRES vs new GMRES
  • Oops – W from power method, precision lost!

32

how to make ca gmres stable
How to make CA-GMRES Stable?
  • Use a different polynomial basis for the Krylov subspace
  • Not Monomial basis W = [v, Av, A2v, …], instead [v, p1(A)v,p2(A)v,…]
  • Possible choices:
    • Newton Basis WN = [v, (A – θ1 I)v , (A – θ2 I)(A – θ1 I)v, …] where shifts θi chosen as approximate eigenvalues from Arnoldi (using same Krylov subspace, so “free”)
    • Chebyshev Basis WC = [v, T1(A)v, T2(A)v, …] where T1(z) chosen to be small in region of complex plane containing large eigenvalues

Summer School Lecture 8

34

slide35

“Monomial” basis [Ax,…,Akx]

fails to converge

Newton polynomial basis does converge

35

summary of what is known 1 3 open
Summary of what is known (1/3), open
  • GMRES
    • Can independently choose k to optimize speed, restart length r to optimize convergence
    • Need to “co-tune” Akx kernel and TSQR
    • Know how to use more stable polynomial bases
    • Proven speedups
  • Can similarly reorganize other Krylov methods
    • Arnoldi and Lanczos, for Ax = λx and for Ax = λMx
    • Conjugate Gradients, for Ax = b
    • Biconjugate Gradients, for Ax=b
    • BiCGStab?
    • Other Krylov methods?

Summer School Lecture 8

37

summary of what is known 2 3 open
Summary of what is known (2/3), open
  • Preconditioning: MAx = Mb
    • Need different communication-avoiding kernel: [x,Ax,MAx,AMAx,MAMAx,AMAMAx,…]
    • Can think of this as union of two of the earlier kernels

[x,(MA)x,(MA)2x,…,(MA)kx] and [x,Ax,(AM)Ax,(AM)2Ax,…,(AM)kAx]

    • For which preconditioners M can we minimize communication?
      • Easy: diagonal M
      • How about block diagonal M?

Summer School Lecture 8

38

slide39

Examine [x,Ax,MAx,AMAx,MAMAx,…]

for A tridiagonal, M block-diagonal

Summer School Lecture 8

39

slide40

Examine [x,Ax,MAx,AMAx,MAMAx,…]

for A tridiagonal, M block-diagonal

Summer School Lecture 8

40

slide41

Examine [x,Ax,MAx,AMAx,MAMAx,…]

for A tridiagonal, M block-diagonal

Summer School Lecture 8

41

slide42

Examine [x,Ax,MAx,AMAx,MAMAx,…]

for A tridiagonal, M block-diagonal

Summer School Lecture 8

42

slide43

Examine [x,Ax,MAx,AMAx,MAMAx,…]

for A tridiagonal, M block-diagonal

Summer School Lecture 8

43

summary of what is known 3 3 open
Summary of what is known (3/3), open

Reconstruct

yi on proc i

Compute on proc j,

Send to proc i

  • Preconditioning: MAx = Mb
    • Need different communication-avoiding kernel: [x,Ax,MAx,AMAx,MAMAx,AMAMAx,…]
    • For block diagonal M, matrix powers rapidly become dense, but ranks of off-diagonal block grow slowly
    • Can take advantage of low rank to minimize communication
        • yi = (AM)kij · xj = U  V · xj = U  (V · xj)
    • Works (in principle) for Hierarchically Semi-Separable M
    • How does it work in practice?
  • For details (and open problems) see M. Hoemmen’s PhD thesis

Summer School Lecture 8

44

of things not said much about
Of things not said (much about) …
  • Other Communication-Avoiding (CA) dense factorizations
    • QR with column pivoting (tournament pivoting)
    • Cholesky with diagonal pivoting
    • GE with “complete pivoting”
    • LDLT ? Maybe with complete pivoting?
  • CA-Sparse factorizations
    • Cholesky, assuming good separators
    • Lower bounds from Lipton/Rose/Tarjan
    • Matrix multiplication, anything else?
  • Strassen, and all linear algebra with

#words_moved = O(n / M/2-1 )

    • Parallel?
  • Extending CA-Krylov methods to “bottom solver” in multigrid with Adaptive Mesh Refinement (AMR)

45

Summer School Lecture 8

summary
Summary

Time to redesign all dense and sparse linear algebra

Don’t Communic…

Summer School Lecture 8

extra slides
Extra slides

Summer School Lecture 8

47