1 / 27

Numerical Linear Algebra An Introduction

Numerical Linear Algebra An Introduction. Levels of multiplication . vector-vector a[i]*b[i] matrix-vector A[i,j]*b[j] matrix-matrix A[I,k]*B[k,j]. Matrix-Matrix. for (i=0; i<n; i++) for (j=0; j<n; j++) { C[i,j]=0.0; for (k=0; k<n; k++)

junius
Download Presentation

Numerical Linear Algebra An Introduction

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Numerical Linear AlgebraAn Introduction High Performance Computing 1

  2. Levels of multiplication • vector-vector a[i]*b[i] • matrix-vector A[i,j]*b[j] • matrix-matrix A[I,k]*B[k,j] High Performance Computing 1

  3. Matrix-Matrix for (i=0; i<n; i++) for (j=0; j<n; j++) { C[i,j]=0.0; for (k=0; k<n; k++) C[i,j]=C[i,j] + A[i,k]*B[k,j]; } Note:O(n3) work High Performance Computing 1

  4. Block matrix • Serial version - same pseudocode, but interpret i, j as indices of subblocks, and A*B means block matrix multiplication • Let n be a power of two, and generate a recursive algorithm. Terminates with an explicit formula for the elementary 2X2 multiplications. Allows for parallelism. Can get O(n2.8) work High Performance Computing 1

  5. Pipeline method • Pump data left to right and top to bottom recv(&A,P[i,j-1]); recv(&B,P[i-1,j]); C=C+A*B send(&A,P[i,j+1]); send(&B,P[i+1, j]); High Performance Computing 1

  6. Pipeline method B[*,0] B[*,1] B[*,2] B[*,3] A[0,*] C[0,0] C[0,1] C[0,2] C[0,3] A[1,*] C[1,0] C[1,1] C[1,2] C[1,3] A[2,*] C[2,0] C[2,1] C[2,2] C[2,3] A[3,*] C[3,0] C[3,1] C[3,2] C[3,3] High Performance Computing 1

  7. Pipeline method • Similar method for matrix-vector multiplication. But you lose some of the cache reuse High Performance Computing 1

  8. A sense of speed – vector ops High Performance Computing 1

  9. A sense of speed – vector ops High Performance Computing 1

  10. Observation s • Simple do loops not effective • Cache and memory hierarchy bottlenecks • For better performance, • combine loops • minimize memory transfer High Performance Computing 1

  11. LINPACK • library of subroutines to solve linear algebra • example – LU decomposition and system solve (dgefa and dgesl, resp.) • In turn, built on BLAS • see netlib.org High Performance Computing 1

  12. BLAS Basic Linear Algebra Subprograms • a library of subroutines designed to provide efficient computation of commonly-used linear algebra routines, like dot products, matrix-vector multiplies, and matrix-matrix multiplies. • The naming convention is not unlike other libraries - the fist letter indicates precision, the rest gives a hint (maybe) of what the routine does, e.g. SAXPY, DGEMM. • The BLAS are divided into 3 levels: vector-vector, matrix-vector, and matrix-matrix. The biggest speed-up usually in level 3. High Performance Computing 1

  13. BLAS • Level 1 High Performance Computing 1

  14. BLAS • Level 2 High Performance Computing 1

  15. BLAS • Level 3 High Performance Computing 1

  16. How efficient is the BLAS? load/store float ops refs/ops level 1 SAXPY 3N 2N 3/2 level 2 SGEMV MN+N+2M 2MN 1/2 level 3 SGEMM 2MN+MK+KN 2MNK 2/N High Performance Computing 1

  17. Matrix-vector read x(1:n) into fast memory read y(1:n) into fast memory for i = 1:n read row i of A into fast memory for j = 1:n y(i) = y(i) + A(i,j)*x(j) writey(1:n) back to slow memory High Performance Computing 1

  18. Matrix-vector • m=# slow memory refs = n^2 +3n • f=# arithmetic ops = 2n^2 • q=f/m ~2 • Mat-vec multiple limited by slow memory High Performance Computing 1

  19. Matrix-matrix High Performance Computing 1

  20. Matrix Multiply - unblocked for i = 1 to n read row i of A into fast memory for j = 1 to n read C(i,j) into fast memory read column j of B into fast memory for k = 1 to n C(i,j) = C(i,j) + A(i,k) * B(k,j) write C(i,j) back to slow memory * High Performance Computing 1

  21. Matrix Multiply unblocked Number of slow memory references on unblocked matrix multiply m = n^3 read each column of B n times + n^2 read each column of A once for each i + 2*n^2 read and write each element of C once = n^3 + 3*n^2 So q = f/m = (2*n^3)/(n^3 + 3*n^2) ~= 2 for large n, no improvement over matrix-vector multiply High Performance Computing 1

  22. Matrix Multiply blocked Consider A,B,C to be N by N matrices of b by b subblocks where b=n/N is called the blocksize for i = 1 to N for j = 1 to N read block C(i,j) into fast memory for k = 1 to N read block A(i,k) into fast memory read block B(k,j) into fast memory C(i,j) = C(i,j) + A(i,k) * B(k,j) {do a matrix multiply on blocks} write block C(i,j) back to slow memory * High Performance Computing 1

  23. Matrix Multiply blocked Number of slow memory references on blocked matrix multiply m = N*n^2 read each block of B N^3 times (N^3 * n/N * n/N) + N*n^2 read each block of A N^3 times + 2*n^2 read and write each block of C = (2*N + 2)*n^2 So q = f/m = 2*n^3 / ((2*N + 2)*n^2) ~= n/N = b for large n High Performance Computing 1

  24. Matrix Multiply blocked So we can improve performance by increasing the blocksize b Can be much faster than matrix-vector multiplty (q=2) Limit: All three blocks from A,B,C must fit in fast memory (cache), so we cannot make these blocks arbitrarily large: 3*b^2 <= M, so q ~= b <= sqrt(M/3) High Performance Computing 1

  25. More on BLAS • Industry standard interface(evolving) • Vendors, others supply optimized implementations • History • BLAS1 (1970s): • vector operations: dot product, saxpy • m=2*n, f=2*n, q ~1 or less • BLAS2 (mid 1980s) • matrix-vector operations • m=n^2, f=2*n^2, q~2, less overhead • somewhat faster than BLAS1 High Performance Computing 1

  26. More on BLAS • BLAS3 (late 1980s) • matrix-matrix operations: matrix matrix multiply, etc • m >= 4n^2, f=O(n^3), so q can possibly be as large as n, so BLAS3 is potentially much faster than BLAS2 • Good algorithms used BLAS3 when possible (LAPACK) • www.netlib.org/blas, www.netlib.org/lapack High Performance Computing 1

  27. BLAS on an IBM RS6000/590 Peak speed = 266 Mflops Peak BLAS 3 BLAS 2 BLAS 1 BLAS 3 (n-by-n matrix matrix multiply) vs BLAS 2 (n-by-n matrix vector multiply) vs BLAS 1 (saxpy of n vectors) High Performance Computing 1

More Related