1 / 19

Dense Linear Algebra (Data Distributions)

Dense Linear Algebra (Data Distributions). Sathish Vadhiyar. Gaussian Elimination - Review. Version 1 for each column i zero it out below the diagonal by adding multiples of row i to later rows for i= 1 to n-1 for each row j below row i for j = i+1 to n

asta
Download Presentation

Dense Linear Algebra (Data Distributions)

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Dense Linear Algebra(Data Distributions) Sathish Vadhiyar

  2. Gaussian Elimination - Review Version 1 for each column i zero it out below the diagonal by adding multiples of row i to later rows for i= 1 to n-1 for each row j below row i for j = i+1 to n add a multiple of row i to row j for k = i to n A(j, k) = A(j, k) – A(j, i)/A(i, i) * A(i, k) i 0 0 0 0 0 0 k 0 0 0 0 0 i i,i X X x j

  3. Gaussian Elimination - Review Version 2 – Remove A(j, i)/A(i, i) from inner loop for each column i zero it out below the diagonal by adding multiples of row i to later rows for i= 1 to n-1 for each row j below row i for j = i+1 to n m = A(j, i) / A(i, i) for k = i to n A(j, k) = A(j, k) – m* A(i, k) i 0 0 0 0 0 0 k 0 0 0 0 0 i i,i X X x j

  4. Gaussian Elimination - Review Version 3 – Don’t compute what we already know for each column i zero it out below the diagonal by adding multiples of row i to later rows for i= 1 to n-1 for each row j below row i for j = i+1 to n m = A(j, i) / A(i, i) for k = i+1 to n A(j, k) = A(j, k) – m* A(i, k) i 0 0 0 0 0 0 k 0 0 0 0 0 i i,i X X x j

  5. Gaussian Elimination - Review Version 4 – Store multipliers m below diagonals for each column i zero it out below the diagonal by adding multiples of row i to later rows for i= 1 to n-1 for each row j below row i for j = i+1 to n A(j, i) = A(j, i) / A(i, i) for k = i+1 to n A(j, k) = A(j, k) – A(j, i)* A(i, k) i 0 0 0 0 0 0 k 0 0 0 0 0 i i,i X X x j

  6. GE - Runtime • Divisions • Multiplications / subtractions • Total 1+ 2 + 3 + … (n-1) = n2/2 (approx.) 12 + 22 + 32 + 42 +52 + …. (n-1)2 = n3/3 – n2/2 2n3/3

  7. Parallel GE • 1st step – 1-D block partitioning along blocks of n columns by p processors i 0 0 0 0 0 0 k 0 0 0 0 0 i i,i X X x j

  8. 1D block partitioning - Steps 1. Divisions n2/2 2. Broadcast xlog(p) + ylog(p-1) + zlog(p-3) + … log1 < n2logp 3. Multiplications and Subtractions (n-1)n/p + (n-2)n/p + …. 1x1 = n3/p (approx.) Runtime: < n2/2 +n2logp + n3/p

  9. 2-D block • To speedup the divisions P i 0 0 0 0 0 0 k 0 0 0 0 0 i Q i,i X X x j

  10. 2D block partitioning - Steps 1. Broadcast of (k,k) logQ 2. Divisions n2/Q (approx.) 3. Broadcast of multipliers xlog(P) + ylog(P-1) + zlog(P-2) + …. = n2/Q logP 4. Multiplications and subtractions n3/PQ (approx.)

  11. Problem with block partitioning for GE • Once a block is finished, the corresponding processor remains idle for the rest of the execution • Solution? -

  12. Onto cyclic • The block partitioning algorithms waste processor cycles. No load balancing throughout the algorithm. • Onto cyclic 0 1 2 3 0 1 2 3 0 1 2 3 0 1-D block-cyclic 2-D block-cyclic cyclic Load balance, block operations, but column factorization bottleneck Has everything Load balance

  13. Block cyclic • Having blocks in a processor can lead to block-based operations (block matrix multiply etc.) • Block based operations lead to high performance • Operations can be split into 3 categories based on number of operations per memory reference • Referred to BLAS levels

  14. Basic Linear Algebra Subroutines (BLAS) – 3 levels of operations • Memory hierarchy efficiently exploited by higher level BLAS

  15. Gaussian Elimination - Review i Version 4 – Store multipliers m below diagonals for each column i zero it out below the diagonal by adding multiples of row i to later rows for i= 1 to n-1 for each row j below row i for j = i+1 to n A(j, i) = A(j, i) / A(i, i) for k = i+1 to n A(j, k) = A(j, k) – A(j, i)* A(i, k) 0 0 0 0 0 0 k 0 0 0 0 0 i i,i X X x j What GE really computes: for i=1 to n-1 A(i+1:n, i) = A(i+1:n, i) / A(i, i) A(i+1:n, i+1:n) -= A(i+1:n, i)*A(i, i+1:n) i Finished multipliers i A(i,i) A(i,k) A(i, i+1:n) Finished multipliers A(j,i) A(j,k) BLAS1 and BLAS2 operations A(i+1:n, i) A(i+1:n, i+1:n)

  16. Converting BLAS2 to BLAS3 • Use blocking for optimized matrix-multiplies (BLAS3) • Matrix multiplies by delayed updates • Save several updates to trailing matrices • Apply several updates in the form of matrix multiply

  17. Modified GE using BLAS3Courtesy: Dr. Jack Dongarra for ib = 1 to n-1 step b /* process matrix b columns at a time */ end = ib+b-1; /* Apply BLAS 2 version of GE to get A(ib:n, ib:end) factored. Let LL denote the strictly lower triangular portion of A(ib:end, ib:end)+1 */ A(ib:end, end+1:n) = LL-1A(ib:end, end+1:n) /* update next b rows of U */ A(end+1:n, end+1:n) = A(end+1:n, end+1:n) - A(ib:n, ib:end)* A(ib:end, end+1:n) /* Apply delayed updates with single matrix multiply */ b ib end Completed part of U A(ib:end, ib:end) ib Completed part of L b end A(end+1:n, end+1:n) A(end+1:n, ib:end)

  18. GE: MiscellaneousGE with Partial Pivoting • 1D block-column partitioning: which is better? Column or row pivoting • 2D block partitioning: Can restrict the pivot search to limited number of columns • Column pivoting does not involve any extra steps since pivot search and exchange are done locally on each processor. O(n-i-1) • The exchange information is passed to the other processes by piggybacking with the multiplier information • Row pivoting • Involves distributed search and exchange – O(n/P)+O(logP)

  19. Triangular Solve – Unit Upper triangular matrix • Sequential Complexity - • Complexity of parallel algorithm with 2D block partitioning (P0.5*P0.5) O(n2) O(n2)/P0.5 • Thus (parallel GE / parallel TS) < (sequential GE / sequential TS) • Overall (GE+TS) – O(n3/P)

More Related