1 / 48

CS 484

CS 484. Dense Matrix Algorithms. There are two types of Matrices Dense (Full) Sparse We will consider matrices that are Dense Square. Mapping Matrices. How do we partition a matrix for parallel processing? There are two basic ways Striped partitioning Block partitioning. 0. 0. 1.

laird
Download Presentation

CS 484

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. CS 484

  2. Dense Matrix Algorithms • There are two types of Matrices • Dense (Full) • Sparse • We will consider matrices that are • Dense • Square

  3. Mapping Matrices • How do we partition a matrix for parallel processing? • There are two basic ways • Striped partitioning • Block partitioning

  4. 0 0 1 1 2 2 3 3 4 4 5 5 6 6 7 7 Striped Partitioning P0 P0 P1 P2 P1 P3 P0 P2 P1 P2 P3 P3 Cyclic striping Block striping

  5. Block Partitioning P0 P1 P2 P3 P0 P1 P4 P5 P6 P7 P0 P1 P2 P3 P2 P3 P4 P5 P6 P7 Cyclic checkerboard Block checkerboard

  6. Block vs. Striped Partitioning • Scalability? • Striping is limited to n processors • Checkerboard is limited to n x n processors • Complexity? • Striping is easy • Block could introduce more dependencies

  7. Dense Matrix Algorithms • Transposition • Matrix - Vector Multiplication • Matrix - Matrix Multiplication • Solving Systems of Linear Equations • Gaussian Elimination

  8. Matrix Transposition • The transpose of A is AT such that AT[i,j] = A[j,i] • All elements below the diagonal move above the diagonal and vice-versa • If we assume unit time to exchange and one element per processor: • Transpose takes (n2 - n)/2

  9. Transpose • Consider case where each processor has more than one element. • Hypothesis: • The transpose of the full matrix can be done by first sending the multiple element messages to their destination and then transposing the contents of the message.

  10. Transpose (Striped Partitioning)

  11. Transpose (Block Partitioning)

  12. Matrix Multiplication

  13. æ ö 2 N ç ÷ = - + T ( P 1 ) t t ç ÷ s w P è ø One Dimensional Decomposition • Each processor "owns" black portion • To compute the owned portion of the answer, each processor requires all of A

  14. Two Dimensional Decomposition • Requires less data per processor • Algorithm can be performed stepwise. • Fox’s algorithm

  15. Broadcast an A sub- matrix to the other processors in row. Compute Rotate the B sub- matrix upwards

  16. ( ) æ ö 2 æ ö log P N ç ÷ = - + + ç ÷ T P 1 1 t t ç ÷ s w 2 P è ø è ø Algorithm Set B' = Blocal for j = 0 to sqrt(P) -2 in each row I the [(I+j) mod sqrt(P)]th task broadcasts A' = Alocal to the other tasks in the row accumulate A' * B' send B' to upward neighbor done

  17. æ ö 2 N ç ÷ = - + T 2 P 1 t t ç ÷ s w P ø è Cannon’s Algorithm • Broadcasting a submatrix to all who need it is costly. • Suggestion: Shift both submatrices

  18. Blocks Need to Be Aligned B00 B01 B02 B03 A00 A01 A02 A03 Each triangle represents a matrix block Only same-color triangles should be multiplied B11 B10 B12 B13 A11 A10 A12 A13 B20 B21 B22 B23 A20 A21 A22 A23 B30 B31 B32 B33 A30 A31 A32 A33

  19. Rearrange Blocks B00 B11 B22 B33 A00 A01 A02 A03 Block Aij cycles left i positions Block Bij cycles up j positions B10 B21 B32 A10 B03 A11 A12 A13 B20 B31 B02 B13 A22 A23 A20 A21 B30 B01 B12 B23 A33 A30 A31 A32

  20. Consider Process P1,2 B22 Step 1 A11 A12 A13 B32 A10 B02 B12

  21. Consider Process P1,2 B32 Step 2 A12 A13 A10 B02 A11 B12 B22

  22. Consider Process P1,2 B02 Step 3 A13 A10 A11 B12 A12 B22 B32

  23. Consider Process P1,2 B12 Step 4 A10 A11 A12 B22 A13 B32 B02

  24. Complexity Analysis • Algorithm has p iterations • During each iteration process multiplies two (n / p )  (n / p ) matrices: (n3 / p 3/2) • Computational complexity: (n3 / p) • During each iteration process sends and receives two blocks of size (n / p )  (n / p ) • Communication complexity: (n2/ p)

  25. Divide and Conquer App Apq Bpp Bpq P0 + P1 P2 + P3 x = Aqp Aqq Bqp Bqq P4 + P5 P6 + P7 P0 = App * Bpp P1 = Apq * Bpq P2 = App * Bpq P3 = Aqp * Bqq P4 = Aqp * Bpp P5 = Aqq * Bqp P6 = Aqp * Bpq P7 = Aqq * Bqq

  26. Systems of Linear Equations • A linear equation in n variables has the form • A set of linear equations is called a system. • A solution exists for a system iff the solution satisfies all equations in the system. • Many scientific and engineering problems take this form. a0x0 + a1x1 + … + an-1xn-1 = b

  27. Solving Systems of Equations • Many such systems are large. • Thousands of equations and unknowns a0,0x0 + a0,1x1 + … + a0,n-1xn-1 = b0 a1,0x0 + a1,1x1 + … + a1,n-1xn-1 = b1 an-1,0x0 + an-1,1x1 + … + an-1,n-1xn-1 = bn-1

  28. Solving Systems of Equations • A linear system of equations can be represented in matrix form a0,0 a0,1 … a0,n-1 x0 b0 a1,0 a1,1 … a1,n-1 x1 b1 an-1,0 an-1,1 … an-1,n-1 xn-1 bn-1 = Ax = b

  29. Solving Systems of Equations • Solving a system of linear equations is done in two steps: • Reduce the system to upper-triangular • Use back-substitution to find solution • These steps are performed on the system in matrix form. • Gaussian Elimination, etc.

  30. a0,0 a0,1 … a0,n-1 x0 b0 0 a1,1 … a1,n-1 x1 b1 0 0 … an-1,n-1 xn-1 bn-1 = Solving Systems of Equations Reduce the system to upper-triangular form Use back-substitution

  31. Reducing the System • Gaussian elimination systematically eliminates variable x[k] from equations k+1 to n-1. • Reduces the coefficients to zero • This is done by subtracting a appropriate multiple of the kth equation from each of the equations k+1 to n-1

  32. Procedure GaussianElimination(A, b, y) for k = 0 to n-1 /* Division Step */ for j = k + 1 to n - 1 A[k,j] = A[k,j] / A[k,k] y[k] = b[k] / A[k,k] A[k,k] = 1 /* Elimination Step */ for i = k + 1 to n - 1 for j = k + 1 to n - 1 A[i,j] = A[i,j] - A[i,k] * A[k,j] b[i] = b[i] - A[i,k] * y[k] A[i,k] = 0 endfor endfor end

  33. Parallelizing Gaussian Elim. • Use domain decomposition • Rowwise striping • Division step requires no communication • Elimination step requires a one-to-all broadcast for each equation. • No agglomeration • Initially map one to to each processor

  34. Communication Analysis • Consider the algorithm step by step • Division step requires no communication • Elimination step requires one-to-all bcast • only bcast to other active processors • only bcast active elements • Final computation requires no communication.

  35. Communication Analysis • One-to-all broadcast • log2q communications • q = n - k - 1 active processors • Message size • q active processors • q elements required T = (ts + twq)log2q

  36. Computation Analysis • Division step • q divisions • Elimination step • q multiplications and subtractions • Assuming equal time --> 3q operations

  37. Computation Analysis • In each step, the active processor set is reduced by one resulting in: å - 1 n = - - CompTime n k 1 = 0 k = - CompTime 3 n ( n 1 ) / 2

  38. Can we do better? • Previous version is synchronous and parallelism is reduced at each step. • Pipeline the algorithm • Run the resulting algorithm on a linear array of processors. • Communication is nearest-neighbor • Results in O(n) steps of O(n) operations

  39. Pipelined Gaussian Elim. • Basic assumption: A processor does not need to wait until all processors have received a value to proceed. • Algorithm • If processor p has data for other processors, send the data to processor p+1 • If processor p can do some computation using the data it has, do it. • Otherwise, wait to receive data from processor p-1

  40. Conclusion • Using a striped partitioning method, it is natural to pipeline the Gaussian elimination algorithm to achieve best performance. • Pipelined algorithms work best on a linear array of processors. • Or something that can be linearly mapped • Would it be better to block partition? • How would it affect the algorithm?

  41. Row Ordering • When dealing with a sparse matrix, sometimes operations can cause a zero space in the matrix to become non-zero

  42. Nested Disection Ordering • Complete these slides using notes in the black binder.

More Related