1 / 29

Matrix Multiplication

Matrix Multiplication. Instructor: Dr Sushil K Prasad Presented By: R. Jayampathi Sampath. Outline. Introduction Hypercube Interconnection Network The Parallel Algorithm Matrix Transposition Communication Efficient Matrix Multiplication on Hypercubes (The paper). Introduction.

penn
Download Presentation

Matrix Multiplication

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Matrix Multiplication Instructor: Dr Sushil K Prasad Presented By: R. Jayampathi Sampath

  2. Outline • Introduction • Hypercube Interconnection Network • The Parallel Algorithm • Matrix Transposition • Communication Efficient Matrix Multiplication on Hypercubes (The paper)

  3. Introduction • Matrix multiplication is important algorithm design in parallel computation. • Matrix multiplication on hypercube • Diameter is smaller • Degree = log(p) • Straightforward RAM algorithm for MatMul requires O(n3) time. • Sequential Algorithm: for (i=0; i<n; i++){ for (j=0; j<n; j++) { t=0; for(k = 0; k<n; k++){ t=t +aik*bkj; } cij=t; } }

  4. Hypercube Interconnection Network 1101 1111 1100 0101 0100 0111 0110 0010 0000 0011 0001 1000 1011 1001

  5. Hypercube Interconnection Network (contd.) • The formal specification of a Hypercube Interconnection Network. • Let processors be available • Let i and i(b) be two integers whose binary representation differ only in position b, • Specifically, • If is binary representation of i • Then is the binary representation of i(b) where is the complement of bit • A g-Dimentional Hypercube interconnection network is formed by connection each processor pi to by two way link for all

  6. 101 100 1 -1 2 -2 000 001 1 -1 2 -2 1 -1 2 -2 • 1 2 • 3 4 A = 2 -4 2 -3 010 011 2 -1 3 -3 4 -4 2 -2 3 -3 4 -4 • -1 -2 • -3 -4 B = 110 111 1 -1 1 -2 1 -1 1 -2 3 -3 4 -4 3 -1 3 -2 3 -3 3 -4 4 -3 4 -4 4 -3 4 -4 The Parallel Algorithm Initial step Step 1.1 Step1: • Example (parallel algorithm) A(0,j,k) & B(0,j,k) -> processors (i,j,k), where 1<=i<=n-1. 2*2 Step 1.2 n=21 #processors N N=n3=23=8 Step 1.2 A(i,j,i) -> processors (i,j,k) where 0<=k<=n-1 B(i,j,k) -> processors (i,j,k) where 0<=j<=n-1 X,X,X i j k

  7. -8 -6 -1 -7 -2 -10 -3 -15 -6 -22 -16 -12 The Parallel Algorithm (contd.) Step2: Step3:

  8. The Parallel Algorithm (contd.) • Implementation of straightforward RAM algorithm on HC. • The multiplication of two n x n matrices A , B where n=2q • Use HC with N = n3 = 23q • Each processor Pr occupying position (i,j,k) where r = in2 + jn + k for 0<= i,j,k <= n-1 • If the binary representation of r is : r3q-1r3q-2…r2qr2q-1…rqrq-1…r0 then the binary representation of i, j, k are r3q-1r3q-2…r2q,r2q-1…rq,rq-1…r0 respectively

  9. The Parallel Algorithm (contd.) • Example (positioning) • The multiplication of two 2 x 2 matrices A , B where n=2=21 q=1 • Use HC with N = n3 = 23q=8 processors • Each processor Pr occupying position (i,j,k) • where r = i22 + j2 + k for 0<= i,j,k <= 1 • If the binary representation of r is : • r2r1r0 • then the binary representation of i, j, k are • r2,r1,r0 respectively

  10. 101 100 Ar 000 001 101 Br Cr 010 011 110 111 The Parallel Algorithm (contd.) • All processors with same index value in the one of i,j,k form a HC with n2 processors • All processors with the same index value in two field coordinates form a HC with n processors • Each processor will have 3 registers Ar, Br and Cralso denoted A(I,j,k) B(I,j,k) and C(i,j,k)

  11. The Parallel Algorithm (contd.) • Step 1:The elements of A and B are distributed to the n3 processors so that the processor in position i,j,k will contain aji and bik • 1.1 Copies of data initially in A(0,j,k) and B(0,j,k) are sent to processors in positions (i,j,k), where 1<=i<=n-1. Resulting in A(i,j,k) = aij and B(i,j,k) = bjk for 0<=i<=n-1. • 1.2 Copies of data in A(i,j,i) are sent to processors in positions (i,j,k), where 0<=k<=n-1. Resulting in A(i,j,k) = aji for 0<=k<=n-1. • 1.3 Copies of data in B(i,j,k) are sent to processors in positions (i,j,k), where 0<=j<=n-1. Resulting in B(i,j,k) = bik for 0<=j<=n-1. • Step 2: Each processor in position (i,j,k) computes the product C(i,j,k) = A(i,j,k) * B(i,j,k) • Step 3: The sum C(0,j,k) = ∑C(i,j,k) for 0<=i<=n-1 and is computed for 0<=j,k<n-1.

  12. The Parallel Algorithm (contd.) • Analysis • Steps 1.1,1.1,1.3 and 3 consists of q constant time iterations. • Step 2 requires constant time • So, T(n3) = O(q) = O(logn) • Cost pT(p) = O(n3logn) • Not cost optimal.

  13. n=4=22 #processors N N=n3=43=64 XX,XX,XX i j k Copies of data initially in A(0,j,k) and B(0,j,k)

  14. n=4=22 #processors N N=n3=43=64 XX,XX,XX i j k A(0,j,k) and B(0,j,k) are sent to processors in positions (i,j,k), where 1<=i<=n-1

  15. n=4=22 #processors N N=n3=43=64 XX,XX,XX i j k Copies of data initially in A(0,j,k) and B(0,j,k)

  16. n=4=22 #processors N N=n3=43=64 XX,XX,XX i j k Senders of A Copies of data in A(i,j,i) are sent to processors in positions (i,j,k), where 0<=k<=n-1

  17. n=4=22 #processors N N=n3=43=64 XX,XX,XX i j k Senders of B Copies of data in B(i,j,k) are sent to processors in positions (i,j,k), where 0<=j<=n-1

  18. Matrix Transposition • The number of processors used is N = n2 = 22q and processor Pr occupies position (i,j) where r = in + j where 0<=i,j<=n-1. • Initially, processor Pr holds all of the elements of matrix A where r = in + j. • Upon termination, processor Ps holds element aij where s = jn + i.

  19. Matrix Transposition (contd.) • A recursive interpretation of the algorithm • Divide the matrix into 4 sub-matrices – n/2 x n/2 • The first level of recursion • The elements of bottom left sub-matrix are swapped with the corresponding elements of the top right sub-matrix. • The elements of other two sub-matrix are untouched. • The same step is now applied to each of four (n/2)*(n/2) matrices. • This continues until 2*2 matrices are transposed. • Analysis • The algorithm consists of q constant time iterations. • T(n) = log(n) • Cost = n2log(n) – not cost optimal (n(n-1)/2 operations on n*n matrix on the RAM by swapping aij wit aji for all i<j)

  20. Matrix Transposition (contd.) • Example 1.1 1.0 1.3 1.2

  21. Outline • 2D Diagonal Algorithm • The 3-D Diagonal Algorithm

  22. 2D Diagonal Algorithm A B 4*4 4*4 Step 1

  23. 2D Diagonal Algorithm (Contd.) Step 2 Step 3 Step 4

  24. 2D Diagonal Algorithm (Contd.) • Above algorithm can be extended to a 3-D mesh embedded in a hypercube with A*i and Bi* being initially distributed along the third dimension z. • Processor piik holding the sub-blocks of Akiand Bik • One-to-all-personalized broadcast of Bi* then replaced by point-to-point communication of Bik from piik to pkik • It fallows one-to-all broadcast Bik to pkik along the z direction.

  25. The 3-D Diagonal Algorithm • HC consisting of p processors • Can be visualized as a 3-D mesh of size • Matrices A and B are partitioned into blocks of p⅔ with blocks along each dimension. • Initially, it is assumed that A and B are mapped onto the 2-D plane x = y • processor piik containing the blocks of Aki and Bki

  26. The 3-D Diagonal Algorithm (contd.) • Algorithm consists of 3 phases • Point to point communication of Bki by piik to pikk • One to all broadcast of blocks of A along the x direction and the newly acquired block of B along the z direction. • Now processor pijk has the blocks of Akj and Bji • Each processor calculates the products of blocks of A and B. • The reduction by adding the result sub matrices along the z direction.

  27. The 3-D Diagonal Algorithm (contd.) • Analysis • Phase 1: Passing messages of size n2/ p⅔ require log(3√p(ts + tw(n2/ p⅔ ))) where ts is the time it takes to start up for message sending and tw is time it takes to send a word from one processor to its neighbor. • Phase 2: takes twice as much time as phase 1. • Phase 3: Can be completed in the same amount of time as Phase 1. • Overall, the algorithm takes (4/3 log p, n2/ p⅔ (4/3 log p))

  28. Bibliography • Akl, Parallel Computation, Models and Methods, Prentice Hall 1997. • Gupta, H & Sadayappan P., Communication Efficient Matrix Mulitplication on Hypercubes, August 1994   Proceedings of the sixth annual ACM symposium on Parallel algorithms and architectures, 320 - 329 • Quinn, M.J., Parallel Computing – Theory and Practice, McGraw Hill, 1997

More Related