1 / 15

A Concurrent Matrix Transpose Algorithm

A Concurrent Matrix Transpose Algorithm. Pourya Jafari. Application . Frequently Used Linear Algebra Operation Scientific Applications FFT Matrix Multiplication. Transpose Matrix. : item/cell at row i and column j of matrix B . For all i, j we have .

noma
Download Presentation

A Concurrent Matrix Transpose Algorithm

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. A Concurrent Matrix Transpose Algorithm Pourya Jafari

  2. Application • Frequently Used Linear Algebra Operation • Scientific Applications • FFT • Matrix Multiplication

  3. Transpose Matrix • : item/cell at row i and column j of matrix B • . • For all i, j we have • . • Simply exchange rows and columns • For simplicity we only consider square matrices • N row N columns labeled 0 to N-1

  4. An Example • Each cell is filled with row|column number • 6 swaps, (4*4 – 4)/2 = 6 • In general, for size N square Matrix we have • swaps,

  5. Parallelizing • Naïve algorithm • A thread for each swap • Quadratic number of threads • Quadratic number of communication links • → impractical

  6. Parallelizing - 2 • More efficient Way • Assign a column to each thread • O(N) threads • Communication links? • Depends on the approach

  7. Measure dislocation • A single swap operation as row and column shifts • For column shift length A • j= i + K → K = i - j • Shift length is i-j; value range is from 0 to N-1

  8. Concurrency Scheme • Minimize communication • Pre-process inside thread • Shift each rows • Intra-process/thread communication • Shift each columns • Post-process inside thread • Shift each rows again

  9. Concurrency Scheme - 2 • We have the row shifts fixed based on row index • Has range 0 to N-1, • consistent with our initial finding • Now arrange the rows, so that column shifts gets us to i • i - L = i’ L + i’ = i L = -j • So we shift each column j cells up

  10. Steps so far • 1 → 2: Column shift j up • 2 → 3: Row shift based on row indices • 3 → 4: ? • Change of indices so far • (i - j, j) → (i - j, i - j + j) → (i - j, i) = (m, n) • One operation to change row index to j • n - m = (i - (i - j))= j (1) (2-a) (2-b) (3) (4)

  11. Efficiency of algorithm so far • O(N) row and column operation • O(N2) overall considering both rows and column • O(N) communication links • Communication is a major bottleneck • Group row shifts • Reduce communication and overall complexity

  12. Radix Representation • Radix r • Base r numbers • For k each digit place (starting from LS) • For l steps from 0 to r-1 • group all row shifts for current step • Radix 3 • Possible numbers 0, 1 and 2 • Second loop { For l=0 to 2 } • Shift all number have l in their kth digit place l*r^k to the right

  13. Special Case: Radix-2 • Two steps only 0 and 1 • We only shift for 1 • Digits are bit representation • Shift all row indices have their kth bit on = + Shift for each row k=0 k=1

  14. Algorithm complexity • Depends on r (radix) • C1=(r-1)[logrN] • C2=b(r-1)[N/r][logrN] • Special cases • r=2 • Important when communication cost is high • Good when message size small • r=N • Good when message size is large • Best value based on communication costs, message size, communication link performance, number of ports, etc.

  15. Radix vs. message size vs. index time for 64 processors

More Related