1 / 35

Algorithm-Based Fault Tolerance Matrix Multiplication

Algorithm-Based Fault Tolerance Matrix Multiplication. Greg Bronevetsky. Problem at Hand. Have matrices A and B Want to compute their product: AB Ask a matrix-matrix-multiply (MMM) implementation to compute product Answer: C Question: Is C the correct answer? How could we know for sure?.

yael
Download Presentation

Algorithm-Based Fault Tolerance Matrix Multiplication

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Algorithm-Based Fault ToleranceMatrix Multiplication Greg Bronevetsky

  2. Problem at Hand • Have matrices A and B • Want to compute their product: AB • Ask a matrix-matrix-multiply (MMM) implementation to compute product • Answer: C • Question: Is C the correct answer? How could we know for sure?

  3. Algorithm-Based Fault Tolerance • Encode input matrices via error-correcting code • Run regular MMM algorithm on encoded matrices • Encoding invariant under MMM • Naturally outputs encoded matrices • Encoding guarantees: • If upto t errors in output, will detect error • If upto c<t errors in output, can decode correct output matrix

  4. Outline Linear Error Correcting Codes ABFT = Linear Encoding of Matrices Algorithm-Based Fault Tolerance

  5. Error Correcting Codes • Map f: k  n • k-long data words  n-long codewords • We use ={0, 1} • Code of length n is a “sparse” subset of n • Very few possible words are valid codewords • Rate of code Amount of information communicated by each codeword

  6. Minimum Distance • Minimum Distance: d() = Hamming distance • Hamming distance: number of spots where words differ • Measures difficulty of decoding/correcting corrupted codewords

  7. Detection and Correction • Code may detect errors in dmin spots • No error can morph one codeword into another • May correct errors in (dmin-1)/2 spots • Can still find “closest” codeword • More details later… Each codeword defines circle around itself of radius dmin/2

  8. Linear Codes • Codewords form linear subspace inside n • In rowspace of generator matrix G: a (n=7, k=3) code

  9. Property 1 • Linear combination of any codewords is also a codeword: For any x,yC, (x+y)C • Codeword*constant is codeword For any zC, k*zC • <0,0…0> always a codeword • Proof: basic properties of linear spaces

  10. Property 2 • Minimum distance of linear code = • Where • Proof:

  11. Parity Check Matrix • H: dual matrix to G • Contains basis of space orthogonal to G’s row space • n-k dimentional space • H is (n-k)xn • Space defined as: • Note: H also defines a linear code

  12. Property 3 • dmin=min # of columns of H that can sum to 0 • Proof:

  13. Property 4 • Minimum distance of linear code  n-k+1 • Proof • Total n dimensions (since codewords are n-vectors) • G’s rowspace rank = k • Thus, H’s columspace rank = n-k • Thus, n-k+1 columns will be linearly dependent • Add up to 0 • By Property 3, this is  dmin

  14. Outline Linear Error Correcting Codes ABFT = Linear Encoding of Matrices Algorithm-Based Fault Tolerance

  15. Encoding a Matrix • Algorithm-Based Fault Tolerance introduced by Huang and Abraham in 1984 • Encode each row of matrix via extra column • Column entries = sums of matrix rows

  16. Encoding a Matrix • Encode each column of matrix via extra row • Row entries = sums of matrix columns • Full Encoding:

  17. Detecting Errors • Suppose matrix A is corrupted to matrix  • entry âi,j is wrong • Can detect error’s exact position: <i,j>

  18. Correcting Errors • Can correct error using row or col checksum

  19. Big Trick: Preservation of Encoding • Column-encoded mtx * Row-encoded mtx = = Fully-encoded mtx • Can check MMM computation by checking encoding of output • If product matrix has an erroneous entry • Can detect • Can correct

  20. Applications • Matrix Multiplication • Given encoded A and B, • Check whether MMM result C (?=AB) has valid encoding • Matrix Factorization • Given a factorization A=WZ • Verify correctness by verifying encodings of factors • Factors row- OR column-encoded • Can only detect, not correct errors

  21. Weighted ABFT • Oftentimes need to check row- or column-encoded matrices • Ex: factorization, data integrity check • Can only detect errors in such matrices • Can we also correct? • Yes, by generalizing to weighted checking rows/columns

  22. Weighting • Suppose we have d n-vectors w1…wd • Can column-encode matrix A: • Lets try out:

  23. Weighted Error Detection

  24. Weighted Error Correction • Weighted encoding Detects and Corrects single errors • Even for non full-encoding

  25. Outline Linear Error Correcting Codes ABFT = Linear Encoding of Matrices Algorithm-Based Fault Tolerance

  26. “Surprise” • But this is all just a linear code! • Generator matrix for above scheme:

  27. Generating Encodings • Given m=<ai,1, ai,2, …, ai,k> as message word (or matrix row/column)

  28. Surprise?? • Not too surprising really • Why else would MMM preserve encoding? • Another possibility: • Efficient: can be implemented via bit shifts • Room open for using any linear code!

  29. Error Detection/Correction in General • To show for linear codes: • Can detect dmin errors • Can correct (dmin-1)/2 errors • Let be original codeword • Let be the corrupted codeword • e: error vector

  30. Error Detection in General • s called the “syndrome vector” • Independent of original codeword • Note: weight(e) <dmin since <dmin errors • Thus: • Detection: if , then ERROR

  31. Error Correction in General • Clearly e is correction vector • corrects error in • Sufficient to prove: weight(e)(dmin-1)/2  H is isomorphism: correction vectors  syndrome vectors • i.e. for each correction vector (want to know)  unique syndrome vector • Thus, possible to correct any error • may not be efficient

  32. H is Onto • weight(e)  (dmin-1)/2 < dmin • rank(H) = n-k  (dmin-1)/2 • Thus, rank(H)  weight(e) and He  0 • Not enough 1’s in e to sum H’s columns to 0 • H maps onto its range • Thus,

  33. H is 1-1 • Let e1 and e2 be correction vectors, e1  e2 • Suppose that: • weight(e1&e2)  (dmin-1)/2 • He1 = He2 = s • He1-He2 = H(e1-e2) = s-s = 0 • And so, (e1-e2) is a codeword • Thus, weight(e1-e2)  dmin • But weight(e1&e2)  (dmin-1)/2 and so weight(e1-e2) dmin-1 • Contradiction! e1 = e2

  34. Other Encoding Schemes • Linear codes preserved by matrix multiplication • Presumably, fancier codes might be preserved by fancier computations • Limit: • S. Winograd showed in 1962 that any code s.t. f(xy) = f(x)  f(y) has rate (k/n) or minimum weight0 as k • How general can we get? • Do good solutions exist for small k? • k=64 bits should be good enough

  35. Summary • For Matrix Multiplication can encode input via linear codes • Solutions exist for more complex codes • Ex: Fourier Transforms • On parallel systems must ensure: • No processor touches >1 element per row/column • Else, if one processor fails, encoding overwhelmed with errors • To ensure this must modify algorithm • Separate check placement theory

More Related