algorithm based fault tolerance for matrix operations l.
Download
Skip this Video
Loading SlideShow in 5 Seconds..
Algorithm-Based Fault Tolerance for Matrix Operations PowerPoint Presentation
Download Presentation
Algorithm-Based Fault Tolerance for Matrix Operations

Loading in 2 Seconds...

play fullscreen
1 / 13

Algorithm-Based Fault Tolerance for Matrix Operations - PowerPoint PPT Presentation


  • 201 Views
  • Uploaded on

Algorithm-Based Fault Tolerance for Matrix Operations. Proposed by: Kuang-Hua Huang Jacob A. Abraham. Problem Description. Achieving a fault tolerant model that is algorithm based rather than hardware based Existing techniques require high overhead cost Error Masking (hardware redundancy)

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about 'Algorithm-Based Fault Tolerance for Matrix Operations' - benjamin


An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
algorithm based fault tolerance for matrix operations

Algorithm-Based Fault Tolerance for Matrix Operations

Proposed by:

Kuang-Hua Huang

Jacob A. Abraham

problem description
Problem Description
  • Achieving a fault tolerant model that is algorithm based rather than hardware based
  • Existing techniques require high overhead cost
    • Error Masking (hardware redundancy)
    • Error Detection and Recovery (hardware/time redundancy)
existing techniques
Existing Techniques
  • Error Masking
    • Triple Module Redundancy – 200%
    • Quadded Logic – 300%
  • Error Detection and Recovery
    • TSC – 73%-83% hardware
    • Alternating Logic – 100% time + 85% hardware
    • RESO – 100% time
    • Watchdog processors
algorithm based fault tolerance
Algorithm-Based Fault Tolerance
  • Pros
    • Detects and corrects errors
    • Extremely low overhead
  • Cons
    • Not generally applicable (mostly useful for MPP systems)
    • Undetectable patterns if more than one error
approach
Approach
  • Encoding of data
  • Redesign of Algorithm
    • Information must be easy to recover
    • Time overhead must not be low
  • Distribution of computation steps
    • All errors can be detected and corrected
checksum matrices
Checksum Matrices
  • Definitions
    • Column checksum matrix
    • Row checksum matrix
    • Full checksum matrix
theorems
Theorems
  • Matrix Multiplication
    • A * B = C  Ac * Br = Cf
  • LU Decomposition
    • C = L * U  Cf = Lc * Ur
  • Addition
    • A + B = C  Af + Bf = Cf
  • Scalar Multiplication
    • c * Af = (c * A)f
  • Transpose
    • AfT = (AT)f
error detection and correction
Error Detection and Correction
  • Detection
    • Compute the sum (S1) of information in each row/column and compare to the corresponding checksum (S2)
  • Location
    • Intersection of the inconsistent row and column (S1 S2)
  • Correction
    • Correction of the error: E = E’ + (S2 – S1)
    • Correct the error in checksum: S1 S2
mesh connected processor arrays
Mesh Connected Processor Arrays

In a mesh connected system, each processor individually handles a calculation in the resultant matrix. In a systolic array, an array of processes handles a row of values

Array B

Array A

overhead for mpp systems
Overhead for MPP systems

Mesh Connected Arrays

Systolic Arrays

undetectable loop patterns
Undetectable Loop Patterns

X

X

  • Certain Patterns of error mask the errors
  • Caused by faulty processors
  • Requires a minimum number of processors to detect error

X

X

X

X

X

X

X

X

X

X

X

X

uniprocessor systems
Uniprocessor Systems
  • In uniprocessor system, a faulty processor can cause all elements to be incorrect
conclusion
Conclusion
  • Algorithm-based fault tolerance applied to matrix operations
  • Low ratio of redundancy
  • Ability to detect and correct errors
  • Ongoing research