1 / 24

Tests and Tolerances for High-Performance Software-Implemented Fault Detection

This paper discusses fault detection in numerical libraries, distinguishing between errors and round-offs in computed results.

aliceward
Download Presentation

Tests and Tolerances for High-Performance Software-Implemented Fault Detection

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Tests and Tolerances for High-Performance Software-Implemented Fault Detection Michael Turmon, Robert Granat, Daniel S.Katz, John Z.Lou

  2. Objective • Software fault detection in common numerical libraries by checking computed output • Faulty environment here essentially constitutes bit flips in application’s state space • Distinguish between errors and round-offs in computed results

  3. Faults and EDMs • Single Event Upsets • Radiation induced errors causing bit flips in memory, cache • Effects application data and code • Data errors are more difficult to detect • Error Detecting Middleware • Wrap existing numerical libraries • Avoid altering internals of the library • More efficient than original computation

  4. Numerical Error Checking - Summary • Consider common numerical matrix computations • Use “post-conditions” to evaluate correctness • Post-condition: Necessary relation between inputs & computed outputs • Use well-known upper bounds on error propagation within numerical algorithms for matrix computations • Define tests and tolerances to separate errors and round-offs • Develop input-independent tolerances

  5. Definitions: Vector & Matrix norms • Vector: ||v||1 = ∑ |vi| ||v||∞ = max|vi| ||v||2 = (∑|vi|2)1/2 • Matrices: ||A||1 = max. column sum of A ||A||∞ = max. row sum of A ||A||2 = largest singular value of A ||A||F = ( |aij| 2)1/2

  6. Matrices review • Orthogonal Matrix A AT = I => A-1 = AT • Unitary Matrix A*T= A-1 • Permutation Matrix Reordered rows of I • Sub-multiplicative property ||Av|| ≤||A|| ||v|| ||AB|| ≤||A|| ||B||

  7. Numerical Functions • Matrix multiplication • QR decomposition A= Q * R • A = input matrix • Q = Orthogonal matrix • R = upper triangular matrix • Singular Value decomposition A = U * D * VT • A = input matrix • D = diagonal matrix • U & V = orthogonal matrices

  8. Numerical Functions (contd.) • LU decomposition • A = P* L*U • P = permutation matrix • L = lower triangular matrix • U = upper triangular matrix • System Solution • Solve for x in Ax=b , given A & b • Matrix inverse • Given A, find B such that A*B = I

  9. Numerical functions (contd.) • Fourier transform • Given x, find y such that y=W x, where W is the matrix of Fourier basis, Wnk = e-j2kn/N • Inverse Fourier transform • Given y, find x such that x = n-1WTy where W is n*n matrix of Fourier bases (WT = W-1)

  10. Operations & Post-conditions

  11. Probe Vector ^ ^ • Post-condition check A = Q * R -> computationally intense • Instead multiply with probe vector w and compare vectors • w A >< w Q R • Choice of w • Elements of w should not vary greatly in magnitude • w should be non-zero everywhere • Can be a vector of all ones, except for FFT ^ ^

  12. Error Propagation – Matrix multiplication ^ • Error matrix E = P – AB • P = mult(A,B) • ||E||∞  n ||A||∞ ||B||∞ u • u = difference between unity & next larger float number, n = dimension common to A & B • d = P w – A B w = E w • ||d||∞ = ||E w||∞  ||E||∞ ||w||∞  n ||A||∞ ||B||∞ ||w||∞ u • ||d||∞ /||A||∞ ||B||∞ ||w||∞>< u • n is ignored – in average case, round-off errors independent of dimension ^ ^

  13. Error Propagation • QRD: • ||d||F / (||A||F ||w||F ) >< u • d = Q R w – A w • SVD: • ||d||/ (||A|| ||w|| ) >< u • d = U D VT w – A w • LUD: • ||d||/ (||A|| ||w|| ) >< u • d = P L U w – A w ^ ^ ^ ^ ^ ^ ^ ^

  14. Error Propagation (contd.) • Solve Ax = b: • ||d||/ (||A|| ||x|| ) >< u • d = A x – b • Matrix inverse: • ||d||/ (||A|| ||B|| ||w|| ) >< u • d = BA w - w ^ ^ ^ ^

  15. Error Propagation - FFT • Forward Transform: • d = (y – Wx)T w • W is the n*n forward transform matrix containing the Fourier basis functions • w cannot have a sparse transform • Error propagation: ||e||  5nlog2n ||x|| u • |d| /(nlog2n ||x||2 ||w||2) >< u • Inverse Transform: • d = (x – n-1 WT y)T w • |d|/(log2n ||y||2 ||w||2) >< u ^

  16. Comparison Tests •  = RHS – LHS and  = || w|| • ( never actually computed) • T0: /||w|| >< u • Trivial test:Un-normalized comparison • T1: /(1 ||w||) >< u • Ideal test: may not always be computable • T2: /(2 ||w||) >< u • Approx. matrix test: based on computed quantities • T3: /(||w||+3) >< u • Approx. vector test: higher chance of false alarms

  17. Experiments • Faults are injected in half the runs by changing a random bit of the algorithm’s state space • Faults are injected at random point of execution • The threshold value is chosen based on error quantity computed in the faulty and fault-free conditions

  18. Choosing 

  19. T2, ,T 1 T3 T0

  20. `

  21. Alternate tests for FFT: Parseval’s condition: (||x||2- n-1/2 ||y||2)/ ||x||2>< u Choosing a vector w2 with real & imag. parts equal to : cos(4(k – n/2)/n), k=0,1,….n-1 and compute difference as before ROC for FFT

  22. Related work • ABFT – introduced by Huang & Abraham for matrix operations, 1984 • Error detection based on algorithm employed – matrix encoded with checksum matrix • Vastly extended by others for various numerical operations • Result Checking – introduced by Blum & Wasserman – focus on computation errors,1996 • Prata & Silva compared the two, found for Matrix mult. & QRD, RC more efficient than ABFT, 1999

  23. Summary • Faults detected based on conditions that numerical output must satisfy • Implemented as wrappers around existing libraries • Run experiments under fault-free & faulty conditions and observe decision criterion • ub >> * =>  can be set based on an average-case outlook rather than assuming worst-case scenario • Selecting  a trade-off between fault detection & false alarms • Can be extended to other common computations like Sorting, Integration, etc.

More Related