1 / 39

Automatic Data Structure Repair for Self-Healing Systems

Automatic Data Structure Repair for Self-Healing Systems. Brian Demsky Martin Rinard Laboratory for Computer Science Massachusetts Institute of Technology. Motivation. Broken Data Structure. Errors Missing elements Inappropriate sharing Dangling references Out of bounds array indices

easter
Download Presentation

Automatic Data Structure Repair for Self-Healing Systems

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Automatic Data Structure Repair for Self-Healing Systems Brian Demsky Martin Rinard Laboratory for Computer Science Massachusetts Institute of Technology

  2. Motivation Broken Data Structure Errors • Missing elements • Inappropriate sharing • Dangling references • Out of bounds array indices • Inconsistent values F = 20 G = 10 F = 20 G = 5 I = 5 J = 2

  3. Goal Broken Data Structure Consistent Data Structure F = 2 G = 1 F = 20 G = 10 F = 10 G = 5 F = 20 G = 10 F = 20 G = 5 Repair Algorithm I = 3 I = 5 J = 2 J = 2

  4. Goal Broken Data Structure Consistency Properties From Developer Consistent Data Structure F = 2 G = 1 F = 20 G = 10 F = 10 G = 5 F = 20 G = 10 F = 20 G = 5 Repair Algorithm I = 3 I = 5 J = 2 J = 2

  5. What Does Repair Algorithm Produce? • Data structure that • Satisfies consistency properties, and • Heuristically close to broken data structure • Not necessarily the same data structure as (hypothetical) correct program would produce • But enough to keep program operating successfully

  6. Precursors • Data structure repair has historically appeared in systems with extreme reliability goals • 5ESS switch – hand coded audit routines • IBM MVS operating system – hand coded failure recovery routines • Key component of these systems

  7. Where Is This Likely To Be Useful? • Not for systems with slack - can just reboot • Cause of error must go away after reboot • Must be OK to lose volatile state • Must be OK to wait for reboot • Persistent data structures (file systems, application files) • Autonomous and/or safety critical systems • Monitor/control unstable physical phenomena • Largely independent subcomputations • Moving time window

  8. Architecture Broken Abstract Model Repaired Abstract Model Internal Consistency Properties External Consistency Properties Model Definition & Translation 1011100110001111011 1010101011110011101 1010111000111101110 1010011110001111011 1010110101110011010 1010111011001100010 Broken Bits Repaired Bits

  9. Architecture Rationale Why go through the abstract model? • Simple, uniform structure • Sets of objects • Relations between objects • Simplifies both • Expression of consistency properties • Repair algorithm • Enables system to support full range of efficient, heavily encoded data structures

  10. struct Entry { byte name[Length]; int firstBlock; } struct Block { int nextBlock; data byte[BlockSize]; } File System Example abst 0 intro 2 1 -5 1 -1 Directory Entries Disk Blocks struct Disk { Entry dir[NumEntries]; Block block[NumBlocks]; } Disk D;

  11. Model Definition • Sets of objects set blocks of integer : partition used | free; • Relations between objects – values of object fields, referencing relationships between objects relation next : used, used; blocks next used free

  12. Model Translation Bits translated to sets and relations in abstract model using statements of the form: Quantifiers, Condition  Inclusion Constraint for i in 0..NumEntries, 0  D.dir[i].firstBlock and D.dir[i].firstBlock < NumBlocks  D.dir[i].firstBlock in used for b in used, 0  D.block[b].nextBlock and D.block[b].nextBlock < NumBlocks b,D.block[b].nextBlock in next for b,n in next, true  n inused for b in 0..NumBlocks, not (b in used) b in free

  13. Model in Example abst 0 intro 2 1 -5 1 -1 Directory Entries Disk Blocks blocks used 0 next free 1 3 next 2

  14. Internal Consistency Properties Quantifiers, Body • Body is first-order property of basic propositions • Inequality constraints on values of numeric fields • V.R = E, V.R < E, V.R  E, V.R  E, V.R > E • Presence of required number of objects • size(S) = C, size(S)  C, size(S)  C • Topology of region surrounding each object • size(V.R) = C, size(V.R)  C, size(V.R)  C • size(R.V) = C, size(R.V)  C, size(R.V)  C • Inclusion constraints: V in S, V1 in V2.R, V1,V2 in R • Example: for b in used, size(next.b)  1

  15. Internal Consistency Violations Evaluate consistency properties, find violations for b in used, size(next.b)  1 is false for b = 1 blocks used 0 next free 1 3 next 2

  16. Repairing Violations of Internal Consistency Properties • Violation provides binding for quantified variables • Convert Body to disjunctive normal form (p1  …  pn )  …  (q1  …  qm ) p1 …pn , q1 …qm are basic propositions • Choose a conjunction to satisfy • Repair violated basic propositions in conjunction

  17. Repairing Violations of Basic Propositions • Inequality constraints on values of numeric fields • V.R = E, V.R < E, V.R  E, V.R  E, V.R > E • Compute value of expression, assign field • Presence of required number of objects • size(S) = C, size(S)  C, size(S)  C • Remove or insert objects from/to set • Topology of region surrounding each object • size(V.R) = C, size(V.R)  C, size(V.R)  C • size(R.V) = C, size(R.V)  C, size(R.V)  C • Remove or insert pairs from/to relation • Inclusion constraints: V in S, V1 in V2.R, V1,V2 in R • Remove or add the object or pair from/to set or relation

  18. Repair in Example for b in used, size(next.b)  1 is false for b = 1 Must repair size(next.1)  1 Can remove either 0,1 or 2,1 from next blocks used 0 next free 1 3 next 2

  19. Repair in Example for b in used, size(next.b)  1 is false for b = 1 Must repair size(next.1)  1 Can remove either 0,1 or 2,1 from next blocks used 0 next free 1 3 2

  20. Acyclic Repair Dependences • Questions • Isn’t it possible for the repair of one constraint to invalidate another constraint? • What about infinite repair loops? • What about unsatisfiable specifications? • Answer • We require specifications to have no cyclic repair dependences between constraints • So all repair sequences terminate • Repair can fail only because of resource limitations

  21. External Consistency Constraints Quantifiers, Condition  Body • Body of form V = E, V.F = E, V.F[I] = E • Example for b in free, true  D.block[b].nextBlock = -2 for i,j in next, true  D.block[i].nextBlock = j for b in used, size(b.next) = 0  D.block[b].nextBlock = -1 • Repair simply performs assignments • Translates model repairs to bit repairs

  22. abst 0 intro 2 1 -5 1 -1 Directory Entries Disk Blocks abst 0 intro 2 1 -1 -1 -2 Directory Entries Disk Blocks Repair in Example Inconsistent File System Repaired File System

  23. When to Test for Consistency and Repair • Persistent data structures • Repair can be independent activity, or • Repair when data written out or read in • Volatile data structures in running program • Under programmer control • Transaction-based approach • Identify transaction start and end • Repair at start, end, or both • Failure-based approach • Wait until program fails • Repair and restart from latest safe point

  24. Experience • We acquired four benchmarks (written in C/C++) • CTAS (air-traffic control tool) • Simplified Linux file system • Freeciv interactive game • Microsoft Word files • We developed specifications for all four • Very little development time (days, not weeks) • Most of time spent figuring out Freeciv and CTAS • Each benchmark has • Workload • Fault insertion methodology • Ran benchmarks with and without repair

  25. CTAS • Set of air-traffic control tools • Traffic management • Arrival planning • Flow visualization • Shortcut planning • Deployed in centers around country (Dallas/Ft. Worth, Los Angeles, Denver, Miami, Minneapolis/St. Paul, Atlanta, Oakland) • Approximately 1 million lines of C/C++ code

  26. CTAS Screen Shot

  27. Results • Workload – recorded radar feed from DFW • Fault insertion • Simulate error in flight plan processing • Bad airport index in flight plan data structure • Without repair • System crashes – segmentation fault • With repair • Aircraft has different origin or destination • System continues to execute • Anomaly eventually flushed from system

  28. Aspects of CTAS • Lots of independent subcomputations • System processes hundreds of aircraft – problem with one should not affect others • Multipurpose system (visualization, arrival planning, shortcuts, …) – problem in one purpose should not affect others • Sliding time window: anomalies eventually flushed • Rebooting ineffective – system will crash again as soon as it sees the problematic flight plan

  29. Simplified Linux File System intro 0 110 1011 directory block super block group block inode bitmap block block bitmap block inode … inode disk blocks inode block Some Consistency Properties • inode bitmap consistent with inode usage • block bitmap consistent with block usage • directory entries refer to valid inodes • files contain valid blocks only • files do not share blocks

  30. Results • Workload – write and verify several files • Fault insertion – crash file system • Inode and block bitmap errors • Partially initialized directory and inode entries • Without repair • Incorrect file contents because of inode and disk block sharing • With repair • Bitmaps repaired preventing illegal sharing, correct file contents

  31. Freeciv Terrain Grid O = Ocean Consistency Properties • Tiles have valid terrain values • Cities are not in the ocean • Each city has exactly one reference from city location grid • City locations are consistent in • City structures and • tile grid O P M M P = Plain O O P M M = Mountain O P M M City Structures P P P M loc: 3,0 loc: 2,3

  32. Results • Workload – Freeciv software plays against itself • Fault insertion – randomly corrupt terrain values • Without repair – program fails (seg fault) • With repair • Game runs just fine • But game plays out differently because of the different terrain values

  33. Microsoft Word Files • Files consist of a sequence of streams • Streams stored using FAT-based data structure • Consistency Properties • FAT blocks exist and contain valid entries • FAT streams are properly terminated • Free blocks properly marked • Streams contain valid blocks • No sharing of blocks between streams abst 1 7 0 intro 1 9 2 1 -1 -1 -2 Directory Entries FAT Disk Blocks

  34. Results • Workload – several Microsoft Word files • Fault insertion – scramble FAT • Without repair • If blocks containing the FAT were incorrectly marked as free, Word successfully loads file • Otherwise, “The document name or path is not valid” • With repair • Word loads all files

  35. Extensions • Elimination of external consistency constraints • Eliminates problems with translating repairs on the abstract model to the actual data structure • Repair algorithm analyzes model definition rules to generate repair actions for the actual data structure

  36. Extensions • Support for doubly linked data structures • Enables the repair algorithm to regenerate back links

  37. Extensions • Compilation and optimization of consistency checking • Achieved significant speedups (n x) by compiling the specification • Achieved further speedups () by partially optimizing away the construction of the abstract model

  38. Related Work • Hand-coded repair • Lucent 5ESS switch • IBM MVS operating system • Self-stabilizing algorithms • Log-based recovery for database systems • Recovery-oriented computing • Recursive restartability • Undo framework

  39. Conclusion • Data structure repair interesting way to (potentially) improve reliability • Specification-based approach promises to make technique more widely applicable • Moving towards more robust, probabilistic, continuous concept of system behavior

More Related