1 / 71

Learning and Repair Techniques for Self-Healing Systems

Learning and Repair Techniques for Self-Healing Systems . Martin Rinard Michael Ernst Computer Science and Artificial Intelligence Laboratory Massachusetts Institute of Technology. Research Overview. Consistency Constraint Learning. Data Structure Repair. Mode Selection. Error

danton
Download Presentation

Learning and Repair Techniques for Self-Healing Systems

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Learning and Repair Techniques for Self-Healing Systems Martin Rinard Michael Ernst Computer Science and Artificial Intelligence Laboratory Massachusetts Institute of Technology

  2. Research Overview Consistency Constraint Learning Data Structure Repair Mode Selection Error Localization Upgrade Evaluation Enhanced, Self-Healing Systems

  3. Synergistic Combinations • Constraint Learning + Error Localization • Constraint Learning + Data Structure Repair • Error Localization + Mode Selection • Error Localization + Upgrade Evaluation

  4. Attribution problem What piece of code caused visible symptom? Have to trace symptom back to cause Error may be present but not visible in test suite Error Localization Problem Crash or Unexpected Result Execution with Broken Data Structure Error Introduced

  5. Goal is to discover errors when they corrupt data not when effect becomes externally visible Perform frequent consistency checks Error localized between first unsuccessful check and last successful check Error Localization Problem Crash or Unexpected Result Execution with Broken Data Structure Error Introduced

  6. Our Approach Specification of Data Structure Consistency Properties Archie Compiler Efficient Consistency Checker Program + Instrumented Program with Early Data Structure Corruption Detection

  7. Default Instrumentation void copynode(struct node *n) { struct node * newnode= malloc(sizeof(struct node)); newnode.data=n.data; newnode.next=n.next; n.next=newnode; } Pass Insert check here Failed Insert check here

  8. Experimental Results • Specification compiler crucial • Speeds up checks by over factor of 5000 • Final checked version 6 times slower than standard version with no checks • Effective error localization for developers • With tool, localized and fixed errors in about 10 minutes per error • Without tool, usually could not localize and fix errors after an hour

  9. Application to SRS • Enhance/enable other techniques • Fast inconsistency detection for data structure repair • Improved mode selection guidance • Improved software upgrade selection metric

  10. Data Structure Repair Broken Data Structure Errors • Missing elements • Inappropriate sharing • Dangling references • Out of bounds array indices • Inconsistent values F = 20 G = 10 F = 20 G = 5 I = 5 J = 2

  11. Goal Broken Data Structure Consistent Data Structure F = 2 G = 1 F = 20 G = 10 F = 10 G = 5 F = 20 G = 10 F = 20 G = 5 Repair Algorithm I = 3 I = 5 J = 2 J = 2

  12. Goal Broken Data Structure Consistent Data Structure Consistency Properties F = 2 G = 1 F = 20 G = 10 F = 10 G = 5 F = 20 G = 10 F = 20 G = 5 Repair Algorithm I = 3 I = 5 J = 2 J = 2

  13. What Does Repair Algorithm Produce? • Data structure that • Satisfies consistency properties, and • Heuristically close to broken data structure • Not necessarily the same data structure as (hypothetical) correct program would produce • But enough to keep program operating successfully

  14. Where Is This Likely To Be Useful? • Not for transient errors - can just reboot (but need slack) • Cause of error must go away after reboot • Must be OK to lose volatile state • Must be OK to wait for reboot • Persistent data structures (file systems, application files) • Autonomous and/or safety critical systems • Monitor/control unstable physical phenomena • Largely independent subcomputations • Moving time window

  15. Architecture Broken Abstract Model Repaired Abstract Model Internal Consistency Properties External Consistency Properties Model Definition & Translation 1011100110001111011 1010101011110011101 1010111000111101110 1010011110001111011 1010110101110011010 1010111011001100010 Broken Bits Repaired Bits

  16. Architecture Rationale Why go through abstract model? • Simple, uniform structure • Sets of objects • Relations between objects • Simplifies both • Expression of consistency properties • Repair algorithm • Enables system to support full range of efficient, heavily encoded data structures • Enables system to generalize to other structures

  17. struct Entry { byte name[Length]; int firstBlock; } struct Block { int nextBlock; data byte[BlockSize]; } File System Example abst 0 intro 2 1 -5 1 -1 Directory Entries Disk Blocks struct Disk { Entry dir[NumEntries]; Block block[NumBlocks]; } Disk D;

  18. Model Definition • Sets of objects set blocks of integer : partition used | free; • Relations between objects – values of object fields, referencing relationships between objects relation next : used, used; blocks next used free

  19. Model Translation Bits translated to sets and relations in abstract model using statements of the form: Quantifiers, Condition  Inclusion Constraint for i in 0..NumEntries, 0  D.dir[i].firstBlock and D.dir[i].firstBlock < NumBlocks  D.dir[i].firstBlock in used for b in used, 0  D.block[b].nextBlock and D.block[b].nextBlock < NumBlocks b,D.block[b].nextBlock in next for b,n in next, true  n inused for b in 0..NumBlocks, not (b in used) b in free

  20. Model in Example abst 0 intro 2 1 -5 1 -1 Directory Entries Disk Blocks blocks used 0 next free 1 3 next 2

  21. Internal Consistency Properties Quantifiers, Condition  Body • Body is first-order property of basic propositions • Inequality constraints on values of numeric fields • V.R = E, V.R < E, V.R  E, V.R  E, V.R > E • Presence of required number of objects • size(S) = C, size(S)  C, size(S)  C • Topology of region surrounding each object • size(V.R) = C, size(V.R)  C, size(V.R)  C • size(R.V) = C, size(R.V)  C, size(R.V)  C • Inclusion constraints: V in S, V1 in V2.R, V1,V2 in R • Example: for b in used, true  size(next.b)  1

  22. Internal Consistency Violations Evaluate consistency properties, find violations for b in used, size(next.b)  1 is false for b = 1 blocks used 0 next free 1 3 next 2

  23. Repairing Violations of Internal Consistency Properties • Violation provides binding for quantified variables • Convert Condition  Body to disjunctive normal form (p1  …  pn )  …  (q1  …  qm ) p1 …pn , q1 …qm are basic propositions • Choose a conjunction to satisfy • Repair violated basic propositions in conjunction

  24. Repairing Violations of Basic Properties • Inequality constraints on values of numeric fields • V.R = E, V.R < E, V.R  E, V.R  E, V.R > E • Compute value of expression, assign field • Presence of required number of objects • size(S) = C, size(S)  C, size(S)  C • Remove or insert objects from/to set • Topology of region surrounding each object • size(V.R) = C, size(V.R)  C, size(V.R)  C • size(R.V) = C, size(R.V)  C, size(R.V)  C • Remove or insert pairs from/to relation • Inclusion constraints: V in S, V1 in V2.R, V1,V2 in R • Add object or pair to set or relation

  25. Repair in Example for b in used, size(next.b)  1 is false for b = 1 Must repair size(next.1)  1 Can remove either 0,1 or 2,1 from next blocks used 0 next free 1 3 next 2

  26. Repair in Example for b in used, size(next.b)  1 is false for b = 1 Must repair size(next.1)  1 Can remove either 0,1 or 2,1 from next blocks used 0 next free 1 3 2

  27. External Consistency Constraints Quantifiers, Condition  Body • Body of form V = E, V.F = E, V.F[I] = E • Example for b in free, true  D.block[b].nextBlock = -2 for i,j in next, true  D.block[i].nextBlock = j for b in used, size(b.next) = 0  D.block[b].nextBlock = -1 • Repair simply performs assignments • Translates model repairs to bit repairs

  28. abst 0 intro 2 1 -5 1 -1 Directory Entries Disk Blocks abst 0 intro 2 1 -1 -1 -2 Directory Entries Disk Blocks Repair in Example Inconsistent File System Repaired File System

  29. When to Test for Consistency and Repair • Persistent data structures • Repair can be independent activity, or • Repair when data written out or read in • Volatile data structures in running program • Transaction-based approach • Identify transaction start and end • Check and repair at start, end, or both • Failure-based approach • Wait until program fails • Repair, restart from closest safe point

  30. Experience • We acquired four benchmarks (written in C/C++) • CTAS (air-traffic control tool) • Simple Linux file system • Freeciv interactive game • Microsoft Word files • We developed specifications for all four • Each benchmark has • Workload • Fault insertion methodology • Ran benchmarks with and without repair

  31. CTAS • Set of air-traffic control tools • Traffic management • Arrival planning • Flow visualization • Shortcut planning • Deployed in centers around country (Dallas/Ft. Worth, Los Angeles, Denver, Miami, Minneapolis/St. Paul, Atlanta, Oakland) • Approximately 1 million lines of C/C++ code

  32. Results • Workload – recorded radar feed from DFW • Fault insertion • Simulate error in flight plan processing • Bad airport index in flight plan data structure • Without repair • System crashes – segmentation fault • With repair • Aircraft has different origin or destination • System continues to execute • Anomaly eventually flushed from system

  33. Aspects of CTAS • Lots of independent subcomputations • System processes hundreds of aircraft – problem with one should not affect others • Multipurpose system (visualization, arrival planning, shortcuts, …) – problem in one purpose should not affect others • Sliding time window: anomalies eventually flushed • Rebooting ineffective – system will crash again as soon as it sees the problematic flight plan

  34. Freeciv Terrain Grid O = Ocean Consistency Properties • Tiles have valid terrain values • Cities are not in the ocean • Each city has exactly one reference from city location grid • City locations are consistent in • City structures and • City location grid O P M M P = Plain O O P M M = Mountain O P M M City Structures P P P M City Location Grid loc: 3,0 loc: 2,3

  35. Results • Workload – Freeciv software plays against itself • Fault insertion – randomly corrupt terrain values • Without repair • Segmentation fault • With repair • Game runs just fine • But game plays out differently because of the different terrain values

  36. Microsoft Word Files • Files consist of a sequence of streams • Streams stored using FAT-based data structure abst 1 7 0 intro 1 9 2 1 -1 -1 -2 Directory Entries FAT Disk Blocks struct Entry { byte name[Length]; byte inUse; unsigned int size; unsigned int block; } struct Block { data byte[BlockSize]; } struct Disk { Entry table[NumEntries]; int FAT[NumBlocks]; Block block[NumBlocks]; } Disk D;

  37. Consistency Properties • The FAT and directory blocks exist • FAT contains valid values only • -1 – terminates FAT streams • -2 – indicates free blocks • Valid disk block index – next block in stream • FAT streams properly terminated • Free blocks properly marked • Streams contain valid blocks only • Streams do not share blocks • Stream length in directory consistent with number of blocks in stream • Directory entries point to valid FAT streams

  38. Results • Workload – several Microsoft Word files • Fault insertion – scramble FAT • Without repair • If FAT blocks incorrectly marked as used, Word successfully loads file • Otherwise, “The document name or path is not valid” • With repair • Word loads all files

  39. Manual Specification Issues • Right now developer manually specifies consistency constraints • Specification overhead • Coverage issues • We propose to address these issues by automatically learning consistency constraints

  40. Learning Consistency Constraints Look for patterns in values the program computes: • Instrument the program to write data trace files • Run the program on a test suite • Invariant detection engine • reads data traces • generates potential consistency constraints • checks constraints Original Instrumented program program Data trace Invariants database Detect Instrument Run invariants Test suite

  41. Sample Consistency Constraints x,y,z are variables; a,b,c are constants Invariants over numbers: • unary: x= a, a x  b, x  a (mod b) • n-ary: xy, x= ay+ bz+ c, x= max(y, z) Invariants over sequences: • unary: sorted, invariants over all elements • with sequence: subsequence, ordering • with scalar: membership

  42. Richer Consistency Constraints Object/class invariants node.left.value < node.right.value string.data[string.length] = ’\0’ Pointers (recursive data structures) tree is sorted for each node n,n=n.child.parent graph g is acyclic Conditionals if proc.priority< 0 then proc.status=active ptr=null or *ptr>i

  43. Experimental Results • Can recover formal specifications • Loop invariants in array programs • Procedure preconditions and postconditions • Software engineering applications • Can identify inadequate test suites • Can correct developer misapprehensions • Can reveal bugs • Optimizations for constraint learning

  44. More Uses for Learned Consistency Constraints • Write better programs [Gries 81, Liskov 86] • Document code • Check assumptions: convert to assert • Maintain invariants to avoid introducing bugs • Locate unusual conditions • Validate test suite: value coverage • Provide hints for higher-level profile-directed compilation [Calder 98] • Bootstrap proofs [Wegbreit 74, Bensalem 96] • Enable/Support Error Localization • Enable/Support Data Structure Repair

  45. Issues in Applying Constraint Learning to Error Localization and Repair • Overly aggressive learned constraints • Can confuse error localization • Can cause “repair” of correct but previously unobserved consistency properties • Test suite/experimentation issue • Constraint language interoperability issues • Role of model in learned constraints • Potential expressivity issues

  46. Ideal Result • Automatically learn key constraints • Substantially improved constraint coverage • More precise constraints • More detailed properties • Different constraints at different program points • Complete data structure coverage • Substantially increase effectiveness and utility of error localization and repair

  47. Comparison with AI Learning Consistency Constraints: Can be formulated as an AI problem Cannot be solved by previous AI techniques • not classification or clustering • no noise • many positive examples • no negative examples • intelligible output

  48. Incremental Inference Online algorithm improves: • response time • space • front end computation • back end computation Process each variable value once, then discard Stop checking invariants after falsification

  49. Mode Selection in Multi-Mode Systems • A multi-mode system’s behavior depends on its environment and internal state • Examples of multi-mode systems: • Web server: polling / interrupt • Cell phone: AMPS / TDMA / CDMA • Router congestion control: normal / intentional drops • Graphics program: high detail / low detail

  50. Controllers • Controller chooses which mode to use • Examples of factors that determine modes: • Web server: heavy traffic vs. light traffic • Cell phone: rural area vs. urban area; interference • Router congestion control: preconfigured policy files • Graphics program: frame rate constraints

More Related