1 / 65

Fundamental Concepts Neeraj Suri

Software Fault Tolerance (SWFT). Fundamental Concepts Neeraj Suri. Overview. Motivation for Fault Tolerance Terminology Faults, Errors and Failures Dependability Recovery Backward and forward Redundancy Error Confinement. Examples of Software Failures.

marged
Download Presentation

Fundamental Concepts Neeraj Suri

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Software Fault Tolerance (SWFT) Fundamental Concepts Neeraj Suri

  2. Overview • Motivation for Fault Tolerance • Terminology • Faults, Errors and Failures • Dependability • Recovery • Backward and forward • Redundancy • Error Confinement

  3. Examples of Software Failures • Denver airport: Failure in luggage management system ; opening delayed for several months • Failure of a space probe sent to Mars due to inhomogeneity of measuring units (inch and cm) • Problems when space shuttle Endeavor met with Intelstat 6 due to rounding of near-zero values • Flaw in Apollo 11 software made moons gravity repulsive rather than attractive • Ariane V guidance failure • AT&T system suffered a 9 hour US-wide blockade • Switch experienced abnormal behavior due to flaws in recovery software and propagated to all switches • Software problem caused radiation safety door of a nuclear power processing plant in the UK to open accidentally • Several patients killed through radiation overdoses due to software flaws in Therac-25 (cancer treatment system) ... (many, many, many more examples – WinXP, Vista   !!!)

  4. Software Failures 62% 33%

  5. Has this trend continued? 45% 35% 17,5% 2,5% (tolerance) (50%) (94%) (76%) (79%) [2] D. Oppenheimer et al. “Why do Internet services fail, and what can be done about it”, USENIX 2003.

  6. Microsoft EULA EXCLUSION OF INCIDENTAL, CONSEQUENTIAL AND CERTAIN OTHER DAMAGES. TO THE MAXIMUM EXTENT PERMITTED BY APPLICABLE LAW,IN NO EVENT SHALL MICROSOFT OR ITS SUPPLIERS BE LIABLE FOR ANY SPECIAL, INCIDENTAL, INDIRECT, OR CONSEQUENTIAL DAMAGES WHATSOEVER(INCLUDING, BUT NOT LIMITED TO, DAMAGES FOR LOSS OF PROFITS OR CONFIDENTIAL OR OTHER INFORMATION, FOR BUSINESS INTERRUPTION, FOR PERSONAL INJURY, FOR LOSS OF PRIVACY, FOR FAILURE TO MEET ANY DUTY INCLUDING OF GOOD FAITH OR OF REASONABLE CARE, FOR NEGLIGENCE, AND FOR ANY OTHER PECUNIARY OR OTHER LOSS WHATSOEVER)ARISING OUT OF OR IN ANY WAY RELATED TO THE USE OF OR INABILITY TO USE THE SOFTWARE PRODUCT, THE PROVISION OF OR FAILURE TO PROVIDE SUPPORT SERVICES, OR OTHERWISE UNDER OR IN CONNECTION WITH ANY PROVISION OF THIS EULA, EVEN IN THE EVENT OF THE FAULT, TORT (INCLUDING NEGLIGENCE), STRICT LIABILITY, BREACH OF CONTRACT OR BREACH OF WARRANTY OF MICROSOFT OR ANY SUPPLIER, AND EVEN IF MICROSOFT OR ANY SUPPLIER HAS BEEN ADVISED OF THE POSSIBILITY OF SUCH DAMAGES.

  7. “Mistakes in software development will continue to be made, no matter how carefully the software is built, and failures will continue to occur: that is the way things are in engineering.” John Knight, 2004

  8. Reasons for Bad Software • Economic / Engineering Reasons • need to sell software to pay for development • diminishing return for fixing rare bugs • Legal Reasons • no legal requirements for good software • Legacy Reasons • need to be (bug) compatible • Technical Reasons • software is complex!

  9. Economical Reasons • Budget • time, persons, money “too many bugs, too little time” • Time To Market • expenses vs income • Reliability vs Features • know thy customer! • Cost vs Benefit Tradeoff

  10. Engineering Issues • Undetected bugs • trigger conditions to complex to find bug • No / benign effects • effect of bug is benign, or • activation of bug is unlikely • Not easy to fix • design is just flawed • Might introduce more bugs • developers do not know code sufficiently well

  11. Technical Reasons • Incomplete understanding of systems • Too high complexity makes it impossible to make sure that software is correct • Complexity • For correctness, complexity should be kept minimal

  12. Software Bugs • Professional programmer: • about 100-150 “bugs” per 1000 lines of code • exact value depends on many factors… • Example: • Windows XP is about 45 million LOC • would imply about 4.5 to 6.75 million bugs

  13. Software Testing • Rule of thumb: • In each round of testing one finds at most half of the bugs • Multiple rounds of testing: • find new bugs with new tests • Example: (million bugs) • 4.5 2.251.120.560.280.14… “Program testing can be a very effective way to show the presence of bugs, but it is hopeless inadequate for showing their absence!” Dijkstra

  14. Software Size A. Chou et al. “An Empirical Study of Operating System Errors”, SOSP 2001

  15. Challenge: Software Continues to Grow... Figure: Windows Generations

  16. Projected Software Bugs A. Chou et al. “An Empirical Study of Operating System Errors”, SOSP 2001

  17. Software Bugs are a Fact “If a problem has no solution, it may not be a problem, but a fact, not to be solved, but to be coped with over time” Shimon Peres

  18. Thesis • Using software fault tolerance mechanisms • one could provide better software, • at a lower cost, and • shorter time-to-market • Example: • Apache is very successful project • Uses many SWFT mechanisms to increase its robustness

  19. Defense in Depth • Strategy: • Avoid faults (e.g., reduce stress of programmers, better languages) • Remove faults (e.g., use appropriate tools) • Cope with faults (i.e., tolerate faults) • Predict faults

  20. Motivation • Progress in software engineering • Design • Testing • Formal methods • ... • Experience still shows that assuming software to be bug free is … (& can be hazardous to your health )

  21. Dependability Threats • Failure (“Ausfall”) • Observable deviation from the specification • Error (“Fehler”) • Part of the system state that may lead to a failure • Fault (“Störung”) • “Defect” or “Flaw” of a system • An active fault produces an error. Otherwise, it is dormant.

  22. Fault Error Failure Chain of Dependability Threats • A fault activates an error • A error propagates to become a failure propagation activation

  23. Example 0 int size = 0 ; long* p = 0; 1 add_element (long value) 2 size += 1; 3 p = (int*) realloc(p,sizeof(long)*size); 4 p[size-1] = value; 5 } Activation: add_element(...) and no memory; Error: p=0 and size > 0 (* before line 4 *); Propagation: execution of line 4 Failure: segmentation violation

  24. Error Error Error Propagation • Failure of one component can activate an error in another component • One components fault is another components failure Failure Fault Component Component

  25. The Goal of Fault Tolerance/Robustness is to Avoid/Mitigate Failures in the Presence of Faults Goal of Fault Tolerance/Robustness

  26. Error Fault Tolerance System Interface error free component fault failure component fault

  27. Basic Fault Tolerant Approach • Define set of likely faults and likely bounds on fault frequency (failure model) • Components must be able to tolerate specified faults • Faults outside failure model might or might not be handled

  28. Design Reuse Component Defects Implementation Specification Environment Maintenance RequirementsEngineering Operation Documentation ProcessManagement HumanInteraction Origin of Faults

  29. Fault Classification • Persistence • Transient fault • Intermittent fault (periodic fault) • Permanent fault • Creation time • Design fault • Operational fault • Intention • Accidental fault • Intentional fault

  30. Classical Failure Model • Crash failure • Fail-silent and Fail-stop • Omission failure • Timing failure • System fails to respond within a specified time slice • Both late and early responses might be “bad” • Late timing failure = performance failure • Arbitrary failure • System behaves arbitrarily

  31. Arbitrary Timing Omission Crash Fail-stop Failure Hierarchy The algorithms used for achieving any kind of fault tolerance depend on the system and failure model

  32. Fault Avoidance Fault Forecasting Fault Removal Fault Tolerance Achieving Dependability • Fault tolerance should not be applied in isolation

  33. Fault Avoidance • Reduce the number of faults during software construction • Rigorous Software Development Process • Requirements Specification & Analysis • Structured Design • Well-defined Mapping to Programming Languages • Clear Documentation • Formal Methods • Software Reuse

  34. Rigorous Software Development (1) • Requirements elicitation • Discover what features each stakeholder expects the system to provide • Imperfect process • Technical and non-technical people have to collaborate • Use-cases • Computer scientists can’t be experts in all application areas • Current software development processes are rigid • No support for “iterative” requirements engineering

  35. Rigorous Software Development (2) • Requirements Analysis / Specification • Specify in a clear and precise way what functionality your system must provide • Complete, but not too complex • Consistent • Determine (or even better: generate) test cases • Should not be rigid

  36. Drink Distributor Coins Cancel Coffee Tea Chocolate Drink Change Drink Distributor Example (1) • Provides hot drinks: coffee, tea and chocolate • User interface • Cycle treatment • Insert money • Choose drink • Take change • Take drink • Or press cancel and coins are given back

  37. Drink Distributor Example (2) • Incomplete specification • No deadline for cancellation specified • What if user inserts new coins before the end of a cycle? • What if the user changes his selection? • What should be done when resources (change, cups, spoons, sugar, coffee, tea, chocolate, water) run out? • Provide partial service?(e.g. only tea and coffee / require exact change) • If manufacturer and user make divergent interpretations, operation time failure will occur

  38. Drink Distributor Example (3) • Augment specification • Cancellation not possible once drink has been chosen • Add green / red light to indicate cycle start • Only the first selected beverage is taken into account • Add lights to show availability of drinks • Each omission of constraint in the specification can lead to a failure in the service delivered to the user • Dissatisfaction • Loss of money

  39. Rigorous Software Development (3) • Structured design • For instance in Object-Orientation:Apply O-O principles, e.g. abstraction, information hiding, modularity, classification, to reduce complexity of the solution • Provide easy-to-read documentation • UML • Programming Methodology • Good programming discipline • Pair-programming • Well-defined mapping of design models to programming constructs • Standards or coding conventions

  40. Formal Methods (1) • Specifications are developed using mathematically tractable languages and tools • Petri Nets, Algebraic Specifications • Permits to prove desired properties • Generation of test cases • Generation of code!

  41. Formal Methods (2) • Mathematical specifications of software tend to be equal in size as the program itself=> just as error-prone • Tools (model-checkers) still face algorithmic challenges when attempting to prove properties of huge models • However, have been successfully applied for safety-critical components

  42. Software Reuse • Well exercised software is less likely to fail • Save development cost • Undiscovered faults may appear when the component is used in a new environment

  43. Fault Removal • Detect and remove existing faults using proofs • Assertion checking in simulator • Testing • Exhaustive testing not feasible • Can’t show the absence of faults • Quality measures • Formal Inspection

  44. Fault Forecasting • Also known asSoftware reliability measurement [Lyu96] • Estimation • Gather failure data during operation or testing • Apply statistical inference techniques • Prediction • Gather software metrics during development • Fault forecasting can indicate the need for additional testing or for applying fault tolerance

  45. Seriousness Classes (1) • DO-178B (standard for SW certification), civil aeronautics • Without effects • Minor / benign • Upset passengers, small increase in workload for the crew • Major / significant • Injuries of the passengers / crew and reducing the efficiency of the crew • Dangerous / serious • Small number of casualties / serious injuries, or preventing the crew from achieving its task in a precise and complete manner • Catastrophic / disastrous • Leading to human lives lost

  46. Seriousness Classes (2) • DO-178B, civil aeronautics • Without effects • Minor / benign • Probable: p > 10-5 • Major / significant • Rare: 10-7 < p < 10-5 • Dangerous / serious • Extremely rare: 10-9 < p < 10-7 • Catastrophic / disastrous • Extremely improbable: p < 10-9

  47. Software Fault Tolerance • Tolerate faults that remain in the system after development, preventing system failure • Remove errors and their effects from the computational state before a failure occurs

  48. Classification • Single Version Software • Monitoring techniques, atomicity of actions, decision verification, exception handling • Multi-version Software • Functionally independent, yet equivalent software • Recovery blocks, N-version programming, … • Multiple Data Representation • Retry blocks, N-copy programming, …

  49. Recovery • Error detection • Identify erroneous state • Error diagnosis • Assess the damage • Error containment • Prevent further damage / error propagation • Error recovery • Substitute the erroneous state with an error-free one

  50. Backward Error Recovery (1) • System state is saved at predetermined recovery points • Called checkpointing • Incremental checkpointing, log • State should be checkpointed on stable storage, not affected by failures • Recover error-free state by rolling back to a previously saved (error-free) state

More Related