1 / 27

Developing Dependable Systems

Developing Dependable Systems. CIS 376 Bruce R. Maxim UM-Dearborn. Software Dependability. Customers expect all software to be dependable. They may accept some system failures in non-critical applications

leola
Download Presentation

Developing Dependable Systems

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Developing Dependable Systems CIS 376 Bruce R. Maxim UM-Dearborn

  2. Software Dependability • Customers expect all software to be dependable. • They may accept some system failures in non-critical applications • Applications having high dependability requirements require special programming techniques

  3. Achieving Dependability • Fault avoidance • software developed to minimize impact of human error • development process is organized so that faults in the software are detected and repaired before customer delivery • Fault tolerance • software designed so that faults in delivered software do not cause system failure

  4. Fault Minimization • Current SE methods can produce fault-free software • Fault-free software merely conforms to its specification (it may or may not always perform correctly since the specification may be flawed) • The cost of producing fault-free software is very expensive and may only be justified in exceptional situations • It may be cheaper to accept some software faults

  5. Developing Fault-Free Software • Needs a precise (preferably formal) specification • Requires an organizational commitment to quality • Information hiding and encapsulation in software design are essential • A programming language with strict type checking and run-time checking should be used • Needs a dependable and repeatable development process

  6. Error Prone Constructs - part 1 • Floating-point numbers • inherently imprecise, frequent comparison errors • Pointers • Dangling references and aliases possible • Dynamic Memory Allocation • memory overflow and garbage problems • Parallelism • race conditions and deadlocks are possible

  7. Error Prone Constructs - part 2 • Recursion • memory overflow when errors occur • Interrupts • errors are difficult to trace • Inheritance • code is no longer localized, unexpected results can arise when changes are made Note: You can use these constructs as needed, but you must be careful to use them correctly.

  8. Information Hiding • Information should only be available to program components on a need to know basis • reduces the probability of accidental corruption of information • information is encapsulated to prevent error propagation to rest of program • since information is localized, programmer is less likely make errors and reviewers are more likely to find errors

  9. Reliable Software Processes • Having a well-defined, repeatable software process will reduce the number of software faults • A well-defined repeatable process is one that does not depend entirely on individual skills, but can be carried out by a team • Significant verification and validation process activities must included to minimize the number of software faults.

  10. Process Validation Activities • Requirements inspections • Requirements management • Model checking • Design inspections • Code inspections • Static code analysis • Test planning and management • Configuration management

  11. Fault Tolerance • Required in critical applications (high reliability needed and high failure costs) • System can continue operation, despite software failure • A system which seems to be fault-free must also be fault tolerant (in case specification errors exist or the validation is incorrect)

  12. Fault Tolerant Actions • Fault detection • system determines an incorrect system state has occurred • Damage assessment • determine system parts affected by fault • Fault recovery • system must restore its state to a known safe state • Fault repair • for a non-transitory fault, system is modified to prevent repetition

  13. Approaches • Defensive Programming • programmers assume faults exist in system code • redundant code is written to check system state for consistency after modification are made • Fault Tolerant Architectures • HW and SW architectures that support redundancy are used • a fault tolerance controller that detects problems and supports recovery • Both approaches are important

  14. Exception Management • Could be program error or an event like power failure • Exception handling facilities in programming languages allow exceptions to be handled without constant checking to detect them • Using normal control constructs to detect exceptions in a sequence of procedural calls adds considerable timing overhead to a program

  15. Fault Detection • Languages with strict type checking allow many errors to be trapped during program compilation • Some types of errors can only be caught at run-time (e.g. cin >> I; cin >> A[I];)

  16. Fault Detection Approaches • Preventative Fault Detection • fault detection mechanism is activated before a state change is committed • if an erroneous state is detected change is cancelled • Retrospective Fault Detection • fault detection mechanism is initiated after system state change has been made • used when correct sequence of actions can lead to erroneous system state or preventative fault detection has too much overhead

  17. Type System Extension • Preventative fault detection really involves extending the current type system by including additional constraints as part of the type definition • These constraints are typically implemented by defining basic operations within a class definition

  18. Damage Assessment • System is analyzed to judge the extent of corruption caused by a system failure • Must determine what parts of the state space have been affected by the failure • Generally based on “validity functions” which can be applied to the state elements to assess if their value is within an allowed range

  19. Damage Assessment Techniques • Checksums are used to check for data transmission errors • Redundant pointers can be used to check integrity of data structures • Watch dog timers can help check for non-terminating processes (e.g. long time with no response assume the worst)

  20. Fault Recovery • Forward Recovery • apply repairs to corrupted system state • usually application specific, requires domain knowledge • e.g. error coding like check sum added to data • Backward Recovery • restore system to known safe state • simpler, since archived safe state is used to replace erroneous state • e.g. use of checkpoints in WP editor

  21. Fault Tolerant Architecture • Defensive programming can not cope faults caused by HW and SW interactions • If requirements are not understood then SW checks are not likely to be correct • Systems with high availability requirements often require fault tolerant architectures • Must tolerate both HW and SW failure

  22. Hardware Fault Tolerance • Triple-modular redundancy (TMR) • Three replicated component are included in the system • If one component produces different output than the other two, failure is assumed • This idea is based on the notion that most failures result from component failures, not design faults • Component failures should be a low probability event

  23. Software Fault Tolerance • TMR is based on two assumptions • HW components do not include common design flaws • simultaneous component failures are not likely • Neither assumption is valid for software components • isn’t possible to replicate SW components without replicating their design flaws • simultaneous component failure is inevitable • Software systems must be diverse

  24. Design Diversity • Different versions of the system are designed and implemented different ways (so they should have different failure rates) • Different approaches to design • object-oriented and function oriented • different implementation languages • different algorithms in the implementation • different tools or environments

  25. Software Analogies to TMR • N-version Programming • same specification is implemented in a number of different version by several teams • all versions compute simultaneously, the majority output is presumed correct • Recovery blocks • a number of explicitly distinct versions of a program are written for the same specification and executed in sequence • an acceptance test is used to select the output to keep

  26. Problems with Design Diversity • Teams tend to tackle the same problems in the same ways, so the resulting implementations may not be diverse • Characteristic errors • different teams are likely make the same mistakes, since some parts of the implementation are more difficult than others • specification errors may cause the same errors to appear in all implementations (argument for developing multiple specifications)

  27. Is software redundancy needed? • Unlike HW, SW faults are not an inevitable consequence of the real world • Some people believe that a higher level of reliability can be reducing software complexity instead • The existence of fault-tolerance controllers increases program complexity considerably and adds sources of errors that affect reliability

More Related