1 / 45

Software Safety in Embedded Systems: Why, What, and How?

This paper explores the importance of software safety in embedded systems and provides insights into the methodologies and practices involved. It covers topics such as hazard analysis, software requirements, design and analysis, human-machine interaction, verification, and validation. The paper also discusses the challenges and complexities of ensuring safety in hardware and software components and the need for system-level methods and viewpoints.

kmcconnell
Download Presentation

Software Safety in Embedded Systems: Why, What, and How?

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Software Safety in Embedded Systems&Software Safety: Why, What, and How – Leveson UC San Diego CSE 294 Spring Quarter 2006 Barry Demchak

  2. Previous Paper • System Safety in Computer-Controlled Automotive Systems – Leveson (2000) • Types of accidents • Safeware Methodology • Project Management • Software Hazard Analysis • Software Requirements Specification & Analysis • Software Design & Analysis • Design & Analysis of Human-Machine Interaction • Software Verification • Feedback from Operational Experience • Change Control and Analysis

  3. Roadmap • Safety definitions • Industrial safety and risk • Systems Issues – hardware and software • Software Safety • Analysis and Modeling • Verification and Validation • System Safety Engineering

  4. Safety Before Computers • NASA: 10-9 chance of failure over a 10 hour flight • British nuclear reactors: no single fault can cause a reactor to trip, and 10-7 chance over 5000 hours of failure to meet a demand to trip • FAA: 10-9 chance per flight hour (i.e., not within total life span of entire fleet)

  5. Introduction of Computers • Nuclear Power Plants • Space Shuttle • Airbus Aircraft • Space Satellites • NORAD • Purpose: perform functions that are too dangerous, quick, or complex for humans

  6. System Safety (def.) • Subdiscipline of systems engineering • Applies scientific, management, and engineering principals • Ensures adequate safety throughout the system life cycle • Constrained by operational effectiveness, time, and cost • MilSpec: “freedom from those conditions that can cause death, injury, occupational illness, or damage to or loss of equipment or property”

  7. More Definitions • Accident • Unwanted and unexpected release of energy • Mishap (or failure) • Unplanned event or series of events • Death, injury, occupational illness, damage, or loss of equipment or property, or environmental harm • Hazard • A condition that can lead to a mishap

  8. More Definitions (cont’d) • Risk • Probability of a hazardous state occurring • Probability of a hazardous state leading to a mishap • Perceived severity of the worst potential mishap that could result from a hazard • Hazard probability • Hazard criticality (severity)

  9. Early Approach • Operational or Industrial Safety • Examining system during operating life • Correcting unacceptable hazards • Ignores crushing effect of single catastrophe • Assumptions • All faults caused by human errors could be avoided completely or located and removed prior to delivery and operation • Relatively low complexity of hardware

  10. Ford Pinto (early 1970s) • Specifications: 2000 pounds, $2000 sale price • Use existing factory tooling • Safety issue with gas tank placement • Analysis • Deaths cost $200,000, burns cost $67,000 • Cost to make change $137M, benefit $49M • Ford engineer: “But you miss the point entirely. You see, safety isn't the issue, trunk space is. You have no idea how stiff the competition is over trunk space.” • Ford president: “Safety doesn’t sell” • Verdict: $100M

  11. Anecdotes • Safety devices themselves have been responsible for losses or increasing chances of mishaps • Redundancy sometimes degrades safety • Unrelated (but related) systems cause errors

  12. Later Approach • System Safety • Design acceptable safety level before actual production or operation • Optimize safety by applying scientific and engineering principals to identify and control hazards through analysis, design, and management procedures • Hazard analysis identifies and assesses • Criticality level of hazards • Risks involved in system design

  13. Later approach (cont’d) • Assumptions • Complexity of software and hardware interaction causes non-linear increase in human-error-induced faults • Impossible to demonstrate safety ahead of usage • Complexity and coupling are covariant

  14. Hardware vs Systems • Hardware • Widgets have long history of use and fault analysis … highly responsive to redundant techniques • Infinite number of stable states • Software • No history with software … reuse is rare • Large number of discrete states without repetitive structure • Difficult to test under realistic conditions

  15. More Systems Issues • Difficult to specify completely – what it does, and what it does not do • Cannot identify misunderstandings about requirements • Engineers assume perfect execution environments, don’t consider transient faults • Lack of system-level methods and viewpoints

  16. Even Bigger Systems Issues • Specification and implementation of components is not the same as between components • Between-component interactions grow exponentially and are often underrepresented in analyses • Components include • Software and components • Hardware • Human operators

  17. Still Bigger Systems Issues • More Components • Development Methodologies • Source code maintenance • Verification/Validation Methodologies • Stakeholder Values • Management • Individual Programmers • Customer • Human Users • Suppliers

  18. Definitions • Reliability • Probability that system will perform intended function • Safety • Probability that hazard will not lead to a mishap • Reliability = failure free • Safety = mishap free • Reliability and Safety often conflict

  19. Safety • Studied separately from security, reliability, or availability • Separation of concerns • Safety requirements are identified and separated from operational requirements • Conflicts resolved in a well-reasoned manner

  20. Definitions • System • Sum total of all component parts • Software is only a part, and its correctness exists only in relation to other system components

  21. Software Safety • Ensures software will execute within a system context without resulting in unacceptable risk • Safety-critical software functions • Directly or indirectly allow a hazardous system state to exist • Safety-critical software • Contains safety-critical functions

  22. System Characteristics • Inputs and outputs over time • Control subsystem • Description of function to be performed • Specification of operating constraints (quality, capacity, process, and safety) • Safety constraints are hazards rewritten as constraints • Safety constraints written, maintained, and audited separately

  23. Constraints, Requirements, Design

  24. Analysis and Modeling • Preliminary Hazard Analysis (PHA) • Subsystem Hazard Analysis (SSHA) • System Hazard Analysis (SHA) • Operating and Support Hazard Analysis (OSHA) • Safeware – Leveson

  25. Hazard Analysis • Start with list of identifiable hazards • Work backward to discover combination of faults that produce the hazard • Categorization • Frequent • Occasional • Reasonably remote • Remote • … physically impossible

  26. Hazard Examples(Nuclear Weapons) • Inadvertent nuclear detonation • Inadvertent prearming, arming, launching, firing, or releasing • Deliberate prearming, arming, launching, firing, or releasing under inappropriate conditions

  27. Software Requirement Analysis • Hard to do • Cubby-hole mentality • Rarely includes what the system should not do • Techniques • Fault Tree Analysis (FTA) • Real Time Logic (RTL) • Petri nets

  28. Fault Tree Example

  29. Real Time Logic • Model the system in terms of events and actions (both data dependency and temporal ordering) • Generate predicates • Determine whether a safety assertion is a theorem derivable from the model • Inherently unsafe means that the assertion cannot be derived from the model

  30. Time Petri Nets • Mathematical modeling of discrete event systems in terms of conditions and events and the relationship between them • Facilitates backward analysis • Points to failures and faults which are potentially most hazardous • Nontrivial to build and maintain

  31. Research Question • What is the place of these analysis techniques in an agile development environment??

  32. Safety Verification and Validation • Showing that a fault cannot occur • Showing that if a fault occurs, it is not dangerous • Only as good as the specifications • Specifications are usually incomplete, and hardware specifications are rare

  33. Safety Verification and Validation • Methodologies • Proofs of adequacy • Software Fault Tree (proofs of fault tree analyses) • Determine safety requirements • Detect software logic errors • Identify multiple failure sequences involving different parts of the system • Inform critical runtime checks • Inform testing

  34. Safety Verification and Validation • Methodologies • Nuclear Safety Cross Check Analysis (NSCCA) • Demonstrate that software will not contribute to a nuclear mishap • Multiple technical analyses demonstrate adherence to specifications • Demonstrate security and control measures • A lot of qualitative judgment regarding criticality • Software Common Mode Analysis • Sneak Software Analysis

  35. Safety Analysis – Quantitative • Requires statistical histories which may not exist • Applies mostly to physical systems • Single-valued Best Estimate • Information sufficient for determinate models • Probabilistic • Science is understood, but limited parameters available • Bounding • Putting a ceiling on the answer

  36. System Safety Engineering • Identify hazards • Assessing hazards (likelihood and criticality) • Design to eliminate or control hazards • Assess risks that cannot be eliminated or controlled

  37. Failure Mode Definitions • Fail-safe • Default is safe mode, no attempt to execute operational mission • Fail-operational • Default is to correct fault and continue with operational mission • Fail-soft • Default is to continue with degraded operations

  38. Designing for Safety • Not possible to ensure safety by analysis or verification alone • Analysis and verification may be cost-prohibitive • Different standard hierarchy • Intrinsically safe • Prevents or minimizes occurrence of hazards • Controls the hazard • Warns of presence of hazard

  39. Safety Design Mechanisms • Lockout device • Prevents event from occurring when hazard is present • Lockin device • Maintains an event or condition • Interlock device • Assuring operation sequences in correct order

  40. Safety Design Principals • Provide leverage for certification • Avoid complexity where possible • Reduce risk by reducing hazard likelihood, or severity, or both • Modularize to separate safety-critical functions from non-critical functions • Execute safety-critical functions under separate authority • Fail on a single-point failure

  41. Safety Design Principals (cont’d) • Start out in safe state, and take affirmative actions to reach higher risk states • Check critical flags as close as possible to actions they protect • Avoid compliments: absence of “armed” is not “safe” • Use “true” values to indicate safety … “false” values can result from common hardware failures

  42. Safety Design Principals (cont’d) • Detection of unsafe states • Watchdog timer • Independent monitors • Asserts and exception handlers • Use backward recovery (return system to safe state) instead of forward recovery (plow ahead)

  43. Human Factors • Define partnership between human and computer • Avoid complacency • Avoid confusion • Avoid passive monitoring

  44. Conclusion • Select suite of techniques and tools spanning entire software development process • Apply them consciensciously, consistently, and thoroughly • Consider implementation tradeoffs • Low catastrophe, high cost alternatives • Moderate catastrophe, moderate cost alternatives • High catastrophe, low cost alternatives

  45. Take Home Messages • Safety is a system issue – in the large sense • Software engineering techniques can contribute to system safety – in both a narrow and broad context • Acceptable risk is king, and determining and executing it is hard

More Related