1 / 39

System Safety

Using Probability for Risk Analysis. System Safety. Barbara Luckett – 20 October 2009. Personal Background. Naval Surface Warfare Center Dahlgren Division (NSWCDD ) “…premier research and development center that serves as a specialty site for weapon system integration.” -- NSWCDD website

briar
Download Presentation

System Safety

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Using Probability for Risk Analysis System Safety Barbara Luckett – 20 October 2009

  2. Personal Background • Naval Surface Warfare Center Dahlgren Division (NSWCDD ) • “…premier research and development center that serves as a specialty site for weapon system integration.” -- NSWCDD website • Platform System Safety Branch • Department of Defense (DoD) acquisition projects • http://www.austal.com/index.cfm?objectID=6B42CC62-65BF-EBC1-2E3E308BACC92365

  3. System Safety Terms and Concepts • System – “a composite, at any level of complexity, of personnel, procedures, materials, tools, equipment, facilities, and software… used together in the intended operational or support environment to perform a given task or achieve a specific purpose.” – MIL-STD 882C • Safety – “freedom from those conditions that can cause death, injury, occupational illness, or damage to or loss of equipment or property, or damage to the environment.” – MIL-STD 882C

  4. What is System Safety? • “The application of engineering and management principles, criteria, and techniques to optimize all aspects of safety within the constraints of operational effectiveness, time, and cost throughout all phases of the system life cycle.” – MIL-STD 882C • “For almost any system, product, or service, the most effective means of limiting product liability and accident risks is to implement an organized system safety function beginning in the conceptual design phase, and continuing through to its development, fabrication, testing, production, use, and ultimate disposal.” – System Safety Society website

  5. System Safety Terms and Concepts • Hazard – “any real or potential condition that can cause death, injury, occupational illness; or damage to or loss of equipment or property; or damage to the environment.” – MIL-STD 882C • Mishap – “an unplanned event or series of events resulting in death, injury, occupational illness; or damage to or loss of equipment or property; or damage to the environment.” – MIL-STD 882C • Effect – “the result of a mishap (ie: death, injury, occupational illness; or damage to or loss of equipment or property; or damage to the environment).” – MIL-STD 882C

  6. Mishap Severity

  7. Mishap Probability

  8. Mishap Risk Index (MRI)

  9. How do we get these values? • Severity values are obtained by brainstorming “worst credible” mishaps in each of three categories: • Personnel injury/death • Damage to system equipment • Environmental damage • Probability values are a little more technical…

  10. Probability Terms and Concepts • “The probability of any outcome of a random phenomenon is the proportion of times the outcome would occur in a very long series of repetitions.” - COMAP • “The sample space S of a random phenomenon is the set of all possible outcomes” - COMAP • “An event is any outcome or set of outcomes of a random phenomenon… An event is a subset of the sample space.” - COMAP -- COMAP text, 7th edition, pages 289-299

  11. Probability Rules • 0 ≤ P(A) ≤ 1 • P(S) = 1 • P(Ac) = 1 – P(A) • P(A or B) = P(A) + P(B) – P(A and B) • P(A and B) = P(A) x P(B)

  12. Methods of Obtaining Probability Values • Fault Tree Analysis • Historical Mishap Data • Given Information • Standard Calculations

  13. Fault Tree Analysis (FTA) • Originally developed by Bell Telephone Laboratories in 1962 for the U.S. Air Force. • Used to analyze probabilities of inadvertent launch of Minuteman missiles • Technique was expanded and improved upon by Boeing Company • Fault Trees are now one of the most widely used methods in system reliability and failure probability analysis

  14. Fault Tree Analysis (FTA) • A Fault Tree is a top-down structured graphical representation of the interaction of failures of events within a system • Basic events (hazards and their causal factors) are at the bottom of the fault tree and are linked via logic symbols (known as gates) to a top event (mishap). • Events in a Fault Tree are continually expanded until sub-events are created for which you can assign a probability. • We can use known probability values for the basic events as well as knowledge of logic gates and boolean logic to calculate the probability of the mishap occurring.

  15. Review of Logic Gates for FTA • AND • OR • XOR • NOT • NAND • NOR • XNOR

  16. AND Gate Logic • All on-site power fails iff Generator #1 fails and Generator #2 fails and Generator #3 fails • A = B and C and D • P(A) = P(B) x P(C) x P(D) All on-site power failed A Generator #1 fails B Generator #2 fails C Generator #3 fails D

  17. OR Gate Logic Elevator door ‘closed’ failed A • Elevator door ‘closed’ failed iff hardware failure or human error or software failure • A = B or C or D • P(A) = P(B) + P(C) + P(D) – P(B)P(C) –P(B)P(D) - P(C)P(D) + P(B)P(C)P(D) Hardware failure B Human Error C Software failure D

  18. FTA Methodology • Generally involves five steps: • Define the undesired top event (mishap) • Obtain an understanding of the system • Construct the fault tree • Deductively define all potential failure paths • Evaluate the probability of the top event • Analyze output and determine what is required to mitigate the top event

  19. Define the undesired top event (mishap) • Fire Protection Systems Fail • Obtain an understanding of the system • Primary smoke detection system with secondary heat detection system • AFFF (Aqueous film-forming foam) fire suppression system • Construct the fault tree • Deductively define all potential failure paths Fire Protection Systems Fail Fire detection system fails Fire suppression system fails Smoke detection system fails Heat detection system fails Pump fails Blocked nozzles

  20. Historical Mishap Data • Using probability of an event occurring in the past to predict probability of the event occurring in the future • EX) If we have a fleet of 5 ships (each with 6 freight elevators onboard) that have been in operation for 20 years (each elevator used approx. 35 hours/year), with 3 injuries caused by elevator malfunctions: • The probability of a mishap can now be determined by dividing the number of times a mishap has occurred by the total operational hours • P(mishap) = # of mishaps / total hours = 3/21000 = 1.42857 x 10-4 • This falls into the REMOTE severity category

  21. Given Information:Hardware Components • Often, probability of failure for a system’s hardware components may be available • EX) Consider a system with an operational function that is dependent on all four of the individual components working (ie: the system function fails if any one of the components fail): • P(system failure) = 1 – [P(component A does not fail) x P(B does not fail) x P(C does not fail) x P(B does not fail)] = 1 – [(0.9357)(0.9083)(0.925)(0.9083)] = 1 - 0.71406 = 0.28594 per 1 million operational hours

  22. Given Information:Test Scenarios • Operational tests can be conducted to provide an estimate of failure for certain system components • EX) We can run a series of tests on a fire suppression system and note when the fire is extinguished. • Define a success here as an event where the fire is extinguished in less than 60 seconds from system activation. • If we conduct 10 tests, and the system fails to extinguish the fire in under a minute once, we have P(failure) = 0.1 • This is not incredibly accurate due to the small sample size

  23. Standard Calculations:Event Types • Let qi(t) = P(Failure of unit i occurs at time t) • Different types of events: • Non-repairable unit • Unit i is not repaired when a failure occurs • Failure rate of λi • qi(t) = 1 − e−λit ≈ λit • Repairable unit (repaired when failure occurs) • Unit i is repaired when a failure occurs and is assumed to be as good as new following a repair • Failure rate of λi • Mean Time to Repair of MTTRi • qi(t) ≈ λit x MTTRi

  24. Standard Calculations:Event Types • Periodically tested (hidden failures) • Unit i is tested periodically with test interval τ • Failure may occur at any time in the test interval, but the failure is only detected in a test or if a demand for the unit occurs. • Typical for safety-critical units (ie: smoke detectors) • Failure rate ofλi • Test interval of τi • qi(t) ≈ λix τi 2 • On-demand probability • Unit i is not active during normal operation, but may be subject to one or more demands • Often used for human (operator) error • qi(t) = P(i fails on request)

  25. Standard Calculations: Why is Human Error important? • Human beings are an integral part of any system, so we cannot accurately estimate the probability of failure without taking people into consideration • “Estimates of the probability that a person will, for example, have a moment’s forgetfulness or lapse of attention and forget to close a valve or close the wrong valve, press the wrong button, make a mistake in arithmetic, and so on… They are not estimates of the probability of error due to poor training or instructions, lack of physical or mental ability, lack of motivation, or poor management” • “… Because so much judgment is involved, it is tempting for those who wish to do so to try to ‘jiggle’ the figures to get the answers they want… Anyone who uses estimates of human reliability outside the usual ranges should be expected to justify them.” – An Engineer’s View of Human Error by Trevor Kletz

  26. Standard Calculations: Human Error Probability P (Human Error) ≈ K1 x K2 x K3 x K4 x K5

  27. Standard Calculations: Human Error Probability

  28. Standard Calculations: Human Error Probability • Consider one scenario: • Type of activity: Requiring attention, routine  K1 = 0.01 • Stress factor: More than 20 seconds available  K2 = 0.5 • Operational qualities: Average knowledge and training  K3 = 1 • Activity anxiety factor: Potential emergency  K4 = 2 • Activity ergonomic factor: Good microclimate, good interface with plant  K5 = 1 • P (Human Error) ≈ K1 x K2 x K3 x K4 x K5 = 0.01 x 0.5 x 1 x 2 x 1 = 0.01 • In this situation, a person will fail 1% of the time • This falls into the PROBABLE category

  29. Back to a Fault Tree Example…

  30. Alarm clock does not wake you up Alarm clock failure You don’t hear it Main (plug-in) clock failure Backup (wind-up) clock failure Power outage Forgot to set (or set incorrectly) Faulty clock Forget to wind Forget to set (or set incorrectly) Faulty clock Electrical Fault Mechanical Fault

  31. Alarm clock does not wake you up Alarm clock failure You don’t hear it negligible Main (plug-in) clock failure Backup (wind-up) clock failure Power outage P = 0.012 Forgot to set (or set incorrectly) P = 0.008 Faulty clock P = 0.0004 Forget to wind P = 0.012 Forget to set (or set incorrectly) P = 0.008 Faulty clock Electrical Fault P = 0.0003 Mechanical Fault P = 0.0004

  32. Probability that the Backup (wind-up) clock fails? • P (backup clock failure) = P (faulty clock) + P (forget to wind) + P (forget to set) • P (backup clock failure) = 0.0004 + 0.012 + 0.008 • P (backup clock failure) = 0.0204 Backup (wind-up) clock failure Faulty Clock P = 0.0004 Forget to wind P = 0.012 Forget to set (or set incorrectly) P = 0.008

  33. Probability that the Main (plug-in) clock fails? • P (main clock failure) = P (power outage) + P (faulty clock) + P (forget to set) • P(main clock failure) = 0.012 + (0.0003 +0.0004) + 0.008 • P(main clock failure) = 0.012 + 0.0007 +0.008 • P(main clock failure) = 0.0207 Main (plug-in) clock failure Power outage P = 0.012 Forgot to set (or set incorrectly) P = 0.008 Faulty clock Electrical Fault P = 0.0003 Mechanical Fault P = 0.0004

  34. Probability that the Alarm Clock Does Not Wake You Up? Alarm clock does not wake you up • P (Alarm Clock Failure) = P (Main Clock Failure) + P (Backup Clock Failure) = 0.0207 x 0.0204 = 0.0oo422 • P (Alarm Clock Does Not Wake You Up) = P (Alarm Clock Failure) + P (You Don’t Hear It) • P (Alarm Clock Does Not Wake You Up) = 0.000422 = 4.22 x 10-4 • This falls into the REMOTE category Alarm clock failure You don’t hear it negligible Main (plug-in) clock failure P = 0.0207 Backup (wind-up) clock failure P = 0.0204

  35. Conclusions • System Safety is a risk management strategy based on identifying, analyzing, and eliminating or mitigating hazards using a systems-based approach. • Hazards are evaluated and analyzed based on the severity and probability values for their corresponding mishap. • Probability values can be obtained by using basic probability rules and boolean logic in addition to historical data, published failure values, an understanding of potential failure paths in a system, and some simple calculations. • The allows us to quantitatively analyze risk levels and make an informed recommendation /decision.

  36. Sources • MIL_STD 882C • Introduction to System Safety: Tutorial for the 19th International System Safety Conference by Dick Church • An Engineer’s View of Human Error by Trevor Kletz • For all Practical Purposes: Mathematical Literacy in Today’s World, 7th edition • http://www.navsea.navy.mil/nswc/dahlgren/default.aspx • http://www.system-safety.org/about/ • http://www.weibull.com/basics/fault-tree/index.htm • http://www.fault-tree.net/papers/andrews-fta-tutor.pdf • http://www.fault-tree-analysis-software.com/fault-tree-analysis-basics.html • http://www.ntnu.no/ross/srt/slides/fta.pdf

More Related