1 / 18

Kathryn Anne Weiss weissk@mit mit/~weissk

An Analysis of Causation in Aerospace Accidents. Kathryn Anne Weiss weissk@mit.edu http://www.mit.edu/~weissk Complex Systems Research Laboratory (CSRL) Department of Aeronautics and Astronautics Massachusetts Institute of Technology Tuesday, September 7, 2004.

roch
Download Presentation

Kathryn Anne Weiss weissk@mit mit/~weissk

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. An Analysis of Causation inAerospace Accidents Kathryn Anne Weissweissk@mit.eduhttp://www.mit.edu/~weissk Complex Systems Research Laboratory (CSRL)Department of Aeronautics and AstronauticsMassachusetts Institute of Technology Tuesday, September 7, 2004 This paper was presented at the Digital Avionics Systems Conference in 2001. This paper and similar papers on accidents, accident modeling and accident reports can be found at http://sunnyday.mit.edu/accidents/index.html

  2. Recent Aerospace Losses Ariane 5 Titan/Centaur/Milstar SOlar HeliosphericObservatory Mars Climate Orbiter

  3. Ariane 5 • June 4, 1996, 40 seconds after launch, the launcher veered off its nominal flight path and exploded • Reused the IRS software from Ariane 4 on the Ariane 5 • The time sequence of the Ariane 5 lift-off is significantly different from that of the Ariane 4 • A function was left in the Ariane 5 software for commonality reasons, “based on the view that, unless proven necessary, it was not wise to make changes in software which worked well on Ariane 4” • An exception was raised causing the nozzle of the solid rocket boosters to deflect, from which the launcher experienced high aerodynamic loads

  4. Mars Climate Orbiter • Relied heavily on previous designs of MGS and Pathfinder • There was an error in the spacecraft’s navigation measurements of nearly 100 km, which resulted in a much lower altitude than expected during MOI and led to the vehicle’s break-up in the atmosphere • The conversion factor from English to Metric units was erroneously left out of the AMD files • Interface Specification required that the impulse-bit calculations should be done using Metric Units • The software supplied by a vendor that used English units

  5. Titan/Centaur/Milstar • Mission to place Milstar in a geosynchronous orbit • Roll rate filter constant should have been entered as–1.992476, but was entered as –0.1992476 • Centaur/Milstar began experiencing instability about the roll axis during the first burn • Instability greatly magnified during Centaur’s second main engine burn, resulting in vehicle tumbling • The Centaur attempted to compensate with its RCS, which ultimately depleted available propellant • The third engine burn terminated early • Milstar satellite placed in a low elliptical final orbit

  6. SOHO Background • SOHO, or the SOlar Heliospheric Observatory, is a joint effort between NASA and ESA to perform helioseismology and monitor the solar atmosphere, corona and wind • SOHO was launched on December 2, 1995, was declared fully operational in April of 1996, and completed a successful two-year primary mission in May of 1998 • It then entered into its extended mission phase • After roughly two months of nominal activity, contact with SOHO was lost June 25, 1998

  7. SOHO Loss (1/4) • The loss was preceded by a routine calibration of the spacecraft's three roll gyroscopes (named A, B and C) and by a momentum management maneuver • In order to increase the amount of science done during the mission and to increase the gyros’ lifespans, a decision was made to compress the timeline of the operational procedures for momentum management, gyro calibration and science instrument calibration into one continuous sequence • The previous process had included a day between completing gyro calibration and beginning the momentum management procedures

  8. SOHO Loss (2/4) • Because the gyro calibration in the new compressed timeline was immediately followed by a momentum management procedure, despinning the gyros at the end of the gyro calibration and re-enabling the on-board software gyro control function was not required • However, after the gyro calibration, Gyro A was specifically despun in order to conserve its life, while Gyros B and C remained active

  9. SOHO Loss (3/4) • The modified predefined command sequence in the on-board control software had an error; it did not contain a necessary function to reactivate Gyro A, which was needed by the Emergency Sun Reacquisition • This omission resulted in the removal of the functionality of the spacecraft’s normal safe mode, ESR, and ultimately caused the sequence of events that led to the loss of telemetry • In addition, there was another error in the software that resulted in leaving Gyro B in its high gain setting following the momentum management maneuver • This error originally triggered the ESR

  10. SOHO Loss (4/4) • The first error was contained within a software function called A_CONFIG_N • ESR requires the use of Gyro A for roll control • Any procedure that spins down Gyro A must set a flag in the computer to respin Gyro A whenever the safe mode is triggered • When A_CONFIG_N was modified, the software enable command was omitted due to “a lack of system knowledge of the person who modified the procedure” • Because the change had not been properly communicated, the operator procedures did not indicate that Gyro A had been spun down

  11. Lessons Learned • We can learn lessons from these and other (all very different) aerospace accidents by examining the factors common among them • These factors are systemic and indicative of many accidents involving aerospace software systems • Systemic factors can be grouped into the following categories: • Flaws in the Safety Culture • Ineffective Organizational Structure • Ineffective Technical Activites

  12. Flaws in the Safety Culture • Overconfidence and Complacency • Success is ironically one of the progenitors of accidents • In SOHO led to inadequate testing and review of changes to ground-issued commands, a false sense of confidence in the team's ability to recover from an ESR, the use of challenging schedules, etc. • Discounting or Not Understanding Software Risks • An engineering culture that has unrealistic expectations about software and the use of computers • Changing (SOHO) software without introducing errors or undesired behavior is much more difficult than building correct software initially

  13. Flaws in the Safety Culture (Cont.) • Assuming Risk Decreases over Time • In the Titan/Centaur/Milstar loss, the Titan Program Office decided that because software was “mature, stable, and had not experienced problems in the past,” they could use the limited resources available after the initial development effort to address hardware issues • Inadequate Emphasis on Risk Management • Incorrect Prioritization of Changes • Slow Understanding of the Problems Associated with Human-Automation Mismatch

  14. Ineffective Organizational Structure • Diffusion of Responsibility and Authority • In almost all of the spacecraft accidents, there appeared to be serious organizational and communication problems among the geographically dispersed partners • Low-level status or Missing System Safety Program • In the SOHO report, no mention is made to any formal safety program. • Limited Communication Channels and Poor Information Flow

  15. Ineffective Technical Activities • Flawed or Inadequate Review Process • For SOHO, the changes to the ground-generated commands were subjected to very limited review • Inadequate Specifications • Software-related accidents almost always are due to misunderstandings about what the software should do • Inadequate System and Software Engineering • Software Reuse Without Appropriate Analysis of its Safety • Two of the spacecraft accidents, Titan and Ariane, involved reused software originally developed for other systems

  16. Ineffective Technical Activities (Cont.) • Unnecessary Complexity and Software Functions • The Ariane 5 and Titan IVB-32 accidents clearly involved software that was not needed, but surprisingly the decision to put in or to keep these features (in the case of reuse) was not questioned in the accident reports.  • Inadequate System Safety Engineering • Test and Simulation Environments that do not Match the Operational Environment • A general principle in testing aerospace systems is to “fly what you test and test what you fly”

  17. Ineffective Technical Activities (Cont.) • Deficiencies in Safety-Related Information Collection and Use • Operational Personnel Not Understanding the Automation • The SOHO report says that the software enable function had not been included as part of the modification to A-CONFIG-N due to a lack of system knowledge of the person who modified the procedure • Inadequate and Ineffective Cognitive Engineering and Feedback • SOHO controllers did not have the information they needed about the state of the gyros and the spacecraft in general to make appropriate decisions

  18. Conclusions • By examining recent, software-related aerospace accidents, we notice similarities, or systemic factors, involved in the losses • These similarities and parallels should help in focusing efforts to prevent future accidents

More Related