1 / 27

Martyn Thomas Founder: Praxis High Integrity Systems Ltd

Software in Practice a series of four lectures on why software projects fail, and what you can do about it. Martyn Thomas Founder: Praxis High Integrity Systems Ltd Visiting Professor of Software Engineering, Oxford University Computing Laboratory. Lecture 2: Software Failures.

andie
Download Presentation

Martyn Thomas Founder: Praxis High Integrity Systems Ltd

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Software in Practicea series of four lectures on why software projects fail, and what you can do about it Martyn Thomas Founder: Praxis High Integrity Systems Ltd Visiting Professor of Software Engineering, Oxford University Computing Laboratory

  2. Lecture 2: Software Failures • Developing software is very difficult • it is easy to make mistakes … • …. and they are unlikely to be found by testing • Errors can be introduced in every phase of software development: • requirements capture, specification, design, programming, building, error correction, modification, re-use ...

  3. Finding faults by testing? type Alert is (Warning, Caution, Advisory); function RingBell(Event : Alert) return Boolean -- return True for Event = Warning or Event = Caution, -- return False for Event = Advisory is Result : Boolean; begin if Event = Warning then Result := True; elsif Event = Advisory then Result := False; end if; return Result; end RingBell; -- C130J code: Caution returns uninitialised (usually TRUE, as required).

  4. Taurus • Taurus was a £50m system to provide electronic share trading for the London Stock Exchange in 1991, removing paper share certificates. (This would revolutionise the job of share registrars). • It overran: a recovery strategy was put in place, • It reached 85% complete and a date for cut-over was announced later the same year. A few weeks later, the project was cancelled. • City firms had wasted £350m on new systems to interface to Taurus.

  5. Taurus: a requirements problem • The system was over-complicated and had failed to reconcile conflicting requirements, especially those from the share registrars.

  6. This lesson has not been learnt ... • No public-sector civil project has ever been put out to tender with a formal specification. • For example, eFDP took two years to agree a set of requirements. The remaining difficulties were put in the requirements as six-month “design studies”. Four weeks after the RfP, the project was abandoned.

  7. Nancy Leveson’s Torpedo:gaps in the specification • How to stop a torpedo blowing up the launch ship? • If it malfunctions or starts to come back: • sink it • blow it up • On live test, a torpedo failed whilst still in the torpedo tube… …

  8. London Ambulance Service (LAS) they took the lowest bid ...

  9. LAS: The Manual System • LAS covers 600 Sq Miles, carries >5000 patients each day; handles 2000-2500 calls daily including 1300-1600 emergency calls. 750 ambulances. • Emergency call written on a form. Location looked up on a map. Form and map co-ordinates placed on a conveyor belt to central dispatch, who remove duplicates and route to a zone to contact an ambulance • This took ~3 minutes and 200 staff. • Decision to implement Computer-Aided Dispatch.

  10. LAS: Computer Aided Dispatch (CAD) version 1 • 1980s. £7.5 million spent. System built but failed its load test and was abandoned. LAS sued the Supplier, who had not understood the requirement properly. • 1990: Requirements started for Version 2. • New CAD to be “fully automated”. Automatic lookup of location; automatic selection of the best ambulance. • No similar system in existence

  11. LAS: CAD Version 2 • New System much more complex than Version 1: CAD+Map Display+Automatic Vehicle Location Service (AVLS) • Andersen Consulting had estimated that a package solution without AVLS, if one existed, would cost £1.5m and take 19 months to implement. • This seems to have become the project budget for a custom system.

  12. LAS: Version 2 bids • 35 companies looked, 19 bid, most said it needed more time and money than the budget • The only bidder who promised to meet all the requirements on time and within budget was a consortium of Apricot (hardware), Systems Options (SO - a small software house) and Datatrak (AVLS). • SO bid only £35K to develop the CAD software! Total bid £937,463 • The next lowest bid was £700K more!!

  13. LAS: Version 2 development • Phase 1 system: no radio messaging • client and server lock-ups • Phase 2 system: with radio messaging • unstable, overloaded at shift change, radio blackspots, unable to cope with staff taking the “wrong” vehicle. • Managers decided to go live on 26 October 2002, ignoring independent review

  14. LAS: Result • 26 October, control room reconfigured to use CAD. No manual backup system. • System progressively lost ambulances • screens filled with exception messages, that scrolled off and were lost • system delayed incidents, waiting for ambulances, so public called again, increasing the workload. • Several or zero ambulances sent to each incident. • Staff stress caused operator errors • Network congestion, slowdown, system collapse. • Oct 27th, semi-manual operation but system crashed through memory leak. System abandoned.

  15. Radiotherapy

  16. Therac 25 • (not the system on the previous slide) • A system for treatment of tumours • Mode 1: low energy electron beam treatment • Mode 2: very high energy beam (25MeV) with a thick metal plate in front, for X-rays. • Therac-20 had a mechanical switch to change beam, and an interlock to stop change to high energy without the plate. • Therac 25 interlock was in software.

  17. Therac-25 User Interface • Set up treatment time • Electron beam, type e • X-ray beam, type x. • System puts the plate in place before switching beam to X-rays. • System: “Beam Ready”, Operator types b to start treatment. • Operator station in a different room from the patient, to protect staff from radiation

  18. Therac: Accident • Ray Cox, oil worker, on the table for his regular e-beam treatment for a tumour on his shoulder. • Operator goes to the other room • types x, realises mistake, types  “edit”, e, “enter” - all within 8 seconds. System says “Malfunction” • cleared the error, got “beam ready” and hit b • same error message, so tried again. Twice. • Ray felt a painful jolt - not like previous treatments. Shouted in pain but no-one heard. Third time he got off the table and went to find the nurse.

  19. Therac 25: outcome • Ray Cox died of radiation overdose 4 months later. • Meanwhile another patient experienced the same accident, but this time a technician realised there was a problem and reported it. • The same problem had occurred in Georgia, Canada and Washington.

  20. Therac: what went wrong? • The operator’s actions exposed a race-condition in the (multi-tasking) code. • The result was a full-power beam without the plate in place. 125-fold overdose! • The particular sequence of actions had never occurred in testing. • Made worse because audio intercom and video link both out of service. System error messages not informative (and usually meant treatment had not occurred).

  21. Therac: Failings • Safety Case claimed 10-11 probability for “computer selects wrong energy”. No evidence for the claim. • No low-complexity protection system (fuse and/or interlock). • Poor software engineering. • Poor investigation of reported accidents. Manufacturer did not consider possible software fault until several accidents

  22. Ariane V: European Space Agency launch vehicle

  23. Ariane V: Explosion • Initial launch exploded • Failure traced to the inertial navigation system (INS). • Overflow on conversion from 64-bit floating to 16-bit integer; exception not trapped • primary and back-up INS both failed for the same reason, and stopped • loss of INS led to auto-destruction.

  24. Ariane V: cause of failure • INS software re-used from Ariane IV • Ariane IV flight profile guaranteed this parameter could not overflow • Ariane V specification was different, in a way that affected the requirements for the INS. • Formal specification would catch this fault.

  25. Conclusions (1) • Software development is hard - all sorts of things go wrong. • It is an engineering task. You dare not do without discipline and rigour. • Even the best people make mistakes. That’s why we use reviews, checklists, type-checkers and other static analysis tools, testing, and proof.

  26. Conclusions (2) A safety-critical software team must have: • Good domain knowledge • Excellent systems engineering / software engineering knowledge, skills, processes • Good knowledge of safety assessment principles, standards, practice and law, • … and finally ...

  27. …a strong safety culture Developing safety-critical software is the subject of my next lecture.

More Related