1 / 54

Software Reliability

Software Reliability. 20 February 2008. Simplified Model of a Computer. processor. retrieves the instruction directs data movement. Performs the operations. Arithmetic Logic Unit. Control Unit. instructions. data. the information that it works on. defines an algorithm. MEMORY.

kitty
Download Presentation

Software Reliability

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Software Reliability 20 February 2008

  2. Simplified Model of a Computer processor • retrieves the • instruction • directs data movement Performs the operations Arithmetic Logic Unit Control Unit instructions data the information that it works on defines an algorithm MEMORY

  3. Points to Remember • Computers access information by location and doesn’t know the value • Computers store numbers in fixed size packets, which means that they can not grow indefinitely • Computers do not distinguish between different types of data (e.g., instructions or text or numbers)

  4. Where are Computers Used? • Finance: banking; stock market; commerce • Medical: diagnostics; life support; medical devices • Communications: television; radio; news; networks • Transportation: traffic signals; air traffic control; air craft; space craft; trains; cars • Military: weapons systems; intelligence gathering • Energy: power plants; toxic chemical plants; oil & gas • Water: sewer • Buildings: HVAC; security; lights • Personal & household items

  5. But Things Go Wrong • Usability • Bad Design • Reliability • Programming Mistakes • Why is it so hard? • Why can’t we get it right?

  6. Is this a Problem? “Our civilization runs on software, yet the art of creating it continues to be a dark mystery, even to the experts. And the greater our ambitions, the more spectacular we seem to fail.” Scott Rosenberg, “Dreaming in Code”

  7. Usability Computer as a tool • A useful tool should doWHAT? • Help you achieve WHAT you want, in less time, with minimal effort…. • Do ALL that you need … • Having a Bad Day

  8. Characteristics of a Useful Tool • Easy to learn (intuitive) • Easy to remember • Standardized

  9. Remote controls with a zillion same-sized buttons

  10. Try plugging in this hair dryer…

  11. What’s wrong here?

  12. Lots of “Great” Examples • Great Interface Disasters • Bad Designs • Usability in the movies… yeah, right!

  13. Reliability • What is it? • Correct output – every time • How often it fails • Reasons that it fails • Top management commitment • User commitment • Misunderstood requirements • Inadequate user involvement • Mismanaged user expectations • Scope creep • Lack of knowledge or skill Keil et al, “A Framework for Identifying Software Project Risks,” CACM 41:11, November 1998

  14. 80% of software projects fail • 50% challenged • 2x budget • 2x completion time • 2/3 planned function • 30% impaired • Scrapped Standish Group, 1995

  15. Is the Problem Overstated? • More recently: Sauer et al claim 67% “delivered close to budget, schedule, and scope expectations” • NIST estimates cost to US economy from inadequate software testing > $59 billion/yr. NIST Planning Report 02-3

  16. What is a Bug?

  17. Bug • Problems in code that cause it to behave in an unintended, unanticipated or unpredictable manner • Origin • Grace Hopper (1947): moth in a relay "First actual case of bug being found." • Thomas Edison used the term in 1878 "Bugs"—as such little faults and difficulties are called— 1906-1992

  18. First Computer Bug

  19. Why are bugs hard to find? • The error can appear in another program • Device drivers, memory management • The error may only occur occasionally • May require multiple conditions to occur

  20. Classes of Problems • Poorly designed software • Poorly understood requirements • Poorly designed user interfaces • Improper use • Data entry problems • Simple coding errors

  21. Can’t We Test Out the Problems? • In order to establish that the probability of failure of software is less than 10-9 in 10 hours, testing required with one computer is greater than 1 million years Butler and Finelli, “The Infeasibility of Experimental Quantification of Life-Critical Software Reliability”

  22. Simple Problems • Tampa couple was billed $4,062,599.57 for a month’s electricity • Correct bill was $146.76 • Input error – clearly not good enough check for reasonable values • High School freshman banned from football because of drug use in middle school • Actual offense was chewing gum and being tardy • Different codes not properly translated - systems are only as good as their weakest links

  23. User Interface Bug • Usability Issue • Afghanistan War (December 2001) • Friendly fire kills 3 injures 20 when satellite-guided bomb landed on a battalion command post • Use of GPS Receiver to determine coordinators • Change battery • What should come up? • www.washingtonpost.com/ac2/wp-dyn/A8853-2002Mar23

  24. Denver Airport Baggage System (1995) How stuff works

  25. Denver Airport Baggage System (1995) • 4 years in development at cost of $193M • Massively complex system • 4000 cars, 21 miles of track, scanners, photocells, 300 computers • Cars misrouted and crashed, baggage lost and damaged • Delayed opening cost $1.1M/day • When airport opened a year late only one airline used it www.cis.gsu.edu/~mmoore/CIS3300/handouts/SciAmSept1994.html

  26. Denver Airport System • Examples of bugs: • Photocell could not detect bags on the belt and therefore didn’t stop system • System had lost track of state of carts during jams • Timing between conveyor belts and carts not properly synchronized • Overall • Not just software glitches • very complex, poorly engineered system

  27. Ariane 5 (1996) Integer overflow Software error

  28. External view Only about 40 seconds after initiation of the flight sequence, at an altitude of about 3700 m, the launcher veered off its flight path, broke up and exploded

  29. External view

  30. Cost Development cost $7 Billion Delay of more than one year One set of four identical, uninsured scientific satellites + One rocket $500,000,000

  31. What Happened? • Overflow: tried to put too big a number into too small a space • Even worse – the feature that caused the problem wasn’t needed! It was only needed to set up the launch! archive.eiffel.com/doc/manuals/technology/contract/ariane/page.html

  32. Bank of New York: Nov 20, 1985 • BoNY: Nation’s largest clearer of Govt securities. • Software to track Federal securities transactions wrote new information on top of old. • Feds debited the bank for each transaction but bank did not know who owed it how much. • 90 minutes => $32 Billion overdraft!

  33. Cost of Bug • Bank had to borrow $24 billion from federal reserves. Interest paid ~$5 million for 1 day. (Annual earnings of bank ~120 million) • BoNY share prices dropped by 25¢ • Federal funds rate dropped from 8.4% to 5.5% • System down for 28 hours. • Fear of financial crisis caused increase in price of platinum!

  34. Cause of bug • Message buffer counter at BoNY system was 16-bit long. • Counters at Fed (and other banks) 32 bit. • More than 32,000 transactions that morning! =>Counter overflow • Securities database corrupted.

  35. The Drama continues… • Trying to correct it – they copied corrupted data over the backup. • Lost a few hours because of this. • Reference: Wiener, Digital Woes, 1993

  36. And then there is … • Therac 25

  37. Therac-25 • Landmark case of how things can go terribly wrong • Medical linear accelerator: radiation therapy for cancer patients • Used to zap tumors with high energy beams • Electron beams for shallow tissue • X-ray photons for deeper tissue • Eleven Therac-25s were installed: • Six in Canada • Five in the United States • Developed by Atomic Energy of Canada Limited (AECL).

  38. Therac-25 • Improvements over Therac-20: • Uses new “double pass” technique to accelerate electrons. • Machine itself takes up less space. • Other differences from the Therac-20: • Software now coupled to the rest of the system and responsible for safety checks. • Hardware safety interlocks removed. • “Easier to use.”

  39. Understanding Therac • Two treatment modes • Why > 1 mode in one machine?

  40. Field Light Mirror Counterweight Beam Flattener (X-ray Mode) Turntable Scan Magnet (Electron Mode) Therac-25 Turntable

  41. 1985-1987: Six known accidents • Jun 1985: Patient at Mareitta GA received overdose • July 1985:Hamilton, Ontario: patient severely burned, died that November. • December 1985: Patient in Yakima, WAoverdose

  42. Vernon Kidd (4th case) • Early March 1986, Tyler, Tx: • receives dose > 100 times too high • Complained he felt burned….. • Engineer: It’s not possible for Therac-25 to give an overdose. • Engineering firm: Machine does not appear capable of giving a patient an electrical shock... • Died 5 months later • Put back in use late March

  43. 3 Weeks Later: Ray Cox • Second accident in Tyler, Tx • Same operator • Patient died 1 month later • This time they were able to reproduce

  44. What Went Wrong? • User Interface • Operator entered code for high energy rather than low energy • “Malfunction message” • Operator entered “Proceed” because system was known to give quirky errors • Result • Turntable was in the wrong position

  45. What would cause that to happen? • Race conditions. • Several different race condition bugs. • Overflow error. • The turntable position was not checked every 256th time the “Class3” variable is incremented. • No hardware safety interlocks. • Wrong information on the console. • Non-descriptive error messages. • “Malfunction 54” • “H-tilt” • User-override-able error modes.

  46. One of the software design errors • SOFTWARE included a set-up test before each treatment…. • Tested various components …. • Variable incremented with each part of test: X = X + 1 • 8 bits…. • Can store values from 0 thru 255

  47. 8-bit code: 128 64 32 16 8 4 2 1place value 1 1 1 1 1 1 1 1 = 255 256 1 0 0 0 0 0 0 0 0 = 0 • IF X = 0 thenPROCEED with treatment

  48. Source of the Bug • Incompetent engineering. • Safety analysis excluded the software! • No usability testing.

  49. Nancy Leveson Clark S. Turner Sources • Leveson, N., Turner, C. S., An Investigation of the Therac-25 Accidents. IEEE Computer, Vol. 26, No. 7, July 1993, pp. 18-41. http://courses.cs.vt.edu/~cs3604/lib/Therac_25/Therac_1.html The authors:

  50. “The ethical dimensions of computer reliability are bound up with the nature of software, and the complexity of such systems.”

More Related