1 / 19

Software Defects and their Impact on System Availability

Software Defects and their Impact on System Availability. -- A study of field failures in operating systems IBM T.J. Watson 1991. Presenter: Shan Lu. Why software defect?. More severe than hardware defect Software cause 60% of outage [Gray’90] Not well understood and studied

malise
Download Presentation

Software Defects and their Impact on System Availability

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Software Defects and their Impact on System Availability -- A study of field failures in operating systems IBM T.J. Watson 1991 Presenter: Shan Lu

  2. Why software defect? • More severe than hardware defect • Software cause 60% of outage [Gray’90] • Not well understood and studied • Different characteristics from hardware • a bug can not be compared with a fault hardware component

  3. Why ‘field’ failure? • Field failure • Failures that happen in production run • Different from defects detected in development & testing • Reflect the real world ‘impact’

  4. Overview • Analyze field failures in Operating System • Get statistics on • Impact of errors • Error type breakdown • Error triggering breakdown • Failure symptom distribution • Others • Use these results to guide future research

  5. Outline • Motivation • Overview • Data source • Design • Analysis results • These results indicate … • Related work

  6. Data source RETAIN • RETAIN database • Remote Technical Assistant Information Network • APAR • Manually extract • Error type • Error trigger • Symptom • Sample the APARs APAR Symptoms Context & environment How to fix Standard attributes Severity (1—4) HIPER ILP

  7. Overlay errors and general errors • Overlay errors • Errors cause storage overlay (memory corruption) • Hard to find and fix • Big impact on availability • Get sample set by key word searching • General errors • All errors including overlay errors • Get sample set by random sampling • Comparison will be made

  8. Error Type • Orthogonal and confidently large class • Totally 13 types • Overlay: 8 • Allocation management • Pointer management • Copy overrun • Regular: plus 6 • Semantic errors • Synchronization error • Unclassified

  9. Error triggering events • Boundary conditions • Bug fixes • Client code • Recovery or error handling • Timing • Unknown

  10. Symptom codes • ABEND • Addressing error (may restart) • Endless wait • Incorrect output • Incorrect output without detecting the failure • Loop • OS goes to infinite loop. Needs restart • Message • Error message printed. Local recovery, no ABEND

  11. Outline • Motivation • Overview • Data source • Design • Analysis results • These results indicate … • Related work

  12. Impact • Does overlay errors have more impact?

  13. Error Type of Overlay Errors • Which is most popular? • Copying Overrun (20%) • Allocation Mgmt. (19%) • Who has most impact? • Allocation Mgmt. (31%HIPERs, 17% IPLs) • Pointer Mgmt. (16%HIPERs, 27%IPLs) • More about copying overrun • Less impact (13%HIPERs, 5% IPLs) • Why?

  14. Others Overlay Error Administrative Err. (Semantic Err.) Synchr. Error (?) Error Type of Regular Errors • Who will dominate? • Impact • HIPERs: Overlay—14%; Undefined State—49% • IPLs: Overlay –4%;Synchr.—70% Copying Overrun Type mismatch Undefined State

  15. Error Triggering Events • What’s your guess? • Most timing-related problems? (Heisenbug) • Breakdown • What does it tell us?

  16. Failure Symptom

  17. What else we can do? • Dig more information from their RETAIN • Do better classification • Try more interesting question • Similar analysis on different applications • Try similar things for open source codes

  18. What does the data tell us? • Test case design • Test boundary condition • Test recovery code • Bug detection • Memory bug detector • Synchronization bugs • Tools help fixing bugs

  19. Something Related • National Vulnerability Database • Bugzilla (mozilla 1998)

More Related