Problem

Preventing and Eliminating Software Faults through the Life CyclePI: Katerina Goseva-PopstojanovaStudent: Margaret HamillLane Dept. Computer Science and Electrical EngineeringWest Virginia University, Morgantown, WVE-mail: Katerina.Goseva@mail.wvu.edu

Problem NASA spends time and effort to track problem reports/change data for every project. These repositories are rich, underused sources about the way software systems fail and the software faults that cause these failures. Our goal: Based on systematic and thorough analysis of the available empirical data, build quantitative and qualitative knowledge that contributes towards improvement of software quality by preventing introduction of faults into the system more efficiently eliminating them through the life cycle compiling lessons learned & recommendations to be used by the pilot project and throughout the Agency

Approach Explore change tracking systems & automate data extraction Quantify multiple dimensions for each fault type Identify common patterns and unusual dependencies Classify fault/failure data from the pilot project Propose appropriate classification scheme Refine Compile a check list to support avoidance & elimination of different faults types Identify the most frequent fault types & most frequent failure types Compile Lessons Learned & Recommendations document 3

Pilot study: Basic facts • We used a large NASA mission as a pilot study • 21 Computer Software Configuration Items (CSCIs) • millions of lines of code • over 8,000 files • developed at two different locations • We analyzed • over 2,800 Software Change Requests (SCRs) entered due to non-conformance with requirements • collected through the software life cycle (i.e., development, testing and on-orbit) • over a period of almost 10 years • To the best of our knowledge, this is the largest dataset considered so far in the published literature

Data quality • We collaborate closely with both the IV&V team and project team which provide invaluable support to our work • Our analysis is based only on data for which quality has been confirmed by the IV&V and project teams • We report results to and incorporate feedback from both teams as soon as results become available (at least quarterly) • domain knowledge • feedback on data quality • guidance & verification of database queries • insights into significance and impact of the research results

Major accomplishments • Sources of failures (i.e., type of faults) • Identified most common fault types • Showed both the internal and external validity of the results • Activities when the problem was discovered (e.g., inspection, testing, analysis, on-orbit) • Only 3% of SCRs are on-orbit • Identified dominant faults types during Development & testing and On-orbit • Severity • Only around 8% of SCRs are safety critical (less than 1% On-orbit) • Analyzed severity of different fault types • Compiled internal document on lessons learned & recommendations for product and process improvement

Source of failures • Terminology • A faultis introduced when a human error results in a mistake in some software artifact (e.g. requirements, design, source code) • Afailureis a departure of the system behavior from its required or expected behavior • Fault and failure relationship is a cause-effect relationship • SCRs entered due to non-conformance with requirements throughout the life cycle (i.e., development, testing, and on-orbit) are indications of failures • The ‘source’ field of an SCRs categorizes the fault(s) that caused the failure • 93% of non-conformance SCRs identify the source of failure

Correlation of number of SCRs & size of CSCIs Outliers discussed with the IV&V and project teams Larger CSCIs tend to have more non-conformance SCRs Statistically significant correlation of 0.79

Source of failures: Most common fault types Most common sources of failure for all 21 CSCIs grouped together • Requirements faults (incorrect, changed & missing requirements): 33% • Coding faults: 33% • Data problems: 14%

Source of failures: Early vs. Late life cycle activities • Distribution of sources of failures (i.e., fault types) • Requirements & Design: 38.25% • Requirements faults: 32.65% • Design faults: 5.60% • Coding, Interface & Integration: 48.57% • Coding Faults: 32.58% • Data Problems: 13.72% • Integration Faults: 2.27% • ‘Other’ 5.80% and ‘Not given’ 7.38% • This distribution of faults across life cycle activities contradicts the common belief that majority of faults are introduced during early life cycle activities, i.e., requirements and design, which dates back to some of the earliest empirical studies [Boehm et. al 75, Endres 1975, Basili et. al 1984] • Important question: Internal & External validity of our results

Source of failures: Internal validity • CSCIs have different maturity • We compared the distribution of fault types across groups of CSCIs based on the number of current releases

Source of failures: Internal validityCSCIs with 3 releases Consistent results: requirements faults (34%), coding faults (25%) and data problems (17%) are the most common sources of failures

Source of failures: Internal validityCSCIs grouped by the number of releases • The same three most common sources are consistently dominating the fault types, accounting for 68% to 86% of the SCRs in each group • Requirements faults: 28% - 40% • Coding faults: 25% - 43% • Data problems: 9% - 17%

Source of failures: External validity • We compared our results based on 2,858 non-conformance SCRs with results from several recent large empirical studies • 199 anomaly reports, 7 JPL unmanned spacecrafts [Lutz et al. 2004 ] • 600 software faults from several releases, switching system [Yu 1998] • 427 pre- and post-release modification requests, optical network element [Leszak et al, 2002] • 668 faults, 12 open source projects [Duraes et al. 2006] • 408 faults, IBM operating system [Christmansson et al. 1996] • Consistent trend across different domains, languages, development processes & organizations Percentage of problems reported due to coding, interface and integration faults together is approximately the same or even higher than the percentage of faults due to early life cycle activities (i.e., requirements and design)

Activity when the problem was discovered Only 3% On-orbit 39% discovered by testing activities The activity being performed when the problem was discovered is identified for 99% of the non-conformance SCRs

Source of failures distribution: Development & testing vs. On-orbit Only 3% of SCRs are on-orbit. Note the logarithmic scale of Y axes.

Source of failures distribution: Development & testing vs. On-orbit Development & testing On orbit The contributions of coding, design & integration faults increase on orbit, while the contributions of requirements & data problems decrease

Severity • Severity is assigned by the review board when deciding whether the SCR needs to be addressed • Sev 1: result in loss of a safety critical function • Sev 1N: sev1 with an established workaround • Sev 2: result in loss of a critical mission support capability • Sev 2N: sev2 with an established workaround • Sev 3: perceivable by operator but neither sev1 or sev2 • Sev 4: discrepancy not perceivable to the FSW user and usually insignificant violation of FSW requirements • Sev 5: not perceivable to the FSW user and usually a case where programming standard is violated Around 8% of all SCRs are safety critical (Development & testing 7% and On-orbit 1%)

Development & testing vs. On-orbit severity Development & testing On orbit Larger percentage of On-orbit SCRs are safety critical

Severity distribution across types of faults The highest percentage of safety critical SCRs comes from coding faults (3.60%), followed by requirement faults, design faults & data problems (3.35% total)

Benefit to the pilot project • Based on our results and the feedback from the IV&V team and project team we compiled a document for internal use which summarizes the Lessons Learned & Recommendations for Product and Process Improvement • Prevent the introduction of faults and improve the effectiveness of detection • Example: Increase effort spent on design and implementation of data repository used to share data between CSCIs • Improve the quality of the data & change tracking process • Example: Ensure the changes to the software artifacts (e.g., requirements, code, etc) made to fix the problem are recorded and can be easily associated with a specific SCR

Broader benefit to NASA Understanding why, how, and when faults manifest as failures is essential to determining how their introduction into the software systems can be prevented and when and how they can be eliminated most effectively For the pilot project and many other NASA missions that undergo incremental development and require sustained engineering for a long period of time, these results can be used to improve the system quality The internal and external validity of our results indicate that several observed trends are not project specific. Rather, they seem to be intrinsic characteristics of software faults and failures which apply across projects Parts of our Lessons Learned document which are related to improvement of the problem/change tracking systems and data quality can be used by newer initiatives such as Constellation, thus proactively avoiding common pitfalls, leading to more accurate data and more cost efficient improvement of software quality

Technical challenges Assuring data quality is an important step of any empirical research effort Inaccurate data may lead to misleading observations Both the IV&V team and the project team have been extremely valuable in helping us to understand the change tracking system, determine the meaning of different attributes, and verify the quality of the data The research approach and analysis techniques can be used by any project that tracks problem reports/change data However, due to the lack of a unified change tracking system, some amount of unique work on exploration of the data format and automation of data extraction may be needed

Future work [FY08-10] • Classify faults and failures using several additional attributes • Conduct more complex, multivariable analysis • Continually update the Lessons Learned & Recommendations for Improvement document • Explore the best ways to prevent and eliminate most common faults and failures throughout the life cycle; compile the results in a check list • Increase awareness of our work so other projects within NASA can benefit from it

Acknowledgements We thank the following NASA civil servants and contractors for their valuable support! • Jill Broadwater • Pete Cerna • Susan Creasy • Randolph Copeland • James Dalton • Bryan Fritch • Nick Guerra • John Hinkle • Lynda Kelsoe • Debbie Miele • Lisa Montgomery • Don Ohi • Chad Pokryzwa • David Pruett • Timothy Plew • Scott Radabaugh

Problem

Problem

Presentation Transcript

Problem

Problem

Problem

Problem

Problem

PROBLEM

Problem:

Problem

Problem

problem

Problem

Problem

Problem

Problem

Chapter 6 Problem 3 Problem 5 Problem 6 Problem 12

Problem