180 likes | 295 Views
This study explores the use of logistic regression (LR) models incorporating design complexity metrics to predict fault-prone object-oriented classes in various software projects. By analyzing data from seven projects and employing simple log data transformations, the authors aim to refine the predictive capabilities of these models. Key metrics like CK-CBO, CK-RFC, and CK-WMC were investigated, revealing that transformation methods can enhance predictive accuracy. The findings suggest that metrics' distributions differ across projects, highlighting the need for local threshold determination.
E N D
Towards Logistic Regression Models for Predicting Fault-prone Code across Software Projects Erika Camargo and OchimizuKoichiro Japan Institute of Science and Technology ESEM 2009
Contents • Abstract • Background • Problem Analysis • Case study • Results • Conclusion and Future Work
Abstract Challenge: To make logistic regression (LR) models, which use design-complexity metrics, able to predict fault-prone o-o classes across software projects. First attempt of solution: simple log data transformations P(Fault prone class) X = design-complexity metric P(y=1) x
Background • Some design-complexity metrics have shown to be good predictors of fault-prone classes in LR models • Among these metrics are the Chidamber & Kemerer (CK) metrics • 80th and 20th percentiles of the distributions can be used to determine high and low values • Their thresholds cannot be determined before their use and should be derivedand used locally
Problem Analysis Can a LR model built with these kind of metrics work efficiently with different software projects? LEAST FAULTY MOST FAULTY P (y=1) Large Size SW project Small Size SW project X = Number of Methods 20 10
Case Study • Data analysis of 7 different projects andapplication of simple log data transformations. • Construction of 3 univariate LR models using a large open source project (1st release of the MYLYN System with 638 Java classes). • Dependent Variables: CK-CBO, CK-RFC, CK-WMC • Independent Variables: Defects (from Bugzilla & CVS) • Test these models with 2 other smaller projects (with 11 and13 Java classes)
Challenge BNS: Banking system (2006) * CRS: Cruise control system (2005) * ECS: ecommerce system (2006) * ELCS: Elevator control system (2003)* FACS: Factory automation system (2005) * GMF: Graphic Modeling Framework ** MYL : Mylyn system ** produced biased regression estimates and reduce the predictive power of regression models (**) Eclipse Project (*) systems developed by students of JAIST, described in: Gomaa Hassan, Designing Concurrent, Distributed, and Real-Time Applications with UML, Addison Wesley-Object Technology Series Editors, July 2000.
BNS: Banking system (2006) * CRS: Cruise control system (2005) * ECS: ecommerce system (2006) * ELCS: Elevator control system (2003)* FACS: Factory automation system (2005) * GMF: Graphic Modeling Framework ** MYL : Mylyn system ** RFC Data of BNS is more spread than the data of the MYL (**) Eclipse Project (*) systems developed by students of JAIST, described in: Gomaa Hassan, Designing Concurrent, Distributed, and Real-Time Applications with UML, Addison Wesley-Object Technology Series Editors, July 2000.
BNS: Banking system (2006) * CRS: Cruise control system (2005) * ECS: ecommerce system (2006) * ELCS: Elevator control system (2003)* FACS: Factory automation system (2005) * GMF: Graphic Modeling Framework ** MYL : Mylyn system ** RFC Data of BNS is more spread than the data of the MYL (**) Eclipse Project (*) systems developed by students of JAIST, described in: Gomaa Hassan, Designing Concurrent, Distributed, and Real-Time Applications with UML, Addison Wesley-Object Technology Series Editors, July 2000.
Case Study Solution. Simple data transformation using “Log10” Example : • Number of Outliers are less • Data Spread is more uniform LCBO = Log10(CBO+1) LTCBO = Log10(CBO+1) + dm; Where dm is the difference of CBO medias of the Mylyn system and the system which data is being transformed
Results Effects of the Log data Transformations: • Elimination of great number of outliers • Overall goodness of fit of the 3 models is better • Discrimination (Most Faulty/Least Faulty) • All models discriminate well between most Faulty and Least Faulty classes of the Mylyn System • What about using different projects?
Results MF: Most Faulty LF: Least Faulty BANKING SYSTEM
Results MF: Most Faulty LF: Least Faulty E-COMMERCE SYSTEM
Conclusions and Future work • CK-CBO, CKR-RFC ad CK-WMC can have different distributions in different projects • Simple Log Transformations seem to improve the prediction ability of LR models, specially when the project measures are not as spread as those used in the construction of the model. • Further data exploration and study of data transformations
Thank you! questions, comments … contact: erika.camargo@jaist.ac.jp