Source code metrics in the software industry School of Computer and Information Science, Edith Cowan University PhD research project by Tim Littlefair supervised by Dr Thomas O'Neill http://www.fste.ac.cowan.edu.au/~tlittlef
Source code metrics • Software metrics are generally classified as being divided into two categories: process metrics and product metrics. • The value of process metrics to aid software process management is now widely accepted. • Source code metrics are a subset of the product metric classification. Many source metrics have been proposed, there is no consensus as to which are useful.
Project features • Metrics of interest were selected. • Automated measurement tool was implemented. • Tool was deployed and evaluated by developers performing real-world development under commercial conditions. • Internet was used to distribute tool and solicit evaluation feedback. • Project is concluding with experiment on use of metrics data as support for software inspection process.
Metrics Implemented • Procedural metrics (measured on a per-function basis) : lines of code, lines of comment, McCabe's Cyclomatic Complexity. • Metrics of object oriented design (proposed by Chidamber and Kemerer): depth of inheritance tree, number of children, coupling between objects, weighted methods per class. • Structural metrics (based on work by Henry & Kafura): fan-in, fan-out, information flow.
Evaluation survey • Tool was publicised via USENET, deployed (and refined) over a 6 month period. • USENET and email addresses from FTP logs used to publicise evaluation URL. • 25 respondents over 3 month period, weak consensus on value of procedural metrics, opinion neutral or marginally negative on OO design metrics and structural metrics.
Review Experiment • Survey yielded little interesting data - need to extend project to make it worthwhile. • Experiment designed to attempt to refute null hypothesis (i.e. "metrics are of no value") by detecting positive effect of metrics use. • Simulated code review - comparing performance of groups with and without metrics support.
The exercise • 5 Java classes to be reviewed. • Each respondent to give yes/no answer on the presence of each of 5 risk factors for each class (e.g. "Excessive length", "Inadequate commenting"). • Comparison of performance requires independently determined set of 'correct' answers.
Treatment groups • Group 0 perform the exercise without metric support with plenty of time (1 hour) • Group 1 perform the exercise without metric support under time constraint (15 mins) • Group 2 perform the exercise with metric support under time constraint (15 mins) • Group 0 are used to derive 'correct' responses, value of metrics support is assessed by seeing if group 2 reflects these better than group 1.
Experimental Outcomes • Code review experiment closed after attracting 15 volunteers (6 in group 0, 4 in group 1 and 5 in group 2) • This was less than we were hoping for, but enough to do analysis on the results • Derivation of ‘correct’ results by selecting a threshold for group 0 responses • Statistical techniques used in processing of returns: contingency tables, receiver-operating characteristic analysis, chi-squared analysis.
Cumulative Responses • The table below shows the number of respondents in each group, together with the number of positive (risk present) responses by each group to each question. • Group 0 responses are distilled to derive the ‘correct’ answer to each of the 25 questions, the performance of groups 1 and 2 will be assessed in terms of level of agreement with these derived responses.
Receiver Operating Characteristic (ROC) The ROC graph presents a visual summary of the way a predictive system responds to a sample of real cases, some of which will be on the borderline. A perfect predictive system would follow the left and upper boundaries of the graph, one which is no better than chance would follow the leading diagonal.
Chi-squared test (1) • We have gathered data on TP, TN, FP, FN responses of 2 groups. • Are differences between groups due to systematic factors or random variation? • Standard chi-squared test: • start from contingency table • calculate chi-squared figure • compare to characteristic value for desired degree of certainty and size of contingency table • testing for null hypothesis (absence of systematic difference)
Experimental Conclusion (1) • No significant difference in the performance of the code review exercise was found between the control group and the treatment group. We therefore conclude that there is no evidence from the current experiment to suggest that the metrics information is of benefit in the setting simulated by this experiment.
Experimental Conclusion (2) • While the experiment failed to demonstrate a significant difference between the performance of the two groups, nonetheless it is possible to say that a difference in performance was observed. Although the statistical analysis of the data shows that the difference observed was well within the range of outcomes that might arise out of the operation of random effects, it is possible that a similar experiment with a larger number of participants might demonstrate a significant effect.
Project Summary • Nature of source code metrics • Selection of existing and new metrics based on GQM paradigm • Evidence on industry attitudes • Empirical evaluation of tools: • human decision support context • problems of realistic experimentation • burden of proof • appropriate statistical techniques