1 / 28

A Metric for Software Readability

A Metric for Software Readability. by Raymond P.L. Buse and Westley R. Weimer Presenters: John and Suman. Readability. The human judgment of how easy a text is to understand A local, line-by-line feature Not related to the size of a program

bowen
Download Presentation

A Metric for Software Readability

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. A Metric for Software Readability by Raymond P.L. Buse and Westley R. Weimer Presenters: John and Suman

  2. Readability • The human judgment of how easy a text is to understand • A local, line-by-line feature • Not related to the size of a program • Not related to essential complexity of software

  3. Readability and Maintenance • Reading code is the most time-consuming of all maintenance activities [J. Lionel E. Deimel 1985, D. R. Raymond 1991, S. Rugaber 2000] • 70% of software cost for maintenance [B. Boehm and V. R. Basili 2001] • Readability also correlates with software quality, code change and defect reporting

  4. Problem Statement • How do we create a software readability metric that: • Correlates strongly with human annotators • Correlates strongly with software quality • Currently no automated readability measure for software

  5. Contributions • A software readability metric that: • correlates strongly with human annotators • correlates strongly with software quality • A survey of 120 human readability annotators • A discussion of the features related to software readability

  6. Readability Metrics for Natural Language • Empirical and objective models of readability • Flesch-Kincaid Grade Level has been used for over 50 years • Based on simple features: • Word length • Sentence length • Used by government agencies, MS Word, Google Docs

  7. Experimental Approach • 120 human annotators were shown 100 code snippets • Resulting 12,000 readability judgments available online (ok, not really)‏

  8. Snippet Selection - Goals • Length • Short enough to aid feature discrimination • Long enough to capture important readability considerations • Logical Coherence • Shouldn't span methods • Include comments adjacent comments • Avoid trivial snippets (e.g a group of import statements)‏

  9. Snippet Selection - Algorithm • Snippet = 3 consecutive simple statements • Based on authors' experience • Simple statements are: declarations, assignments, function calls, breaks, continues, throws and returns • Other nearby statements are included: • Comments, function headers, blank lines, if, else, try-catch, switch, etc. • Snippets cannot cross scope boundaries

  10. Readability Scoring • Readability was rated from 1 to 5 • 1 - “less readable” • 3 - “neutral” • 5 - “more readable”

  11. Inter-annotator agreement • Good correlation needed for a coherent model • Pearson product-moment correlation coefficient • Correlation of 1 indicates perfect correlation • Correlation of 0 indicates only random correlation • Calculated for pair wise for all annotators • Average correlation of 0.56 • Typically considered “moderate to strong”

  12. Readability Model Objective: • Mechanically predict human readability judgments • Determine code features that are predictive of readability Usage: • Use this model to analyze code (automate software readability metric)

  13. Model Generation • Classifier “Machine learning algorithms” • Instances “Feature vector from a snippet” • Experiment procedure - training phase - set of instances with labeled “correct answer” - classify based on the score from the bimodal distribution

  14. Model Generation (contd …) • Decide on a set of features that can be detected statically • These factors relate to structure, density, logical complexity, documentation of the analyzed code • Each feature is independent of the size/block of code

  15. Model Generation (contd …) • Build a classifier on a set of features • Use 10-fold cross validation - random partitioning of data set into 10 subsets - train on 9 and test on 1 - repeat this process 10 times • Mitigate any bias from partitioning by repeating the 10-fold validation 20 times • Average the results across all of the runs

  16. Results • Two relevant success metrics – precision & recall • Recall - % of snippets judged by annotators and classified by model as “more readable” • Precision – fraction of snippets judged by annotators and classified by model as “more readable” “Performance is measured by weighing together the f-measure statistic and harmonic mean of the two metrics”

  17. Results (contd …) • “0.61” – f-measure of the classifier trained on randomly generated score labels • “0.8” – f-measure of the classifier trained on average human data

  18. Results (contd …) • Repeated the experiment separately with annotator experience group (100 200 and 400 level, graduate CS students

  19. Interesting Facts from performance measure … • Average line length and average number of identifiers per line are important to readability • Average identifier length, loops, if constructs and comparison operators are not very predictive features

  20. Readability Correlations (Experiment 1) • Correlate defects detected by FindBugs* and readability metric • Run FindBugs on benchmarks • Classified the defects reports (one containing at least one defect and other containing none) • Run the trained classifier • Record the f-measure for “contains a bug” with respect to classifier judgment of “less readable” *FindBugs – a popular static bug finding tool

  21. Readability Correlations (Experiment 2) • Correlates future code churn to readability • Uses readability to predict those functions that will be modified between 2 successive releases of a program • Consider a function to have changed • Where text is not exactly the same • Changes in whitespaces

  22. Readability Correlations - Results • Average f-measure: • For Experiment 1 -> 0.61 and for Experiment 2 -> 0.63

  23. Relating Metric to Software Life Cycle • Readability tends to change over a long period of time

  24. Relating Metric to Software Life Cycle (contd …) • Correlate project readability against project maturity (as reported by developers) “Projects that reach maturity tend to be more readable”

  25. Discussion • Identifier Length • No influence! • Long names can improve readability, but can also reduce it • Comments might be more appropriate • Author's suggestions: Improved IDEs and code inspections • Code Comments • Only moderately correlated • Being used to “make up for” ugly code? • Characters/identifiers per line • Strongly correlated • Just as long sentences are more difficult to understand, so are • long lines of code • Author's suggestion: keep lines short, even if it means breaking • them up over several lines

  26. Related Work • Natural Language Metrics [R. F. Flesch 1948, R. Gunning 1952, J. P. Kincaid and E. A. Smith 1970, G. H. McLaughlin 1969] • Coding Standards [S. Ambler 1997, B. B. Bederson et al. 2002, H. Sutter and A. Alexandrescu 2004] • Style Checkers [T. Copeland 2005] • Defect Prediction [T. J. Cheatham et al. 1995, T. L. Graves et al. 2000, T. M. Koshgoftaar et al. 1996, J. P. Kincaid and E. A. Smith 1970]

  27. Future Work • Examine personal preferences • Create personal models • Models based on application domain • Broader features • e.g. number of statements in an ifblock • IDE integration • Explore minimum set of predictive features

  28. Conclusion • Created a readability metric based on a specific set of human annotators • This metric: • agrees with the annotators as much as they agree with each other • has significant correlation with conventional metrics of software quality • Examining readability could improve language design and engineering practice

More Related