A Metric for Software Readability

A Metric for Software Readability by Raymond P.L. Buse and Westley R. Weimer Presenters: John and Suman

Readability • The human judgment of how easy a text is to understand • A local, line-by-line feature • Not related to the size of a program • Not related to essential complexity of software

Readability and Maintenance • Reading code is the most time-consuming of all maintenance activities [J. Lionel E. Deimel 1985, D. R. Raymond 1991, S. Rugaber 2000] • 70% of software cost for maintenance [B. Boehm and V. R. Basili 2001] • Readability also correlates with software quality, code change and defect reporting

Problem Statement • How do we create a software readability metric that: • Correlates strongly with human annotators • Correlates strongly with software quality • Currently no automated readability measure for software

Contributions • A software readability metric that: • correlates strongly with human annotators • correlates strongly with software quality • A survey of 120 human readability annotators • A discussion of the features related to software readability

Readability Metrics for Natural Language • Empirical and objective models of readability • Flesch-Kincaid Grade Level has been used for over 50 years • Based on simple features: • Word length • Sentence length • Used by government agencies, MS Word, Google Docs

Experimental Approach • 120 human annotators were shown 100 code snippets • Resulting 12,000 readability judgments available online (ok, not really)‏

Snippet Selection - Goals • Length • Short enough to aid feature discrimination • Long enough to capture important readability considerations • Logical Coherence • Shouldn't span methods • Include comments adjacent comments • Avoid trivial snippets (e.g a group of import statements)‏

Snippet Selection - Algorithm • Snippet = 3 consecutive simple statements • Based on authors' experience • Simple statements are: declarations, assignments, function calls, breaks, continues, throws and returns • Other nearby statements are included: • Comments, function headers, blank lines, if, else, try-catch, switch, etc. • Snippets cannot cross scope boundaries

Readability Scoring • Readability was rated from 1 to 5 • 1 - “less readable” • 3 - “neutral” • 5 - “more readable”

Inter-annotator agreement • Good correlation needed for a coherent model • Pearson product-moment correlation coefficient • Correlation of 1 indicates perfect correlation • Correlation of 0 indicates only random correlation • Calculated for pair wise for all annotators • Average correlation of 0.56 • Typically considered “moderate to strong”

Readability Model Objective: • Mechanically predict human readability judgments • Determine code features that are predictive of readability Usage: • Use this model to analyze code (automate software readability metric)

Model Generation • Classifier “Machine learning algorithms” • Instances “Feature vector from a snippet” • Experiment procedure - training phase - set of instances with labeled “correct answer” - classify based on the score from the bimodal distribution

Model Generation (contd …) • Decide on a set of features that can be detected statically • These factors relate to structure, density, logical complexity, documentation of the analyzed code • Each feature is independent of the size/block of code

Model Generation (contd …) • Build a classifier on a set of features • Use 10-fold cross validation - random partitioning of data set into 10 subsets - train on 9 and test on 1 - repeat this process 10 times • Mitigate any bias from partitioning by repeating the 10-fold validation 20 times • Average the results across all of the runs

Results • Two relevant success metrics – precision & recall • Recall - % of snippets judged by annotators and classified by model as “more readable” • Precision – fraction of snippets judged by annotators and classified by model as “more readable” “Performance is measured by weighing together the f-measure statistic and harmonic mean of the two metrics”

Results (contd …) • “0.61” – f-measure of the classifier trained on randomly generated score labels • “0.8” – f-measure of the classifier trained on average human data

Results (contd …) • Repeated the experiment separately with annotator experience group (100 200 and 400 level, graduate CS students

Interesting Facts from performance measure … • Average line length and average number of identifiers per line are important to readability • Average identifier length, loops, if constructs and comparison operators are not very predictive features

Readability Correlations (Experiment 1) • Correlate defects detected by FindBugs* and readability metric • Run FindBugs on benchmarks • Classified the defects reports (one containing at least one defect and other containing none) • Run the trained classifier • Record the f-measure for “contains a bug” with respect to classifier judgment of “less readable” *FindBugs – a popular static bug finding tool

Readability Correlations (Experiment 2) • Correlates future code churn to readability • Uses readability to predict those functions that will be modified between 2 successive releases of a program • Consider a function to have changed • Where text is not exactly the same • Changes in whitespaces

Readability Correlations - Results • Average f-measure: • For Experiment 1 -> 0.61 and for Experiment 2 -> 0.63

Relating Metric to Software Life Cycle • Readability tends to change over a long period of time

Relating Metric to Software Life Cycle (contd …) • Correlate project readability against project maturity (as reported by developers) “Projects that reach maturity tend to be more readable”

Discussion • Identifier Length • No influence! • Long names can improve readability, but can also reduce it • Comments might be more appropriate • Author's suggestions: Improved IDEs and code inspections • Code Comments • Only moderately correlated • Being used to “make up for” ugly code? • Characters/identifiers per line • Strongly correlated • Just as long sentences are more difficult to understand, so are • long lines of code • Author's suggestion: keep lines short, even if it means breaking • them up over several lines

Related Work • Natural Language Metrics [R. F. Flesch 1948, R. Gunning 1952, J. P. Kincaid and E. A. Smith 1970, G. H. McLaughlin 1969] • Coding Standards [S. Ambler 1997, B. B. Bederson et al. 2002, H. Sutter and A. Alexandrescu 2004] • Style Checkers [T. Copeland 2005] • Defect Prediction [T. J. Cheatham et al. 1995, T. L. Graves et al. 2000, T. M. Koshgoftaar et al. 1996, J. P. Kincaid and E. A. Smith 1970]

Future Work • Examine personal preferences • Create personal models • Models based on application domain • Broader features • e.g. number of statements in an ifblock • IDE integration • Explore minimum set of predictive features

Conclusion • Created a readability metric based on a specific set of human annotators • This metric: • agrees with the annotators as much as they agree with each other • has significant correlation with conventional metrics of software quality • Examining readability could improve language design and engineering practice

A Metric for Software Readability

A Metric for Software Readability

Presentation Transcript

Safety as a Software Metric

Drawing Conventions for Readability

Readability Assessment for Text Simplification

Role of Software Readability on Software Development Cost

Style for Readability

Metric Based Software Project Management

Designing Documents for Readability

Readability

Enhancing Readability

Designing for readability

A Comparison of Features for Automatic Readability Assessment

Readability

Software Metric

A code metric tool for Software Engineering

Powerpoint Readability Test

Readability Metrics for Network Visualization

Readability

Video Quality Metric (VQM) Software

A Novel Test Coverage Metric for Concurrently-Accessed Software Components

Readability

Software metric

Readability Index