Finding Similar Defects Using Synonymous Identifier Retrieval

Finding Similar Defects Using Synonymous Identifier Retrieval Norihiro Yoshida, Takeshi Hattori, Katsuro Inoue Osaka University, Japan

Similar code fragment SF1 Similar code fragment SF2 Similar Code fragment • One of factors that make software maintenance more difficult Source file B Source file A Modify it Code fragment CF It is necessary to determine whether or not modify them It is necessary to develop automatic code retrieval tool based on code similarity

Key Idea • In many cases, code fragments involving similar identifier names have the similar functionalities. • e.g., type, variable, function names • Developers often need to inspect those code fragments simultaneously. It is necessary to develop automatic code retrieval tool based on identifier similarity

SC-Retriever: Code retrieval tool based on identifier similarity • Retrieves code fragments that are similar to a query code fragment • Based on identifier similarity • e.g., type, variable, function name • Determines synonymous words in target source files Query Code Fragment Similar code fragments Identifier extraction Retrieval Target source files Identifier extraction Synonymous identifier determination

Why should we determine synonymous words in source code? • SC-Retriever needs to identify a set of code fragments that have similar functionalities • Different developer often uses different identifier names even if they implement the same functionalities It is necessary to determine synonymous words for identifying code fragments that have similar functionalities.

How to determine synonymous words? • We use an automatic synonymous words determination technique in NLP. • Dagan’s method[1], which is based on co-occurrence relation and do not use thesauruses and dictionaries. • His method detects a set of synonymous words often occurs a similar set of words in statements. • e.g., “Kids play soccer”, “Children play soccer” • Note that we should set threshold for synonymous words determination. Both “kids” and “children” co-occur with a set of words “play“ “soccer”.  They are synonymous. [1] I. Dagan, L. Lee, and F. C. N. Pereira. Similarity-based models of word cooccurrence probabilities. Machine Learning, 34(1-3):43–69, 1999.

How to match a query code fragment with target source files? • if code fragments have the same or synonymous identifiers as the query identifiers,…. • those code fragments are extracted as similar code fragments from the target source files Identifiers in query code fragment host alloc add host alloc add node node Identifiers in target source files

Case Study • Overview • conduct a case study with SC-Retriever and CCFinder • retrieve defective functions in 2 software systems • compare the efficiency of the retrieval • Target Systems • Canna(90KLOC, 2361 functions) • client-server Japanese character input system • Ver. 3.6 involves 19 buffer overflow defects • Those defects exist in 18 functions. • SPARS-J (36KLOC, 859 functions)

Experimental Step • choose defective code fragments from Canna source code • retrieve C functions in Canna source code. • SC-Retriever • we give 3 chosen code fragments as the queries. • CCFinder • we detect code clones for those 3 code fragments. • calculate the precisions, recalls and F-scores with the retrieved results and the bug records • F-score is harmonic average between precision and recall.

Resultant precisions, recalls, and F-scores • We set threshold for synonymous words determination, to 0.1 and 0.2. • If th is set to high value, a lot of synonymous words are detected • F-Scores of SC-Retriever are higher than those of CCFinder. • Recalls of SC-Retriever are relatively high • Precisions of CCFinder are relatively high • The results of SC-Retriever depends on queries and th

Future Work • Further case studies on defects in other software systems. • Code clone detection tool based on synonymous words determination • Method to calculate code fragment ranking based on identifier similarity • Other methods to determine synonymous words • LSI, dictionary, or thesauruses based method

Finding Similar Defects Using Synonymous Identifier Retrieval

Finding Similar Defects Using Synonymous Identifier Retrieval

Presentation Transcript

Synonymous Paraphrasing Using WordNet and Internet

Finding Similar Music Artists for Recommendation

Finding Similar Music Artists for Recommendation

Retrieval of Similar Electronic Health Records using UMLS Concept Graphs

Color-Based Object Identifier Using LVQ

Finding Unknown Lengths in Similar Figures

Using Similar Triangles

Finding Similar Items

Finding Similar Sets

Proofs using Similar Triangles

Finding Similar Items: Locality Sensitive Hashing

Finding Similar Music Artists for Recommendation

Content-Based Retrieval of Similar Radiological Images

Using Similar Figures

Finding Similar Image Quickly Using Object Shapes

Parenthetical/Synonymous

Finding Similar Items

Finding Similar Items

Finding Similar Items

Finding Similar Items

Proofs using Similar Triangles

3.4: Using Similar Triangles