1 / 41

Presentation 5

Presentation 5. Cross Language Clone Analysis Team 2 October 27, 2010. Agenda. Current Tasks GOLD Parsing System Grammar Update Clone Analysis Team Collaboration Path Forward. Our Team. Allen Tucker Patricia Bradford Greg Rodgers Brian Bentley Ashley Chafin. Current Tasks.

jorryn
Download Presentation

Presentation 5

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Presentation 5 Cross Language Clone Analysis Team 2 October 27, 2010

  2. Agenda • Current Tasks • GOLD Parsing System • Grammar Update • Clone Analysis • Team Collaboration • Path Forward

  3. Our Team • Allen Tucker • Patricia Bradford • Greg Rodgers • Brian Bentley • Ashley Chafin

  4. Current Tasks What we are tackling…

  5. Current Tasks (Review) • Current tasks created for the first user story “Source Code Load & Translate”: • Load & parse C# source code. • Load & parse JAVA source code. • Load & parse C++ source code. • Translate the parsed C# source code to CodeDOM. • Translate the parsed JAVA source code to CodeDOM. • Translate the parsed C++ source code to CodeDOM. • Associate the CodeDOM to the original source code.

  6. UML Model – Load & Parse

  7. UML Model – Translate

  8. UML Model – Associate

  9. GOLD Parsing System GOLD Parsing Populating CodeDOM

  10. Topics To Discuss • What we are doing? • Compiled Grammar Table • Bookkeeping • Testing

  11. How It Works (Block Structure) Source Code Grammar Builder Compiled Grammar Table (*.cgt) Engine Parsed Data

  12. How It Works (Process) Source Code Grammar Builder Compiled Grammar Table (*.cgt) Engine Parsed Data Typical output from engine: a long nested tree

  13. Usage within CloneDigger Source Code Compiled Grammar Table (*.cgt) Engine Parsed Data CodeDOM Conversion AST • CodeDOM Conversion • Need to write routine to move data from Parsed Tree to CodeDOM • Parsed data trees from parser are stored in consistent data structure, but are based on rules defined within grammars

  14. Compiled Grammar Table • For Java, there is… • 359 production rules • 249 distinctive symbols (terminal & non-terminal) • For C#, there is… • 415 production rules • 279 distinctive symbols (terminal & non-terminal)

  15. Production Rule Dependancies

  16. Our Grammar Bookkeeping Since there are so many production rules, we came up with the following bookkeeping: • A spreadsheet of the compiled grammar table (for each language) with each production rule indexed. • This spreadsheet covers: • various aspects of language • what we have/have not handled from the parser • what we have/have not implemented into CodeDOM • percentage complete

  17. Our Grammar Bookkeeping

  18. Testing • White Box Testing: • Unit Testing • Black Box Testing: • Production Rule Testing • Allows us to test the robustness of our engine because we can force rule production errors. • Regression Testing • Automated

  19. Unit Testing

  20. Production Rule Test Input File Example

  21. Task Understanding • Three Step Process • Step 1 Code Translation • Step 2 Clone Detection • Step 3 Visualization Common Model Translator Source Files Detected Clones Inspector Common Model Clone Visualization UI Detected Clones

  22. Grammar Updates Java & C#

  23. Grammar Updates • Currently the grammars we have for the Gold parser are out dated. • Current Gold Grammars • C# version 2.0 • Java version 1.4 • Current available software versions • C# version 4.0 • Java version 6

  24. Grammar Updates • Available updated grammars • Antlr has grammars updated to more recent versions of both C# and Java. • C# version 4.0 (latest version) • Java version 1.5 (second to latest version) • Currently we are attempting to transform the Antlr grammars into Gold Parser grammars.

  25. Grammar Update Issues • Grammars for C# and Java are very complex and require a lot of work to build. • Antler and Gold Parser grammars use completely different syntax. • Positive note: Other development not halted by use of older grammars.

  26. Clone Analysis Dr. Kraft’s Student’s Tool

  27. Software Clones • Software Clones: (Definitions from Wikipedia) • Duplicate code: a sequence of source code that occurs more than once, either within a program or across different programs owned or maintained by the same entity. • Clones: sequences of duplicate code. • “Clones are segments of code that are similar according to some definition of similarity.” —Ira Baxter, 2002

  28. Software Clones (cont.) • How clones are created: • copy and paste programming • similar functionality, similar code • plagiarism

  29. Software Clones (cont.) • 3 Types of Clones: • Type 1: an exact copy without modifications (except for whitespace and comments). • Type 2: a syntactically identical copy • only variable, type, or function identifiers have been changed. • Type 3: a copy with further modifications • statements have been changed, added, or removed.

  30. Introduction (cont.) • Per our task, in order to find clones across different programming languages, we will have to first convert the code from each language over to a language independent object model. • Some Language Independent Object Models: • Dagstuhl Middle Metamodel (DMM) • Microsoft CodeDOM • Both of these models provide a language independent object model for representing the structure of source code.

  31. Related Research • Detecting clones across multiple programming languages is on the cutting edge of research. • A preliminary version of this was done by Dr. Kraft and his students for C# and VB. • They compared the Mono C# parser (written in C#) to the Mono VB parser (written in VB). • Publication: • Nicholas A. Kraft, Brandon W. Bonds, Randy K. Smith: Cross-language Clone Detection. SEKE 2008: 54-59

  32. Dr. Kraft Approach • Token sequence of CodeDOM graphs with Levenshtein distance • The Levenshtein distance between two sequences is defined as the minimum number of edits needed to transform one sequence into the other • Performs Comparisons of code files • CodeDOM tree is tokenized • Based on Distances • Percentage of matching tokens in a sequence

  33. Dr. Kraft Approach (cont)

  34. Limitations • Only does file-to-file comparisons • Does not detect clones in same source file • Can only detect Type 1 and some Type 2 clones • Not very efficient (brute force)

  35. Enhancements • Split into parameter (identifiers and literals) and non-parameter tokens • Non-parameter tokens summarized using a hash function • Parameter tokens are encoded using a position index for their occurrence in the sequence • Abstracts concrete names and values while maintaining order

  36. Enhancements (cont) • Represent all prefixes of the sequence in a suffix tree • Suffixes that share the same set of edges have a common prefix • Prefix occurs more than once (clone)

  37. Team Collaboration Team 2 & Team 3

  38. Team Collaboration Team 2 & Team 3 • Team 2 • We plan to start giving Team 3 periodic drops of our source code for Java and C# parsing. • We are researching and working to update the Java and C# grammars. • Team 3 • Team 3 is working on C++ parsing. • Looking into other parser, ELSA.

  39. Path Forward Next Iteration & Schedule

  40. Path Forward Finalize Iteration 1 (C++ to CodeDom) Iteration 2 (Code Analysis) Iteration 3 (Begin GUI)

  41. Schedule

More Related