Presentation 5

Presentation 5 Cross Language Clone Analysis Team 2 October 27, 2010

Agenda • Current Tasks • GOLD Parsing System • Grammar Update • Clone Analysis • Team Collaboration • Path Forward

Our Team • Allen Tucker • Patricia Bradford • Greg Rodgers • Brian Bentley • Ashley Chafin

Current Tasks What we are tackling…

Current Tasks (Review) • Current tasks created for the first user story “Source Code Load & Translate”: • Load & parse C# source code. • Load & parse JAVA source code. • Load & parse C++ source code. • Translate the parsed C# source code to CodeDOM. • Translate the parsed JAVA source code to CodeDOM. • Translate the parsed C++ source code to CodeDOM. • Associate the CodeDOM to the original source code.

UML Model – Load & Parse

UML Model – Translate

UML Model – Associate

GOLD Parsing System GOLD Parsing Populating CodeDOM

Topics To Discuss • What we are doing? • Compiled Grammar Table • Bookkeeping • Testing

How It Works (Block Structure) Source Code Grammar Builder Compiled Grammar Table (*.cgt) Engine Parsed Data

How It Works (Process) Source Code Grammar Builder Compiled Grammar Table (*.cgt) Engine Parsed Data Typical output from engine: a long nested tree

Usage within CloneDigger Source Code Compiled Grammar Table (*.cgt) Engine Parsed Data CodeDOM Conversion AST • CodeDOM Conversion • Need to write routine to move data from Parsed Tree to CodeDOM • Parsed data trees from parser are stored in consistent data structure, but are based on rules defined within grammars

Compiled Grammar Table • For Java, there is… • 359 production rules • 249 distinctive symbols (terminal & non-terminal) • For C#, there is… • 415 production rules • 279 distinctive symbols (terminal & non-terminal)

Production Rule Dependancies

Our Grammar Bookkeeping Since there are so many production rules, we came up with the following bookkeeping: • A spreadsheet of the compiled grammar table (for each language) with each production rule indexed. • This spreadsheet covers: • various aspects of language • what we have/have not handled from the parser • what we have/have not implemented into CodeDOM • percentage complete

Our Grammar Bookkeeping

Testing • White Box Testing: • Unit Testing • Black Box Testing: • Production Rule Testing • Allows us to test the robustness of our engine because we can force rule production errors. • Regression Testing • Automated

Unit Testing

Production Rule Test Input File Example

Task Understanding • Three Step Process • Step 1 Code Translation • Step 2 Clone Detection • Step 3 Visualization Common Model Translator Source Files Detected Clones Inspector Common Model Clone Visualization UI Detected Clones

Grammar Updates Java & C#

Grammar Updates • Currently the grammars we have for the Gold parser are out dated. • Current Gold Grammars • C# version 2.0 • Java version 1.4 • Current available software versions • C# version 4.0 • Java version 6

Grammar Updates • Available updated grammars • Antlr has grammars updated to more recent versions of both C# and Java. • C# version 4.0 (latest version) • Java version 1.5 (second to latest version) • Currently we are attempting to transform the Antlr grammars into Gold Parser grammars.

Grammar Update Issues • Grammars for C# and Java are very complex and require a lot of work to build. • Antler and Gold Parser grammars use completely different syntax. • Positive note: Other development not halted by use of older grammars.

Clone Analysis Dr. Kraft’s Student’s Tool

Software Clones • Software Clones: (Definitions from Wikipedia) • Duplicate code: a sequence of source code that occurs more than once, either within a program or across different programs owned or maintained by the same entity. • Clones: sequences of duplicate code. • “Clones are segments of code that are similar according to some definition of similarity.” —Ira Baxter, 2002

Software Clones (cont.) • How clones are created: • copy and paste programming • similar functionality, similar code • plagiarism

Software Clones (cont.) • 3 Types of Clones: • Type 1: an exact copy without modifications (except for whitespace and comments). • Type 2: a syntactically identical copy • only variable, type, or function identifiers have been changed. • Type 3: a copy with further modifications • statements have been changed, added, or removed.

Introduction (cont.) • Per our task, in order to find clones across different programming languages, we will have to first convert the code from each language over to a language independent object model. • Some Language Independent Object Models: • Dagstuhl Middle Metamodel (DMM) • Microsoft CodeDOM • Both of these models provide a language independent object model for representing the structure of source code.

Related Research • Detecting clones across multiple programming languages is on the cutting edge of research. • A preliminary version of this was done by Dr. Kraft and his students for C# and VB. • They compared the Mono C# parser (written in C#) to the Mono VB parser (written in VB). • Publication: • Nicholas A. Kraft, Brandon W. Bonds, Randy K. Smith: Cross-language Clone Detection. SEKE 2008: 54-59

Dr. Kraft Approach • Token sequence of CodeDOM graphs with Levenshtein distance • The Levenshtein distance between two sequences is defined as the minimum number of edits needed to transform one sequence into the other • Performs Comparisons of code files • CodeDOM tree is tokenized • Based on Distances • Percentage of matching tokens in a sequence

Dr. Kraft Approach (cont)

Limitations • Only does file-to-file comparisons • Does not detect clones in same source file • Can only detect Type 1 and some Type 2 clones • Not very efficient (brute force)

Enhancements • Split into parameter (identifiers and literals) and non-parameter tokens • Non-parameter tokens summarized using a hash function • Parameter tokens are encoded using a position index for their occurrence in the sequence • Abstracts concrete names and values while maintaining order

Enhancements (cont) • Represent all prefixes of the sequence in a suffix tree • Suffixes that share the same set of edges have a common prefix • Prefix occurs more than once (clone)

Team Collaboration Team 2 & Team 3

Team Collaboration Team 2 & Team 3 • Team 2 • We plan to start giving Team 3 periodic drops of our source code for Java and C# parsing. • We are researching and working to update the Java and C# grammars. • Team 3 • Team 3 is working on C++ parsing. • Looking into other parser, ELSA.

Path Forward Next Iteration & Schedule

Path Forward Finalize Iteration 1 (C++ to CodeDom) Iteration 2 (Code Analysis) Iteration 3 (Begin GUI)

Schedule

Presentation 5

Presentation 5

Presentation Transcript

Chapter 5 Presentation

Team #5 Presentation

Chapter 5 : PowerPoint Presentation

Pre Presentation 5

Presentation 5

My Homework 5 Presentation

Group 5 Presentation

Presentation by group 5

PTAC Presentation Region 5

Group 5 Presentation

Presentation 5 Outline

Week 5 Presentation

Chapter 5 Presentation skills

SYNDICATE PRESENTATION SYNDICATE -5

Group 5 Presentation

Parking Pal Presentation #5

Team #5 Presentation

Tutorial 5 Answer Presentation

Clinical Case Presentation # 5

Team #5 Presentation

5 th presentation