Mike joy 25 february 2010
This presentation is the property of its rightful owner.
Sponsored Links
1 / 71

New Approaches for Detecting Similarities in Program Code PowerPoint PPT Presentation


  • 93 Views
  • Uploaded on
  • Presentation posted in: General

Mike Joy 25 February 2010. New Approaches for Detecting Similarities in Program Code. Overview of Talk. What is the Problem? Historical Overview New Approaches Where Next?. Part 1 – What is the Problem?. Document similarity What do we mean? Why is software an issue?

Download Presentation

New Approaches for Detecting Similarities in Program Code

An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -

Presentation Transcript


Mike joy 25 february 2010

Mike Joy25 February 2010

New Approaches for Detecting Similarities in Program Code


Overview of talk

Overview of Talk

  • What is the Problem?

  • Historical Overview

  • New Approaches

  • Where Next?


Part 1 what is the problem

Part 1 – What is the Problem?

Document similarity

What do we mean?

Why is software an issue?

Why is this interesting?


Four stages

Four stages

Collection

 Detection

 Confirmation

 Investigation

From Culwin and Lancaster (2002)‏.


Stage 1 collection

Stage 1: Collection

Get all documents together online

so they can be processed

formats?

security?

BOSS (Warwick)

Coursemaster (Nottingham)

Managed Learning Environment


Stage 2 detection

Stage 2: Detection

Compare with other submissions

Compare with external documents

essay-based assignments

We’ll come back to this later

it’s the interesting bit!


Stage 3 confirmation

Stage 3: Confirmation

Software tool says “A and B similar”

Are they?

Never rely on a computer program!

Requires expert human judgement

Evidence must be compelling

Might go to court


Stage 4 investigation

Stage 4: Investigation

A from B, or B from A, or joint work?

If A from B, did B know?

open networked file

printer output

Did the culprit/s understand?

University processes must be followed


Why is this interesting

Why is this Interesting?

How do you compare two programs?

This is an algorithm question

Stages 2 and 3: detection and confirmation

How do you use the results (of a comparison) to educate students?

This is a pedagogic question

Stage 4, and before stage 1!


Digression essays

Digression: Essays

Plagiarism in essays is easier to detect

Lots of “tricks” a lecturer can use!

Google search on phrases

Abnormal style

... etc.

Software tools

Let's have a look ...


Pedagogy

Pedagogy

Can be used by academics to

detect plagiarism

provide evidence

Can be used by students to

check their own work


Part 2 historical overview

Part 2 – Historical Overview

How has similar code been detected in the past?

How well do the approaches work?


Why not use turnitin

Why not use Turnitin?

It won’t work!

String matching algorithm inappropriate

Database does not contain code

Commercial involvement

E.g. Black Duck Software


New approaches for detecting similarities in program code

/* Program 1 */

public class Hello {

public static void main(String[] argv) {

System.out.println(“Hello World”)

}

}

/* Program 2 */

public class HelloWorld {

public static void main(String[] x) {

System.out.println(“hello world!”)

}

}


Is this plagiarism

Is This Plagiarism?

Is Program 2 derived from Program 1 in a manner which is “plagiarism”?

Probably No

It's too simple

Too many copies in books / on the web

Most of it is generic syntax


New approaches for detecting similarities in program code

Program 3

(Source code for MS Windows 7)

Program 4

(code 98% identical to the source code for MS Windows 7)


Is this plagiarism1

Is This Plagiarism?

Is Program 4 derived from Program 3 in a manner which is “plagiarism”?

Definitely Yes

It's too complicated to happen by chance

Millions of lines of code

The source is “closed”

Microsoft guard it very well!


New approaches for detecting similarities in program code

/* Program 5 */

public class Sun {

static final double latitude=52.4;

static final double longitude=-1.5;

static final double tpi = 2.0*pi;

/* ... */

public static void main(String[] args) { calculate(); }

public static double FNrange(double x) {

double b = x / tpi;

double a = tpi * (b - (long)(b));

if (a < 0) a = tpi + a; return a;

};

public static void calculate() { /* ... */ }

/* ... */

/* Program 6 */

public class SunsetCalculator {

static float latitude=52.4;

static float longitude=-1.5;

/* ... */

public static void main(String[] args) { findSunsetTime(); }

public static double rangeCalc(float arg) {

float x = arg / tpi;

float y = 2*3.14159 * (x - (int)(x));

if (y < 0) y = 2*3.14159 + y; return y;

};

public static void findSunsetTime() { /* ... */ }

/* ... */


Is this plagiarism2

Is This Plagiarism?

Is Program 6 derived from Program 5 in a manner which is “plagiarism”?

Maybe

Structure is similar – cosmetic changes

But the algorithm is public domain

Maybe 6 derived from 5, maybe the other way round


History

History ...

First known plagiarism detection system was an attribute counting program developed by Ottenstein (1976)

More recent systems compare the structure of source-code programs

Structure-based systems include: YAP3, MOSS, JPlag, Plague, and Sherlock.


Detection tools 1

Detection Tools (1)

Attribute counting systems (Halstead, 1972):

Numbers of unique operators

Numbers of unique operands

Total numbers of operator occurrences

Total numbers of operand occurrences


Detection tools 2

Detection Tools (2)

Structure-based systems:

Each program is converted into token strings (or something similar)

Token streams are compared for determining similar source-code fragments

Tools: JPlag, MOSS, and Sherlock


Example code 1

Example (code 1)

int calculate(String arg) {

int ans=0;

for (int j=1; j<=100; j++) {

ans *= j;

}

return ans;

}


Example code 2

Example (code 2)

Integer doit(String v) {

float result=0.0;

for (float f=100.0; f > 0.0; f--)

result *= f;

return result;

}


Example tokenised

Example (tokenised)

type name(type name) start

type name=number

loop (type name=number

name compare number

operation name) start

name operation name

end

return name

end


Detectors

Detectors‏

MOSS (Berkeley/Stanford, USA)

JPlag (Karlsruhe, Germany)

Java only

Programs must compile?

Sherlock (Warwick, UK)

MOSS and JPlag are Internet resources

Data Protection?


New approaches for detecting similarities in program code

MOSS

Developed by Alex Aiken in 1994

MOSS (for a Measure Of Software Similarity) determines the similarity of C, C++, Java, Pascal, Ada, ML, Lisp, or Scheme programs.

MOSS is free, but you must create an account

MOSS home page:

http://theory.stanford.edu/~aiken/moss/


Moss algorithm

MOSS – Algorithm

“Winnowing” (Schleimer et al.,2003)

Local document fingerprinting algorithm

Efficiency proven (33% of lower bound)

Guarantees detection of matches longer than a certain threshold


Using moss

Using MOSS

Moss is being provided as an Internet service

User must download MOSS Perl script for submitting files to the MOSS server

The script uses a direct network connection

The MOSS server produces HTML pages listing pairs of programs with similar code

MOSS highlights similar code-fragments within programs that appear the same

Data Protection? – US service

Maintenance?


Jplag

JPlag

Developed by Guido Malpohl in 1996

JPlag currently supports Java, C#, C, C++, Scheme, and natural language text

Use of JPlag is free, but user must create an account

JPlag can be used to compare student assignments but does not compare with code on the Internet

JPlag home page: www.ipd.uni-karlsruhe.de/jplag


Jplag algorithm

JPlag – Algorithm

Parse (or scan) programs

Convert programs to tokens

3) Pairwise compare

“Greedy String Tiling”

maximises percentage of common token strings

worst case θ(n3), average case linear

Precheltet al. (2002)


Jplag file processing

JPlag File Processing


Jplag results

JPlag - Results

Results in HTML Format

Histogram of similarity values found for all pairs of programs

Similar pairs and their similarity values displayed

Select file pairs to view


Jplag matches

JPlag - Matches

Similar lines matched with the same colour

Code fragment similarity values based on similar tokens found


Sherlock

Sherlock

Developed at the University of Warwick Department of Computer Science

Sherlock was fully integrated with the BOSS online submission software in 2002 and Open-Sourced

Sherlock detects plagiarism on source-code and natural language assignments

BOSS home page: www.boss.org.uk


Sherlock preprocessing

Sherlock - Preprocessing

Whitespace

Comments

Normalisation

Tokenisation


Sherlock results

Sherlock – Results

Results displayed

Similarity values of suspicious files

Similarity values depend on the length of similar lines found as a percentage of the whole file size

Select suspicious matches to examine

Mark suspicious files


Sherlock matches

Sherlock – Matches

Suspected sections marked with

**begin suspicious section**

and

**end suspicious section**


Sherlock document set

Sherlock – Document Set

User can view graph

Each node represents one submission

An edge means two submissions

Options to select threshold

Click on lines to view or to mark suspicious matches


Codematch

CodeMatch

Commercial product

Free academic use for small data sets

Exact algorithm not published

patent pending?


Example of identical instruction sequences

Example of Identical “Instruction Sequences”

/* File 1*/

for (int i=1; i<10; i++) {

if (a==10)

print(“done”);

else

a++;

}

/* File 2*/

for (int x=100; x > 0; x--) {

if (z99 > -10)

print(“ans is ” + z99);

else {

abc += 65;

}

}


Codematch algorithm

CodeMatch – Algorithm

Remove comments, whitespace and lines containing only keywords/syntax; compare sequences of instructions

Extract comments, and compare

Extract identifiers, and count similar; x, xxx, xx12345 are “similar”

Combine (1), (2) and (3) to give correlation score


Heuristics

Heuristics

Comments

Spelling mistakes

Unusual English (Thai, German, …)‏

Use of Search Engines

Unusual style

Code errors


Tool efficiency

Tool Efficiency

MOSS, JPlag and Sherlock are effective

Results returned are similar

Results returned are not identical

User interface issues may be important


Part 3 new approaches

Part 3 – New Approaches

Eschew the “syntax driven” approach

Lateral thinking?

Case study: Latent Semantic Analysis


Digression similarity

Digression: Similarity

What do we actually mean by “similar”?

This is where the problems start ...


1 staff survey

(1) Staff Survey

We carried out a survey in order to:

gather the perceptions of academics on what constitutes source-code plagiarism, and

create a structured description of what constitutes source-code plagiarism from a UK academic perspective

Cosma and Joy (2008)


Data source

Data Source

On-line questionnaire distributed to 120 academics

Questions were in the form of small scenarios

Mostly multiple-choice responses

Comments box below each question

Anonymous – option for providing details

Received 59 responses, from more that 34 different institutions

Responses were analysed and collated to create a universally acceptable source-code plagiarism description.


Results

Results

Grey areas include:

O-O templates

Inappropriate collaboration

Translating between (programming) languages

Re-use of work already submitted


Other issues

Other Issues

Various issues on source-code plagiarism including:

Source-code reuse

Source-code self-plagiarism

Copying without adaptation

Copying with adaptation: minimal, moderate, extreme

Converting source to another language

Using code-generator software

Collusion

Obtaining source-code written by other authors

False and “pretend” references


2 student survey

(2) Student Survey

We carried out a survey (Joy et al.,2008) in order to:

gather the perceptions of students on what (source code) plagiarism means,

identify types of plagiarism which are poorly understood, and

identify categories of student whoperceive the issue differently to others


Data source1

Data Source

Online questionnaire answered by 770 students from computing departments across the UK

Anonymised, but brief demographic information included

Used 15 “scenarios”, each of which may describe a plagiaristic activity


Results 1

Results (1)

No significant difference in perspectives in terms of

university

degree programme

level of study (BS, MS, PhD)‏


Results 2

Results (2)

Issues which students misunderstood:

Open Source code

Translating between languages

Re-use of code from previous assignments

Placing references within technical documentation


Latent semantic analysis

Latent Semantic Analysis

Documents as “bags of words”

Known technique in IR

Handles synonymy and polysemy

Maths is nasty 

Results reported in (Cosma and Joy, 2010)


Document corpus

Document Corpus

  • m x n “term by document” matrix A

  • Rows = unique words

  • Columns = documents

  • Entries = no. of occurrences


Term weighting

Term Weighting

  • Algorithm to weight data in A

  • Local and global weights

  • Importance of terms in matrix A


Singular value decomposition svd

Singular Value Decomposition (SVD)

  • Decompose m x n matrix A = U∑VT

  • U is an m x r “term by dimension” matrix

  • V is an n x r “file by dimension” matrix

  • ∑ is an r x r “singular values” matrix

  • Truncate matrices to k dimensions, where k ≤ r


Svd 2

SVD (2)

  • Ak = Uk∑kVkT

  • Reduces “noise”

  • Highlights important relations between terms and documents

  • Size of k determined experimentally


Svd 3

SVD (3)

  • Given a “query” q (set of weighted keywords), can map to k-space:

  • Qk = qTUk∑k-1

  • Think of Q as a k-vector; can compare to vectors representing files using e.g. “cosine similarity” (dot product)


Uses of lsa

Uses of LSA

  • Essay grading

  • Essay feedback

  • Indexing

  • Language independent processing

  • Cross-language information retrieval

  • Source-code clustering

  • Plagiarism detection (natural language)


Summary

Summary

  • LSA can help detect plagiarism instances missed by other tools

    • Improved recall but poorer precision

    • Integration with structure-based tools is effective

  • Visualisation of relative file similarities

  • Predictability of LSA results is problematic


Where next

Where Next?

Algorithms to include Internet-located code

“Blended” algorithms

Cross-language detection

Further exploration of LSA


References 1

References (1)

  • F. Culwin and T.Lancaster, “Plagiarism, prevention, deterrence and detection”, [online] available from:www.heacademy.ac.uk/assets/York/documents/resources/resourcedatabase/id426_plagiarism_prevention_deterrence_detection.pdf‏ (2002)

  • G. Cosma and M.S. Joy, “An Approach to Source-Code Plagiarism Detection and Investigation using Latent Semantic Analysis” IEEE Transactions on Computers, to appear (2010)

  • G. Cosma and M.S. Joy, “Towards a Definition on Source-Code Plagiarism”, IEEE Transactions on Education 51(2) pp. 195-200 (2008)


References 2

References (2)

  • G. Cosma and M.S. Joy, “Source-code Plagiarism: a UK Academic Perspective”, Proceedings of the 7th Annual Conference of the HEA Network for Information and Computer Sciences (2006)

  • M. Halstead, “Natural Laws Controlling Algorithm Structure, ACM SIGPLAN Notices 7(2) pp. 19-26 (1972)

    M.S. Joy, G. Cosma, J.Y-K. Yau and J.E. Sinclair (2008), “Source Code Plagiarism – a Student Perspective” (under review)

    M.S. Joy and M. Luck, “Plagiarism in Programming Assignments”, IEEE Transactions on Education 42(2), pp. 129-133 (1999)


References 3

References (3)

  • K. Ottenstein, “An Algorithmic Approach to the Detection and Prevention of Plagiarism”, ACM SIGCSE Bulletin8(4) pp. 30-41 (1976)

  • L. Prechelt, G. Malpohl and M. Philippsen, “Finding “Plagiarisms among a Set of Programs with JPlag”. Journal of Universal Computer Science 8(11) pp. 1016-1038 (2002)

  • S. Schleimer, D.S. Wilkerson and A. Aitken, “Winnowing: Local Algorithms for Document Fingerprinting”, Proceedings of the ACM SIGMOD International Conference on Management of Data, pp. 76-85 (2003)


  • Login