Mike joy 25 february 2010
1 / 71

New Approaches for Detecting Similarities in Program Code - PowerPoint PPT Presentation

  • Uploaded on

Mike Joy 25 February 2010. New Approaches for Detecting Similarities in Program Code. Overview of Talk. What is the Problem? Historical Overview New Approaches Where Next?. Part 1 – What is the Problem?. Document similarity What do we mean? Why is software an issue?

I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
Download Presentation

PowerPoint Slideshow about ' New Approaches for Detecting Similarities in Program Code' - dylan-oliver

An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.

- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
Mike joy 25 february 2010

Mike Joy25 February 2010

New Approaches for Detecting Similarities in Program Code

Overview of talk
Overview of Talk

  • What is the Problem?

  • Historical Overview

  • New Approaches

  • Where Next?

Part 1 what is the problem
Part 1 – What is the Problem?

Document similarity

What do we mean?

Why is software an issue?

Why is this interesting?

Four stages
Four stages


 Detection

 Confirmation

 Investigation

From Culwin and Lancaster (2002)‏.

Stage 1 collection
Stage 1: Collection

Get all documents together online

so they can be processed



BOSS (Warwick)

Coursemaster (Nottingham)

Managed Learning Environment

Stage 2 detection
Stage 2: Detection

Compare with other submissions

Compare with external documents

essay-based assignments

We’ll come back to this later

it’s the interesting bit!

Stage 3 confirmation
Stage 3: Confirmation

Software tool says “A and B similar”

Are they?

Never rely on a computer program!

Requires expert human judgement

Evidence must be compelling

Might go to court

Stage 4 investigation
Stage 4: Investigation

A from B, or B from A, or joint work?

If A from B, did B know?

open networked file

printer output

Did the culprit/s understand?

University processes must be followed

Why is this interesting
Why is this Interesting?

How do you compare two programs?

This is an algorithm question

Stages 2 and 3: detection and confirmation

How do you use the results (of a comparison) to educate students?

This is a pedagogic question

Stage 4, and before stage 1!

Digression essays
Digression: Essays

Plagiarism in essays is easier to detect

Lots of “tricks” a lecturer can use!

Google search on phrases

Abnormal style

... etc.

Software tools

Let's have a look ...


Can be used by academics to

detect plagiarism

provide evidence

Can be used by students to

check their own work

Part 2 historical overview
Part 2 – Historical Overview

How has similar code been detected in the past?

How well do the approaches work?

Why not use turnitin
Why not use Turnitin?

It won’t work!

String matching algorithm inappropriate

Database does not contain code

Commercial involvement

E.g. Black Duck Software

/* Program 1 */

public class Hello {

public static void main(String[] argv) {

System.out.println(“Hello World”)



/* Program 2 */

public class HelloWorld {

public static void main(String[] x) {

System.out.println(“hello world!”)



Is this plagiarism
Is This Plagiarism?

Is Program 2 derived from Program 1 in a manner which is “plagiarism”?

Probably No

It's too simple

Too many copies in books / on the web

Most of it is generic syntax

Program 3

(Source code for MS Windows 7)

Program 4

(code 98% identical to the source code for MS Windows 7)

Is this plagiarism1
Is This Plagiarism?

Is Program 4 derived from Program 3 in a manner which is “plagiarism”?

Definitely Yes

It's too complicated to happen by chance

Millions of lines of code

The source is “closed”

Microsoft guard it very well!

/* Program 5 */

public class Sun {

static final double latitude=52.4;

static final double longitude=-1.5;

static final double tpi = 2.0*pi;

/* ... */

public static void main(String[] args) { calculate(); }

public static double FNrange(double x) {

double b = x / tpi;

double a = tpi * (b - (long)(b));

if (a < 0) a = tpi + a; return a;


public static void calculate() { /* ... */ }

/* ... */

/* Program 6 */

public class SunsetCalculator {

static float latitude=52.4;

static float longitude=-1.5;

/* ... */

public static void main(String[] args) { findSunsetTime(); }

public static double rangeCalc(float arg) {

float x = arg / tpi;

float y = 2*3.14159 * (x - (int)(x));

if (y < 0) y = 2*3.14159 + y; return y;


public static void findSunsetTime() { /* ... */ }

/* ... */

Is this plagiarism2
Is This Plagiarism?

Is Program 6 derived from Program 5 in a manner which is “plagiarism”?


Structure is similar – cosmetic changes

But the algorithm is public domain

Maybe 6 derived from 5, maybe the other way round

History ...

First known plagiarism detection system was an attribute counting program developed by Ottenstein (1976)

More recent systems compare the structure of source-code programs

Structure-based systems include: YAP3, MOSS, JPlag, Plague, and Sherlock.

Detection tools 1
Detection Tools (1)

Attribute counting systems (Halstead, 1972):

Numbers of unique operators

Numbers of unique operands

Total numbers of operator occurrences

Total numbers of operand occurrences

Detection tools 2
Detection Tools (2)

Structure-based systems:

Each program is converted into token strings (or something similar)

Token streams are compared for determining similar source-code fragments

Tools: JPlag, MOSS, and Sherlock

Example code 1
Example (code 1)

int calculate(String arg) {

int ans=0;

for (int j=1; j<=100; j++) {

ans *= j;


return ans;


Example code 2
Example (code 2)

Integer doit(String v) {

float result=0.0;

for (float f=100.0; f > 0.0; f--)

result *= f;

return result;


Example tokenised
Example (tokenised)

type name(type name) start

type name=number

loop (type name=number

name compare number

operation name) start

name operation name


return name



MOSS (Berkeley/Stanford, USA)

JPlag (Karlsruhe, Germany)

Java only

Programs must compile?

Sherlock (Warwick, UK)

MOSS and JPlag are Internet resources

Data Protection?


Developed by Alex Aiken in 1994

MOSS (for a Measure Of Software Similarity) determines the similarity of C, C++, Java, Pascal, Ada, ML, Lisp, or Scheme programs.

MOSS is free, but you must create an account

MOSS home page:


Moss algorithm
MOSS – Algorithm

“Winnowing” (Schleimer et al.,2003)

Local document fingerprinting algorithm

Efficiency proven (33% of lower bound)

Guarantees detection of matches longer than a certain threshold

Using moss
Using MOSS

Moss is being provided as an Internet service

User must download MOSS Perl script for submitting files to the MOSS server

The script uses a direct network connection

The MOSS server produces HTML pages listing pairs of programs with similar code

MOSS highlights similar code-fragments within programs that appear the same

Data Protection? – US service



Developed by Guido Malpohl in 1996

JPlag currently supports Java, C#, C, C++, Scheme, and natural language text

Use of JPlag is free, but user must create an account

JPlag can be used to compare student assignments but does not compare with code on the Internet

JPlag home page: www.ipd.uni-karlsruhe.de/jplag

Jplag algorithm
JPlag – Algorithm

Parse (or scan) programs

Convert programs to tokens

3) Pairwise compare

“Greedy String Tiling”

maximises percentage of common token strings

worst case θ(n3), average case linear

Precheltet al. (2002)

Jplag results
JPlag - Results

Results in HTML Format

Histogram of similarity values found for all pairs of programs

Similar pairs and their similarity values displayed

Select file pairs to view

Jplag matches
JPlag - Matches

Similar lines matched with the same colour

Code fragment similarity values based on similar tokens found


Developed at the University of Warwick Department of Computer Science

Sherlock was fully integrated with the BOSS online submission software in 2002 and Open-Sourced

Sherlock detects plagiarism on source-code and natural language assignments

BOSS home page: www.boss.org.uk

Sherlock preprocessing
Sherlock - Preprocessing





Sherlock results
Sherlock – Results

Results displayed

Similarity values of suspicious files

Similarity values depend on the length of similar lines found as a percentage of the whole file size

Select suspicious matches to examine

Mark suspicious files

Sherlock matches
Sherlock – Matches

Suspected sections marked with

**begin suspicious section**


**end suspicious section**

Sherlock document set
Sherlock – Document Set

User can view graph

Each node represents one submission

An edge means two submissions

Options to select threshold

Click on lines to view or to mark suspicious matches


Commercial product

Free academic use for small data sets

Exact algorithm not published

patent pending?

Example of identical instruction sequences
Example of Identical “Instruction Sequences”

/* File 1*/

for (int i=1; i<10; i++) {

if (a==10)





/* File 2*/

for (int x=100; x > 0; x--) {

if (z99 > -10)

print(“ans is ” + z99);

else {

abc += 65;



Codematch algorithm
CodeMatch – Algorithm

Remove comments, whitespace and lines containing only keywords/syntax; compare sequences of instructions

Extract comments, and compare

Extract identifiers, and count similar; x, xxx, xx12345 are “similar”

Combine (1), (2) and (3) to give correlation score



Spelling mistakes

Unusual English (Thai, German, …)‏

Use of Search Engines

Unusual style

Code errors

Tool efficiency
Tool Efficiency

MOSS, JPlag and Sherlock are effective

Results returned are similar

Results returned are not identical

User interface issues may be important

Part 3 new approaches
Part 3 – New Approaches

Eschew the “syntax driven” approach

Lateral thinking?

Case study: Latent Semantic Analysis

Digression similarity
Digression: Similarity

What do we actually mean by “similar”?

This is where the problems start ...

1 staff survey
(1) Staff Survey

We carried out a survey in order to:

gather the perceptions of academics on what constitutes source-code plagiarism, and

create a structured description of what constitutes source-code plagiarism from a UK academic perspective

Cosma and Joy (2008)

Data source
Data Source

On-line questionnaire distributed to 120 academics

Questions were in the form of small scenarios

Mostly multiple-choice responses

Comments box below each question

Anonymous – option for providing details

Received 59 responses, from more that 34 different institutions

Responses were analysed and collated to create a universally acceptable source-code plagiarism description.


Grey areas include:

O-O templates

Inappropriate collaboration

Translating between (programming) languages

Re-use of work already submitted

Other issues
Other Issues

Various issues on source-code plagiarism including:

Source-code reuse

Source-code self-plagiarism

Copying without adaptation

Copying with adaptation: minimal, moderate, extreme

Converting source to another language

Using code-generator software


Obtaining source-code written by other authors

False and “pretend” references

2 student survey
(2) Student Survey

We carried out a survey (Joy et al.,2008) in order to:

gather the perceptions of students on what (source code) plagiarism means,

identify types of plagiarism which are poorly understood, and

identify categories of student whoperceive the issue differently to others

Data source1
Data Source

Online questionnaire answered by 770 students from computing departments across the UK

Anonymised, but brief demographic information included

Used 15 “scenarios”, each of which may describe a plagiaristic activity

Results 1
Results (1)

No significant difference in perspectives in terms of


degree programme

level of study (BS, MS, PhD)‏

Results 2
Results (2)

Issues which students misunderstood:

Open Source code

Translating between languages

Re-use of code from previous assignments

Placing references within technical documentation

Latent semantic analysis
Latent Semantic Analysis

Documents as “bags of words”

Known technique in IR

Handles synonymy and polysemy

Maths is nasty 

Results reported in (Cosma and Joy, 2010)

Document corpus
Document Corpus

  • m x n “term by document” matrix A

  • Rows = unique words

  • Columns = documents

  • Entries = no. of occurrences

Term weighting
Term Weighting

  • Algorithm to weight data in A

  • Local and global weights

  • Importance of terms in matrix A

Singular value decomposition svd
Singular Value Decomposition (SVD)

  • Decompose m x n matrix A = U∑VT

  • U is an m x r “term by dimension” matrix

  • V is an n x r “file by dimension” matrix

  • ∑ is an r x r “singular values” matrix

  • Truncate matrices to k dimensions, where k ≤ r

Svd 2
SVD (2)

  • Ak = Uk∑kVkT

  • Reduces “noise”

  • Highlights important relations between terms and documents

  • Size of k determined experimentally

Svd 3
SVD (3)

  • Given a “query” q (set of weighted keywords), can map to k-space:

  • Qk = qTUk∑k-1

  • Think of Q as a k-vector; can compare to vectors representing files using e.g. “cosine similarity” (dot product)

Uses of lsa
Uses of LSA

  • Essay grading

  • Essay feedback

  • Indexing

  • Language independent processing

  • Cross-language information retrieval

  • Source-code clustering

  • Plagiarism detection (natural language)


  • LSA can help detect plagiarism instances missed by other tools

    • Improved recall but poorer precision

    • Integration with structure-based tools is effective

  • Visualisation of relative file similarities

  • Predictability of LSA results is problematic

Where next
Where Next?

Algorithms to include Internet-located code

“Blended” algorithms

Cross-language detection

Further exploration of LSA

References 1
References (1)

  • F. Culwin and T.Lancaster, “Plagiarism, prevention, deterrence and detection”, [online] available from:www.heacademy.ac.uk/assets/York/documents/resources/resourcedatabase/id426_plagiarism_prevention_deterrence_detection.pdf‏ (2002)

  • G. Cosma and M.S. Joy, “An Approach to Source-Code Plagiarism Detection and Investigation using Latent Semantic Analysis” IEEE Transactions on Computers, to appear (2010)

  • G. Cosma and M.S. Joy, “Towards a Definition on Source-Code Plagiarism”, IEEE Transactions on Education 51(2) pp. 195-200 (2008)

References 2
References (2)

  • G. Cosma and M.S. Joy, “Source-code Plagiarism: a UK Academic Perspective”, Proceedings of the 7th Annual Conference of the HEA Network for Information and Computer Sciences (2006)

  • M. Halstead, “Natural Laws Controlling Algorithm Structure, ACM SIGPLAN Notices 7(2) pp. 19-26 (1972)

    M.S. Joy, G. Cosma, J.Y-K. Yau and J.E. Sinclair (2008), “Source Code Plagiarism – a Student Perspective” (under review)

    M.S. Joy and M. Luck, “Plagiarism in Programming Assignments”, IEEE Transactions on Education 42(2), pp. 129-133 (1999)

References 3
References (3)

  • K. Ottenstein, “An Algorithmic Approach to the Detection and Prevention of Plagiarism”, ACM SIGCSE Bulletin8(4) pp. 30-41 (1976)

  • L. Prechelt, G. Malpohl and M. Philippsen, “Finding “Plagiarisms among a Set of Programs with JPlag”. Journal of Universal Computer Science 8(11) pp. 1016-1038 (2002)

  • S. Schleimer, D.S. Wilkerson and A. Aitken, “Winnowing: Local Algorithms for Document Fingerprinting”, Proceedings of the ACM SIGMOD International Conference on Management of Data, pp. 76-85 (2003)