1 / 32

Detection of Plagiarism In University Projects Using Metrics-Based Similarity

Detection of Plagiarism In University Projects Using Metrics-Based Similarity. Ettore Merlo, Ecole Polytechnique de Montr é al. Dagstuhl Seminar on Duplication, Redundancy, and Similarity in Software July, 2006. Context. Detect plagiarism in first years programming projects at university

Download Presentation

Detection of Plagiarism In University Projects Using Metrics-Based Similarity

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Detection of Plagiarism In University Projects Using Metrics-Based Similarity Ettore Merlo, Ecole Polytechnique de Montréal Dagstuhl Seminar on Duplication, Redundancy, and Similarity in Software July, 2006

  2. Context • Detect plagiarism in first years programming projects at university • Programming skills have to be developed during courses

  3. Plagiarism Detection • Comparison of sets of syntactic blocks • Spectral analysis of similarity • Increasing thresholds • Spectral shape parameters are computed • Projects are ranked by similarity spectrum • The most similar projects are considered as candidates for plagiarism

  4. Plagiarism Problem • Detect code transformations that require little programming effort and make apparent differences in source code • Changed identifier by editing operations • Changed source code layout (comments, indentation, order of procedures, functions, and methods, file structure) • Changed constants (initialization, loops)

  5. Metrics-Based Similarity • Definition • Two code fragments are similar if their associated vectors of metrics satisfy some similarity criterion

  6. Abstract Syntax Tree Source code Parsing and Analysis Metrics Extraction Clones Extraction F1 m11 m12 ……. M1k …………………………………. Fj mj1 mj2 ……. mjk Metrics Clones Similarity Identification Process

  7. Metrics Extraction • Metrics for similarity detection • Volume • Complexity • Module/function interface • Call graph structure • Local memory • Global memory • Dataflow

  8. Metrics Matching • similar(fI,fJ) = | mk(fI) – mk(fJ) | <= thk • forall k within the size of the metrics vector

  9. Metrics Matching Complexity • n = | fragments_set | • Exact solution algorithms show a worst-case O(n ) complexity in general • Linear complexity exact solutions exist for specific sub-problems • Opportunistic strategies and heuristics may reduce the average-case complexity • Approximate solutions may reduce the worst-case complexity 2

  10. Threshold-Based Quantization

  11. Threshold-Based Quantization (2) • Clusters represent the following hyper-parallelepiped: • Clusters represent a partition of all fragments • Complexity is O(M·n) where: • M is the cardinality of metrics • n is the total number of fragments • often M << n

  12. Quantization Error • Fragments in neighboring clusters may be closer than (thi/ 2) and still be in different clusters • Errors for threshold level (thi) disappear for threshold levels (k·thi), (k > 1)

  13. Project Comparison • Compute structural similarity spectrum • Compute similarity for increasing threshold levels in s steps • Quantize projects for the current threshold level • Traverse current clusters to check for commonality in compared project • Count common structurally-similar fragments under current threshold level

  14. Project Comparison (2) • Complexity: O(s·M·(n1 + n2)) • n1, n2 : size of projects • M: cardinality of metrics • s: threshold steps • Rationale: • Plagiarism is hard to deeply hide if little programming energy is deployed • Surface differences are quickly ignored by thresholds of increasing levels

  15. Project Comparison (3) • Typical spectrum

  16. Parameters • Granularity: functions and methods • Steps: 5 • Metrics and thresholds: • CALLS: 1 • LOCALS: 1 • NONLCALS: 1 • PARNUM: 1 • STMNT: 3 • NBRANCHES: 1 • NLOOPS: 1

  17. Plagiarism problem • Projects are composed of a variable number of fragments • Problem similar to class comparison or to software evolution analysis • Identify projects with high spectral similarity • p = number of projects • Galaxy approach • O(p) • Pair comparison • O(p2)

  18. Galaxy • Algorithm:

  19. Procedural Projects

  20. OO Projects

  21. Clone Visualization • Visual display of source code fragments differences • DP-matching algorithm on tokens

  22. Matching Algorithms • Compute the sets of lexical changes • Dynamic programming • Sub-optimal and heuristic ones

  23. Matching Example int restore_stack ( object info ) { int restore_list ( int index , object info ) {

  24. Remarks • Similarity contrast is very good for procedural code • Distribution of similarity for OO code is less sharp • Reference classes were given as a part of the projects • Methods tend to be smaller • More methods tend to be similar • Class structure could be taken into consideration • Inter-class relationship could be taken into account

  25. Administrative Approach • Identify most similar projects • Do not make any hypothesis about the causes of similarity • Shift the burden of explanation over the authors of a project

  26. Conclusions • A metrics based plagiarism detection approach in an academic environment has been presented • The presented approach has been successfully used to discourage plagiarism in course projects

  27. Bibliography Merlo E., Antoniol G., Di Penta M., Rollo F. "Linear Complexity Object-Oriented Similarity for Clone Detection and Software Evolution Analysis", Proc. International Conference of Software Maintenance (ICSM), IEEE Computer Society Press, 2004, pp. 412-416 Merlo E., Antoniol G., Di Penta M., ``Complexity and Feasibility Issues in Object Oriented Clone Detection'', Proc. 2nd International Workshop on Detection of Software Clones (IWDSC-2003), Victoria (BC), Canada, 2003, pp. 5-6. G. Antoniol, U. Villano, E. Merlo, M. Di Penta, ``Analyzing Cloning Evolution in the Linux Kernel'‘, Information and Software Technology, Vol. 44, No. 13, pp. 755-765, October 1, 2002

  28. Bibliography (2) E. Merlo, M. Dagenais, P. Bachand, J. S. Sormani, G. Antoniol ``Investigating Large Software System Evolution: the Linux Kernel'' Computer Software and Applications Conference, COMPSAC - 2002 Dagenais M., Patenaude J. F., Merlo E., Lague B., ``Comparison of clones occurrence in Java and Modula-3 software systems'', in ``Advances in Software Engineering: Comprehension, Evaluation, and Evolution'', H. Erdogmus and O. Tanir (Eds.), Springer-Verlag, ISBN: 0-387-95109-1, 2001. Casazza G., Antoniol G., Villano U., Merlo E., Di Penta M., ``Identifying Clones in the Linux Kernel'', Proc. International Workshop on Source Code Analysis and Manipulation (IWSCAM), IEEE Computer Society Press, pp. 90-97, 2001

  29. Bibliography (3) Antoniol A., Casazza G., Di Penta M., Merlo E., ``Modeling Clones Evolution through Time Series'', Proc. International Conference of Software Maintenance (ICSM), IEEE Computer Society Press, pp. 273-280, 2001 Antoniol G., Casazza G., Merlo E., ``GAWK Software System Evolution'', International Workshop on Feedback and Evolution in Software and Business Processes (FEAST), July 2000 Balazinska M., Merlo E., Dagenais M., Lague B., Kontogiannis K., ``Advanced Clone-analysis as a Basis for Object-oriented System Refactoring'', Proc. Working Conference on Reverse Engineering (WCRE), IEEE Computer Society Press, pp. 98-107, 2000.

  30. Bibliography (4) Balazinska M., Merlo E., Dagenais M., Lague B., Kontogiannis K., ``Measuring Clone Based Reengineering Opportunities'', Proc. International Software Metrics Symposium, pp. 292-303, IEEE Computer Society Press, 1999 Balazinska M., Merlo E., Dagenais M., Lague B., Kontogiannis K., ``Partial Redesign of Java Software Systems Based on Clone Analysis'', Proc. 6th Working Conference on Reverse Engineering, WCRE99, pp. 326-336, IEEE Computer Society Press, 1999 Dagenais M., Merlo E., Lague B., Proulx D., ``Clones Occurrence on Large Object Oriented Software Packages'', Proc. CASCON'98, pp. 192-200, IBM Canada, National Research Council of Canada, 1998

  31. Bibliography (5) Lague, B., Proulx, D., Mayrand, J., Merlo, E.M., Hudepohl, J., ``Assessing the Benefits of Incorporating Function Clone Detection in a Development Process'', Proc. of International Conference on Software Maintenance, IEEE Computer Society Press, 1997, pp. 314-321. Mayrand, J., Leblanc, C., and Merlo, E., ``Experiment on the Automatic Detection of Function Clones in a Software System Using Metrics'', Proc. IEEE International Conference on Software Maintenance, Monterey, California, November 1996, IEEE Computer Society Press, pp. 244-253. Kontogiannis K., De Mori R., Merlo E., Galler M., Bernstein M., ``Pattern matching techniques for clone detection'', Journal of Automated Software Engineering, V.3, 1996, pp. 77-108, Kluwer Academic Publishers.

  32. Further Contacts Ettore Merlo Ecole Polytechnique de Montréal tel: +1 (514 ) 340 4711 ext. 5758 fax: +1 (514) 340 3240 ettore.merlo@polymtl.ca

More Related