1 / 26

Authorship Analysis in Cybercrime Investigation

Authorship Analysis in Cybercrime Investigation. Rong Zheng, Yi Qin, Zan Huang, Hsinchun Chen Artificial Intelligence Lab University of Arizona. Outline. Introduction Literature Review Research Questions Experimental Design Results & Discussions Conclusions & Future Directions

adamdaniel
Download Presentation

Authorship Analysis in Cybercrime Investigation

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Authorship Analysis in Cybercrime Investigation Rong Zheng, Yi Qin, Zan Huang, Hsinchun Chen Artificial Intelligence Lab University of Arizona

  2. Outline • Introduction • Literature Review • Research Questions • Experimental Design • Results & Discussions • Conclusions & Future Directions • Questions & Comments

  3. Introduction • Internet has offered us a much more convenient way to share information across time and place. • Cyberspace also opened a new venue for criminal activities. • Cyber attacks • Distribution of illegal materials in cyberspace • Computer-mediated illegal communications within big crime groups or terrorists • Cybercrime has become one of the major security issues for the law enforcement community.

  4. Cybercrime • Definition: • Illegal computer-mediated activities that can be conducted through global electronic networks [Thomas, 2000] • Problems in cybercrime investigation • Data collection • Huge amount of online document • Rule Forming • Difficult to discern illegal document • Identity Tracing • Difficult to trace identities due to the anonymity of cybercrime • The anonymity of cyberspace makes identity tracing a significant problem which hinders investigations.

  5. Possible Solution -- Authorship Analysis • An author might leave his unique “wordprint” in his writings. • Authorship analysis may identify the “wordprint” of the criminals. • For forensic purposes, this method has been used in a number of courts in England (the Court of Criminal Appeal), Ireland (the Central Criminal Court), Northern Ireland, and Australia.

  6. Authorship Analysis in Cybercrime Investigation • A cyber criminal may have “wordprint” hidden in his online messages. • For example: Hi, I have several pretty cheap CD to sell. They are all brand new , and only $1 for each. Please contact pepter@yahoo.com if you are interested. • In this study, we propose to use the authorship analysis approach to solve the problem of identity tracing in cybercrime investigation. Has a greeting Specialcharacter Use email as contact method

  7. Authorship Analysis • Categories: • Author identification • Author characterization • Similarity detection • Applications: • Disputed authorship literature • Shakespeare’s work, Federalist Papers • Software forensic • Virus authorship, source code plagiarism

  8. Performance of Authorship Analysis • Two critical research issues influence the performance of authorship analysis: • Feature selection • Find out the effective discriminators • Analytical techniques • Approach to discriminating texts by authors based on the selected features

  9. Feature Selection • Content specific features [Elliot, 1991] • key words, special characters • Style markers • Word/Character based features [Yule, 1938] • length of words, vocabulary richness • Syntactic features [Mosteller, 1964; Baayen, 1996] • function words(‘the’ ‘if’ ‘to’), punctuation • Structural features [Vel, 2000] • has a title/signature, has separators between paragraphs

  10. Summary on Feature Selection • Content specific features are only effective in specific applications. • Word based features alone cannot represent writing style. But the combination of word based and syntactic features is very effective. [Baayen, 1996] • Structural features are helpful in Vel’s email applications. [Vel, 2000] • Style markers are the most frequently used features in past studies.

  11. Analytical Techniques for Authorship Analysis • Statistical approaches • Univariate methods for authorship analysis • Thisted and Efron test [Thisted, 1987] • CUSUM [Farringdon 1996] • Multivariate methods for authorship analysis • Cluster analysis [Holmes, 1995] • Principle component analysis (PCA) [Burrow, 1987] • Linear discriminant analysis (LDA) [Baayen, 2002] • Machine learning approaches • Bayesian [Mosteller, 1984] • Decision tree [Apte, 1998] • Neural Network [Merriam, 1995; Bradley, 1996] • SVM [Diederich, 2000; Vel, 2001]

  12. Summary on Analytical Techniques • Machine learning methods generally achieved higher accuracies than statistical methods in this field. • Machine learning methods can deal with a large set of features with less requirement on stringent mathematical models or assumptions than statistical methods. • The performance of authorship analysis largely depends on the quality of the feature set.

  13. Challenges for Applying Authorship Analysis to Online Documents • Online documents are generally short in length. • The writing styles of online documents are less formal and the vocabulary is less stable. • The structure or composition style of online documents is often different from normal text documents. • Due to the internationalization of cybercrime, multilingual problems become a new challenge for authorship analysis.

  14. Research Questions • Will authorship analysis techniques be applicable in identifying authors in cyberspace? • What are the effects of using different types of features in identifying authors in cyberspace? • Which classification techniques are appropriate for authorship analysis in cyberspace? • Will the authorship analysis framework be applicable in a multilingual context?

  15. Experimental Design --Testbed • English Email Messages • 70 emails provided by 3 students • English Internet Newsgroup Messages • 153 potentially illegal messages written by 9 authors from misc.forsale.computers.pc-specific.software, misc.forsale.computers and mac-specific.software. • Chinese BBS Messages • 70 messages written by 3 authors from bbs.mit.edu

  16. Experimental Design -- Techniques • Decision tree • Implemented C4.5 algorithm to deal with continuous values’ attributes for our datasets • Backpropagation neural network • Standard three-layer fully connected backpropagation neural network • Support vector machine • BSVM [Hsu, 2002] • Use linear kernel function • Set noise term to 1000

  17. Experimental Design -- Feature Selection • For our English dataset, the feature selection was based on Vel’s study on email authorship analysis [Vel, 2000] (We added 36 style markers and 8 content specific features): • 206 style markers • 150 function words and 56 other language-based style features • 8 structural features • 9 content specific features • illegal content specific features • For our Chinese dataset, we preliminarily extracted 60 style markers and 7 structural features.

  18. Procedures • Three steps: • Style markers were used in the first run. • Structural features were added in the second run. • Content specific features were added in the third run (newsgroup dataset only). • This procedure was repeated for each of the three algorithms.

  19. Measures

  20. Experimental Results

  21. Discussions -- Techniques • SVM and neural networks achieved better performance than the C4.5 decision tree algorithm. • This confirmed the results in previous studies. [Diederich, 2000] • SVM generally had the best performance because of its capability of dealing with a large set of input features.

  22. Discussions -- Feature Selections • Using style markers alone, we achieved high accuracy. • Style markers and the techniques are effective. • Using style markers and structural features outperformed using style markers only (with p-values < 0.05). • Consistent personal patterns exist in the message structures. • Using style markers, structural features, and content specific features did not outperform using style markers and structural features (with p-value of 0.3086). • The content distinction of those messages is not significant. • Style marker and structural feature are highly effective.

  23. Discussions -- Datasets • The measures of prediction performance drop significantly for the Chinese dataset compared with the English datasets. • We only used 67 features for the Chinese dataset. • Larger set of function words are needed. • Nevertheless, we achieved 70% - 80% accuracy.

  24. Conclusions • The experimental results indicated a promising future for applying the authorship analysis approaches in cybercrime investigation to address the identity-tracing problem. • Structural features are significant discriminators for online documents. • SVM and neural network techniques achieved high performance. • This approach is promising in the multilingual context.

  25. Future Directions • More illegal messages will be incorporated into our testbed. • The current approach will be extended to analyze the authorship of other cybercrime-related materials, such as bomb threats, hate speeches, and child-pornography. • Another more challenging future direction is to automatically generate an optimal feature set which is specifically suitable for a given dataset.

  26. Questions & Comments Thank you!

More Related