stylometry n.
Download
Skip this Video
Loading SlideShow in 5 Seconds..
Stylometry PowerPoint Presentation
Download Presentation
Stylometry

Loading in 2 Seconds...

play fullscreen
1 / 14

Stylometry - PowerPoint PPT Presentation


  • 166 Views
  • Uploaded on

Stylometry. Projects, mostly Fall 2009 Project. Seidenberg School of Computer Science and Information Systems. Stylometry - is the study of the unique linguistic styles and writing behaviors of individuals in order to determine authorship. Description of Project. Part I

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about 'Stylometry' - onawa


An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
stylometry

Stylometry

Projects, mostly Fall 2009 Project

Seidenberg School of Computer Science and Information Systems

description of project
Stylometry - is the study of the unique linguistic styles and writing behaviors of individuals in order to determine authorshipDescription of Project
  • Part I
    • Search to determine an interesting and unique application of stylometry for Research
  • Part II
    • Feasibility study on existing tools/applications for email authorship (250 words or less)
existing potential uses of stylometry
Existing / Potential Uses of Stylometry

- Social networking, electronic mail, and instant messaging are still in early stages of study

use cases
Use Cases
  • Twitter
    • Used to verify existing Twitter accounts and help mitigate impersonations
  • Electronic mail
    • Implemented in a corporate setting helping identify anonymous emails meant to do harm
  • Chat
    • Assist in determining authorship of instant messages
    • Similar to Twitter but needs to be dynamic
use cases1
Use Cases
  • Terrorism
    • Help identify an author of terrorist content or identify terrorist content by using contextual analysis
    • Applied to blogs, forums, wikis, email, chat and other forms of digital content
tools discovered
Tools discovered
  • JGAAP (Java Graphical Authorship Attribute Program)
  • Signature Tool
  • C# Tool
  • StyleTool
  • Blog stylometry tool
  • Stylometry tool
tools discovered1
Tools discovered
  • JGAAP (Java Graphical Authorship Attribute Program)
    • Java based tool
    • Runs on Windows and Linux
    • Identification tool
      • 1 of n decision – Many known email authors trying to determine the author of one unknown email
    • One unknown email author compared to 99 known email authors
    • 100 total tests run
tools discovered2
Tools discovered
  • C# Tool
    • Written in C programming language
    • Developed by prior Pace CS graduate students
    • Identification tool
      • 1 of n decision – Many known email authors trying to determine the author of one unknown email
    • One unknown email author compared to 99 known email authors
    • 100 total tests run
tools discovered3
Tools discovered
  • Signature Tool
    • Written in C programming language (not confirmed)
    • Created by Peter Millican from Hartford College
    • Authentication Tool
      • Either match / no match
    • Match testing – 9 known and 1 unknown sample (same author)
    • No Match – 10 known and 1 unknown (two different authors)
    • Total of 105 tests were run
testing methodology
Testing methodology
  • Each team member submitted 20 (or 30) actual emails from 2 (3) different authors.
    • Total of 100 emails collected from 10 different authors
    • Removed from native program and saved as text files
    • Average size (words) of email 195.7
  • Different testing for identification and authentication tools
    • For authentication tool
      • False Accept Rate - Rate a document is falsely attributed to an author
      • False Reject Rate - Rate a document is not correctly attributed to an author
testing results
Testing Results

JGAAP (Levenshtein Distance algorithm)

C# Tool Match Test

Categorizing the result based

on the country of the author

Signature Tool Match Test

Signature Tool No-Match Test

earlier study s features 20 of 55
1. Number of sentences beginning with upper case

2. Number of sentences beginning with lower case

3. Number of Words

4. Average Word Length

5. Number of Sentences

6. Average Number of Words per Sentence

7. Number of Paragraphs

8. Average Number of words per Paragraph

9. Number of Exclamation Marks

10. Number of Number Signs

11. Number of Dollar Signs

12. Number of Ampersands

13. Number of Percent Signs

14. Number of Apostrophes

15. Number of Left parentheses

16. Number of Right parentheses

17. Number of Asterisks

18. Number of Plus Signs

19. Number of Commas

20. Number of Dashes

Earlier Study’s Features – 20 of 55
conclusion
Conclusion
  • Overall the moderate accuracy of the test results suggest that none of the tools evaluated are capable of accurate stylometric email author identification
  • Categorizing email samples by country of origin seems to yield better accuracy results for all three tools tested.
recommendations
Recommendations
  • Further testing and research using email from authors of different countries
  • Continue to refine and add to the stylistic feature set created by prior Pace graduate students
    • Include new features becoming more prevalent in digital content. Ex. Emoticons, hyperlinks
      • Internet slang – BRB, LOL, TTYL
  • Consideration for people who wish to disguise their identity needs to be addressed and researched further