stylometry project l.
Download
Skip this Video
Loading SlideShow in 5 Seconds..
Stylometry Project PowerPoint Presentation
Download Presentation
Stylometry Project

Loading in 2 Seconds...

play fullscreen
1 / 16

Stylometry Project - PowerPoint PPT Presentation


  • 408 Views
  • Uploaded on

Stylometry Project. May 4, 2007. Pace’s Research Day . TEAM MEMBERS. Rob Goodman, Programmer Currently working at KPMG Completing MS in Computer Science in December 2008 Matt Hahn, Quality Assurance Currently working at Affiliated Computer Services, Inc.

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about 'Stylometry Project' - Rita


An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
stylometry project

Stylometry Project

May 4, 2007

Pace’s Research Day

team members
TEAM MEMBERS
  • Rob Goodman, Programmer
    • Currently working at KPMG
    • Completing MS in Computer Science in December 2008
  • Matt Hahn, Quality Assurance
    • Currently working at Affiliated Computer Services, Inc.
    • Completing MS in in Information Technologies in May 2007
  • Madhuri Marella, Programmer
    • Completing MS in Computer Science in May 2007
  • Chris Ojar, Team Leader
    • Currently working at Pace’s Evening Support Office in Pleasantville
    • Completing MS in Internet Technologies in May 2007
what is stylometry
WHAT IS STYLOMETRY?
  • Unique linguistic styles and writing behaviors of individuals in order to determine authorship
  • Used to attribute authorship to anonymous or disputed documents, and it has legal as well as academic and literary applications
  • Uses statistical analysis, pattern recognition, and artificial intelligence techniques. For features, stylometry typically analyzes the text by using word frequencies and identifying patterns in common parts of speech
the program
THE PROGRAM
  • A pattern recognition system to identify the author of arbitrary email using stylometry features
  • Phase 1 – Data Collection
    • Raw data from Keystroke Biometric Project
    • Plain text emails

Phase 2 – Feature Extraction

Measurements of punctuation, content format, and keystrokes [when applicable]

Normalize features to 0-1 range

Phase 3 – Classification

k-Nearest-Neighbor using Euclidean distance

Defaulted to 10

raw data examples
RAW DATA EXAMPLES

File Name: Sandy-biometrics.txt

File Name: Goodman-email.txt

Dear Ms. Sanderson:

I enjoyed our conversation on February 18th at the Family and Child Development seminar on teaching young children and appreciated your personal input about helping children attend school for the first time.  This letter is to follow-up about the Fourth Grade Teacher position as discussed at the seminar.  I will be completing my Bachelor of Science Degree in Family and Child Development with a concentration in Early Childhood Education at Pace in May of 2007, and will be available for employment at that time…

dirty data example
DIRTY DATA EXAMPLE

<Shift> I'm on my second take and <Shift> I'm still writing about the same book <Shift> : <Shift> " <Shift> A <Shift> Million <Shift> Little <Shift> Pieces. <Backspace> <Backspace> <Shift> " <Shift> I'm not sure if <Shift> I am supposed to be typing the same this <Backspace> ng <Shift> I typed on submit <Backspace> ssion <Shift> #1 as <Shift> I am on sb <Backspace> ubmission <Shift> #2, but since <Shift> my sister is skiing in <Shift> Vermont, <Shift> I'll just continued <Backspace> . <Shift> In any event, as a <Backspace> soon as <Shift> I found out the book was not true, <Shift> I couldn't pick it up for a few days. <Shift> Then, it got the best of me. <Shift> It is tu <Backspace> <Backspace> a fact that <Shift> James <Shift> Frey is a great ri <Backspace> <Backspace> writer. <Shift> He holds your interest and attention a <Backspace> so <Shift> I go <Backspace> t b <Backspace> past the fact the <Backspace> <Backspace> at he lied, and continued on. <Shift> I have to say <Shift> I endj <Backspace> <Backspace> joyed the book a lot better as a non-fiction book than <Shift> I did as a fiction novel.

clean data example
CLEAN DATA EXAMPLE

I'm on my second take and I'm still writing about the same book: "A Million Little Pieces." I'm not sure if I am supposed to be typing the same thing I typed on submission #1 as I am on submission #2, but since my sister is skiing in Vermont, I'll just continue. In any event, as soon as I found out the book was not true, I couldn't pick it up for a few days. Then, it got the best of me. It is a fact that James Frey is a great writer. He holds your interest and attention so I got past the fact that he lied, and continued on. I have to say I enjoyed the book a lot better as a non-fiction book than I did as a fiction novel.

the program8
THE PROGRAM
  • A pattern recognition system to identify the author of arbitrary email using stylometry features
  • Phase 1 – Data Collection
    • Raw data from Keystroke Biometric Project
    • Plain text emails
  • Phase 2 – Feature Extraction
    • Measurements of punctuation, content format, and keystrokes [when applicable]
    • Normalize features to 0-1 range

Phase 3 – Classification

k-Nearest-Neighbor using Euclidean distance

Defaulted to 10

the program10
THE PROGRAM
  • A pattern recognition system to identify the author of arbitrary email using stylometry features
  • Phase 1 – Data Collection
    • Raw data from Keystroke Biometric Project
    • Plain text emails
  • Phase 2 – Feature Extraction
    • Measurements of punctuation, content format, and keystrokes [when applicable]
    • Normalize features to 0-1 range
  • Phase 3 – Classification
    • k-Nearest-Neighbor using Euclidean distance
      • Defaulted to 10
design model
DESIGN MODEL

START

Single Raw Data File?

READ RAW DATA

Email Reconstructed, Dirty File and Feature Stats Generated

Yes

Base Data Files of Email reconstructed, Dirty File and Feature Stats Generated with File Name Saved with the Extension of “- Clean Original.”

Emails Reconstructed, Dirty Files and Feature Stats Generated in One File with File Name Saved as Batch.year-month-day and military time

No

Select & Convert Base Data Files to…

Compare to Test Case?

…DATA SET FILE

No

Enter Author of Test Case

No

Do You Accept the Program’s Result?

Yes

Run Compare

Yes

READ TEST CASE

Save Test Case to Data Set?

Yes

END

No

analysis model
ANALYSIS MODEL

START

READ RAW DATA

Feature Extraction

Feature Statistics

Normalized Feature Statistics

K Nearest Neighbor Classifier

TEST CASE

K Nearest Neighbor Identification

project home page
PROJECT HOME PAGE

http://utopia.csis.pace.edu/cs615/2006-2007/team2/

questions
QUESTIONS

Contact

cojar@pace.eduor ctappert@pace.edufor more informationor visithttp://utopia.csis.pace.edu/cs615/2006-2007/team2