Determining common authorship among documents
Download
1 / 21

Determining Common Authorship Among Documents - PowerPoint PPT Presentation


  • 78 Views
  • Uploaded on

Determining Common Authorship Among Documents. Paul Bonamy Mentor: Dr. Paul Kantor. Author Identification & Common Authorship. Author Identification: “Who wrote this?” Mosteller/Wallace, 1964 – The Federalist 12 disputed papers attributed to Madison Generally utilizes statistical analysis

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about ' Determining Common Authorship Among Documents ' - suki-davidson


An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
Determining common authorship among documents

Determining Common Authorship Among Documents

Paul Bonamy

Mentor: Dr. Paul Kantor


Author identification common authorship
Author Identification & Common Authorship

  • Author Identification: “Who wrote this?”

  • Mosteller/Wallace, 1964 – The Federalist

  • 12 disputed papers attributed to Madison

  • Generally utilizes statistical analysis

  • Common Authorship: “Do these share an author?”

  • Does not (necessarily) require statistics/training

  • Useful for detecting forgeries, etc


Bmr bxr
BMR/BXR

  • Implements Bayesian Multinomial Regression

  • Used to perform 1-of-k classification

  • BMRtrain accepts feature vectors, outputs assignment model

  • BMRclassify accepts model & vectors, outputs assignments

  • Can output author probability vectors


Bayesian analysis
Bayesian Analysis

  • Consider two match boxes

  • Probability of Box 1, given black marble?

    • H0= We have Box 1, E = We see a black marble


Bayesian analysis in bmr
Bayesian Analysis in BMR

  • Bayes’ Theorem Extendable to P(C|F1…FN)

    • C is a class

    • F1…FNare features

  • Effectively applies Bayes’ Theorem to itself


Bmr bxr workflow
BMR/BXR Workflow

Data

( Doc Corpus)

Test/Train

Splitter

Training Set

Testing Set

Feature Extractor

Feature Vectors

Feature Vectors

BMRtrain

Model

BMRclassify

Author

Identification

Author

Probabilities

Author

Probabilities


Corpus construction
Corpus Construction

  • Articles from 2006-07 issues of The Compass Newspaper

  • 16 Authors

  • 130 Documents

    • 300 - 500 Words: 69

    • 500+ Words: 61

  • Varied Topics

  • On Friday, November 3, LSSU experienced its first closing of the semester due to inclement weather. The Soo Evening News reported a “number of minor mishaps,” and “slippery-road induced mishaps,” including two crashes near the campus of LSSU. All classes before 10 AM were canceled because of the snow and ice that had accumulated overnight, but many students arrived for classes as usual, unaware of the cancellation. …


Feature extraction
Feature Extraction

  • Perl script using Lingua::EN::Tagger

  • Selects words, part-of-speech (POS), or both (wordPOS)

    • address/VB

    • address/NN

  • Used wordPOS in common authorship study

  • Returns vector of feature frequencies

  • 4:9.0 16:5.0 22:4.0 23:2.0 28:5.0 29:1.0 33:4.0 36:9.0 38:1.0 41:3.0 46:13.0 56:2.0 …


Author probability vectors
Author Probability Vectors

  • Produced by BMR/BXR upon request

  • Probability doc belongs to each author in the training set

  • Not normalized (sum not necessarily 1)

  • 0.17% 0.68% 9.13% 8.90% 2.42% 0.94% 10.55% 0.32% 0.72% 36.95% 0.31% 0.50% 0.48% 22.08% 1.34% 4.52%


Computed with features
Computed With Features

  • Start with feature vectors

  • Select all distinct pairs of vectors

  • Compute dot product and Euclidean distance

  • Sort data

    • Descending by dot product

    • Ascending by Euclidean distance


Computed with authors
Computed With Authors

  • Start with author probability vectors

  • Select all distinct pairs of vectors

  • Compute dot product and Euclidean distance

  • Sort data

    • Descending by dot product

    • Ascending by Euclidean distance


What are we looking for
What Are We Looking For?

  • DP and Euclidean distance measure distance

    • Computed distances between vectors

    • Sorted from closest to furthest

  • Docs by same author are close together

    • Docs by different authors far apart


Roc curve
ROC Curve

  • Shows fractions of not-pairs versus fraction of pairs

  • Area under curve indicates model accuracy

    • Higher is better

  • Euclidean distance of feature vector

  • This curve: 64.7% of area under curve








Analyzing other corpora
Analyzing Other Corpora

  • Obtained second corpus

    • 9377 Documents

    • 24 Authors

  • Results similar to those on Compass dataset


Open questions
Open Questions

  • Are Area Under Curve variations significant?

  • How does Author ID model accuracy affect same-author accuracy?

    • A low Author-ID accuracy model did very well

  • Can we reduce memory/processing requirements?


ad