1 / 24

Authorship Attribution Using Probabilistic Context-Free Grammars

Authorship Attribution Using Probabilistic Context-Free Grammars. Sindhu Raghavan, Adriana Kovashka, Raymond Mooney The University of Texas at Austin. Authorship Attribution. Task of identifying the author of a document Applications Forensics (Luckyx and Daelemans, 2008)

Download Presentation

Authorship Attribution Using Probabilistic Context-Free Grammars

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Authorship Attribution Using Probabilistic Context-Free Grammars Sindhu Raghavan, Adriana Kovashka, Raymond Mooney The University of Texas at Austin

  2. Authorship Attribution • Task of identifying the author of a document • Applications • Forensics(Luckyx and Daelemans, 2008) • Cyber crime investigation (Zheng et al., 2009) • Automatic plagiarism detection (Stamatatos, 2009) • The Federalist papers study (Monsteller and Wallace, 1984) • The Federalist papers are a set of essays of the US constitution • Authorship of these papers were unknown at the time of publication • Statistical analysis was used to find the authors of these documents

  3. Existing Approaches • Style markers (function words) as features for classification (Monsteller and Wallace, 1984; Burrows, 1987; Holmes and Forsyth, 1995; Joachims, 1998; Binongo and Smith, 1999; Stamatatos et al., 1999; Diederich et al., 2000; Luyckx and Daelemans, 2008) • Character-level n-grams (Peng et al., 2003) • Syntactic features from parse trees (Baayen et al., 1996) • Limitations • Capture mostly lexical information • Do not necessarily capture the author’s syntactic style

  4. Our Approach • Using probabilistic context-free grammar (PCFG) to capture the syntactic style of the author • Construct a PCFG based on the documents written by the author and use it as a language model for classification • Requires annotated parse trees of the documents How do we obtain these annotated parse trees?

  5. Algorithm – Step 1 Training documents ………………….. ….…….. ………………….. ….…….. ………………….. ….…….. ………………….. ….…….. ………………….. ….…….. ………………….. ….…….. ………………….. ….…….. ………………….. ….…….. Treebank each document using a statistical parser trained on a generic corpus • Stanford parser(Klein and Manning, 2003) • WSJ or Brown corpus from Penn Treebank(http://www.cis.upenn.edu/~treebank) ………………….. ….…….. ………………….. ….…….. ………………….. ….…….. ………………….. ….…….. Bob Mary John Alice

  6. Algorithm – Step 2 Probabilistic Context-Free Grammars S NP VP .8 S  VP .2 NP  Det A N .4 NP  NP PP .35 NP  PropN .25 . . . S NP VP .7 S  VP .3 NP  Det A N .6 NP  NP PP .25 NP  PropN .15 . . . S NP VP .9 S  VP .1 NP  Det A N .3 NP  NP PP .5 NP  PropN .2 . . . S NP VP .5 S  VP .5 NP  Det A N .8 NP  NP PP .1 NP  PropN .1 . . . Bob Mary John Alice Train a PCFG for each author using the treebanked documents from Step 1

  7. Algorithm – Step 3 S NP VP .8 S  VP .2 NP  Det A N .4 NP  NP PP .35 NP  PropN .25 Test document .6 Alice ………………….. ….…….. S NP VP .7 S  VP .3 NP  Det A N .6 NP  NP PP .25 NP  PropN .15 .5 Bob S NP VP .9 S  VP .1 NP  Det A N .3 NP  NP PP .5 NP  PropN .2 .33 Mary S NP VP .5 S  VP .5 NP  Det A N .8 NP  NP PP .1 NP  PropN .1 .75 John

  8. Algorithm – Step 3 S NP VP .8 S  VP .2 NP  Det A N .4 NP  NP PP .35 NP  PropN .25 Test document .6 Alice ………………….. ….…….. Multiply the probability of the top parse for each sentence in the test document S NP VP .7 S  VP .3 NP  Det A N .6 NP  NP PP .25 NP  PropN .15 .5 Bob S NP VP .9 S  VP .1 NP  Det A N .3 NP  NP PP .5 NP  PropN .2 .33 Mary S NP VP .5 S  VP .5 NP  Det A N .8 NP  NP PP .1 NP  PropN .1 .75 John

  9. Algorithm – Step 3 S NP VP .8 S  VP .2 NP  Det A N .4 NP  NP PP .35 NP  PropN .25 Test document .6 Alice ………………….. ….…….. Multiply the probability of the top parse for each sentence in the test document S NP VP .7 S  VP .3 NP  Det A N .6 NP  NP PP .25 NP  PropN .15 .5 Bob S NP VP .9 S  VP .1 NP  Det A N .3 NP  NP PP .5 NP  PropN .2 .33 Mary S NP VP .5 S  VP .5 NP  Det A N .8 NP  NP PP .1 NP  PropN .1 .75 Label for the test document John

  10. Experimental Evaluation

  11. Data Blue – News articlesRed – Literary works Data sets available at www.cs.utexas.edu/users/sindhu/acl2010

  12. Methodology • Bag-of-words model (baseline) • Naïve Bayes, MaxEnt • N-gram models (baseline) • N=1,2,3 • Basic PCFG model • PCFG-I (Interpolation)

  13. Methodology • Bag-of-words model (baseline) • Naïve Bayes, MaxEnt • N-gram models (baseline) • N=1,2,3 • Basic PCFG model • PCFG-I (Interpolation)

  14. Basic PCFG • Train PCFG based only on the documents written by the author • Poor performance when few documents are available for training • Increase the number of documents in the training set • Forensics - Do not always have access to a number of documents written by the same author • Need for alternate techniques when few documents are available for training

  15. PCFG-I • Uses the method of interpolationfor smoothing • Augment the training data by adding sections of WSJ/Brown corpus • Up-sample data for the author

  16. Results

  17. Performance of Baseline Models Accuracy in % Dataset Inconsistent performance for baseline models – the same model does not necessarily perform poorly on all data sets

  18. Performance of PCFG and PCFG-I Accuracy in % Dataset PCFG-I performs better than the basic PCFG model on most data sets

  19. PCFG Models vs. Baseline Models Accuracy in % Dataset Best PCFG model outperforms the worst baseline for all data sets, but does not outperform the best baseline for all data sets

  20. PCFG-E • PCFG models do not always outperform N-gram models • Lexical features from N-gram models useful for distinguishing between authors • PCFG-E(Ensemble) • PCFG-I (best PCFG model) • Bigram model (best N-gram model) • MaxEnt based bag-of-words (discriminative classifier)

  21. Performance of PCFG-E Accuracy in % Dataset PCFG-E outperforms or matches with the best baseline on all data sets

  22. Significance of PCFG (PCFG-E – PCFG-I) Accuracy in % Dataset Drop in performance on removing PCFG-I from PCFG-E on most data sets

  23. Conclusions • PCFGs are useful for capturing the author’s syntactic style • Novel approach for authorship attribution using PCFGs • Both syntactic and lexical information is necessary to capture author’s writing style

  24. Thank You

More Related