1 / 17

Authorship Verification as a One-Class Classification Problem

Authorship Verification as a One-Class Classification Problem. Moshe Koppel Jonathan Schler. Introduction. Goal Given examples of the writing of a single author, ask to determine if given texts is written by this author Authorship attribution

Download Presentation

Authorship Verification as a One-Class Classification Problem

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Authorship Verification as a One-Class Classification Problem Moshe Koppel Jonathan Schler

  2. Introduction • Goal • Given examples of the writing of a single author, ask to determine if given texts is written by this author • Authorship attribution • Given examples of several of authors, ask to determine which author wrote the given anonymous texts

  3. Challenge • Negative samples are neither exhaustive nor representative • Single author may consciously vary his/her style from text to text

  4. Authorship Verification • Naïve Approach • Given examples of the writing of author A • Concoct a mishmash of works by other authors • Learn a model for A vs. not-A • Learn A vs. X (an mystery work) • Easy to distinguish between A and X • Different author • Same author (otherwise)

  5. Authorship Verification • Unmasking basic idea • A small number of features do most of the works in distinguish books • Iteratively remove those most useful features • Gauge the speed with which cross-validation accuracy degrades

  6. Authorship Verification Unmasking House of Seven Gables against Hawthorne (actual author), Melville and Cooper

  7. Experiment

  8. Experiment • Use One-class SVM as baseline • 6 of 20 same-author pairs are correctly classified • 143 of 189 different-author pairs are correctly classified

  9. Experiment • Using Unmasking Approach • Choose feature set with 250 words with highest average frequency in Ax and X • Build Degradation Curve Use 10-fold validation for A again X, for each fold Do 10 iterations { Build a model for A against X Evaluate accuracy results Add accuracy number to degradation curve Remove 6 top contributing feature from data }

  10. Experiment Unmasking An Ideal Husband against each of the ten authors

  11. Experiment • Distinguish same-author curves and different-author curve • Represent degradation curve as feature vector • Feature vector: numerical vector in terms of its essential feature • Accuracy after 6 elimination rounds < 89% • The 2nd highest accuracy drop in two iteration > 16% • Test degradation curve

  12. Experiment Result • 19 of 20 same-author pairs are correctly classified • 181 of 189 different-author pairs are correctly classified • Accuracy 95.7%

  13. Extension • Use negative examples to eliminate some false positive from the unmasking phase • In our case, use elimination method improved accuracy • 189 of 189 different-author pairs are correctly classified • Introduced a single new misclassified

  14. Extension • Elimination If alternative author {A1,…,An} exists then { build model M for classifying A vs. all other alternative authors test each chunk of X with built model M for each alternative author Ai build model Mi for classifying Ai vs. {A or all other alternative authors} test each chunk of X with built model Mi } If number of chunks assigned to Ai > # of chunks assigned to A then return different-author }

  15. Actual Literary Mystery • Two 19th century collection of Hebrew-Aramaic • RP includes 509 documents (by Ben Ish Chai) • TL includes 524 documents (Ben Ish Chai claims to have found in an archive)

  16. Actual Literary Mystery Unmasking TL against Ben Ish Chai and four impostors

  17. Conclusion • Unmasking – complete ignore examples • High accuracy • Unmasking + Elimination (little negative data) • Accuracy better • More experiment need to confirm this methods is also good for other languages

More Related