170 likes | 287 Views
This study explores authorship verification through a one-class classification framework, focusing on distinguishing whether a piece of text is written by a specific author. It discusses the challenges faced, such as the lack of exhaustive negative samples and potential stylistic variations of the same author. Employing innovative techniques like the Unmasking method and One-class SVM, the research presents experiments demonstrating high accuracy in identifying same and different authors, achieving up to 95.7% accuracy. The findings suggest that methods can be applicable across languages with further exploration.
E N D
Authorship Verification as a One-Class Classification Problem Moshe Koppel Jonathan Schler
Introduction • Goal • Given examples of the writing of a single author, ask to determine if given texts is written by this author • Authorship attribution • Given examples of several of authors, ask to determine which author wrote the given anonymous texts
Challenge • Negative samples are neither exhaustive nor representative • Single author may consciously vary his/her style from text to text
Authorship Verification • Naïve Approach • Given examples of the writing of author A • Concoct a mishmash of works by other authors • Learn a model for A vs. not-A • Learn A vs. X (an mystery work) • Easy to distinguish between A and X • Different author • Same author (otherwise)
Authorship Verification • Unmasking basic idea • A small number of features do most of the works in distinguish books • Iteratively remove those most useful features • Gauge the speed with which cross-validation accuracy degrades
Authorship Verification Unmasking House of Seven Gables against Hawthorne (actual author), Melville and Cooper
Experiment • Use One-class SVM as baseline • 6 of 20 same-author pairs are correctly classified • 143 of 189 different-author pairs are correctly classified
Experiment • Using Unmasking Approach • Choose feature set with 250 words with highest average frequency in Ax and X • Build Degradation Curve Use 10-fold validation for A again X, for each fold Do 10 iterations { Build a model for A against X Evaluate accuracy results Add accuracy number to degradation curve Remove 6 top contributing feature from data }
Experiment Unmasking An Ideal Husband against each of the ten authors
Experiment • Distinguish same-author curves and different-author curve • Represent degradation curve as feature vector • Feature vector: numerical vector in terms of its essential feature • Accuracy after 6 elimination rounds < 89% • The 2nd highest accuracy drop in two iteration > 16% • Test degradation curve
Experiment Result • 19 of 20 same-author pairs are correctly classified • 181 of 189 different-author pairs are correctly classified • Accuracy 95.7%
Extension • Use negative examples to eliminate some false positive from the unmasking phase • In our case, use elimination method improved accuracy • 189 of 189 different-author pairs are correctly classified • Introduced a single new misclassified
Extension • Elimination If alternative author {A1,…,An} exists then { build model M for classifying A vs. all other alternative authors test each chunk of X with built model M for each alternative author Ai build model Mi for classifying Ai vs. {A or all other alternative authors} test each chunk of X with built model Mi } If number of chunks assigned to Ai > # of chunks assigned to A then return different-author }
Actual Literary Mystery • Two 19th century collection of Hebrew-Aramaic • RP includes 509 documents (by Ben Ish Chai) • TL includes 524 documents (Ben Ish Chai claims to have found in an archive)
Actual Literary Mystery Unmasking TL against Ben Ish Chai and four impostors
Conclusion • Unmasking – complete ignore examples • High accuracy • Unmasking + Elimination (little negative data) • Accuracy better • More experiment need to confirm this methods is also good for other languages