Authorship Verification as a One-Class Classification Problem

Authorship Verification as a One-Class Classification Problem Moshe Koppel Jonathan Schler

Introduction • Goal • Given examples of the writing of a single author, ask to determine if given texts is written by this author • Authorship attribution • Given examples of several of authors, ask to determine which author wrote the given anonymous texts

Challenge • Negative samples are neither exhaustive nor representative • Single author may consciously vary his/her style from text to text

Authorship Verification • Naïve Approach • Given examples of the writing of author A • Concoct a mishmash of works by other authors • Learn a model for A vs. not-A • Learn A vs. X (an mystery work) • Easy to distinguish between A and X • Different author • Same author (otherwise)

Authorship Verification • Unmasking basic idea • A small number of features do most of the works in distinguish books • Iteratively remove those most useful features • Gauge the speed with which cross-validation accuracy degrades

Authorship Verification Unmasking House of Seven Gables against Hawthorne (actual author), Melville and Cooper

Experiment

Experiment • Use One-class SVM as baseline • 6 of 20 same-author pairs are correctly classified • 143 of 189 different-author pairs are correctly classified

Experiment • Using Unmasking Approach • Choose feature set with 250 words with highest average frequency in Ax and X • Build Degradation Curve Use 10-fold validation for A again X, for each fold Do 10 iterations { Build a model for A against X Evaluate accuracy results Add accuracy number to degradation curve Remove 6 top contributing feature from data }

Experiment Unmasking An Ideal Husband against each of the ten authors

Experiment • Distinguish same-author curves and different-author curve • Represent degradation curve as feature vector • Feature vector: numerical vector in terms of its essential feature • Accuracy after 6 elimination rounds < 89% • The 2nd highest accuracy drop in two iteration > 16% • Test degradation curve

Experiment Result • 19 of 20 same-author pairs are correctly classified • 181 of 189 different-author pairs are correctly classified • Accuracy 95.7%

Extension • Use negative examples to eliminate some false positive from the unmasking phase • In our case, use elimination method improved accuracy • 189 of 189 different-author pairs are correctly classified • Introduced a single new misclassified

Extension • Elimination If alternative author {A1,…,An} exists then { build model M for classifying A vs. all other alternative authors test each chunk of X with built model M for each alternative author Ai build model Mi for classifying Ai vs. {A or all other alternative authors} test each chunk of X with built model Mi } If number of chunks assigned to Ai > # of chunks assigned to A then return different-author }

Actual Literary Mystery • Two 19th century collection of Hebrew-Aramaic • RP includes 509 documents (by Ben Ish Chai) • TL includes 524 documents (Ben Ish Chai claims to have found in an archive)

Actual Literary Mystery Unmasking TL against Ben Ish Chai and four impostors

Conclusion • Unmasking – complete ignore examples • High accuracy • Unmasking + Elimination (little negative data) • Accuracy better • More experiment need to confirm this methods is also good for other languages

Authorship Verification as a One-Class Classification Problem

Authorship Verification as a One-Class Classification Problem

Presentation Transcript

Applications of one-class classification

Text Categorization Moshe Koppel Lecture 5: Authorship Verification

Class problem

Verification As a Matter Of Course

Authorship Verification

Re-Thinking Product Line Verification as a Constraints Problem

STEP ONE IDENTIFY A PROBLEM

Authorship

Read as a class

AGE ESTIMATION: A CLASSIFICATION PROBLEM

Lecture 6: The Ultimate Authorship Problem: Verification for Short Docs

Cocktail Party Problem as Binary Classification

Authorship

Dependency Parsing as a Classification Problem

DiVo : A Novel Distance based Voting Method for One Class Classification

Authorship

The Classification Problem

Dual data driven SIMCA as a one-class classifier

A-Class : a novel classification method

K-Class Classification Problem

Dependency Parsing as a Classification Problem

ONE-CLASS CLASSIFICATION