Data mining final project
Download
1 / 37

Data Mining Final Project - PowerPoint PPT Presentation


Data Mining Final Project. Nick Foti Eric Kee. Topic: Author Identification. Author Identification Given writing samples, can we determine who wrote them? This is a well studied field See also: “stylometry” This has been applied to works such as The Bible Shakespeare

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha

Download Presentation

Data Mining Final Project

An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -

Presentation Transcript


Data Mining Final Project

Nick Foti

Eric Kee


Topic: Author Identification

  • Author Identification

    • Given writing samples, can we determine who wrote them?

    • This is a well studied field

      • See also: “stylometry”

    • This has been applied to works such as

      • The Bible

      • Shakespeare

      • Modern texts as well


Corpus Design

  • A corpus is:

    • A body of text used for linguistic analysis

  • Used Project Gutenberg to create corpus

  • The corpus was designed as follows

    • Four authors of varying similarity

      • Anne Brontë

      • Charlotte Brontë

      • Charles Dickens

      • Upton Sinclair

    • Multiple books per author

  • Corpus size: 90,000 lines of text


Dataset Design

  • Extracted features common in literature

    • Word Length

    • Frequency of “glue” words

      • See Appendix A and [1,2] for list of glue words

  • Note: corpus was processed using

    • C#, Matlab, Python

  • Data set parameters are

    • Number of dimensions: 309

      • Word length and 308 glue words

    • Number of observations: ≈ 3,000

      • Each obervation ≈ 30 lines of text from a book


Classifier Testing and Analysis

  • Tested classifier with test data

    • Used testing and training data sets

      • 70% for training, 30% for testing

    • Used cross-validation

  • Analyzed Classifier Performance

    • Used ROC plots

    • Used confusion matrices

  • Used common plotting scheme (right)

Red Dots Indicate

True-Positive Cases

E X A M P L E

78%

55%

45%

22%

Anne B.

FP

Anne B.

TP

Charlotte B.

FP

Charlotte B.

TP


Binary Classification


Word Length Classification

  • Calculated average word length for each observation

  • Computed gaussian kernel density from word length samples

  • Used ROC curve to calculate cutoff

    • Optimized sensitivity and specificity with equal importance


Word Length: Anne B. vs Upton S.

Anne Brontë

Charlotte Brontë

100%

100%

0%

0%

Anne B.

F P

Upton Sinclair

F P

Upton Sinclair

T P

Anne B.

T P


Word Length: Brontë vs. Brontë

Anne Brontë

Charlotte Brontë

100%

78.1%

21.9%

0%

Anne B.

T P

Anne B.

F P

Charlotte B.

F P

Charlotte B.

T P


Principal Component Analysis

  • Used PCA to find a better axis

  • Notice: distribution similar to word length distribution

  • Is word lengththe only usefuldimension?

Anne Brontë vs. Upton Sinclair

Word Length Density

PCA Density


Principal Component Analysis

Without word length

  • It appears that word length is the most useful axis

  • We’ll come back to this…

Anne Brontë vs. Upton Sinclair

PCA Density


K-Means

  • Used K-means to find dominant patterns

    • Unnormalized

    • Normalized

  • Trained K-means on training set

  • To classify observations in test set

    • Calculate distance of observation to each class mean

    • Assign observation to the closest class

  • Performed cross-validation to estimate performance


Unnormalized K-means

Anne Brontë vs. Upton Sinclair

98.1%

92.1%

7.9%

1.9%

Anne B.

T P

Anne B.

F P

Upton Sinclair

F P

Upton Sinclair

T P


Unnormalized K-means

Anne Brontë vs. Charlotte Brontë

95.7%

74.7%

25.3%

4.3%

Anne B.

T P

Anne B.

F P

Charlotte B.

F P

Charlotte B.

T P


Normalized K-means

Anne Brontë vs. Upton Sinclair

53.3%

50.6%

49.4%

46.7%

Anne B.

T P

Anne B.

F P

Upton Sinclair

F P

Upton Sinclair

T P


Normalized K-means

Anne Brontë vs. Charlotte Brontë

86.7%

84.2%

15.8%

13.3 %

Anne B.

T P

Anne B.

F P

Charlotte B.

F P

Charlotte B.

T P


Discriminant Analysis

  • Peformed discriminant analysis

    • Computed with equal covariance matrices

      • Used average Omega of class pairs

    • Computed with unequal covariance matrices

      • Quadratic discrimination fails because covariance matrices have 0 determinant (see equation below)

    • Computed theoretical misclassification probability

  • To perform quadratic discriminant analysis

    • Compute Equation 1 for each class

    • Choose class with minimum value

(1)


Discriminant Analysis

Anne Brontë vs. Upton Sinclair

Empirical P(err) = 0.116

Theoretical P(err) = 0.149

96.2%

92.2%

3.8%

7.8%

Anne B.

T P

Anne B.

F P

Upton Sinclair

F P

Upton Sinsclair

T P


Discriminant Analysis

Anne Brontë vs. Charlotte Brontë

Empirical P(err) = 0.152

Theoretical P(err) = 0.181

92.7%

89.2%

10.8%

7.3%

Anne B.

T P

Anne B.

F P

Charlotte B.

F P

Charlotte B.

T P


Logistic Regression

  • Fit linear model to training data on all dimensions

  • Threw out singular dimensions

    • Left with ≈ 298 coefficients + intercept

  • Projected training data onto synthetic variable

    • Found threshold by minimizing error of misclassification

  • Projected testing data onto synthetic variable

    • Used threshold to classify points


Logistic Regression

Anne Brontë vs Charlotte Brontë

Anne Brontë

Charlotte Brontë

92%

89.5%

8%

10.5%

Anne B

TP

Anne B

TP

Charlotte B

TP

Charlotte B

TP


Logistic Regression

Anne Brontë vs Upton Sinclair

Anne Brontë

Upton Sinclair

98%

99%

2%

2%

Anne B

TP

Anne B

FP

Upton S

TP

Upton S

FP


4-Class Classification


4-Class K-means

  • Used K-means to find patterns among all classes

    • Unnormalized

    • Normalized

  • Trained using a training set

  • Tested performance as in 2-class K-means

  • Performed cross-validation to estimate performance


Unnormalized K-Means

4-Class Confusion Matrix

Anne Brontë

Charles Dickens

Charlotte Brontë

Upton Sinclair

88%

87%

59%

54%

34%

22%

CD

FP

AB

FP

CB

TP

US

FP

CD

FP

AB

FP

CB

FP

US

FP

US

FP

CD

TP

AB

FP

CB

FP

US

FP

CD

FP

AB

TP

CB

FP


Normalized K-Means

4-Class Confusion Matrix

Anne Brontë

Charles Dickens

Charlotte Brontë

Upton Sinclair

70%

67%

67%

67%

27%

26%

20%

CD

FP

AB

FP

CB

TP

US

FP

CD

FP

AB

FP

CB

FP

US

FP

CD

TP

AB

FP

CB

FP

US

FP

CD

FP

AB

TP

CB

FP

US

FP


Additional K-means testing

  • Also tested K-means without word length

    • Recall that we had perfect classification with 1D word length (see plot below)

    • Is K-means using only 1 dimension to classify?

      Note: perfect classification only occurs between Anne B. and Sinclair

Anne Brontë vs. Upton Sinclair


Unnormalized K-Means (No Word Length)

  • K-means can classify without word length

4-Class Confusion Matrix 

Anne Brontë

Charles Dickens

Charlotte Brontë

Upton Sinclair

72%

44%

43%

35%

35%

33%

29%

CD

FP

AB

FP

CB

TP

US

FP

CD

FP

AB

FP

CB

FP

US

FP

US

FP

CD

TP

AB

FP

CB

FP

US

FP

CD

FP

AB

TP

CB

FP


Multinomial Regression

  • Multinomial distribution

    • Extension of binomial distribution

      • Random variable is allowed to take on n values

  • Used multinom(…) to fit log-linear model for training

    • Used 248 dimensions (max limit on computer)

    • Returns 3 coefficients per dimension and 3 intercepts

  • Found probability that observations belongs to each class


Multinomial Regression

  • Multinomial Logit Function is

    where jare the coefficients and cj are the intercepts

  • To classify

    • Compute probabilities

      • Pr(yi = Dickens), Pr(yi = Anne B.), …

    • Choose class with maximum probability


Multinomial Regression

4-Class Confusion Matrix

Anne Brontë

Charles Dickens

Charlotte Brontë

Upton Sinclair

86%

83%

93%

78%

AB

FP

CD

FP

CD

TP

US

FP

AB

TP

CB

FP

CB

FP

CD

FP

AB

FP

CB

TP

US

FP

CD

FP

AB

FP

CB

FP

US

FP

CB

FP


Multinomial Regression

(Without Word Length)

  • Multinomial regression does not require word length

4-Class Confusion Matrix

Anne Brontë

Charles Dickens

Charlotte Brontë

Upton Sinclair

79%

79%

91%

76%

AB

FP

CD

FP

CD

TP

US

FP

AB

TP

CB

FP

CB

FP

CD

FP

AB

FP

CB

TP

US

FP

CD

FP

AB

FP

CB

FP

US

FP

CB

FP


Appendix A: Glue Words

I a aboard about above across after again against ago ahead all almost along alongside already also although always am amid amidst among amongst an and another any anybody anyone anything anywhere apart are aren't around as aside at away back backward backwards be because been before beforehand behind being below between beyond both but by can can't cannot could couldn't dare daren't despite did didn't directly do does doesn't doing don't done down during each either else elsewhere enough even ever evermore every everybody everyone everything everywhere except fairly farther few fewer for forever forward from further furthermore had hadn't half hardly has hasn't have haven't having he hence her here hers herself him himself his how however if in indeed inner inside instead into is isn't it its itself just keep kept later least less lest like likewise little low lower many may mayn't me might mightn't mine minus more moreover most much must mustn't my myself near need needn't neither never nevertheless next no no-one nobody none nor not nothing notwithstanding now nowhere of off often on once one ones only onto opposite or other others otherwise ought oughtn't our ours ourselves out outside over own past per perhaps please plus provided quite rather really round same self selves several shall shan't she should shouldn't since so some somebody someday someone something sometimes somewhat still such than that the their theirs them themselves then there therefore these they thing things this those though through throughout thus till to together too towards under underneath undoing unless unlike until up upon upwards us versus very via was wasn't way we well were weren't what whatever when whence whenever where whereas whereby wherein wherever whether which whichever while whilst whither who whoever whom with whose within why without will won't would wouldn't yet you your yours yourself yourselves


Conclusions

  • Authors can be identified by their word usage frequencies

  • Word length may be used to distingush between the Brontë sisters

    • Word length does not, however, extend to all authors (See Appendix C)

  • The glue words describe genuine differences between all four authors

    • K-means distinguishes the same patterns that multinomial regression classifies

      • This indicates that supervised training finds legitimate patterns, rather than artifacts

  • The Brontë sisters are the most similar authors

  • Upton Sinclair is the most different author


Appendix B: Code

  • See Attached .R files


Appendix C: Single Dimension 4-Author Classification

Classification using Multinomial Regression

4-Class Confusion Matrix 

Anne Brontë

Charles Dickens

Charlotte Brontë

Upton Sinclair

96%

94%

54%

46%

22%

11%

6%

3%

CD

FP

AB

FP

CB

TP

US

FP

CD

FP

AB

FP

CB

FP

US

FP

US

FP

CD

TP

AB

FP

CB

FP

US

FP

CD

FP

AB

TP

CB

FP


References

[1] Argamon, Saric, Stein, “Style Mining of Electronic Messages for Multiple Authorship Discrimination: First Results,” SIGKDD 2003.

[2] Mitton, “Spelling checkers, spelling correctors and the misspellings of poor spellers,” Information Processing and Management, 1987.


ad
  • Login