Boosting Textual Source Attribution

Boosting Textual Source Attribution Foaad Khosmood Department of Computer Science University of California, Santa Cruz Winter 2006

HEY??? What’s so funny? • What makes something funny? • Can we tell by just reading? Can a computer? • Shakespeare’s Comedies and Tragedies. • Actually, Comedies, Tragedies, Historical Plays and Sonnets.

High Level Source Attribution Process

Experimenting with Boosting • Most work done on binary classification. • Needs lots of “weak” learners. • Some variants work well with limited Data Set. • Will provide knowledge about importance of features.

Data Set (Training) • Tragedies • Anthony and Cleopatra • Titus Andronicus • Hamlet • Julius Caesar • Romeo and Juliet • Comedies • Measure for Measure • Much Ado about Nothing • Merchant of Venice • Midsummer Night’s Dream • Taming of the Shrew • Twelfth Night

Data Set (Test) • All’s Well that End’s Well [c] • Comedy of Errors [c] • As You Like It [c] • The Tempest [c] • Mary Wives of Windsor [c] • King Lear [t] • Macbeth [t] • Coriolaunus [t] • Othello [t]

Feature Selection • Features: words • Selection method: picked 2500 most common words in the Training Set • Preprocessing: 300 common English words and grammar operators removed • HTML and stage directions removed • 429 out of 2500 words were not common to all plays. Chose the 429 for weak learner functions. (this particular run)

COMEDY WORDS TRAGEDY WORDS 429 Words: 225(Com.), 204(Trag.) Data: Vector of 2500 words: X= [X1, X2…X2500] Weak Learners F1(X)…F429(X), each returning 1 for a positive hit.

Boosting • A mix of LP Boost and TotalBoost • No Termination (finite weak learners) • Didn’t have a Gamma function, used Eta (error) instead. • Didn’t use Zero Sum constraint on normalization of weight updates.

Classification • Used Accumulated weights at the very end • Every presense in Test Corpus means (1*W) added to totalW, some W’s negative • At the end it was a simple matter of observing if the results were positive or negative and by how much.

         

Program Output • [root@localhost output]# ./classify.sh • 00_allswell.html-ratio.txt: 14.6807 • 01_comedyErrors.html-ratio.txt: 13.2634 • 02_measure.html-ratio.txt: 34.2748 • 03_muchAdo.html-ratio.txt: -6.43018 • 04_asyoulikeit.html-ratio.txt: 18.8413 • 05_cleopatra.html-ratio.txt: 14.1148 • 06_lear.html-ratio.txt: 32.2858 • 07_macbeth.html-ratio.txt: -21.095 • 08_coriolanus.html-ratio.txt: 43.5599 • 09_titus.html-ratio.txt: -3.31167 • 10_cleopatraFull.html-ratio.txt: -300.179 • 11_learFull.html-ratio.txt: 356.504 • 13_tempestFull.html-ratio.txt: 454.171 • 14_marryWivesFull.html-ratio.txt: 147.738 • 15_measure2.html-ratio.txt: 39.0357 • 16_measureFull.html-ratio.txt: 112.527 • 17_muchAdoFull.html-ratio.txt: 256.078 • 18_veronaFull.html-ratio.txt: -222.444 • 19_othelloFull.html-ratio.txt: -433.769 • 20_titusFull.html-ratio.txt: -564.977

Results • All’s Well that End’s Well [c][1] • Comedy of Errors [c][1] • As You Like It [c][1] • The Tempest [c][1] • Mary Wives of Windsor [c][1] • King Lear [t][0] • Macbeth [t][1] • Coriolaunus [t][0] • Othello [t][1] 2/9 mistakes, 7/9 or 77%, (also 66% and 69%) Previous run on Neural Net (different setup: 5/13 61%) - With no proportionals!

Challenges • Natural language has a lot nuances that could make a difference (preprocessing methods, “common word” sets, adaptations) • Boosting has great potential in this area • Words provide easy method for coming up with (many) weak learners

Boosting Textual Source Attribution

Boosting Textual Source Attribution

Presentation Transcript

Attribution

Ozone source attribution from intercontinental to regional scales

Attribution

Textual Analysis and Textual Theory

Attribution

Boosting Textual Compression in Optimal Linear Time

Textual Analysis and Textual Theory

Attribution

Designing Studies to Better Understand Food Source Attribution

Boosting

Textual Analysis and Textual Theory

attribution

Source attribution in Campylobacter jejuni

Source Attribution and Source Sensitivity Modeling Studies with CMAQ and CAMx

Textual Analysis and Textual Theory

BART – Source Attribution of Visibility Impairment

Attribution

Attribution

Source attribution in Campylobacter jejuni