1 / 17

Boosting Textual Source Attribution

Boosting Textual Source Attribution. Foaad Khosmood Department of Computer Science University of California, Santa Cruz Winter 2006. HEY??? What’s so funny?. What makes something funny? Can we tell by just reading? Can a computer? Shakespeare’s Comedies and Tragedies.

aleta
Download Presentation

Boosting Textual Source Attribution

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Boosting Textual Source Attribution Foaad Khosmood Department of Computer Science University of California, Santa Cruz Winter 2006

  2. HEY??? What’s so funny? • What makes something funny? • Can we tell by just reading? Can a computer? • Shakespeare’s Comedies and Tragedies. • Actually, Comedies, Tragedies, Historical Plays and Sonnets.

  3. High Level Source Attribution Process

  4. Experimenting with Boosting • Most work done on binary classification. • Needs lots of “weak” learners. • Some variants work well with limited Data Set. • Will provide knowledge about importance of features.

  5. Data Set (Training) • Tragedies • Anthony and Cleopatra • Titus Andronicus • Hamlet • Julius Caesar • Romeo and Juliet • Comedies • Measure for Measure • Much Ado about Nothing • Merchant of Venice • Midsummer Night’s Dream • Taming of the Shrew • Twelfth Night

  6. Data Set (Test) • All’s Well that End’s Well [c] • Comedy of Errors [c] • As You Like It [c] • The Tempest [c] • Mary Wives of Windsor [c] • King Lear [t] • Macbeth [t] • Coriolaunus [t] • Othello [t]

  7. Feature Selection • Features: words • Selection method: picked 2500 most common words in the Training Set • Preprocessing: 300 common English words and grammar operators removed • HTML and stage directions removed • 429 out of 2500 words were not common to all plays. Chose the 429 for weak learner functions. (this particular run)

  8. COMEDY WORDS TRAGEDY WORDS 429 Words: 225(Com.), 204(Trag.) Data: Vector of 2500 words: X= [X1, X2…X2500] Weak Learners F1(X)…F429(X), each returning 1 for a positive hit.

  9. Boosting • A mix of LP Boost and TotalBoost • No Termination (finite weak learners) • Didn’t have a Gamma function, used Eta (error) instead. • Didn’t use Zero Sum constraint on normalization of weight updates.

  10. Classification • Used Accumulated weights at the very end • Every presense in Test Corpus means (1*W) added to totalW, some W’s negative • At the end it was a simple matter of observing if the results were positive or negative and by how much.

  11.         

  12.         

  13.         

  14. Program Output • [root@localhost output]# ./classify.sh • 00_allswell.html-ratio.txt: 14.6807 • 01_comedyErrors.html-ratio.txt: 13.2634 • 02_measure.html-ratio.txt: 34.2748 • 03_muchAdo.html-ratio.txt: -6.43018 • 04_asyoulikeit.html-ratio.txt: 18.8413 • 05_cleopatra.html-ratio.txt: 14.1148 • 06_lear.html-ratio.txt: 32.2858 • 07_macbeth.html-ratio.txt: -21.095 • 08_coriolanus.html-ratio.txt: 43.5599 • 09_titus.html-ratio.txt: -3.31167 • 10_cleopatraFull.html-ratio.txt: -300.179 • 11_learFull.html-ratio.txt: 356.504 • 13_tempestFull.html-ratio.txt: 454.171 • 14_marryWivesFull.html-ratio.txt: 147.738 • 15_measure2.html-ratio.txt: 39.0357 • 16_measureFull.html-ratio.txt: 112.527 • 17_muchAdoFull.html-ratio.txt: 256.078 • 18_veronaFull.html-ratio.txt: -222.444 • 19_othelloFull.html-ratio.txt: -433.769 • 20_titusFull.html-ratio.txt: -564.977

  15. Results • All’s Well that End’s Well [c][1] • Comedy of Errors [c][1] • As You Like It [c][1] • The Tempest [c][1] • Mary Wives of Windsor [c][1] • King Lear [t][0] • Macbeth [t][1] • Coriolaunus [t][0] • Othello [t][1] 2/9 mistakes, 7/9 or 77%, (also 66% and 69%) Previous run on Neural Net (different setup: 5/13 61%) - With no proportionals!

  16. Challenges • Natural language has a lot nuances that could make a difference (preprocessing methods, “common word” sets, adaptations) • Boosting has great potential in this area • Words provide easy method for coming up with (many) weak learners

More Related