boosting textual source attribution n.
Download
Skip this Video
Loading SlideShow in 5 Seconds..
Boosting Textual Source Attribution PowerPoint Presentation
Download Presentation
Boosting Textual Source Attribution

Loading in 2 Seconds...

play fullscreen
1 / 17

Boosting Textual Source Attribution - PowerPoint PPT Presentation


  • 82 Views
  • Uploaded on

Boosting Textual Source Attribution. Foaad Khosmood Department of Computer Science University of California, Santa Cruz Winter 2006. HEY??? What’s so funny?. What makes something funny? Can we tell by just reading? Can a computer? Shakespeare’s Comedies and Tragedies.

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about 'Boosting Textual Source Attribution' - aleta


An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
boosting textual source attribution

Boosting Textual Source Attribution

Foaad Khosmood

Department of Computer Science

University of California, Santa Cruz

Winter 2006

hey what s so funny
HEY??? What’s so funny?
  • What makes something funny?
  • Can we tell by just reading? Can a computer?
  • Shakespeare’s Comedies and Tragedies.
    • Actually, Comedies, Tragedies, Historical Plays and Sonnets.
experimenting with boosting
Experimenting with Boosting
  • Most work done on binary classification.
  • Needs lots of “weak” learners.
  • Some variants work well with limited Data Set.
  • Will provide knowledge about importance of features.
data set training
Data Set (Training)
  • Tragedies
    • Anthony and Cleopatra
    • Titus Andronicus
    • Hamlet
    • Julius Caesar
    • Romeo and Juliet
  • Comedies
    • Measure for Measure
    • Much Ado about Nothing
    • Merchant of Venice
    • Midsummer Night’s Dream
    • Taming of the Shrew
    • Twelfth Night
data set test
Data Set (Test)
  • All’s Well that End’s Well [c]
  • Comedy of Errors [c]
  • As You Like It [c]
  • The Tempest [c]
  • Mary Wives of Windsor [c]
  • King Lear [t]
  • Macbeth [t]
  • Coriolaunus [t]
  • Othello [t]
feature selection
Feature Selection
  • Features: words
  • Selection method: picked 2500 most common words in the Training Set
  • Preprocessing: 300 common English words and grammar operators removed
    • HTML and stage directions removed
  • 429 out of 2500 words were not common to all plays. Chose the 429 for weak learner functions. (this particular run)
slide8

COMEDY WORDS

TRAGEDY WORDS

429 Words: 225(Com.), 204(Trag.)

Data: Vector of 2500 words: X= [X1, X2…X2500]

Weak Learners F1(X)…F429(X), each returning 1 for

a positive hit.

boosting
Boosting
  • A mix of LP Boost and TotalBoost
  • No Termination (finite weak learners)
  • Didn’t have a Gamma function, used Eta (error) instead.
  • Didn’t use Zero Sum constraint on normalization of weight updates.
classification
Classification
  • Used Accumulated weights at the very end
  • Every presense in Test Corpus means (1*W) added to totalW, some W’s negative
  • At the end it was a simple matter of observing if the results were positive or negative and by how much.
slide12

slide13

slide14

program output
Program Output
  • [root@localhost output]# ./classify.sh
  • 00_allswell.html-ratio.txt: 14.6807
  • 01_comedyErrors.html-ratio.txt: 13.2634
  • 02_measure.html-ratio.txt: 34.2748
  • 03_muchAdo.html-ratio.txt: -6.43018
  • 04_asyoulikeit.html-ratio.txt: 18.8413
  • 05_cleopatra.html-ratio.txt: 14.1148
  • 06_lear.html-ratio.txt: 32.2858
  • 07_macbeth.html-ratio.txt: -21.095
  • 08_coriolanus.html-ratio.txt: 43.5599
  • 09_titus.html-ratio.txt: -3.31167
  • 10_cleopatraFull.html-ratio.txt: -300.179
  • 11_learFull.html-ratio.txt: 356.504
  • 13_tempestFull.html-ratio.txt: 454.171
  • 14_marryWivesFull.html-ratio.txt: 147.738
  • 15_measure2.html-ratio.txt: 39.0357
  • 16_measureFull.html-ratio.txt: 112.527
  • 17_muchAdoFull.html-ratio.txt: 256.078
  • 18_veronaFull.html-ratio.txt: -222.444
  • 19_othelloFull.html-ratio.txt: -433.769
  • 20_titusFull.html-ratio.txt: -564.977
results
Results
  • All’s Well that End’s Well [c][1]
  • Comedy of Errors [c][1]
  • As You Like It [c][1]
  • The Tempest [c][1]
  • Mary Wives of Windsor [c][1]
  • King Lear [t][0]
  • Macbeth [t][1]
  • Coriolaunus [t][0]
  • Othello [t][1]

2/9 mistakes, 7/9 or 77%, (also 66% and 69%)

Previous run on Neural Net (different setup: 5/13 61%)

- With no proportionals!

challenges
Challenges
  • Natural language has a lot nuances that could make a difference (preprocessing methods, “common word” sets, adaptations)
  • Boosting has great potential in this area
  • Words provide easy method for coming up with (many) weak learners