1 / 50

Project 4

Brian Nisonger Shauna Eggers Joshua Johans o n. Project 4. Our Project. Combination systems Tried to improve Tagging accuracy by combining three different taggers Taggers MaxEnt TBL Trigram Methods Voting Weighted Voting (invented by Joshua) Bayes. Observations.

susanna
Download Presentation

Project 4

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Brian Nisonger Shauna Eggers Joshua Johanson Project 4

  2. Our Project • Combination systems • Tried to improve Tagging accuracy by combining three different taggers • Taggers • MaxEnt • TBL • Trigram • Methods • Voting • Weighted Voting (invented by Joshua) • Bayes

  3. Observations • Our combinations system improved the overall results of the tagging • We invented a new type of voting and splitting of the data that shows promise • In applying the standard combo methods to tagging, we made a few tweaks

  4. Voting

  5. Method • Our voting algorithm was quite simple we trained three taggers on the data • Then applied the following: • A= T(n) for W(n) of Trigram • B= T(n) for W(n) of TBL • C= T(n) for W(n) of MaxEnt • If A=C then A • elseB

  6. Voting Method Training Data Output Voting Training Data Output Training Data Output

  7. Voting Method Training Data Output Voting Output Training Data Output Training Data Output

  8. Splitting the Data • Used 83 % of training data to train taggers • After being trained they are tested on the remaining 17%. • The output is used to train the combination systems.

  9. Weighted Voter Training Data Output Training Data Output Output Training Data Output Training Data Output Output Training Data Output Training Data Output Output

  10. Weighted Voter Training Data Output Training Data Output Output Training Data Output Training Data Output Output Training Data Output Training Data Output Output Weighted Voter

  11. Weighted Voter Training Data Output Training Data Output Training Data Output Training Data Output Training Data Output Training Data Output Weighted Voter Output

  12. Weighted Voter

  13. pull/NN pull/VBP pull/VB • When this is the output, what is the most likely tag? • How often is the tagger right when it outputs this tag? • What about on this specific word? • If two taggers both think its some kind of verb, isn’t it more likely to be a verb? (similarity)

  14. Probabilities • i) P(t|w, t1,t2,t3) • ii) P(t|w,t1) • iii) P(t|w,t2) • iv) P(t|w,t3) • v) P(t|t1,t2,t3) • vi) P(t|t1) • vii) P(t|t2) • viii) P(t|t3)

  15. Example • pull/NN pull/VBP pull/VB • i) P(t| pull, NN, VBP, VB) • ii) P(t| pull, NN) • iii) P(t| pull, VBP) • iv) P(t| pull, VB) • v) P(t| NN, VBP, VB) • vi) P(t| NN) • vii) P(t| VBP) • viii) P(t| VB)

  16. How do you put all of these probabilities together? • trial and error • multiply them – huge smoothing issue • add them – not very mathematical • weight certain probabilities higher

  17. Which did I end up with? • Add them together with weights

  18. How did you get the weights? • Complicated Mathematical Equations • Sorry, no can do • Train the weights • I didn’t have time • Try different weights until I get good results. • Isn’t that cheating? • oh well

  19. What were the weights • i) P(t|w, t1,t2,t3) - 50 • ii) P(t|w,t1) - 1 • iii) P(t|w,t2) - 1 • iv) P(t|w,t3) - 1 • v) P(t|t1,t2,t3) - 6 • vi) P(t|t1) - 2 • vii) P(t|t2) - 2 • viii) P(t|t3) - 2

  20. Future Work (not really) • Get more sophisticated, mathematically sound probabilities – for example, look at the Naïve Bayes. • I could actually train the weights, instead of just fudging until I get good numbers • That would require more splitting of the data.

  21. Naïve Bayes

  22. Naive Bayes:Model • Model: the selected tag t for the input word w is the one that maximizes for k taggers where = t is the correct tag for w = tagger hi produces tag ti for w

  23. Naive Bayes:Model Derivation Prob that tag t is correct for word w given set of hypotheses 1 through k:

  24. Naive Bayes:Model Derivation Prob that tag t is correct for word w given set of hypotheses 1 through k: Bayes Magic: =

  25. Naive Bayes:Model Derivation Prob that tag t is correct for word w given set of hypotheses 1 through k: Bayes Magic: = Remove denominator: =

  26. Naive Bayes:Model Derivation Prob that tag t is correct for word w given set of hypotheses 1 through k: Bayes Magic: = Remove denominator: = “Naivete”: Independence assumption: =

  27. Naive Bayes:Model Derivation Prob that tag t is correct for word w given set of hypotheses 1 through k: Bayes Magic: = Remove denominator: = “Naivete”: Independence assumption: = Voila! =

  28. Naive Bayes:Comparison to Parsing Model • (Henderson and Brill 1999): • Notice: second prob calculated for the correct hypothesis, not the produced hypothesis • Applying direct analog of this approach for tagging did not produce any improvement over the baseline taggers • consistently produced average of baseline results • Why different for H&B? (...or was it?)

  29. Naive Bayes:Parameter Estimation 1. Probability that t is correct for w: 2. Probability that when t is correct, tagger hiproduces tag ti:

  30. Naive Bayes:Unknown Words • When words in the hypotheses were unknown by the taggers, treat them all the same way: • Convert all unknown words in tagger outputs (combiner inputs) to “OtherUnk” • Use params for OtherUnk • This did not make a terribly big difference in output: average 0.006 improvement • (But it didn’t hurt either)

  31. Naive Bayes:Smoothing • Witten-Bell-ish: Assign 0-frequency items same value as lowest-frequency items • For each parameter, use the smallest value seen in the training data as the smoothing value • Not by defn Witten-Bell, as not sure that lowest-frequency item is 1, but basically the same

  32. Results

  33. Results

  34. Splitting the Data

  35. Is splitting the data bad? • Yes. It gives the tagger less data to train on, which means worse results for the tagger. • Then why don’t you just split away less data • Then you have less data to train the combination system. • Can’t you get the input to the combination system from tagging the training data? • You can’t run the tagger on the same data you trained on. We are learning the reliability of the tagger and this artificially inflates the reliability. • Once you train the combination system on the results from the taggers trained on the split data, can you test it with the results from the taggers trained on the whole data set? • Technically no. The combination system learns the reliability of each tagger, and if you change the data the tagger is trained on, you change the reliability of that tagger. • Isn’t the reliability of the taggers trained on the split data and the taggers trained on the whole data set close enough? • Let’s find out!

  36. Weighted Voter Training Data Output Training Data Output Output Training Data Output Training Data Output Output Training Data Output Training Data Output Output Weighted Voter

  37. Weighted Voter Training Data Output Training Data Output Training Data Output Training Data Output Training Data Output Training Data Output Weighted Voter Output

  38. Does it work? • Slightly increases the weighted voter • Decreases the Naïve Bayes

  39. Why? • The accuracy for the taggers goes up, so the overall accuracy goes up. • The combination systems are learning the reliability of a tagger, and the taggers were changed. This decreases the ability to predict the right reliability, so accuracy goes down. • Naïve Bayes is more sensitive to a change in taggers than the weighted voter.

  40. If we can use the output from the tagger trained on one set of data for the combination system and the output from the tagger trained on another set of data to test it, then it doesn’t matter how we split the data. Output Training Data Training Data Output Training Data Training Data Output Training Data Training Data Output

  41. We can then combine the output of these taggers trained on the different portions of the training data. Output Training Data Training Data Output Training Data Training Data Output Training Data Training Data Output

  42. We can then combine the output of these taggers trained on the different portions of the training data. Output Training Data Training Data Output Training Data Training Data Output Training Data Training Data Output

  43. Each of these segments is the result from the taggers being tested on unseen data, but together they give you how the tagger would have tagged the entire data set if it had never seen it. Output Output Output Output

  44. This gives you a lot more data to test the combination system on, increasing accuracy of the combination system. • You could then split this data, making it possible to train the weights. Output Output Output Output

  45. Increasing training data increases accuracy, but not enough for the Naïve Bayes to recover from the loss of accuracy from using different data to train the tagger.

  46. So what can we do about it? • The loss of accuracy was because the reliability of the output from taggers trained on the split data was different from the output from the taggers trained on all of the data. • Take smaller slices Training Data Training Data

  47. So what can we do about it? • The difference between the split and unsplit training data is smaller, so the taggers should be more similar, helping the combination systems correctly predict the reliability of the tagger. Training Data Training Data

  48. So can we do it? • Taking smaller slices is very expensive, especially for the TBL. • If the tagger were retractable, we might be able to produce the training data without having to rerun the system several times.

  49. Trigram Model • Train the Trigram model on the whole training data • For each sentence in the training data, calculate what the probabilities would have been if the tagger were not trained on that sentence. • Tag the sentence based on the new probabilities.

  50. Conclusions • Combination methods • A simple method like Voting works surprisingly well • Naïve Bayes needs a lot of training data to show improvement, but when it does the difference is substantial • New methods to improve the basic voting show consistently better results than voting by itself • Data preparation • The weighted voting method was further improved by more intelligent splitting of the data • Applying the new splitting techniques to Naïve Bayes needs some investigation to see if there is any improvement

More Related