Project 4

Brian Nisonger Shauna Eggers Joshua Johanson Project 4

Our Project • Combination systems • Tried to improve Tagging accuracy by combining three different taggers • Taggers • MaxEnt • TBL • Trigram • Methods • Voting • Weighted Voting (invented by Joshua) • Bayes

Observations • Our combinations system improved the overall results of the tagging • We invented a new type of voting and splitting of the data that shows promise • In applying the standard combo methods to tagging, we made a few tweaks

Voting

Method • Our voting algorithm was quite simple we trained three taggers on the data • Then applied the following: • A= T(n) for W(n) of Trigram • B= T(n) for W(n) of TBL • C= T(n) for W(n) of MaxEnt • If A=C then A • elseB

Voting Method Training Data Output Voting Training Data Output Training Data Output

Voting Method Training Data Output Voting Output Training Data Output Training Data Output

Splitting the Data • Used 83 % of training data to train taggers • After being trained they are tested on the remaining 17%. • The output is used to train the combination systems.

Weighted Voter Training Data Output Training Data Output Output Training Data Output Training Data Output Output Training Data Output Training Data Output Output

Weighted Voter Training Data Output Training Data Output Output Training Data Output Training Data Output Output Training Data Output Training Data Output Output Weighted Voter

Weighted Voter Training Data Output Training Data Output Training Data Output Training Data Output Training Data Output Training Data Output Weighted Voter Output

Weighted Voter

pull/NN pull/VBP pull/VB • When this is the output, what is the most likely tag? • How often is the tagger right when it outputs this tag? • What about on this specific word? • If two taggers both think its some kind of verb, isn’t it more likely to be a verb? (similarity)

How do you put all of these probabilities together? • trial and error • multiply them – huge smoothing issue • add them – not very mathematical • weight certain probabilities higher

Which did I end up with? • Add them together with weights

How did you get the weights? • Complicated Mathematical Equations • Sorry, no can do • Train the weights • I didn’t have time • Try different weights until I get good results. • Isn’t that cheating? • oh well

Future Work (not really) • Get more sophisticated, mathematically sound probabilities – for example, look at the Naïve Bayes. • I could actually train the weights, instead of just fudging until I get good numbers • That would require more splitting of the data.

Naïve Bayes

Naive Bayes:Model • Model: the selected tag t for the input word w is the one that maximizes for k taggers where = t is the correct tag for w = tagger hi produces tag ti for w

Naive Bayes:Model Derivation Prob that tag t is correct for word w given set of hypotheses 1 through k:

Naive Bayes:Model Derivation Prob that tag t is correct for word w given set of hypotheses 1 through k: Bayes Magic: =

Naive Bayes:Model Derivation Prob that tag t is correct for word w given set of hypotheses 1 through k: Bayes Magic: = Remove denominator: =

Naive Bayes:Model Derivation Prob that tag t is correct for word w given set of hypotheses 1 through k: Bayes Magic: = Remove denominator: = “Naivete”: Independence assumption: =

Naive Bayes:Model Derivation Prob that tag t is correct for word w given set of hypotheses 1 through k: Bayes Magic: = Remove denominator: = “Naivete”: Independence assumption: = Voila! =

Naive Bayes:Comparison to Parsing Model • (Henderson and Brill 1999): • Notice: second prob calculated for the correct hypothesis, not the produced hypothesis • Applying direct analog of this approach for tagging did not produce any improvement over the baseline taggers • consistently produced average of baseline results • Why different for H&B? (...or was it?)

Naive Bayes:Parameter Estimation 1. Probability that t is correct for w: 2. Probability that when t is correct, tagger hiproduces tag ti:

Naive Bayes:Unknown Words • When words in the hypotheses were unknown by the taggers, treat them all the same way: • Convert all unknown words in tagger outputs (combiner inputs) to “OtherUnk” • Use params for OtherUnk • This did not make a terribly big difference in output: average 0.006 improvement • (But it didn’t hurt either)

Naive Bayes:Smoothing • Witten-Bell-ish: Assign 0-frequency items same value as lowest-frequency items • For each parameter, use the smallest value seen in the training data as the smoothing value • Not by defn Witten-Bell, as not sure that lowest-frequency item is 1, but basically the same

Results

Splitting the Data

Is splitting the data bad? • Yes. It gives the tagger less data to train on, which means worse results for the tagger. • Then why don’t you just split away less data • Then you have less data to train the combination system. • Can’t you get the input to the combination system from tagging the training data? • You can’t run the tagger on the same data you trained on. We are learning the reliability of the tagger and this artificially inflates the reliability. • Once you train the combination system on the results from the taggers trained on the split data, can you test it with the results from the taggers trained on the whole data set? • Technically no. The combination system learns the reliability of each tagger, and if you change the data the tagger is trained on, you change the reliability of that tagger. • Isn’t the reliability of the taggers trained on the split data and the taggers trained on the whole data set close enough? • Let’s find out!

Weighted Voter Training Data Output Training Data Output Output Training Data Output Training Data Output Output Training Data Output Training Data Output Output Weighted Voter

Weighted Voter Training Data Output Training Data Output Training Data Output Training Data Output Training Data Output Training Data Output Weighted Voter Output

Does it work? • Slightly increases the weighted voter • Decreases the Naïve Bayes

Why? • The accuracy for the taggers goes up, so the overall accuracy goes up. • The combination systems are learning the reliability of a tagger, and the taggers were changed. This decreases the ability to predict the right reliability, so accuracy goes down. • Naïve Bayes is more sensitive to a change in taggers than the weighted voter.

If we can use the output from the tagger trained on one set of data for the combination system and the output from the tagger trained on another set of data to test it, then it doesn’t matter how we split the data. Output Training Data Training Data Output Training Data Training Data Output Training Data Training Data Output

We can then combine the output of these taggers trained on the different portions of the training data. Output Training Data Training Data Output Training Data Training Data Output Training Data Training Data Output

Each of these segments is the result from the taggers being tested on unseen data, but together they give you how the tagger would have tagged the entire data set if it had never seen it. Output Output Output Output

This gives you a lot more data to test the combination system on, increasing accuracy of the combination system. • You could then split this data, making it possible to train the weights. Output Output Output Output

Increasing training data increases accuracy, but not enough for the Naïve Bayes to recover from the loss of accuracy from using different data to train the tagger.

So what can we do about it? • The loss of accuracy was because the reliability of the output from taggers trained on the split data was different from the output from the taggers trained on all of the data. • Take smaller slices Training Data Training Data

So what can we do about it? • The difference between the split and unsplit training data is smaller, so the taggers should be more similar, helping the combination systems correctly predict the reliability of the tagger. Training Data Training Data

So can we do it? • Taking smaller slices is very expensive, especially for the TBL. • If the tagger were retractable, we might be able to produce the training data without having to rerun the system several times.

Trigram Model • Train the Trigram model on the whole training data • For each sentence in the training data, calculate what the probabilities would have been if the tagger were not trained on that sentence. • Tag the sentence based on the new probabilities.

Conclusions • Combination methods • A simple method like Voting works surprisingly well • Naïve Bayes needs a lot of training data to show improvement, but when it does the difference is substantial • New methods to improve the basic voting show consistently better results than voting by itself • Data preparation • The weighted voting method was further improved by more intelligent splitting of the data • Applying the new splitting techniques to Naïve Bayes needs some investigation to see if there is any improvement

Project 4

Project 4

Presentation Transcript

Project #4

PowerPoint Project 4

Project 4: Final Design Project

Project 4

Group 4 Project

Technology Project #4

PROJECT 4

Project #4

Project 4

Project 4

Project 4

Project 4 Roadmap

Project 4

Project #4

Project 4

Project 4

Project 4

Project 4

Project Part 4