ML for NLP: Ling 572

ML for NLP: Ling 572 Achim Ruopp Zhengbo Zhou Albert Bertram

Outline • Tagging Introduction • Blah, blah, blah, you did this too • And now for something possibly different • Results • Comments • Questions • Snide remarks

Tagging introduction • Tagging is pretty simple • Relatively good results tagging with: • Tag = argmax p(t | word ) t in Tagset • The problem is well understood • Algorithm changes can be clearly visible

Preaching to the Choir • You all did this too: • P1: HMM/FST trigram tagger • P2: TBL tagger • P3: Maximum Entropy tagger

Forking Processes • Project 4 offered some choices • Self-training FTW

General Algorithm • Initialize: • Tagged data T • Untagged data U • Loop: • Train a tagger M on T • Use M to add tags to U, creating U’ • Move the best sentences from U’ into T • Replace U with the rest of U’ • End when you’re satisfied

Experimental Details • For i=1,5 • Initialize T with i*1000 sentences • For j=15,25,35 • Initialize U with j * 1000 sentences • Do self training • “Best” of U’ is defined as the best n • Where n = 20% of |T| • Satisfied when |U| == 0

Results (I): Graph

Results (II): Zoomed in

Results (III): Projects 1-3 • Probably not much new here: • You all did: projects 1-3 • That outlier though: TBL with simple unknown word treatment • The simpler approach to unknown word handling has much worse results.

Results (IV): Project 4 • Overall, more than 1% improvement • Not spectacular • Better than a poke in the eye with a sharp stick • Probably works better in a domain with fewer degrees of freedom. • Then again, unlabelled data is cheap • wget –r http://www.gutenberg.org/

Further Considerations • Not much to say about P1-3, but for P4 • We used a trigram tagger • Higher probabilities for shorter sentences • May want a quality measure rather than this • Unknown words are problematic • Try another tagger?

Done Done Done • Questions? • Comments. • Snide remarks. • Chuck Norris jokes.

For the Morbidly Curious Project 1 tagging accuracy

For the Morbidly Curious Project 4 tagging accuracy/number of sentences added

ML for NLP: Ling 572