1 / 17

ML for NLP: Ling 572

ML for NLP: Ling 572. Achim Ruopp Zhengbo Zhou Albert Bertram. Outline. Tagging Introduction Blah, blah, blah, you did this too And now for something possibly different Results Comments Questions Snide remarks. Tagging introduction. Tagging is pretty simple

olisa
Download Presentation

ML for NLP: Ling 572

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. ML for NLP: Ling 572 Achim Ruopp Zhengbo Zhou Albert Bertram

  2. Outline • Tagging Introduction • Blah, blah, blah, you did this too • And now for something possibly different • Results • Comments • Questions • Snide remarks

  3. Tagging introduction • Tagging is pretty simple • Relatively good results tagging with: • Tag = argmax p(t | word ) t in Tagset • The problem is well understood • Algorithm changes can be clearly visible

  4. Preaching to the Choir • You all did this too: • P1: HMM/FST trigram tagger • P2: TBL tagger • P3: Maximum Entropy tagger

  5. Forking Processes • Project 4 offered some choices • Self-training FTW

  6. General Algorithm • Initialize: • Tagged data T • Untagged data U • Loop: • Train a tagger M on T • Use M to add tags to U, creating U’ • Move the best sentences from U’ into T • Replace U with the rest of U’ • End when you’re satisfied

  7. Experimental Details • For i=1,5 • Initialize T with i*1000 sentences • For j=15,25,35 • Initialize U with j * 1000 sentences • Do self training • “Best” of U’ is defined as the best n • Where n = 20% of |T| • Satisfied when |U| == 0

  8. Results (I): Graph

  9. Results (II): Zoomed in

  10. Results (III): Projects 1-3 • Probably not much new here: • You all did: projects 1-3 • That outlier though: TBL with simple unknown word treatment • The simpler approach to unknown word handling has much worse results.

  11. Results (IV): Project 4 • Overall, more than 1% improvement • Not spectacular • Better than a poke in the eye with a sharp stick • Probably works better in a domain with fewer degrees of freedom. • Then again, unlabelled data is cheap • wget –r http://www.gutenberg.org/

  12. Further Considerations • Not much to say about P1-3, but for P4 • We used a trigram tagger • Higher probabilities for shorter sentences • May want a quality measure rather than this • Unknown words are problematic • Try another tagger?

  13. Done Done Done • Questions? • Comments. • Snide remarks. • Chuck Norris jokes.

  14. For the Morbidly Curious Project 1 tagging accuracy

  15. For the Morbidly Curious Project 2 tagging accuracy

  16. For the Morbidly Curious Project 3 tagging accuracy

  17. For the Morbidly Curious Project 4 tagging accuracy/number of sentences added

More Related