1 / 24

Real-World Semi-Supervised Learning of POS-Taggers for Low-Resource Languages

Real-World Semi-Supervised Learning of POS-Taggers for Low-Resource Languages. Dan Garrette , Jason Mielens , and Jason Baldridge. Proceedings of ACL 2013. Semi-Supervised Training. HMM with Expectation-Maximization (EM). Need:. Large raw corpus. Tag dictionary.

gwidon
Download Presentation

Real-World Semi-Supervised Learning of POS-Taggers for Low-Resource Languages

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Real-World Semi-Supervised Learning of POS-Taggers for Low-Resource Languages Dan Garrette, Jason Mielens, and Jason Baldridge Proceedings of ACL 2013

  2. Semi-Supervised Training HMM with Expectation-Maximization (EM) Need: Large raw corpus Tag dictionary [Kupiec, 1992][Merialdo, 1994]

  3. Previous Works: • Supervised Learning • Provide high accuracy for POS tagging (Manning, 2011). • Perform poorly when little supervision is available. • Semi-Supervised • Done by training sequence models such as HMM using the EM algorithm. • Work in this area has still relied on relatively • large amounts of data. • (Kupiec, 1992; Merialdo,1994).

  4. Previous Works: • Goldberg et al.(2008) • Manually constructed lexicon for Hebrew to train HMM tagger. • Lexicon was developed over a long period of time by expert lexicographers. • Tackstrom et al. (2013) • Evaluated use of mixed type and token constraints generated by projecting information from high resource language to low resource languages. • Large parallel corpora required.

  5. Low-Resource Languages 6,900 languages in the world ~30 have non-negligible quantities of data No million-word corpus for any endangered language [Maxwell and Hughes, 2006][Abney and Bird, 2010]

  6. Low-Resource Languages Kinyarwanda (KIN) Niger-Congo. Morphologically-rich. Malagasy (MLG) Austronesian. Spoken in Madagascar. Also, English

  7. Collecting Annotations • Supervised training is not an option. • Semi-supervised training: • Annotate some data by hand in 4 hours, • (in 30-minute intervals) for two tasks. • Type supervision. • Token supervision.

  8. Tag Dict Generalization These annotations are too sparse! Generalize to the entire vocabulary

  9. Tag Dict Generalization Haghighi and Klein (2006) do this witha vector space. We don’t have enough raw data Das and Petrov (2011) do this witha parallel corpus. We don’t have a parallel corpus

  10. Tag Dict Generalization Strategy: Label Propagation • Connect annotations to raw corpus tokens • Push tag labels to entire corpus [Talukdar and Crammer. 2009]

  11. Morphological Transducers • Finite-state transducers are used for morphological analysis. • FST accepts a word type and produces • a set of morphological features. • Power of FSTs: • Analyze out-of-vocabulary items by looking for known affixes and guessing the stem of the word.

  12. Tag Dict Generalization NEXT_walks PREV_<b> NEXT_thug PREV_the TOK_the_4 TOK_the_1 TOK_the_9 TOK_thug_5 TOK_dog_2 TYPE_the TYPE_thug TYPE_dog SUF1_e PRE1_t PRE2_th SUF1_g PRE1_d PRE2_do

  13. Tag Dict Generalization Type Annotations _the__DT_____dog_NN____ SUF1_g PRE2_th PRE1_t TYPE_the TYPE_thug TYPE_dog PREV_<b> PREV_the NEXT_walks TOK_the_4 TOK_the_1 TOK_dog_2 TOK_thug_5

  14. Tag Dict Generalization Type Annotations _the_________dog________ SUF1_g PRE2_th PRE1_t TYDTthe TYNNog TYPE_thug PREV_<b> PREV_the NEXT_walks TOK_the_4 TOK_the_1 TOK_dog_2 TOK_thug_5

  15. Tag Dict Generalization Type Annotations _the________ SUF1_g PRE2_th PRE1_t dog TYPE_the TYPE_thug TYPE_dog PREV_<b> PREV_the NEXT_walks TOK_the_4 TOK_the_1 TOK_dog_2 TOK_thug_5 Token Annotations the dog walksDT NN VBZ

  16. Tag Dict Generalization Type Annotations _the________ SUF1_g PRE2_th PRE1_t dog TYPE_the TYPE_thug TYPE_dog PREV_<b> PREV_the NEXT_walks TODTe_4 TOK_the_1 TOKNN_2 TOK_thug_5 Token Annotations the dog walks ____________

  17. Model Minimization • • LP graph has a node for each corpus token. • Each node is labelled with distribution over POS tags. • Graph provides a corpus of sentences labelled with noisy tag distributions. • Greedily seek the minimal set of tagbigrams that describe the raw corpus. • Now use, HMM trained by EM. [Ravi et al., 2010; Garrette and Baldridge, 2012]

  18. Overall Accuracy All of these values were achieved using both FST and affix LP features.

  19. Results

  20. Types versus Tokens

  21. Mixing Type and Token Annotations

  22. Morphological Analysis

  23. Annotator Experience

  24. Conclusion • Type Annotations are the most useful input from a linguist. • We can train effective POS-taggers on low resource languages given only a small amount of unlabeled text and a few hours of annotation by a non-native linguist.

More Related