Real-World Semi-Supervised Learning of POS-Taggers for Low-Resource Languages

Real-World Semi-Supervised Learning of POS-Taggers for Low-Resource Languages Dan Garrette, Jason Mielens, and Jason Baldridge Proceedings of ACL 2013

Semi-Supervised Training HMM with Expectation-Maximization (EM) Need: Large raw corpus Tag dictionary [Kupiec, 1992][Merialdo, 1994]

Previous Works: • Supervised Learning • Provide high accuracy for POS tagging (Manning, 2011). • Perform poorly when little supervision is available. • Semi-Supervised • Done by training sequence models such as HMM using the EM algorithm. • Work in this area has still relied on relatively • large amounts of data. • (Kupiec, 1992; Merialdo,1994).

Previous Works: • Goldberg et al.(2008) • Manually constructed lexicon for Hebrew to train HMM tagger. • Lexicon was developed over a long period of time by expert lexicographers. • Tackstrom et al. (2013) • Evaluated use of mixed type and token constraints generated by projecting information from high resource language to low resource languages. • Large parallel corpora required.

Low-Resource Languages 6,900 languages in the world ~30 have non-negligible quantities of data No million-word corpus for any endangered language [Maxwell and Hughes, 2006][Abney and Bird, 2010]

Low-Resource Languages Kinyarwanda (KIN) Niger-Congo. Morphologically-rich. Malagasy (MLG) Austronesian. Spoken in Madagascar. Also, English

Collecting Annotations • Supervised training is not an option. • Semi-supervised training: • Annotate some data by hand in 4 hours, • (in 30-minute intervals) for two tasks. • Type supervision. • Token supervision.

Tag Dict Generalization These annotations are too sparse! Generalize to the entire vocabulary

Tag Dict Generalization Haghighi and Klein (2006) do this witha vector space. We don’t have enough raw data Das and Petrov (2011) do this witha parallel corpus. We don’t have a parallel corpus

Tag Dict Generalization Strategy: Label Propagation • Connect annotations to raw corpus tokens • Push tag labels to entire corpus [Talukdar and Crammer. 2009]

Morphological Transducers • Finite-state transducers are used for morphological analysis. • FST accepts a word type and produces • a set of morphological features. • Power of FSTs: • Analyze out-of-vocabulary items by looking for known affixes and guessing the stem of the word.

Tag Dict Generalization NEXT_walks PREV_ NEXT_thug PREV_the TOK_the_4 TOK_the_1 TOK_the_9 TOK_thug_5 TOK_dog_2 TYPE_the TYPE_thug TYPE_dog SUF1_e PRE1_t PRE2_th SUF1_g PRE1_d PRE2_do

Tag Dict Generalization Type Annotations _the__DT_____dog_NN____ SUF1_g PRE2_th PRE1_t TYPE_the TYPE_thug TYPE_dog PREV_ PREV_the NEXT_walks TOK_the_4 TOK_the_1 TOK_dog_2 TOK_thug_5

Tag Dict Generalization Type Annotations _the_________dog________ SUF1_g PRE2_th PRE1_t TYDTthe TYNNog TYPE_thug PREV_ PREV_the NEXT_walks TOK_the_4 TOK_the_1 TOK_dog_2 TOK_thug_5

Tag Dict Generalization Type Annotations _the________ SUF1_g PRE2_th PRE1_t dog TYPE_the TYPE_thug TYPE_dog PREV_ PREV_the NEXT_walks TOK_the_4 TOK_the_1 TOK_dog_2 TOK_thug_5 Token Annotations the dog walksDT NN VBZ

Tag Dict Generalization Type Annotations _the________ SUF1_g PRE2_th PRE1_t dog TYPE_the TYPE_thug TYPE_dog PREV_ PREV_the NEXT_walks TODTe_4 TOK_the_1 TOKNN_2 TOK_thug_5 Token Annotations the dog walks ____________

Model Minimization • • LP graph has a node for each corpus token. • Each node is labelled with distribution over POS tags. • Graph provides a corpus of sentences labelled with noisy tag distributions. • Greedily seek the minimal set of tagbigrams that describe the raw corpus. • Now use, HMM trained by EM. [Ravi et al., 2010; Garrette and Baldridge, 2012]

Overall Accuracy All of these values were achieved using both FST and affix LP features.

Results

Types versus Tokens

Mixing Type and Token Annotations

Morphological Analysis

Annotator Experience

Conclusion • Type Annotations are the most useful input from a linguist. • We can train effective POS-taggers on low resource languages given only a small amount of unlabeled text and a few hours of annotation by a non-native linguist.

Real-World Semi-Supervised Learning of POS-Taggers for Low-Resource Languages

Real-World Semi-Supervised Learning of POS-Taggers for Low-Resource Languages

Presentation Transcript

Semi-supervised Learning

Machine Learning PoS-Taggers

Semi-Supervised Learning

Semi-Supervised Learning

Semi-supervised learning

Semi-Supervised Learning

Semi-supervised learning

Semi-Supervised Learning

Semi-supervised Learning

Inductive Semi-supervised Learning

Supervised and semi-supervised learning for NLP

Machine Learning PoS-Taggers

Semi-Supervised Approaches for Learning to Parse Natural Languages

Semi-Supervised Learning

Semi-Supervised Learning

Semi-supervised Learning

Semi-Supervised Learning

COMP3503 Semi-Supervised Learning

Semi-Supervised Learning

Semi-Supervised Approaches for Learning to Parse Natural Languages