Part-of-Speech Tagging

Part-of-Speech Tagging Foundation of Statistical NLP CHAPTER 10

Contents • Markov Model Taggers • Hidden Markov Model Taggers • Transformation-Based Learning of Tags • Tagging Accuracy and Uses of Taggers

Markov Model Taggers • Markov properties • Limited horizon • Time invariant cf. Wh-extraction (Chomsky) a. Should Peter buy a book? b. Which book should Peter buy?

Markov Model Taggers • The probabilistic model • Finding the best tagging t1,n for a sentence w1,n ex:P(AT NN BEZ IN AT VB | The bear is on the move)

assumtion • words are independent of each other • a word’s identity only depends on its tag

Markov Model Taggers • Training for all tags t jdo for all tags tkdo end end for all tags t jdo for all words wldo end end

Markov Model Taggers • Tagging (the Viterbi algorithm)

Variations • The models for unknown words 1. assuming that they can be any part of speech 2. using morphological to make inferences about a possible parts of speech

Z: normalization constant

Variation • Trigram taggers • Interpolation • Variable Memory Markov Model (VMMM)

Variation • Smoothing • Reversibility Kl: the number of possible parts of speech of wl

Variation • Sequence vs. tag by tag Time flies like an arrow. a. NN VBZ RB AT NN. P(.) = 0.01 b. NN NNS VB AT NN. P(.) = 0.01 • there is no large difference in accuracy between maximizing the sequence and tag

Hidden Markov Model Taggers When we have no tagged training data • Initializing all parameters with the dictionary information • Jelinek’s method • Kupiec’s method

Hidden Markov Model Taggers • Jelinek’s method • initializing the HMM with the MLE for P(wk|ti) • assuming that words occur equally likely with each of their possible tags. T(wj): the number of tags allowed for wj

Hidden Markov Model Taggers • Kupiec’s method • grouping all words with the same possible parts of speech into ‘metawords’ uL • not to fine-tune parameters for each word

Hidden Markov Model Taggers • Training • after initialization, the HMM is trained using the Forward-Backward algorithm • Tagging • equal to VMM ! the difference between VMM tagging and HMM tagging is in how we train the model, not in how we tag.

The effect of initialization on HMM overtrainingproblem D0 maximum likelihood estimates from a tagged training corpus D1 correct ordering only of lexical probabilities D2 lexical probabilities proportional to overall tag probabilities D3 equal lexical probabilities for all tags admissible for a word T0 maximum likelihood estimates from a tagged training corpus T1 equal probabilities for all transitions Hidden Markov Model Taggers

Use Visible Markov Model • a sufficiently large training text • similar to the intended text of application • Run Forward-Backward for a few iterations • no training text • training and test text are very different • but at least some lexical information • Run Forward-Backward for a larger number of iterations • no lexical information

Part-of-Speech Tagging