POS Tagging HMM Taggers

POS TaggingHMM Taggers (continued)

Today • Walk through the guts of an HMM Tagger • Address problems with HMM Taggers, specifically unknown words

HMM Tagger P(word|tag) x P(tag|previous n tags) P(word|tag) – • The probability of the word given a tag (not vice versa) • We model this by using a word-tag matrix (often called a language model) • Familiar? • HW 4 (3)

HMM Tagger P(word|tag) x P(tag|previous n tags) P(tag|previous n tags) – • How likely a tag is given the n so many tags before • Simplified to the previous tag • Modeled by using a tag-tag matrix • Familiar? • HW 4 (2)

HMM Tagger • But why is it P(word|tag) not P(tag|word)? • Take the following examples (from J&M): • Secretariat/NNP is/VBZ expected/VBN to/TO race/?? tomorrow/NN • People/NNS continue/VBP to/TO inquire/VB the/DT reason/NN for/IN the/DT race/?? for/IN outer/JJ space/NN

The no-go HMM Tagger • Invert word and tag, P(t|w) instead of P(w|t): • P(VB|race) = .02 • P(NN|race) = .98

HMM Tagger • But don’t we really want to maximize the probability of the best sequence of tags for a given sequence of words? • Not just the best tag for a given word? • Thus, we really want to maximize (and implement): P(t1,…,tn | w1,…,wn), or T^ = argmax P(T|W)

HMM Tagger • By Bayes Rule: P(T) P(W|T) P(T|W) = ----------------- P(W) • Since P(W) is always the same (why?), then P(T|W) = P(T) P(W|T)

Implementation • So we have the best tag sequence will be the maximization of: T^ = argmax ΠP(ti|ti-1) P(wi|ti) • Training: Learn the transition and emission probabilities from a corpus • smoothing may be necessary • State transition probabilities • Emission probabilities

Training • An HMM needs to be trained on the following: • The initial state probabilities • The state transition probabilities • The tag-tag matrix • The emission probabilities • The tag-word matrix

Implementation • Once trained, how do we implement such a maximization function? T^ = argmax ΠP(ti|ti-1) P(wi|ti) • Can’t we just walk through every path, calculate all probabilities, and choose the path with the highest probability (max)? • Yeah, if we have a lot of time. (Why?) • Exponential • Better to use a DP algorithm, such as Viterbi

Unknown Words • The tagger just described will do poorly on unknown words. Why? • Because P(wi|ti) = 0 for a word it has not seen (or more specifically, the given word-tag pair). • How do we resolve this problem? • A dictionary with the most common tag (the stupid tagger) • Still doesn’t solve the problem for completely novel words • Morphological/typographical analysis • Probability of a tag generating an unknown word • Secondary training required

POS Tagging HMM Taggers