1 / 17

POS Tagging HMM Taggers

POS Tagging HMM Taggers. (continued). Today. Walk through the guts of an HMM Tagger Address problems with HMM Taggers, specifically unknown words. HMM Tagger. What is the goal of a Markov Tagger? To maximize the following expression: P(w i |t j ) x P(t i |t 1, i -1 )

Download Presentation

POS Tagging HMM Taggers

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. POS TaggingHMM Taggers (continued)

  2. Today • Walk through the guts of an HMM Tagger • Address problems with HMM Taggers, specifically unknown words

  3. HMM Tagger • What is the goal of a Markov Tagger? • To maximize the following expression: P(wi|tj) x P(ti|t1,i-1) • Or P(word|tag) x P(tag|previous n tags) • Simplifies, by the Markov assumption, to: P(wi|ti) x P(ti|ti-1)

  4. HMM Tagger P(word|tag) x P(tag|previous n tags) P(word|tag) – • The probability of the word given a tag (not vice versa) • We model this by using a word-tag matrix (often called a language model) • Familiar? • HW 4 (3)

  5. HMM Tagger P(word|tag) x P(tag|previous n tags) P(tag|previous n tags) – • How likely a tag is given the n so many tags before • Simplified to the previous tag • Modeled by using a tag-tag matrix • Familiar? • HW 4 (2)

  6. HMM Tagger • But why is it P(word|tag) not P(tag|word)? • Take the following examples (from J&M): • Secretariat/NNP is/VBZ expected/VBN to/TO race/?? tomorrow/NN • People/NNS continue/VBP to/TO inquire/VB the/DT reason/NN for/IN the/DT race/?? for/IN outer/JJ space/NN

  7. HMM Tagger Secretariat/NNP is/VBZ expected/VBN to/TO race/?? tomorrow/NN • Maximize: P(wi|tj) x P(tj|tj-1) • We can choose between • P(race|VB) P(VB|TO) • P(race|NN) P(NN|TO)

  8. The good HMM Tagger • From the Brown/Switchboard corpus: • P(VB|TO) = .34 • P(NN|TO) = .021 • P(race|VB) = .00003 • P(race|NN) = .00041 • P(VB|TO) x P(race|VB) = .34 x .00003 = .00001 • P(NN|TO) x P(race|NN) = .021 x .00041 = .000007  a. TO followed by VB in the context of race is more probable (‘race’ really has no effect here).

  9. The no-go HMM Tagger • Invert word and tag, P(t|w) instead of P(w|t): • P(VB|race) = .02 • P(NN|race) = .98

  10. HMM Tagger • But don’t we really want to maximize the probability of the best sequence of tags for a given sequence of words? • Not just the best tag for a given word? • Thus, we really want to maximize (and implement): P(t1,…,tn | w1,…,wn), or T^ = argmax P(T|W)

  11. HMM Tagger • By Bayes Rule: P(T) P(W|T) P(T|W) = ----------------- P(W) • Since P(W) is always the same (why?), then P(T|W) = P(T) P(W|T)

  12. HMM Tagger • P(T|W) = P(T) P(W|T) = P(t1,…,tn) P(w1,…,wn|t1,…,tn) • By chain rule (computes joint probabilities from conditional probabilities, or vice versa) =P(tn| t1, …, tn-1) x P(tn-1| t1, …, tn-2) x P(tn-2| t1, …, tn-3) x … x P(t1) x P(w1| t1, …, tn) x P(w2 | t1, …, tn) x P(w3 | t1, …, tn) x … x P(wn | t1, …, tn) n = ΠP(wi|w1t1…wi-1ti-1) P(ti|w1t1…wi-1ti-1) i=1

  13. HMM Tagger • P(T|W) = P(T) P(W|T) n = ΠP(wi|w1t1…wi-1ti-1) P(ti|w1t1…wi-1ti-1) i=1 • Simplifying assumption: probability of word is dependent on its tag: P(wi|w1t1…wi-1ti-1) = P(wi|ti) • And the Markov assumption (for bigram): P(ti|w1t1…wi-1ti-1ti) = P(ti|ti-1) • The best tag sequence is then: n T^ = argmax ΠP(ti|ti-1) P(wi|ti) i=1

  14. Implementation • So we have the best tag sequence will be the maximization of: T^ = argmax ΠP(ti|ti-1) P(wi|ti) • Training: Learn the transition and emission probabilities from a corpus • smoothing may be necessary • State transition probabilities • Emission probabilities

  15. Training • An HMM needs to be trained on the following: • The initial state probabilities • The state transition probabilities • The tag-tag matrix • The emission probabilities • The tag-word matrix

  16. Implementation • Once trained, how do we implement such a maximization function? T^ = argmax ΠP(ti|ti-1) P(wi|ti) • Can’t we just walk through every path, calculate all probabilities, and choose the path with the highest probability (max)? • Yeah, if we have a lot of time. (Why?) • Exponential • Better to use a DP algorithm, such as Viterbi

  17. Unknown Words • The tagger just described will do poorly on unknown words. Why? • Because P(wi|ti) = 0 for a word it has not seen (or more specifically, the given word-tag pair). • How do we resolve this problem? • A dictionary with the most common tag (the stupid tagger) • Still doesn’t solve the problem for completely novel words • Morphological/typographical analysis • Probability of a tag generating an unknown word • Secondary training required

More Related