1 / 35

Conditional Markov Models: MaxEnt Tagging and MEMMs

This review provides an overview of Hidden Markov Models (HMMs) and their applications in tagging and segmentation. It explores the use of Maximum Entropy (MaxEnt) models as an alternative to generative models in HMMs.

Download Presentation

Conditional Markov Models: MaxEnt Tagging and MEMMs

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Conditional Markov Models: MaxEnt Tagging and MEMMs William W. Cohen CALD

  2. 0.5 0.9 0.5 0.1 0.8 0.2 Review: Hidden Markov Models A C 0.6 0.4 • Efficient dynamic programming algorithms exist for • Finding Pr(S) • The highest probability path P that maximizes Pr(S,P) (Viterbi) • Training the model • (Baum-Welch algorithm) A C 0.9 0.1 S1 S2 S4 S3 A C 0.3 0.7 A C 0.5 0.5

  3. HMM for Segmentation • Simplest Model: One state per entity type

  4. HMM Learning • Manally pick HMM’s graph (eg simple model, fully connected) • Learn transition probabilities: Pr(si|sj) • Learn emission probabilities: Pr(w|si)

  5. Learning model parameters • When training data defines unique path through HMM • Transition probabilities • Probability of transitioning from state i to state j = number of transitions from i to j total transitions from state i • Emission probabilities • Probability of emitting symbol k from state i = number of times k generated from i number of transition from I • When training data defines multiple path: • A more general EM like algorithm (Baum-Welch)

  6. What is a “symbol” ??? Cohen => “Cohen”, “cohen”, “Xxxxx”, “Xx”, … ? 4601 => “4601”, “9999”, “9+”, “number”, … ? Datamold: choose best abstraction level using holdout set

  7. What is a symbol? Bikel et al mix symbols from two abstraction levels

  8. What is a symbol? Ideally we would like to use many, arbitrary, overlapping features of words. S S S identity of word ends in “-ski” is capitalized is part of a noun phrase is in a list of city names is under node X in WordNet is in bold font is indented is in hyperlink anchor … t - 1 t t+1 … is “Wisniewski” … part ofnoun phrase ends in “-ski” O O O t - t +1 t 1 Lots of learning systems are not confounded by multiple, non-independent features: decision trees, neural nets, SVMs, …

  9. Pr(red|red) = 1 Pr(red) Pr(green|green) = 1 Pr(green) Stupid HMM tricks start

  10. Pr(red|red) = 1 start Pr(red) Pr(green|green) = 1 Pr(green) Stupid HMM tricks Pr(y|x) = Pr(x|y) * Pr(y) / Pr(x) argmax{y} Pr(y|x) = argmax{y} Pr(x|y) * Pr(y) = argmax{y} Pr(y) * Pr(x1|y)*Pr(x2|y)*...*Pr(xm|y) Pr(“I voted for Ralph Nader”|ggggg) = Pr(g)*Pr(I|g)*Pr(voted|g)*Pr(for|g)*Pr(Ralph|g)*Pr(Nader|g)

  11. HMM’s = sequential NB

  12. From NB to Maxent

  13. From NB to Maxent

  14. From NB to Maxent Learning: set alpha parameters to maximize this: the ML model of the data, given we’re using the same functional form as NB. Turns out this is the same as maximizing entropy of p(y|x) over all distributions.

  15. MaxEnt Comments • Implementation: • All methods are iterative • Numerical issues (underflow rounding) are important. • For NLP like problems with many features, modern gradient-like or Newton-like methods work well – sometimes better(?) and faster than GIS and IIS • Smoothing: • Typically maxent will overfit data if there are many infrequent features. • Common solutions: discard low-count features; early stopping with holdout set; Gaussian prior centered on zero to limit size of alphas (ie, optimize log likelihood - sum alpha)

  16. MaxEnt Comments • Performance: • Good MaxEnt methods are competitive with linear SVMs and other state of are classifiers in accuracy. • Can’t easily extend to higher-order interactions (e.g. kernel SVMs, AdaBoost). • Training is relatively expensive. • Embedding in a larger system: • MaxEnt optimizes Pr(y|x), not error rate.

  17. MaxEnt Comments • MaxEnt competitors: • Model Pr(y|x)with Pr(y|score(x)) using score from SVM’s, NB, … • Regularized Winnow, BPETs, … • Ranking-based methods that estimate if Pr(y1|x)>Pr(y2|x). • Things I don’t understand: • Why don’t we call it logistic regression? • Why is always used to estimate the density of (y,x) pairs rather than a separate density for each class y? • When are its confidence estimates reliable?

  18. What is a symbol? Ideally we would like to use many, arbitrary, overlapping features of words. S S S identity of word ends in “-ski” is capitalized is part of a noun phrase is in a list of city names is under node X in WordNet is in bold font is indented is in hyperlink anchor … t - 1 t t+1 … is “Wisniewski” … part ofnoun phrase ends in “-ski” O O O t - t +1 t 1

  19. What is a symbol? S S S identity of word ends in “-ski” is capitalized is part of a noun phrase is in a list of city names is under node X in WordNet is in bold font is indented is in hyperlink anchor … t - 1 t t+1 … is “Wisniewski” … part ofnoun phrase ends in “-ski” O O O t - t +1 t 1 Idea: replace generative model in HMM with a maxent model, where state depends on observations

  20. What is a symbol? S S S identity of word ends in “-ski” is capitalized is part of a noun phrase is in a list of city names is under node X in WordNet is in bold font is indented is in hyperlink anchor … t - 1 t t+1 … is “Wisniewski” … part ofnoun phrase ends in “-ski” O O O t - t +1 t 1 Idea: replace generative model in HMM with a maxent model, where state depends on observations and previous state

  21. What is a symbol? S S identity of word ends in “-ski” is capitalized is part of a noun phrase is in a list of city names is under node X in WordNet is in bold font is indented is in hyperlink anchor … S t - 1 t+1 … t is “Wisniewski” … part ofnoun phrase ends in “-ski” O O O t - t +1 t 1 Idea: replace generative model in HMM with a maxent model, where state depends on observations and previous state history

  22. Ratnaparkhi’s MXPOST • Sequential learning problem: predict POS tags of words. • Uses MaxEnt model described above. • Rich feature set. • To smooth, discard features occurring < 10 times.

  23. MXPOST

  24. Feature selection GIS MXPOST: learning & inference

  25. MXPost inference

  26. MXPost results • State of art accuracy (for 1996) • Same approach used successfully for several other sequential classification steps of a stochastic parser (also state of art). • Same approach used for NER by Bortwick, Malouf, Manning, and others.

  27. Alternative inference

  28. Finding the most probable path: the Viterbi algorithm (for HMMs) • define to be the probability of the most probable path accounting for the first i characters of x and ending in state k • we want to compute , the probability of the most probable path accounting for all of the sequence and ending in the end state • can define recursively • can use dynamic programming to find efficiently

  29. Finding the most probable path: the Viterbi algorithm for HMMs • initialization:

  30. The Viterbi algorithm for HMMs • recursion for emitting states (i =1…L):

  31. The Viterbi algorithm for HMMs and Maxent Taggers • recursion for emitting states (i =1…L):

  32. MEMMs • Basic difference from ME tagging: • ME tagging: previous state is feature of MaxEnt classifier • MEMM: build a separate MaxEnt classifier for each state. • Can build any HMM architecture you want; eg parallel nested HMM’s, etc. • Data is fragmented: examples where previous tag is “proper noun” give no information about learning tags when previous tag is “noun” • Mostly a difference in viewpoint

  33. MEMMs

  34. MEMM task: FAQ parsing

  35. MEMM features

More Related