1 / 66

MEMMs/CMMs and CRFs

MEMMs/CMMs and CRFs. William W. Cohen Sep 22, 2010. Announcements…. Wiki Pages - HowTo. http://malt.ml.cmu.edu/mw/index.php/Social_Media_Analysis_10-802_in_Spring_2010#Other_Resources Example: http://malt.ml.cmu.edu/mw/index.php/Turney,_ACL_2002 Key points Naming the pages – examples:

val
Download Presentation

MEMMs/CMMs and CRFs

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. MEMMs/CMMs and CRFs William W. Cohen Sep 22, 2010

  2. Announcements…

  3. Wiki Pages - HowTo • http://malt.ml.cmu.edu/mw/index.php/Social_Media_Analysis_10-802_in_Spring_2010#Other_Resources • Example: http://malt.ml.cmu.edu/mw/index.php/Turney,_ACL_2002 • Key points • Naming the pages – examples: • [[Cohen ICML 1995]] • [[Lin and Cohen ICML 2010]] • [[Minkov et al IJCAI 2005]] • Structured links: • [[AddressesProblem::named entity recognition]] • [[UsesMethod::absolute discounting] • [[RelatedPaper::Pang et al ACL 2002] • [[UsesDataset::Citeseer]] • [[Category::Paper]] • [[Category::Problem]] • [[Category::Method]] • [[Category::Dataset]] • Rule of 2: Don’t create a page unless you expect 2 inlinks • A method from a paper that’s not used anywhere else should be described in-line • No inverse links – but you can emulate these with queries

  4. Wiki Pages – HowTo, con’t • To turn in: • Add them to the wiki • Add links to them on your user page • Send me an email with links to each page you want to get graded on • [I may send back bug reports until people get the hang of this…] • WhenTo: Three pages by 9/30 at midnight • Actually 10/1 at dawn is fine. • Suggestion: • Think of your project and build pages for the dataset, the problem, and the (baseline) method you plan to use.

  5. Projects • Some sample projects • Apply existing method to a new problem • http://staff.science.uva.nl/~otsur/papers/sarcasmAmazonICWSM10.pdf • Apply new method to an existing dataset • Build something that might help you in your research • E.g., Extract names of people (pundits, politicians, …) from political blogs • Classify folksonomy tags as person names, place names, … • On Wed 9/29 - “turn in”: • One page, covering some subset of: • What you plan to do with what data • Why you think it’s interesting • Any relevant superpowers you might have • How you plan to evaluate • What techniques you plan to use • What question you want to answer • Who you might work with • These will be posted on the class web site • On Friday 10/8: • Similar abstract from each team • Team is (preferably) 2-3 people, but I’m flexible • Main new information: who’s on what team

  6. Conditional Markov Models

  7. What is a symbol? Ideally we would like to use many, arbitrary, overlapping features of words. S S S identity of word ends in “-ski” is capitalized is part of a noun phrase is in a list of city names … t - 1 t t+1 … is “Wisniewski” … part ofnoun phrase ends in “-ski” O O O t - t +1 t 1 Lots of learning systems are not confounded by multiple, non-independent features: decision trees, neural nets, SVMs, …

  8. Pr(red|red) = 1 start Pr(red) Pr(green|green) = 1 Pr(green) Stupid HMM tricks Pr(y|x) = Pr(x|y) * Pr(y) / Pr(x) argmax{y} Pr(y|x) = argmax{y} Pr(x|y) * Pr(y) = argmax{y} Pr(y) * Pr(x1|y)*Pr(x2|y)*...*Pr(xm|y) Pr(“I voted for Ralph Nader”|ggggg) = Pr(g)*Pr(I|g)*Pr(voted|g)*Pr(for|g)*Pr(Ralph|g)*Pr(Nader|g)

  9. From NB to Maxent

  10. From NB to Maxent

  11. What is a symbol? S S identity of word ends in “-ski” is capitalized is part of a noun phrase is in a list of city names is under node X in WordNet is in bold font is indented is in hyperlink anchor … S t - 1 t+1 … t is “Wisniewski” … part ofnoun phrase ends in “-ski” O O O t - t +1 t 1 Idea: replace generative model in HMM with a maxent model, where state depends on observations and previous state history

  12. Ratnaparkhi’s MXPOST • Sequential learning problem: predict POS tags of words. • Uses MaxEnt model described above. • Rich feature set. • To smooth, discard features occurring < 10 times.

  13. MXPOST

  14. Inference for MENE When will prof Cohen post the notes … B B B B B B B I I I I I I I O O O O O O O

  15. Inference for MXPOST When will prof Cohen post the notes … B B B B B B B I I I I I I I O O O O O O O (Approx view): find best path, weights are now on arcs from state to state.

  16. Inference for MXPOST When will prof Cohen post the notes … B B B B B B B I I I I I I I O O O O O O O More accurately: find total flow to each node, weights are now on arcs from state to state.

  17. Inference for MXPOST When will prof Cohen post the notes … B B B B B B B I I I I I I I O O O O O O O Find best path? tree? Weights are on hyperedges

  18. Inference for MxPOST When will prof Cohen post the notes … I iI oI O iO oO Beam search is alternative to Viterbi: at each stage, find all children, score them, and discard all but the top n states

  19. Inference for MxPOST When will prof Cohen post the notes … oII I iI oiO oI ioI ioO O iO ooI oO ooO Beam search is alternative to Viterbi: at each stage, find all children, score them, and discard all but the top n states

  20. Inference for MxPOST When will prof Cohen post the notes … oiI oiiI I iI oiO oiiO oI ioI iooI ioO O iO iooO ooI oO oooI ooO oooO Beam search is alternative to Viterbi: at each stage, find all children, score them, and discard all but the top n states

  21. MXPost results • State of art accuracy (for 1996) • Same approach used successfully for several other sequential classification steps of a stochastic parser (also state of art). • Same (or similar) approaches used for NER by Borthwick, Malouf, Manning, and others.

  22. Frietag, McCallum, Pereira

  23. MEMMs • Basic difference from ME tagging: • ME tagging: previous state is feature of MaxEnt classifier • MEMM: build a separate MaxEnt classifier for each state. • Can build any HMM architecture you want; eg parallel nested HMM’s, etc. • Data is fragmented: examples where previous tag is “proper noun” give no information about learning tags when previous tag is “noun” • Mostly a difference in viewpoint • MEMM does allow possibility of “hidden” states and Baum-Welsh like training • Viterbi is the most natural inference scheme

  24. MEMM task: FAQ parsing

  25. MEMM features

  26. MEMMs

  27. Conditional Random Fields

  28. Implications of the MEMM model • Does this do what we want? • Q: does Y[i-1] depend on X[i+1] ? • “a nodes is conditionally independent of its non-descendents given its parents” • Q: what is Y[0] for the sentence “Qbbzzt of America Inc announced layoffs today in …”

  29. Inference for MXPOST When will prof Cohen post the notes … B B B B B B B I I I I I I I O O O O O O O (Approx view): find best path, weights are now on arcs from state to state.

  30. Inference for MXPOST When will prof Cohen post the notes … B B B B B B B I I I I I I I O O O O O O O More accurately: find total flow to each node, weights are now on arcs from state to state. Flow out of a node is always fixed:

  31. Label Bias Problem • Consider this MEMM, and enough training data to perfectly model it: Pr(0123|rib)=1 Pr(0453|rob)=1 Pr(0123|rob) = Pr(1|0,r)/Z1 * Pr(2|1,o)/Z2 * Pr(3|2,b)/Z3 = 0.5 * 1 * 1 Pr(0453|rib) = Pr(4|0,r)/Z1’ * Pr(5|4,i)/Z2’ * Pr(3|5,b)/Z3’ = 0.5 * 1 *1

  32. How important is label bias? • Could be avoided in this case by changing structure: • Our models are always wrong – is this “wrongness” a problem? • See Klein & Manning’s paper for more on this….

  33. Another view of label bias [Sha & Pereira] So what’s the alternative?

  34. Inference for MXPOST When will prof Cohen post the notes … B B B B B B B I I I I I I I O O O O O O O More accurately: find total flow to each node, weights are now on arcs from state to state. Flow out of a node is always fixed:

  35. Another max-flow scheme When will prof Cohen post the notes … B B B B B B B I I I I I I I O O O O O O O More accurately: find total flow to each node, weights are now on arcs from state to state. Flow out of a node is always fixed:

  36. Another max-flow scheme: MRFs When will prof Cohen post the notes … B B B B B B B I I I I I I I O O O O O O O • Goal is to learn how to weight edges in the graph: • weight(yi,yi+1) = 2*[(yi=B or I) and isCap(xi)] + 1*[(yi=B and isFirstName(xi)] - 5*[(yi+1≠B and isLower(xi) and isUpper(xi+1)]

  37. Another max-flow scheme: MRFs When will prof Cohen post the notes … B B B B B B B I I I I I I I O O O O O O O Find total flow to each node, weights are now on edges from state to state. Goal is to learn how to weight edges in the graph, given features from the examples.

  38. MEMMs: Sequence classification f:xy is reduced to many cases of ordinary classification, f:xiyi …combined with Viterbi or beam search CRFs: Sequence classification f:xy is done by: Converting x,Y to a MRF Using “flow” computations on the MRF to compute the best y|x CRFs vs MEMMs x1 x2 x3 x4 x5 x6 x1 x2 x3 x4 x5 x6 … Pr(Y|x2,y1’) MRF: φ(Y1,Y2), φ(Y2,Y3),…. Pr(Y|x4,y3) … Pr(Y|x5,y5) Pr(Y|x2,y1) … y1 y2 y3 y4 y5 y6 y1 y2 y3 y4 y5 y6

  39. The math: Review of maxent

  40. Review of maxent/MEMM/CMMs We know how to compute this.

  41. Details on CMMs

  42. New model From CMMs to CRFs Recall why we’re unhappy: we don’t want local normalization How to compute this?

  43. x1 x2 x3 y1 y2 y3 What’s the new model look like? What’s independent? If fi is HMM-like and depends on only xj,yj or yj,yj-1

  44. What’s the new model look like? What’s independent now?? y1 y2 y3 x

  45. CRF learning – from Sha & Pereira

  46. CRF learning – from Sha & Pereira

  47. CRF learning – from Sha & Pereira Something like forward-backward • Idea: • Define matrix of y,y’ “affinities” at stage i • Mi[y,y’] = “unnormalized probability” of transition from y to y’ at stage I • Mi * Mi+1 = “unnormalized probability” of any path through stages i and i+1

  48. y1 y2 y3 x y1 y2 y3

  49. Forward backward ideas a e name name name c g b f nonName nonName nonName d h

  50. CRF learning – from Sha & Pereira

More Related