1 / 67

Hidden Markov Models (HMMs) for Information Extraction

Hidden Markov Models (HMMs) for Information Extraction. Daniel S. Weld CSE 454. Administrivia . Group meetings next week Feel free to rev proposals thru weekend. What is “Information Extraction”. As a task:. Filling slots in a database from sub-segments of text.

marybdavis
Download Presentation

Hidden Markov Models (HMMs) for Information Extraction

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Hidden Markov Models (HMMs)for Information Extraction Daniel S. Weld CSE 454

  2. Administrivia Group meetings next week Feel free to rev proposals thru weekend

  3. What is “Information Extraction” As a task: Filling slots in a database from sub-segments of text. October 14, 2002, 4:00 a.m. PT For years, Microsoft CorporationCEOBill Gates railed against the economic philosophy of open-source software with Orwellian fervor, denouncing its communal licensing as a "cancer" that stifled technological innovation. Today, Microsoft claims to "love" the open-source concept, by which software code is made public to encourage improvement and development by outside programmers. Gates himself says Microsoft will gladly disclose its crown jewels--the coveted code behind the Windows operating system--to select customers. "We can be open source. We love the concept of shared source," said Bill Veghte, a MicrosoftVP. "That's a super-important shift for us in terms of code access.“ Richard Stallman, founder of the Free Software Foundation, countered saying… IE NAME TITLE ORGANIZATION Bill GatesCEOMicrosoft Bill VeghteVPMicrosoft Richard StallmanfounderFree Soft.. Slides from Cohen & McCallum

  4. Landscape of IE Techniques: Models Classify Pre-segmentedCandidates Sliding Window Abraham Lincoln was born in Kentucky. Abraham Lincoln was born in Kentucky. Classifier Classifier which class? which class? Try alternatewindow sizes: Context Free Grammars Boundary Models Finite State Machines Abraham Lincoln was born in Kentucky. Abraham Lincoln was born in Kentucky. Abraham Lincoln was born in Kentucky. BEGIN Most likely state sequence? NNP NNP V V P NP Most likely parse? Classifier PP which class? VP NP VP BEGIN END BEGIN END S Lexicons Abraham Lincoln was born in Kentucky. member? Alabama Alaska … Wisconsin Wyoming Each model can capture words, formatting, or both Slides from Cohen & McCallum

  5. Finite State Models Generative directed models HMMs Naïve Bayes Sequence GeneralGraphs Conditional Conditional Conditional Logistic Regression General CRFs Linear-chain CRFs GeneralGraphs Sequence

  6. Warning Graphical models add another set of arcs between nodes where the arcs mean something completely different – confusing. Skip for 454 New sldies are too abstract

  7. Recap: Naïve Bayes Classifier Spam? Hidden state y Causal dependency (probabilistic) P(xi | y=spam) P(xi | y≠spam) Random variables (Boolean) x1 x2 x3 Observable Nigeria? Widow? CSE 454?

  8. Recap: Naïve Bayes Assumption: features independent given label Generative Classifier Model joint distribution p(x,y) Inference Learning: counting Can we use for IE directly? Labels of neighboring words dependent! The article appeared in the Seattle Times. city? Need to consider sequence! length capitalization suffix

  9. Hidden Markov Models Generative Sequence Model 2 assumptions make joint distribution tractable 1. Each state depends only on its immediate predecessor.2. Each observation depends only on current state. Finite state model state sequence other person … y1 y2 y3 y4 y5 y6 y7 y8 location other observation sequence x1 x2 x3 x4 x5 x6 x7 x8 person person Yesterday Pedro … Graphical Model transitions … … observations

  10. Hidden Markov Models Generative Sequence Model Model Parameters Start state probabilities Transition probabilities Observation probabilities Finite state model state sequence other person … y1 y2 y3 y4 y5 y6 y7 y8 location other observation sequence x1 x2 x3 x4 x5 x6 x7 x8 person person Yesterday Pedro … Graphical Model transitions … … observations

  11. HMM Formally Set of states: {yi} Set of possible observations {xi} Probability of initial state Transition Probabilities Emission Probabilities

  12. Example: Dishonest Casino Casino has two dice: Fair die P(1) = P(2) = P(3) = P(5) = P(6) = 1/6 Loaded die P(1) = P(2) = P(3) = P(5) = 1/10 P(6) = 1/2 Casino player switches dice Approx once every20 turns Game: You bet $1 You roll (always with a fair die) Casino player rolls (maybe with fair die, maybe with loaded die) Highest number wins $2 Slides from Serafim Batzoglou

  13. The dishonest casino model 0.05 0.95 0.95 FAIR LOADED P(1|F) = 1/6 P(2|F) = 1/6 P(3|F) = 1/6 P(4|F) = 1/6 P(5|F) = 1/6 P(6|F) = 1/6 P(1|L) = 1/10 P(2|L) = 1/10 P(3|L) = 1/10 P(4|L) = 1/10 P(5|L) = 1/10 P(6|L) = 1/2 0.05 Slides from Serafim Batzoglou

  14. IE with Hidden Markov Models Given a sequence of observations: Yesterday Pedro Domingos spoke this example sentence. and a trained HMM: person name location name background Find the most likely state sequence: (Viterbi) YesterdayPedro Domingosspoke this example sentence. Any words said to be generated by the designated “person name” state extract as a person name: Person name: Pedro Domingos Slide by Cohen & McCallum

  15. IE with Hidden Markov Models For sparse extraction tasks : Separate HMM for each type of target Each HMM should Model entire document Consist of target and non-target states Not necessarily fully connected 16 Slide by Okan Basegmez

  16. Or … Combined HMM Example – Research Paper Headers 17 Slide by Okan Basegmez

  17. HMM Example: “Nymble” [Bikel, et al 1998], [BBN “IdentiFinder”] Task: Named Entity Extraction Person end-of-sentence start-of-sentence Org Train on ~500k words of news wire text. (Five other name classes) Other Results: Case Language F1 . Mixed English 93% Upper English 91% Mixed Spanish 90% Slide adapted from Cohen & McCallum

  18. Finite State Model Person end-of-sentence start-of-sentence Org (Five other name classes) vs. Path Other y6 y5 y4 y3 y2 … y1 x1 x2 x3 x4 x5 x6 …

  19. Question #1 – Evaluation GIVEN A sequence of observations x1 x2 x3 x4 ……xN A trained HMM θ=( , , ) QUESTION How likely is this sequence, given our HMM ? P(x,θ) Why do we care? Need it for learning to choose among competing models!

  20. Question #2 - Decoding GIVEN A sequence of observations x1 x2 x3 x4 ……xN A trained HMM θ=( , , ) QUESTION How dow we choose the corresponding parse (state sequence) y1 y2 y3 y4 ……yN, which “best” explains x1 x2 x3 x4 ……xN ? There are several reasonable optimality criteria: single optimal sequence, average statistics for individual states, …

  21. A parse of a sequence 1 1 1 1 … 2 2 2 2 … … … … … K K K K … Given a sequence x = x1……xN, A parse of o is a sequence of states y = y1, ……, yN person 1 2 2 other K location x1 x2 x3 xK Slide by Serafim Batzoglou

  22. Question #3 - Learning GIVEN A sequence of observations x1 x2 x3 x4 ……xN QUESTION How do we learn the model parameters θ=( , , ) which maximize P(x, θ) ?

  23. Three Questions Evaluation Forward algorithm (Could also go other direction) Decoding Viterbi algorithm Learning Baum-Welch Algorithm (aka “forward-backward”) A kind of EM (expectation maximization)

  24. A Solution to #1: Evaluation Given observations x=x1 …xNand HMM θ, what is p(x) ? Enumerate every possible state sequence y=y1 …yN Probability of x and given particular y Probability of particular y Summing over all possible state sequences we get 2T multiplications per sequence For small HMMs T=10, N=10 there are 10 billion sequences! NT state sequences!

  25. Solution to #1: Evaluation Use Dynamic Programming: Define forward variable probability that at time t - the state is Si - the partial observation sequence x=x1 …xthas been emitted

  26. Forward Variable t(i) Prob - that the state at time t vas value Si and - the partial obs sequence x=x1 …xthas been seen 1 1 1 1 … … person 2 2 2 2 … … other … … … … K K K K location … … x1 x2 x3 xt

  27. Forward Variable t(i) prob - that the state at t vas value Si and - the partial obs sequence x=x1 …xthas been seen 1 1 1 1 … … person 2 2 2 2 … … other Si … … … … K K K K location … … x1 x2 x3 xt

  28. Solution to #1: Evaluation Use Dynamic Programming Cache and reuse inner sums Define forward variables probability that at time t • the state is yt=Si • the partial observation sequence x=x1 …xthas been omitted

  29. The Forward Algorithm

  30. The Forward Algorithm INITIALIZATION INDUCTION TERMINATION Time: O(K2N) Space: O(KN) K = |S| #states N length of sequence

  31. The Backward Algorithm

  32. Three Questions Evaluation Forward algorithm (Could also go other direction) Decoding Viterbi algorithm Learning Baum-Welch Algorithm (aka “forward-backward”) A kind of EM (expectation maximization)

  33. Need new slide! Looks like this but deltas

  34. Solution to #2 - Decoding Given x=x1 …xNand HMM θ, what is “best” parse y1 …yN? Several optimal solutions 1. States which are individually most likely 2. Single best state sequenceWe want to find sequence y1 …yN, such that P(x,y) is maximized y* = argmaxy P( x, y ) Again, we can use dynamic programming! 1 1 1 1 … 1 2 2 2 2 … 2 2 … … … … K K K K … K o1 o2 o3 oK

  35. The Viterbi Algorithm DEFINE INITIALIZATION INDUCTION TERMINATION Backtracking to get state sequence y*

  36. The Viterbi Algorithm x1 x2 ……xj-1 xj……………………………..xT State 1 Max 2 i δj(i) K Time: O(K2T) Space: O(KT) Linear in length of sequence Remember: δk(i) = probability of most likely state seq ending with state Sk Slides from Serafim Batzoglou

  37. The Viterbi Algorithm Pedro Domingos 41

  38. Three Questions Evaluation Forward algorithm (Could also go other direction) Decoding Viterbi algorithm Learning Baum-Welch Algorithm (aka “forward-backward”) A kind of EM (expectation maximization)

  39. Solution to #3 - Learning Given x1 …xN, how do we learn θ=( , , ) to maximize P(x)? Unfortunately, there is no known way to analytically find a global maximum θ* such thatθ* = arg max P(o | θ) But it is possible to find a local maximum; given an initial model θ, we can always find a model θ’ such that P(o | θ’) ≥ P(o | θ)

  40. Chicken & Egg Problem If we knew the actual sequence of states It would be easy to learn transition and emission probabilities But we can’t observe states, so we don’t! • If we knew transition & emission probabilities • Then it’d be easy to estimate the sequence of states (Viterbi) • But we don’t know them! 44 Slide by Daniel S. Weld

  41. Simplest Version Mixture of two distributions Know: form of distribution & variance, % =5 Just need mean of each distribution .01 .03 .05 .07 .09 45 Slide by Daniel S. Weld

  42. Input Looks Like .01 .03 .05 .07 .09 46 Slide by Daniel S. Weld

  43. We Want to Predict ? .01 .03 .05 .07 .09 47 Slide by Daniel S. Weld

  44. Chicken & Egg Note that coloring instances would be easy if we knew Gausians…. .01 .03 .05 .07 .09 48 Slide by Daniel S. Weld

  45. Chicken & Egg And finding the Gausians would be easy If we knew the coloring .01 .03 .05 .07 .09 49 Slide by Daniel S. Weld

  46. Expectation Maximization (EM) Pretend we do know the parameters Initialize randomly: set 1=?; 2=? .01 .03 .05 .07 .09 50 Slide by Daniel S. Weld

  47. Expectation Maximization (EM) Pretend we do know the parameters Initialize randomly [E step]Compute probability of instance having each possible value of the hidden variable .01 .03 .05 .07 .09 51 Slide by Daniel S. Weld

  48. Expectation Maximization (EM) Pretend we do know the parameters Initialize randomly [E step]Compute probability of instance having each possible value of the hidden variable .01 .03 .05 .07 .09 52 Slide by Daniel S. Weld

  49. Expectation Maximization (EM) Pretend we do know the parameters Initialize randomly [E step]Compute probability of instance having each possible value of the hidden variable [M step] Treating each instance as fractionally having both values compute the new parameter values .01 .03 .05 .07 .09 53 Slide by Daniel S. Weld

  50. ML Mean of Single Gaussian Uml = argminui(xi – u)2 .01 .03 .05 .07 .09 54 Slide by Daniel S. Weld

More Related