Hidden Markov Models

Hidden Markov Models Richard Golden (following approach of Chapter 9 of Manning and Schutze, 2000) REVISION DATE:April 15 (Tuesday), 2003

a11=0.7 1 S0 S1 a12=0.3 a21=0.5 2 S2 a22=0.5 VMM (Visible Markov Model)

HMM Notation • State Sequence Variables: X1, …, XT+1 • Output Sequence Variables: O1, …, OT • Set of Hidden States (S1, …, SN) • Output Alphabet (K1, …, KM) • Initial State Probabilities (1, .., N)i=p(X1=Si), i=1,…,N • State Transition Probabilities (aij) i,j{1,…,N}aij =p(Xt+1|Xt), t=1,…,T • Emission Probabilities (bij) i{1,…,N},j {1,…,M}bij=p(Xt+1=Si|Xt=Sj), t=1,…,T

S0 S1 K3 K2 K1 HMM State-Emission Representation • Note that sometimes a Hidden Markov Model is represented by having the emission arrows come off the arcs • In this situation you would have a lot more emission arrows because there’s a lot more arcs… • But the transition and emission probabilities are the same…it just takes longer to draw on your powerpoint presentation (self-conscious presentation) a11=0.7 b11=0.6 b12=0.1 b13=0.3 1=1 a12=0.3 b22=0.7 b23=0.2 2=0 a21=0.5 S2 b21=0.1 a22=0.5

Arc-Emission Representation • Note that sometimes a Hidden Markov Model is represented by having the emission arrows come off the arcs • In this situation you would have a lot more emission arrows because there’s a lot more arcs… • But the transition and emission probabilities are the same…it just takes longer to draw on your powerpoint presentation (self-conscious presentation)

Fundamental Questions for HMMs • MODEL FIT • How can we compute likelihood of observations and hidden statesgiven known emission and transition probabilities? Compute:p(“Dog”/NOUN,”is”/VERB,”Good”/ADJ | {aij},{bkm}) • How can we compute likelihood of observationsgiven known emission and transition probabilities? p(“Dog”,”is”,”Good” | {aij},{bkm})

Fundamental Questions for HMMs • INFERENCE • How can we infer the sequence of hidden statesgiven the observations and the known emission and transition probabilities? • Maximize: • p(“Dog”/?,”is”/?, “Good”/? | {aij},{bkm})with respect to the unknown labels

Fundamental Questions for HMMs • LEARNING • How can we estimate the emission and transition probabilitiesgiven observations and assuming that hidden states are observable during learning process? • How can we estimate emission and transition probabilitiesgivenobservations only?

Direct Calculation of Model Fit(note use of “Markov” Assumptions) Part 1 Follows directly from the definition of a conditional probability: p(o,x)=p(o|x)p(x) EXAMPLE:P(“Dog”/NOUN,”is”/VERB,”Good”/ADJ | {aij},{bij}) = p(“Dog”,”is”,”Good”|NOUN,VERB,ADJ {aij},{bij}) X p(NOUN,VERB,ADJ | aij},{bij})

Direct Calculation of Likelihood of Labeled Observations(note use of “Markov” Assumptions)Part 2 EXAMPLE:Compute p(“DOG”/NOUN,”is”/VERB,”good”/ADJ|{aij},{bkm})

a11=0.7 b11=0.6 b12=0.1 b13=0.3 1=1 a12=0.3 b22=0.7 S0 S1 K1 K2 K3 b23=0.2 a21=0.5 2=0 S2 b21=0.1 a22=0.5 Graphical Algorithm Representation of Direct Calculation of Likelihood of Observations and Hidden States (not hard!) Note that “good” is The name Of the dogj So it is a Noun! The likelihood of a particular “labeled” sequence of observations (e.g., p(“Dog”/NOUN,”is”/VERB,”Good”/NOUN|{aij},{bkm})) may be computed Using the “direct calculation” method using following simple graphical algorithm. Specifically, p(K3/S1, K2/S2, K1/S1 |{aij},{bkm}))= 1b13a12b22a21b11

Extension to case where the likelihood of the observations given parameters is needed(e.g., p( “Dog”, ”is”, ”good” | {aij},{bij}) KILLER EQUATION!!!!!

Efficiency of Calculations is Important (e.g., Model-Fit) • Assume 1 multiplication per microsecond • Assume N=1000 word vocabulary and T=7 word sentence. • (2T+1)NT+1 multiplications by “direct calculation” yields (2(7)+1)(1000)(7+1) is about475,000 million years of computer time!!! • 2N2T multiplications using “forward method”is about 14 seconds of computer time!!!

Forward, Backward, and Viterbi Calculations • Forward calculation methods are thus very useful. • Forward, Backward, and Viterbi Calculations will now be discussed.

S0 S1 Forward Calculations – Overview TIME 3 TIME 4 TIME 2 K1 K2 K3 K1 K2 K3 K1 K2 K3 b12=0.1 b11=0.6 b13=0.3 a11=0.7 S1 S1 1 a12=0.3 a21=0.5 2 S2 S2 S2 b23=0.2 a22=0.5 b21=0.1 b22=0.1 K1 K2 K3 K1 K2 K1 K2 K3 K3

S0 S1 Forward Calculations – Time 2 (1 word example) TIME 2 NOTE: that 1 (2)+ 2 (2) is the likelihood of the observation/word “K3”in this “1 word example” K1 K2 K3 b13=0.3 a11=0.7 S1 1 a12=0.3 a21=0.5 2 S2 S2 a22=0.5 b23=0.2 K1 K2 K3

S0 S1 Forward Calculations – Time 3 (2 word example) TIME 3 TIME 4 TIME 2 K1 K2 K3 K1 K2 K3 b12=0.1 1(3) b11=0.6 a11=0.7 S1 S1 1 a12=0.3 a21=0.5 2 S2 S2 S2 a22=0.5 b21=0.1 b22=0.1 K1 K2 K3 K1 K2 K3

S0 S1 Forward Calculations – Time 4 (3 word example) TIME 3 TIME 4 TIME 2 K1 K2 K3 K1 K2 K3 K1 K2 K3 b12=0.1 b13=0.3 b11=0.6 a11=0.7 S1 S1 S1 1 a12=0.3 a21=0.5 2 S2 S2 S2 S2 a22=0.5 b21=0.1 b23=0.2 b22=0.1 K1 K2 K3 K1 K2 K1 K2 K3 K3

Forward Calculation of Likelihood Function (“emit and jump”)

S0 S1 Backward Calculations – Overview TIME 3 TIME 4 TIME 2 K1 K2 K3 K1 K2 K3 K1 K2 K3 b12=0.1 b11=0.6 b13=0.3 a11=0.7 S1 S1 1 a12=0.3 a21=0.5 2 S2 S2 S2 b23=0.2 a22=0.5 b21=0.1 b22=0.1 K1 K2 K3 K1 K2 K1 K2 K3 K3

Backward Calculations – Time 4 TIME 4 K1 K2 K3 b11`=0.6 S1 S2 b21=0.1 K1 K2 K3

Backward Calculations – Time 3 TIME 3 K1 K2 K3 b11`=0.6 S1 S2 b21=0.1 K1 K2 K3

Backward Calculations – Time 2 TIME 3 TIME 4 TIME 2 NOTE: that 1 (2)+ 2 (2) is the likelihood the observation/word sequence “K2,K1”in this “2 word example” K1 K2 K3 K1 K2 K3 b12=0.1 b13=0.3 a11=0.7 S1 S1 a12=0.3 a21=0.5 a22=0.5 S2 S2 b23=0.2 b22=0.7 K1 K2 K1 K2 K3 K3

S0 S1 Backward Calculations – Time 1 TIME 3 TIME 4 TIME 2 K1 K2 K3 K1 K2 K3 K1 K2 K3 b12=0.1 b11=0.6 b13=0.3 a11=0.7 S1 S1 1 a12=0.3 a21=0.5 2 S2 S2 S2 b23=0.2 a22=0.5 b21=0.1 b22=0.1 K1 K2 K3 K1 K2 K1 K2 K3 K3

Backward Calculation of Likelihood Function (“EMIT AND JUMP”)

You get same answer going forward or backward!! Backward Forward

The Forward-Backward Method • Note the forward method computes: • Note the backward method computes (t>1): • We can do the forward-backward methodwhich computes p(K1,…,KT) using formula (using any choice of t=1,…,T+1!):

Example Forward-Backward Calculation! Backward Forward

Solution to Problem 1 • The “hard part” of the 1st Problem was to find the likelihood of the observations for an HMM • We can now do this using either theforward, backward, or forward-backwardmethod.

Solution to Problem 2: Viterbi Algorithm(Computing “Most Probable” Labeling) • Consider direct calculation of labeledobservations • Previously we summedthese likelihoods together across all possible labelings to solve the first problemwhich was to compute the likelihood of the observationsgiven the parameters (Hard part of HMM Question 1!). • We solved this problem using forward or backward method. • Now we want to compute all possible labelings and theirrespective likelihoods and pick the labeling which isthe largest! EXAMPLE:Compute p(“DOG”/NOUN,”is”/VERB,”good”/ADJ|{aij},{bkm})

Efficiency of Calculations is Important (e.g., Most Likely Labeling Problem) • Just as in the forward-backward calculations wecan solve problem of computing likelihood of every possible one of the NT labelings efficiently • Instead of millions of years of computing time we can solve the problem in several seconds!!

S0 S1 Viterbi Algorithm – Overview (same setup as forward algorithm) TIME 3 TIME 4 TIME 2 K1 K2 K3 K1 K2 K3 K1 K2 K3 b12=0.1 b11=0.6 b13=0.3 a11=0.7 S1 S1 1 a12=0.3 a21=0.5 2 S2 S2 S2 b23=0.2 a22=0.5 b21=0.1 b22=0.1 K1 K2 K3 K1 K2 K1 K2 K3 K3

S0 S1 Forward Calculations – Time 2 (1 word example) TIME 2 K1 K2 K3 b13=0.3 a11=0.7 S1 1=1 a12=0.3 a21=0.5 2=0 S2 S2 a22=0.5 b23=0.2 K1 K2 K3

S0 S1 Backtracking – Time 2 (1 word example) TIME 2 K1 K2 K3 b13=0.3 a11=0.7 S1 1=1 a12=0.3 a21=0.5 2=0 S2 S2 a22=0.5 b23=0.2 K1 K2 K3

S0 S1 Forward Calculations – (2 word example) TIME 3 TIME 4 TIME 2 K1 K2 K3 K1 K2 K3 b12=0.1 b13=0.3 S1 S1 a11=0.7 1 a21=0.5 a12=0.3 2 a22=0.5 S2 S2 S2 b23=0.2 b22=0.1 K1 K2 K3 K1 K2 K3

S0 S1 BACKTRACKING – (2 word example) TIME 3 TIME 4 TIME 2 K1 K2 K3 K1 K2 K3 b12=0.1 b13=0.3 S1 S1 a11=0.7 1 a21=0.5 a12=0.3 2 a22=0.5 S2 S2 S2 b23=0.2 b22=0.1 K1 K2 K3 K1 K2 K3

Formal Analysis of 2 word case

S0 S1 Forward Calculations – Time 4 (3 word example) TIME 3 TIME 4 TIME 2 K1 K2 K3 K1 K2 K3 K1 K2 K3 b12=0.1 b13=0.3 b11=0.6 a11=0.7 S1 S1 S1 1 a12=0.3 a21=0.5 2 S2 S2 S2 S2 a22=0.5 b21=0.1 b23=0.2 b22=0.1 K1 K2 K3 K1 K2 K1 K2 K3 K3

S0 S1 Backtracking to Obtain Labeling for 3 word case TIME 3 TIME 4 TIME 2 K1 K2 K3 K1 K2 K3 K1 K2 K3 b12=0.1 b13=0.3 b11=0.6 a11=0.7 S1 S1 S1 1 a12=0.3 a21=0.5 2 S2 S2 S2 S2 a22=0.5 b21=0.1 b23=0.2 b22=0.1 K1 K2 K3 K1 K2 K1 K2 K3 K3

Formal Analysis of 3 word case

Third Fundamental Question:Parameter Estimation • Make Initial Guess for {aij} and {bkm} • Compute probability one hidden state follows another given: {aij} and {bkm} and sequence of observations.(computed using forward-backward algorithm) • Compute probability of observed state given a hidden state given: {aij} and {bkm} and sequence of observations.(computed using forward-backward algorithm) • Use these computed probabilities tomake an improved guess for {aij} and {bkm} • Repeat this process until convergence • Can be shown that this algorithm does infact converge to correct choice for {aij} and {bkm}assuming that the initial guess was close enough..

Hidden Markov Models