Hidden Markov Models

1 / 41

# Hidden Markov Models - PowerPoint PPT Presentation

Hidden Markov Models. Richard Golden (following approach of Chapter 9 of Manning and Schutze, 2000) REVISION DATE: April 15 (Tuesday), 2003. a 11 =0.7.  1. S 0. S 1. a 12 =0.3. a 21 =0.5.  2. S 2. a 22 =0.5. VMM (Visible Markov Model). HMM Notation.

I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.

## PowerPoint Slideshow about 'Hidden Markov Models' - joseph-brewer

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.

- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript

### Hidden Markov Models

Richard Golden

(following approach of Chapter 9 of Manning and Schutze, 2000)

REVISION DATE:April 15 (Tuesday), 2003

a11=0.7

1

S0

S1

a12=0.3

a21=0.5

2

S2

a22=0.5

VMM (Visible Markov Model)
HMM Notation
• State Sequence Variables: X1, …, XT+1
• Output Sequence Variables: O1, …, OT
• Set of Hidden States (S1, …, SN)
• Output Alphabet (K1, …, KM)
• Initial State Probabilities (1, .., N)i=p(X1=Si), i=1,…,N
• State Transition Probabilities (aij) i,j{1,…,N}aij =p(Xt+1|Xt), t=1,…,T
• Emission Probabilities (bij) i{1,…,N},j {1,…,M}bij=p(Xt+1=Si|Xt=Sj), t=1,…,T

S0

S1

K3

K2

K1

HMM State-Emission Representation
• Note that sometimes a Hidden Markov Model is represented by having the emission arrows come off the arcs
• In this situation you would have a lot more emission arrows because there’s a lot more arcs…
• But the transition and emission probabilities are the same…it just takes longer to draw on your powerpoint presentation (self-conscious presentation)

a11=0.7

b11=0.6

b12=0.1

b13=0.3

1=1

a12=0.3

b22=0.7

b23=0.2

2=0

a21=0.5

S2

b21=0.1

a22=0.5

Arc-Emission Representation
• Note that sometimes a Hidden Markov Model is represented by having the emission arrows come off the arcs
• In this situation you would have a lot more emission arrows because there’s a lot more arcs…
• But the transition and emission probabilities are the same…it just takes longer to draw on your powerpoint presentation (self-conscious presentation)
Fundamental Questions for HMMs
• MODEL FIT
• How can we compute likelihood of observations and hidden statesgiven known emission and transition probabilities?

• How can we compute likelihood of observationsgiven known emission and transition probabilities? p(“Dog”,”is”,”Good” | {aij},{bkm})
Fundamental Questions for HMMs
• INFERENCE
• How can we infer the sequence of hidden statesgiven the observations and the known emission and transition probabilities?
• Maximize:
• p(“Dog”/?,”is”/?, “Good”/? | {aij},{bkm})with respect to the unknown labels
Fundamental Questions for HMMs
• LEARNING
• How can we estimate the emission and transition probabilitiesgiven observations and assuming that hidden states are observable during learning process?
• How can we estimate emission and transition probabilitiesgivenobservations only?
Direct Calculation of Model Fit(note use of “Markov” Assumptions) Part 1

Follows directly from the definition of a conditional probability: p(o,x)=p(o|x)p(x)

Direct Calculation of Likelihood of Labeled Observations(note use of “Markov” Assumptions)Part 2

a11=0.7

b11=0.6

b12=0.1

b13=0.3

1=1

a12=0.3

b22=0.7

S0

S1

K1

K2

K3

b23=0.2

a21=0.5

2=0

S2

b21=0.1

a22=0.5

Graphical Algorithm Representation of Direct Calculation of Likelihood of Observations and Hidden States (not hard!)

Note that

“good” is

The name

Of the dogj

So it is a

Noun!

The likelihood of a particular “labeled” sequence of observations

(e.g., p(“Dog”/NOUN,”is”/VERB,”Good”/NOUN|{aij},{bkm})) may be computed

Using the “direct calculation” method using following simple graphical algorithm.

Specifically, p(K3/S1, K2/S2, K1/S1 |{aij},{bkm}))= 1b13a12b22a21b11

Extension to case where the likelihood of the observations given parameters is needed(e.g., p( “Dog”, ”is”, ”good” | {aij},{bij})

KILLER EQUATION!!!!!

Efficiency of Calculations is Important (e.g., Model-Fit)
• Assume 1 multiplication per microsecond
• Assume N=1000 word vocabulary and T=7 word sentence.
• (2T+1)NT+1 multiplications by “direct calculation” yields (2(7)+1)(1000)(7+1) is about475,000 million years of computer time!!!
• 2N2T multiplications using “forward method”is about 14 seconds of computer time!!!
Forward, Backward, and Viterbi Calculations
• Forward calculation methods are thus very useful.
• Forward, Backward, and Viterbi Calculations will now be discussed.

S0

S1

Forward Calculations – Overview

TIME 3

TIME 4

TIME 2

K1

K2

K3

K1

K2

K3

K1

K2

K3

b12=0.1

b11=0.6

b13=0.3

a11=0.7

S1

S1

1

a12=0.3

a21=0.5

2

S2

S2

S2

b23=0.2

a22=0.5

b21=0.1

b22=0.1

K1

K2

K3

K1

K2

K1

K2

K3

K3

S0

S1

Forward Calculations – Time 2 (1 word example)

TIME 2

NOTE: that 1 (2)+ 2 (2)

is the likelihood of the observation/word “K3”in this “1 word example”

K1

K2

K3

b13=0.3

a11=0.7

S1

1

a12=0.3

a21=0.5

2

S2

S2

a22=0.5

b23=0.2

K1

K2

K3

S0

S1

Forward Calculations – Time 3 (2 word example)

TIME 3

TIME 4

TIME 2

K1

K2

K3

K1

K2

K3

b12=0.1

1(3)

b11=0.6

a11=0.7

S1

S1

1

a12=0.3

a21=0.5

2

S2

S2

S2

a22=0.5

b21=0.1

b22=0.1

K1

K2

K3

K1

K2

K3

S0

S1

Forward Calculations – Time 4 (3 word example)

TIME 3

TIME 4

TIME 2

K1

K2

K3

K1

K2

K3

K1

K2

K3

b12=0.1

b13=0.3

b11=0.6

a11=0.7

S1

S1

S1

1

a12=0.3

a21=0.5

2

S2

S2

S2

S2

a22=0.5

b21=0.1

b23=0.2

b22=0.1

K1

K2

K3

K1

K2

K1

K2

K3

K3

S0

S1

Backward Calculations – Overview

TIME 3

TIME 4

TIME 2

K1

K2

K3

K1

K2

K3

K1

K2

K3

b12=0.1

b11=0.6

b13=0.3

a11=0.7

S1

S1

1

a12=0.3

a21=0.5

2

S2

S2

S2

b23=0.2

a22=0.5

b21=0.1

b22=0.1

K1

K2

K3

K1

K2

K1

K2

K3

K3

Backward Calculations – Time 4

TIME 4

K1

K2

K3

b11`=0.6

S1

S2

b21=0.1

K1

K2

K3

Backward Calculations – Time 3

TIME 3

K1

K2

K3

b11`=0.6

S1

S2

b21=0.1

K1

K2

K3

Backward Calculations – Time 2

TIME 3

TIME 4

TIME 2

NOTE: that 1 (2)+ 2 (2)

is the likelihood the observation/word sequence “K2,K1”in this “2 word example”

K1

K2

K3

K1

K2

K3

b12=0.1

b13=0.3

a11=0.7

S1

S1

a12=0.3

a21=0.5

a22=0.5

S2

S2

b23=0.2

b22=0.7

K1

K2

K1

K2

K3

K3

S0

S1

Backward Calculations – Time 1

TIME 3

TIME 4

TIME 2

K1

K2

K3

K1

K2

K3

K1

K2

K3

b12=0.1

b11=0.6

b13=0.3

a11=0.7

S1

S1

1

a12=0.3

a21=0.5

2

S2

S2

S2

b23=0.2

a22=0.5

b21=0.1

b22=0.1

K1

K2

K3

K1

K2

K1

K2

K3

K3

The Forward-Backward Method
• Note the forward method computes:
• Note the backward method computes (t>1):
• We can do the forward-backward methodwhich computes p(K1,…,KT) using formula (using any choice of t=1,…,T+1!):
Solution to Problem 1
• The “hard part” of the 1st Problem was to find the likelihood of the observations for an HMM
• We can now do this using either theforward, backward, or forward-backwardmethod.
Solution to Problem 2: Viterbi Algorithm(Computing “Most Probable” Labeling)
• Consider direct calculation of labeledobservations
• Previously we summedthese likelihoods together across all possible labelings to solve the first problemwhich was to compute the likelihood of the observationsgiven the parameters (Hard part of HMM Question 1!).
• We solved this problem using forward or backward method.
• Now we want to compute all possible labelings and theirrespective likelihoods and pick the labeling which isthe largest!

Efficiency of Calculations is Important (e.g., Most Likely Labeling Problem)
• Just as in the forward-backward calculations wecan solve problem of computing likelihood of every possible one of the NT labelings efficiently
• Instead of millions of years of computing time we can solve the problem in several seconds!!

S0

S1

TIME 3

TIME 4

TIME 2

K1

K2

K3

K1

K2

K3

K1

K2

K3

b12=0.1

b11=0.6

b13=0.3

a11=0.7

S1

S1

1

a12=0.3

a21=0.5

2

S2

S2

S2

b23=0.2

a22=0.5

b21=0.1

b22=0.1

K1

K2

K3

K1

K2

K1

K2

K3

K3

S0

S1

Forward Calculations – Time 2 (1 word example)

TIME 2

K1

K2

K3

b13=0.3

a11=0.7

S1

1=1

a12=0.3

a21=0.5

2=0

S2

S2

a22=0.5

b23=0.2

K1

K2

K3

S0

S1

Backtracking – Time 2 (1 word example)

TIME 2

K1

K2

K3

b13=0.3

a11=0.7

S1

1=1

a12=0.3

a21=0.5

2=0

S2

S2

a22=0.5

b23=0.2

K1

K2

K3

S0

S1

Forward Calculations – (2 word example)

TIME 3

TIME 4

TIME 2

K1

K2

K3

K1

K2

K3

b12=0.1

b13=0.3

S1

S1

a11=0.7

1

a21=0.5

a12=0.3

2

a22=0.5

S2

S2

S2

b23=0.2

b22=0.1

K1

K2

K3

K1

K2

K3

S0

S1

BACKTRACKING – (2 word example)

TIME 3

TIME 4

TIME 2

K1

K2

K3

K1

K2

K3

b12=0.1

b13=0.3

S1

S1

a11=0.7

1

a21=0.5

a12=0.3

2

a22=0.5

S2

S2

S2

b23=0.2

b22=0.1

K1

K2

K3

K1

K2

K3

S0

S1

Forward Calculations – Time 4 (3 word example)

TIME 3

TIME 4

TIME 2

K1

K2

K3

K1

K2

K3

K1

K2

K3

b12=0.1

b13=0.3

b11=0.6

a11=0.7

S1

S1

S1

1

a12=0.3

a21=0.5

2

S2

S2

S2

S2

a22=0.5

b21=0.1

b23=0.2

b22=0.1

K1

K2

K3

K1

K2

K1

K2

K3

K3

S0

S1

Backtracking to Obtain Labeling for 3 word case

TIME 3

TIME 4

TIME 2

K1

K2

K3

K1

K2

K3

K1

K2

K3

b12=0.1

b13=0.3

b11=0.6

a11=0.7

S1

S1

S1

1

a12=0.3

a21=0.5

2

S2

S2

S2

S2

a22=0.5

b21=0.1

b23=0.2

b22=0.1

K1

K2

K3

K1

K2

K1

K2

K3

K3

Third Fundamental Question:Parameter Estimation
• Make Initial Guess for {aij} and {bkm}
• Compute probability one hidden state follows another given: {aij} and {bkm} and sequence of observations.(computed using forward-backward algorithm)
• Compute probability of observed state given a hidden state given: {aij} and {bkm} and sequence of observations.(computed using forward-backward algorithm)
• Use these computed probabilities tomake an improved guess for {aij} and {bkm}
• Repeat this process until convergence
• Can be shown that this algorithm does infact converge to correct choice for {aij} and {bkm}assuming that the initial guess was close enough..