Hidden markov models
This presentation is the property of its rightful owner.
Sponsored Links
1 / 46

Hidden Markov Models PowerPoint PPT Presentation


  • 59 Views
  • Uploaded on
  • Presentation posted in: General

Hidden Markov Models. Chris Brew. The Ohio State University. Introduction. Dynamic Programming Markov models as effective tools for language modelling How to solve three classic problems Calculate the probability of a corpus given a model Guess the sequence of states passed through

Download Presentation

Hidden Markov Models

An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -

Presentation Transcript


Hidden markov models

Hidden Markov Models

Chris Brew

The Ohio State University


Introduction

Introduction

  • Dynamic Programming

  • Markov models as effective tools for language modelling

  • How to solve three classic problems

    • Calculate the probability of a corpus given a model

    • Guess the sequence of states passed through

    • Adapt the model to the corpus

  • Generalization of word-confetti


Edit distance

Edit Distance

  • You have a text that can do

    • Insert a character

    • Delete a character

    • Substitute one character for another

  • The edit distance between two sequences x1…xn, y1…ym is the smallest number of elementary operations that will transform x1…xn into y1…ym


Algorithm for edit distance

Algorithm for edit distance

  • Fill up a rectangular array of intermediate results starting at the bottom left and working up to the top right.

  • This is time efficient, because it avoids backtracking.

  • It can be made space efficient, because not all the entries in the array are relevant to the best path


Initialization

Initialization

S

P

O

H

S

C

H

E

A

P


Initialization1

Initialization

  • def sdist(string1,string2):delCost = 1.0 insCost = 1.0 substCost = 1.0 m = len(string1) n = len(string2) d[0][0] = 0.0…

  • This code is not a complete program, needs imports and so on.


The borders

The borders

S

P

O

H

S

C

H

E

A

P


The borders1

The borders

  • We fill in the first row, adding entries with indices (1,0) through (m,0)

    • …for i in range(m): d[i+1,0] = d[I,0] + delCost …

  • We fill in the first column, adding entries with indices (0,1) through (0,n)

    • …for j in range(m): d[0,j+1] = d[0,j] + insCost …


Recursion

Recursion

S

P

O

H

S

C

H

E

A

P


Recursion1

Recursion

for i in range(m): for j in range(n): if string1[I] == string2[j]:subst = 0 else:subst = substCost d[i+1,j+1] = min(d[i,j] + subst, d[i+1,j]+ insCost, d[i,j+1]+ delCost)


Wrapup

Wrapup

  • At the end, the total distance is in the cell at (m,n).

  • This version says that there is no charge for matching a letter against itself, but that it costs one penalty point to match against anything else. It would be easy to vary this if we thought, for example, that it was less bad to confuse some letter pairs than to confuse others.


Dynamic programming

Dynamic Programming

  • Like many other algorithms, DP is efficient because it systematically records intermediate results.

  • There are actually exponentially many paths through the matrix, but only a polynomial amount of effort is needed to fill it out.

  • If you’re clever, no need to fill all the cells


Topics

Topics

  • The noisy channel model

  • Markov models

  • Hidden Markov models

  • What is Part of speech tagging?

  • Three problems solved

    • Probability estimation (problem 1)

    • Viterbi algorithm (problem 2)

    • Forward-Backward algorithm (problem 3)


The noisy channel model

The noisy channel model

  • Incomplete information

Noisy

Channel

Words

+

Parts-of-speech

Words only


Markov models

the

bit

dogs

Markov Models

  • States and transitions (with probabilities)


Matrix form of markov models

Matrix form of Markov models

  • Transition Matrix(A) The Dogs BitThe 0.01 0.46 0.53Dogs 0.05 0.15 0.80Bit 0.77 0.32 0.01

  • Start with initial probabilities p(0)The 0.7Dogs 0.2Bit 0.1


Using markov models

Using Markov models

  • Choose initial state from p(0). Say it was “the”

  • Choose transition from “the” row of A. If we choose “dogs” that has probability 0.46.

  • But we can get to “dogs” from other places too.p(1)[“dogs”] =p(0)[“the”]*0.46+p(0)[“dogs”]*0.15+p(0)[“bit”]*0.32

  • After N time steps p(n) =ANp(0)


Using markov models ii

Using Markov models II

  • If we want the whole of p(1) we can do it efficiently by multiplying the matrix A by the vector p(0).

  • We can do the same to get p(2) from p(1)

  • After N time steps p(n) =ANp(0)

  • Best path and string probability also not hard.


Hidden markov models1

det

n

vb

Hidden Markov Models

  • Now you don’t know the state sequence

the

a

these

dogs

dogs

chased

bit

bit

cats


Matrix form of hmms

Matrix form of HMMs

  • Transition Matrix(A) Emission Matrix (B)DET N VB Dogs Bit The …DET 0.01 0.89 0.10 DET 0.0 0.0 1.0N 0.30 0.20 0.50 N 0.2 0.1 0.0VB 0.67 0.23 0.10 VB 0.1 0.6 0.0

  • Start with initial probabilities p(0)Det 0.7N 0.2VB 0.1


Using hidden markov models

Using Hidden Markov models

  • Generation:

    • Draw from p(0)

    • Choose transition from relevant row of A

    • Choose emission from relevant row of B

    • After N time steps p(n) =ANp(0)

  • Easy because state stays known.

  • If one wanted, one could generate all possible strings, annotating with probability.


State sequences

State sequences

  • All you see is the output:

    • “The bit dogs …”

  • But you can’t tell which of

    • DET N VB … DET VB N …

    • DET N N … DET VB VB …

  • Each of these has different probabilities.

  • Don’t know which state you are in.


The three problems

The three problems

  • Probability estimation

    • Given a sequence of observations O and a model M. Find P(O|M)

  • Best path estimation

    • Given a sequence of observations O and a model M, find a sequence of states I which maximizes P(O,I|M).


The third problem

The third problem

  • Training

    • Adjust the model parameters so that P(O|M) is as large as possible for given O. Hard problem because there are so many adjustable parameters which could vary


Probability estimation

Probability estimation

  • Easy in principle. Form joint probability of state sequences and observations P(O,I|M). Marginalize out I.

  • But this involves sum over exponentially many paths.

  • Efficient algorithm uses idea that probability of state at time t+1 is easy to get from knowledge of all states at time t.


Probability estimation1

bdet(bit)

bvb(bit)

bn(bit)

Probability estimation

  • Getting the next time step

bit

dogs

DET(t)

adet,i

avb,i

VB(t)

i(t+1)

bi(dogs)

an,i

N(t)


Event 1

Event 1

Arrive in state j at time step t. (big event)


Event 2

Event 2

  • Generate word k from state j


Event 3

Event 3

  • Transition from state j to state i


Event 4

Event 4

  • Continue to from j to end of string (big event)


Best path

bdet(bit)

bvb(bit)

bn(bit)

Best path

bit

dogs

DET(t)

adet,i

avb,i

VB(t)

i(t+1)

bi(dogs)

an,i

N(t)

Maximize not sum


Backward probabilities

dogs

bdet(cat)

DET(t)

adet,i

ai,det

bdet(bit)

avb,i

ai,vb

VB(t)

bi(dogs)

bvb(bit)

bvb(cat)

vb(t+2)

an,i

ai,n

i(t+1)

N(t)

bn(bit)

bn(cat)

i(t+1)

n(t+2)

Backward probabilities

  • Counterpart of forward probs

bit

cat

det(t+2)


Forward and backward

Forward and Backward

  • Note that our notation is not quite the same as that in M&S p334. Ours is a state-emission HMM, theirs is an arc-emission HMM. See the note on p338 for more details.

  • We assume that i(t) includes the probability of generating words up to but not including the one in the state just reached. i(t) therefore starts by generating this word


State probabilities

State probabilities

  • i(t)i(t) is p(in state i at time t, all words)

  • Sum over all states k of k(t)k(t) is p(sentence)

  • p(in state i at time t) is i(t)i(t)/ (Sumkk(t)k(t) )

  • p(in state i) average over all time ticks of p(in state i at time t)


Training

Training

  • Uses forward and backward probabilities

  • Starts from an initial guess

  • Improves the initial guess using data

  • Stops at a (locally) best model

  • Specialization of the EM algorithm


Factorizing the path

Factorizing the path

  • Consider p(in state iat time t and in state jat time t+1| Model,Observations)

  • We could see this as two things

    • Get to iwhile generating words up to t

      * Get from t to end of corpus while generating remaining words.


Factorizing the path 2

Factorizing the path 2

  • Consider p(in state iat time t and in state jat time t+1| Model,Observations)

  • We could see this as four things

    • Get to iwhile generating words up to t

    • Generate word from i

    • Make correct transition from ito j

    • Get from t+1 to end of corpus while generating remaining words.

  • The merit of this is that we can use the current model for the inside bit.


Factorizing the path 3

Factorizing the path 3

  • Consider p(in state iat time t and in state jat time t+1| Model,Observations)

  • We could see this as four things

    • Get to iwhile generating words up to t

    • Repeat ad lib

      • Generate word from current state

      • Make a transition that generates the word that we saw

    • Get from t+k to end of corpus while generating remaining words.

  • If we wanted, the model for the inside bit could be a bit more complicated than we assumed above. Research topic.


Expected transition counts 2

Expected transition counts 2

  • We have these things already

    • Forward prob: i(t)

    • Transition prob: aij

    • Emission prob: bj(word)

    • Backward prob j(t+1)


Expected transition counts 3

bdet(cat)

ai,det

bdet(...)

ai,vb

bvb(cat)

bvb(...)

ai,n

bn(cat)

bn(...)

Expected Transition counts 3

dogs

bit

adet,i

ai,j

avb,i

bi(bit)

bJ(dogs)

an,i

i(t)

i(t+1)


Estimated transition probabilities

Estimated transition probabilities

  • i(t)aijbj(word)j(t+1) is count(in state i at time t,in state j at time t+1, all words)

  • p(in state i at time t,in state j at time t+1) is i(t)aijbj(word)j(t+1) / (Sumki(t)aikbk(word)k(t+1) )

  • Sum over all time ticks to get expected transition counts.

  • Derive new probabilities from these counts.


Estimated emission probabilities

Estimated emission probabilities

  • Calculate expected number of times in state j at places where particular word happened.

  • Divide expected number of times in state j

  • average over time ticks is new emission probability.


Re estimation for everybody

Re-estimation (for everybody)

  • Recall that we guessed the initial parameters. Replace initial parameters with new ones derived as above.

  • These will be better than the originals because:

    • The data ensures that we only consider paths which can generate the words that we did see in the corpus

    • Paths which fit the data well get taken frequently, bad paths infrequently


Re estimation the details

Re-estimation (the details)

  • Baum et al. show that this will always converge to a local maximum.

  • An instance of Dempster Laird and Rubin’s EM algorithm.

  • For a modern review of EM see: ftp://ftp.cs.utoronto.ca/pub/radford/emk.pdf


Summary

Summary

  • Three problems solved

  • Simple model based on finite-state technology

  • Sensitive to a limited range of context information

  • Re-estimation as in instance of the EM algorithm


Where to get more information

Where to get more information

  • Maryland implementation in C

  • My implementation in Python

  • Matlab code by Zoubin Ghahramani

  • Manning and Schütze ch 9.

  • Charniak chapters 3 and 4

  • http://www.georgetown.edu/cball/ling361/tagging_overview.html


  • Login