HMM (I)

1 / 49

# HMM (I) - PowerPoint PPT Presentation

HMM (I). LING 570 Fei Xia Week 7: 11/5-11/7/07. HMM. Definition and properties of HMM Two types of HMM Three basic questions in HMM. Definition of HMM. Hidden Markov Models. There are n states s 1 , …, s n in an HMM, and the states are connected.

I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.

## PowerPoint Slideshow about 'HMM (I)' - kylee

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.

- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript

### HMM (I)

LING 570

Fei Xia

Week 7: 11/5-11/7/07

HMM
• Definition and properties of HMM
• Two types of HMM
• Three basic questions in HMM

### Definition of HMM

Hidden Markov Models
• There are n states s1, …, sn in an HMM, and the states are connected.
• The output symbols are produced by the states or edges in HMM.
• An observation O=(o1, …, oT) is a sequence of output symbols.
• Given an observation, we want to recover the hidden state sequence.
• An example: POS tagging
• States are POS tags
• Output symbols are words
• Given an observation (i.e., a sentence), we want to discover the tag sequence.
V

DT

N

N

N

time

flies

like

an

arrow

Same observation, different state sequences

P

DT

N

N

V

time

flies

like

an

arrow

Two types of HMMs
• State-emission HMM (Moore machine):
• The output symbol is produced by states:
• By the from-state
• By the to-state
• Arc-emission HMM (Mealy machine):
• The output symbol is produce by the edges; i.e., by the (from-state, to-state) pairs.

### PFA recap

Formal definition of PFA

A PFA is

• Q: a finite set of N states
• Σ: a finite set of input symbols
• I: Q R+ (initial-state probabilities)
• F: Q R+ (final-state probabilities)
• : the transition relation between states.
• P:(transition probabilities)
Constraints on function:

Probability of a string:

b:0.8

a:1.0

q0:0

q1:0.2

An example of PFA

F(q0)=0

F(q1)=0.2

I(q0)=1.0

I(q1)=0.0

P(abn)=I(q0)*P(q0,abn,q1)*F(q1)

=1.0 * 1.0*0.8n *0.2

### Arc-emission HMM

Definition of arc-emission HMM
• A HMM is a tuple :
• A set of states S={s1, s2, …, sN}.
• A set of output symbols Σ={w1, …, wM}.
• Initial state probabilities
• Transition prob: A={aij}.
• Emission prob: B={bijk}
Constraints in an arc-emission HMM

For any integer n and any HMM

An example: HMM structure

w1

w2

w1

w1

w5

sN

s1

s2

w4

w3

Same kinds of parameters but the emission probabilities depend on both states: P(wk | si, sj)

 # of Parameters: O(N2M + N2).

o1

o2

on

Xn+1

X1

X2

Xn

A path in an arc emission HMM
• State sequence: X1,n+1
• Output sequence: O1,n
PFA vs. Arc-emission HMM

A PFA is

• Q: a finite set of N states
• Σ: a finite set of input symbols
• I: Q R+ (initial-state probabilities)
• F: Q R+ (final-state probabilities)
• : the transition relation between states.
• P:(transition probabilities)

A HMM is a tuple :

• A set of states S={s1, s2, …, sN}.
• A set of output symbols Σ={w1, …, wM}.
• Initial state probabilities
• Transition prob: A={aij}.
• Emission prob: B={bijk}

### State-emission HMM

Definition of state-emission HMM
• A HMM is a tuple :
• A set of states S={s1, s2, …, sN}.
• A set of output symbols Σ={w1, …, wM}.
• Initial state probabilities
• Transition prob: A={aij}.
• Emission prob: B={bjk}
• We use si and wk to refer to what is in an HMM structure.
• We use Xi and Oi to refer to what is in a particular HMM path and its output
Constraints in a state-emission HMM

For any integer n and any HMM

An example: the HMM structure

s1

s2

sN

w1

w2

w1

w3

w5

w1

• Two kinds of parameters:
• Transition probability: P(sj| si)
• Emission probability: P(wk | si)
•  # of Parameters: O(NM+N2)

X1

X2

Xn

o2

on

o1

Output symbols are generated by the from-states
• State sequence: X1,n
• Output sequence: O1,n

X2

X3

X1

Xn+1

o2

on

o1

Output symbols are generated by the to-states
• State sequence: X1,n+1
• Output sequence: O1,n

X1

X2

Xn

o2

on

o1

X2

X3

X1

Xn+1

o2

on

o1

A path in a state-emission HMM

Output symbols are produced by the from-states:

Output symbols are produced by the to-states:

o1

o2

on

Xn+1

X1

X2

Xn

X2

X3

X1

Xn+1

o2

on

o1

Arc-emission vs. state-emission
Properties of HMM
• Markov assumption (Limited horizon):
• Stationary distribution (Time invariance): the probabilities do not change over time:
• The states are hidden because we know the structure of the machine (i.e., S and Σ), but we don’t know which state sequences generate a particular output.
Are the two types of HMMs equivalent?
• For each state-emission HMM1, there is an arc-emission HMM2, such that for any sequence O, P(O|HMM1)=P(O|HMM2).
• The reverse is also true.
• How to prove that?
Applications of HMM
• N-gram POS tagging
• Bigram tagger: oi is a word, and si is a POS tag.
• Other tagging problems:
• Word segmentation
• Chunking
• NE tagging
• Punctuation predication
• Other applications: ASR, ….

### Three HMM questions

Three fundamental questions for HMMs
• Training an HMM: given a set of observation sequences, learn its distribution, i.e. learn the transition and emission probabilities
• HMM as a parser: Finding the best state sequence for a given observation
• HMM as an LM: compute the probability of a given observation
Training an HMM: estimating the probabilities
• Supervised learning:
• The state sequences in the training data are known
• ML estimation
• Unsupervised learning:
• The state sequences in the training data are unknown
• forward-backward algorithm

### HMM as a parser

oT

o1

o2

XT+1

XT

X1

X2

HMM as a parser: Finding the best state sequence
• Given the observation O1,T=o1…oT, find the state sequence X1,T+1=X1 … XT+1 that maximizes P(X1,T+1 | O1,T).

 Viterbi algorithm

“time flies like an arrow”

\emission

N time 0.1

V time 0.1

N flies 0.1

V flies 0.2

V like 0.2

P like 0.1

DT an 0.3

N arrow 0.1

\init

BOS 1.0

\transition

BOS N 0.5

BOS DT 0.4

BOS V 0.1

DT N 1.0

N N 0.2

N V 0.7

N P 0.1

V DT 0.4

V N 0.4

V P 0.1

V V 0.1

P DT 0.6

P N 0.4

N

V

P

DT

Finding all the paths: to build the trellis

time flies like an arrow

N

N

N

N

V

V

V

V

BOS

P

P

P

P

DT

DT

DT

DT

Finding all the paths (cont)

time flies like an arrow

N

N

N

N

N

V

V

V

V

V

BOS

P

P

P

P

P

DT

DT

DT

DT

DT

Viterbi algorithm

The probability of the best path that produces O1,t-1 while ending up in state sj:

Initialization:

Induction:

 Modify it to allow ²-emission

Viterbi algorithm: calculating ±j(t)

# N is the number of states in the HMM structure

# observ is the observation O, and leng is the length of observ.

Initialize viterbi[0..leng] [0..N-1] to 0

for each state j

viterbi[0] [j] = ¼[j]

back-pointer[0] [j] = -1 # dummy

for (t=0; t

for (j=0; j

k=observ[t] # the symbol at time t

viterbi[t+1] [j] = maxi viterbi[t] [i] aij bjk

back-pointer[t+1] [j] = arg maxi viterbi[t] [i] aij bjk

Viterbi algorithm: retrieving the best path

# find the best path

best_final_state = arg maxj viterbi[leng] [j]

j = best_final_state

push(arr, j);

for (t=leng; t>0; t--)

i = back-pointer[t] [j]

push(arr, i)

j = i

return reverse(arr)

Hw7 and Hw8
• Hw7: write an HMM “class”:
• Output HMM
• Hw8: implement the algorithms for two HMM tasks:
• HMM as parser: Viterbi algorithm
• HMM as LM: the prob of an observation
Implementation issue storing HMM

Approach #1:

• ¼i: pi {state_str}
• aij: a {from_state_str} {to_state_str}
• bjk: b {state_str} {symbol}

Approach #2:

• state2idx{state_str} = state_idx
• symbol2idx{symbol_str} = symbol_idx
• ¼i: pi [state_idx] = prob
• aij: a [from_state_idx] [to_state_idx] = prob
• bjk: b [state_idx] [symbol_idx] = prob
• idx2state[state_idx] = state_str
• Idx2symbol[symbol_idx] = symbol_str
Storing HMM: sparse matrix
• aij: a [i] [j] = prob
• bjk: b [j] [k] = prob
• aij: a[i] = “j1 p1 j2 p2 …”
• aij: a[j] = “i1 p1 i2 p2 …”
• bjk: b[j] = “k1 p1 k2 p2 ….”
• bjk: b[k] = “j1 p1 j2 p2 …”
Other implementation issues
• Index starts from 0 in programming, but often starts from 1 in algorithms
• The sum of logprob is used in practice to replace the product of prob.
• Check constraints and print out warning if the constraints are not met.

### HMM as LM

HMM as an LM: computing P(o1, …, oT)

1st try:

- enumerate all possible paths

- add the probabilities of all paths

Forward probabilities
• Forward probability: the probability of producing O1,t-1 while ending up in state si:
Calculating forward probability

Initialization:

Induction:

Summary
• Definition: hidden states, output symbols
• Properties: Markov assumption
• Applications: POS-tagging, etc.
• Three basic questions in HMM
• Find the probability of an observation: forward probability
• Find the best sequence: Viterbi algorithm
• Estimate probability: MLE
• Bigram POS tagger: decoding with Viterbi algorithm