1 / 123

# CS 6243 Machine Learning - PowerPoint PPT Presentation

CS 6243 Machine Learning. Markov Chain and Hidden Markov Models. Outline. Background on probability Hidden Markov models Algorithms Applications. Probability Basics. Definition (informal)

I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.

## PowerPoint Slideshow about ' CS 6243 Machine Learning' - dara

An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.

- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript

### CS 6243 Machine Learning

Markov Chain and Hidden Markov Models

• Background on probability

• Hidden Markov models

• Algorithms

• Applications

• Definition (informal)

• Probabilities are numbers assigned to events that indicate “how likely” it is that the event will occur when a random experiment is performed

• A probability law for a random experiment is a rule that assigns probabilities to the events in the experiment

• The sample space S of a random experiment is the set of all possible outcomes

• All probabilities between 0 and 1

• If A, B are mutually exclusive:

• P(A  B) = P(A) + P(B)

• Thus: P(not(A)) = P(Ac) = 1 – P(A)

S

B

A

• The joint probability of two events A and B P(AB), or simply P(A, B) is the probability that event A and B occur at the same time.

• The conditional probability of P(A|B) is the probability that A occurs given B occurred.

P(A | B) = P(A  B) / P(B)

<=> P(A  B) = P(A | B) P(B)

<=> P(A  B) = P(B|A) P(A)

• Roll a die

• If I tell you the number is less than 4

• What is the probability of an even number?

• P(d = even | d < 4) = P(d = even  d < 4) / P(d < 4)

• P(d = 2) / P(d = 1, 2, or 3) = (1/6) / (3/6) = 1/3

• A and B are independent iff:

• Therefore, if A and B are independent:

These two constraints are logically equivalent

• Are P(d = even) and P(d < 4) independent?

• P(d = even and d < 4) = 1/6

• P(d = even) = ½

• P(d < 4) = ½

• ½ * ½ > 1/6

• If your die actually has 8 faces, will P(d = even) and P(d < 5) be independent?

• Are P(even in first roll) and P(even in second roll) independent?

• Playing card, are the suit and rank independent?

• Let B1, B2, …, BN be mutually exclusive events whose union equals the sample space S. We refer to these sets as a partition of S.

• An event A can be represented as:

• Since B1, B2, …, BN are mutually exclusive, then

• P(A) = P(A  B1) + P(A B2) + … + P(ABN)

• And therefore

• P(A) = P(A|B1)*P(B1) + P(A|B2)*P(B2) + … + P(A|BN)*P(BN)

• = i P(A | Bi) * P(Bi)

Marginalization

Exhaustive conditionalization

• A loaded die:

• P(6) = 0.5

• P(1) = … = P(5) = 0.1

• Prob of even number?

P(even)

= P(even | d < 6) * P (d<6) +

P(even | d = 6) * P (d=6)

= 2/5 * 0.5 + 1 * 0.5

= 0.7

• A box of dice:

• 99% fair

• P(6) = 0.5.

• P(1) = … = P(5) = 0.1

• Randomly pick a die and roll, P(6)?

• P(6) = P(6 | F) * P(F) + P(6 | L) * P(L)

• 1/6 * 0.99 + 0.5 * 0.01 = 0.17

• P(A  B) = P(B) * P(A | B) = P(A) * P(B | A)

Conditional probability

(likelihood)

P

(

A

|

B

)

P

(

B

)

Prior of B

=>

=

P

(

B

|

A

)

P

(

A

)

Posterior probability

Prior of A (Normalizing constant)

This is known as Bayes Theorem or Bayes Rule, and is (one of) the most useful relations in probability and statistics

Bayes Theorem is definitely the fundamental relation in Statistical Pattern Recognition

• Given B1, B2, …, BN, a partition of the sample space S. Suppose that event A occurs; what is the probability of event Bj?

• P(Bj | A) = P(A | Bj) * P(Bj) / P(A)

= P(A | Bj) * P(Bj) / jP(A | Bj)*P(Bj)

Posterior probability

Prior of Bj

Likelihood

Normalizing constant

(theorem of total probabilities)

Bj: different models / hypotheses

In the observation of A, should you choose a model that maximizes P(Bj | A) or P(A | Bj)? Depending on how much you know about Bj !

• A test for a rare disease claims that it will report positive for 99.5% of people with disease, and negative 99.9% of time for those without.

• The disease is present in the population at 1 in 100,000

• What is P(disease | positive test)?

• P(D|+) = P(+|D)P(D)/P(+) = 0.01

• What is P(disease | negative test)?

• P(D|-) = P(-|D)P(D)/P(-) = 5e-8

• We’ve talked about the boxes of casinos: 99% fair, 1% loaded (50% at six)

• We said if we randomly pick a die and roll, we have 17% of chance to get a six

• If we get 3 six in a row, what’s the chance that the die is loaded?

• How about 5 six in a row?

• P(loaded | 666)

= P(666 | loaded) * P(loaded) / P(666)

= 0.53 * 0.01 / (0.53 * 0.01 + (1/6)3 * 0.99)

= 0.21

• P(loaded | 66666)

= P(66666 | loaded) * P(loaded) / P(66666)

= 0.55 * 0.01 / (0.55 * 0.01 + (1/6)5 * 0.99)

= 0.71

• Assume nature generates a type of DNA sequence as follows:

• A box of dice, each with four faces: {A,C,G,T}

• Select a die suitable for the type of DNA

• Roll it, append the symbol to a string.

• Repeat 3, until all symbols have been generated.

• Given a string say X=“GATTCCAA…” and two dice

• M1 has the distribution of pA=pC=pG=pT=0.25.

• M2 has the distribution: pA=pT=0.20, pC=pG=0.30

• What is the probability of the sequence being generated by M1 or M2?

• X = GATTCCAA

• P(X | M1) = P(x1,x2,…,xn | M1)

= i=1..nP(xi|M1)

= 0.258 = 1.53e-5

• P(X | M2) = P(x1,x2,…,xn | M2)

= i=1..nP(xi|M2)

= 0.25 0.33 = 8.64e-6

P(X|M1) / P(X|M2) =  P(xi|M1)/P(xi|M2) = (0.25/0.2)5 (0.25/0.3)3

LLR =  log(P(xi|M1)/P(xi|M2))

= nASA + nCSC + nGSG + nTST

= 5 * log(1.25) + 3 * log(0.833) = 0.57

Si = log (P(i | M1) / P(i | M2)), i = A, C, G, T

Log likelihood ratio (LLR)

• Take the prior probabilities of M1 and M2 into consideration if known

Log (P(M1|X) / P(M2|X))

= LLR + log(P(M1)) – log(P(M2))

= nASA + nCSC + nGSG + nTST + log(P(M1)) – log(P(M2))

• If P(M1) ~ P(M2), results will be similar to LLR test

Markov models for DNA sequences criterion

We have assumed independence of nucleotides in different positions - unrealistic in biology

Example: CpG islands criterion

• CpG - 2 adjacent nucleotides, same strand (not base-pair; “p” stands for the phosphodiester bond of the DNA backbone)

• In mammal promoter regions, CpG is more frequent than other regions of genome

• often mark gene-rich regions

CpG islands criterion

• CpG Islands

• More CpG than elsewhere

• More C & G than elsewhere, too

• Typical length: a few 100s to few 1000s bp

• Questions

• Is a short sequence (say, 200 bp) a CpG island or not?

• Given a long sequence (say, 10-100kb), find CpG islands?

Markov models criterion

• A sequence of random variables is a k-th order Markov chain if, for all i, ithvalue is independent of all but the previous k values:

• First order (k=1):

• Second order:

• 0th order: (independence)

A 1 criterionst order Markov model for CpG islands

• Essentially a finite state automaton (FSA)

• Transitions are probabilistic (instead of deterministic)

• 4 states: A, C, G, T

• 16 transitions: ast = P(xi = t | xi-1 = s)

• Begin/End states

Probability of a sequence criterion

• What’s the probability of ACGGCTA in this model?

P(A) * P(C|A) * P(G|C) … P(A|T)

= aBA aAC aCG …aTA

• Equivalent: follow the path in the automaton, and multiply the transition probabilities on the path

Training criterion

• Estimate the parameters of the model

• CpG+ model: Count the transition frequencies from known CpG islands

• CpG- model: Also count the transition frequencies from sequences without CpG islands

• ast = #(s→t) / #(s →)

a+st

a-st

Discrimination / Classification criterion

• Given a sequence, is it CpG island or not?

• Log likelihood ratio (LLR)

βCG = log2(a+CG/a -CG) = log2(0.274/0.078) = 1.812

βBA = log2(a+ A/a - A) = log2(0.591/1.047) = -0.825

Example criterion

• X = ACGGCGACGTCG

• S(X) = βBA + βAC +βCG +βGG +βGC +βCG +βGA + βAC +βCG +βGT +βTC +βCG

= βBA + 2βAC +4βCG +βGG +βGC +βGA +βGT +βTC

= -0.825+ 2*.419 + 4*1.812+.313 +.461 - .624 - .730 + .573

= 7.25

CpG island scores criterion

Figure 3.2 (Durbin book) The histogram of length-normalized scores for all the sequences. CpG islands are shown with dark grey and non-CpG with light grey.

Questions criterion

• Q1: given a short sequence, is it more likely from CpG+ model or CpG- model?

• Q2: Given a long sequence, where are the CpG islands (if any)?

• Approach 1: score (e.g.) 100 bp windows

• Pro: simple

• Con: arbitrary, fixed length, inflexible

• Approach 2: combine +/- models.

Combined model criterion

• Given a long sequence, predict which state each position is in. (states are hidden: Hidden Markov model)

Hidden Markov Model (HMM) criterion

• Introduced in the 70’s for speech recognition

• Have been shown to be good models for biosequences

• Alignment

• Gene prediction

• Protein domain analysis

• An observed sequence data that can be modeled by a Markov chain

• State path unknown

• Model parameter known or unknown

• Observed data: emission sequences X = (x1x2…xn)

• Hidden data: state sequences Π = (π1π2…πn)

Hidden Markov model (HMM) criterion

Definition: A hidden Markov model (HMM) is a five-tuple

• Alphabet = { b1, b2, …, bM }

• Set of statesQ = { 1, ..., K }

• Transition probabilities between any two states

aij = transition prob from state i to state j

ai1 + … + aiK = 1, for all states i = 1…K

• Start probabilitiesa0i

a01 + … + a0K = 1

• Emission probabilities within each state

ek(b) = P( xi = b | i = k)

ek(b1) + … + ek(bM) = 1, for all states k = 1…K

1

2

K

HMM for the Dishonest Casino criterion

A casino has two dice:

• Fair die

P(1) = P(2) = P(3) = P(4) = P(5) = P(6) = 1/6

P(1) = P(2) = P(3) = P(4) = P(5) = 1/10

P(6) = 1/2

Casino player switches back and forth between fair and loaded die once in a while

The dishonest casino model criterion

aFF = 0.95

aFL = 0.05

aLL = 0.95

Fair

eF(1) = 1/6

eF(2) = 1/6

eF(3) = 1/6

eF(4) = 1/6

eF(5) = 1/6

eF(6) = 1/6

eL(1) = 1/10

eL(2) = 1/10

eL(3) = 1/10

eL(4) = 1/10

eL(5) = 1/10

eL(6) = 1/2

aLF = 0.05

Transition probability

Emission probability

Simple scenario criterion

• You don’t know the probabilities

• The casino player lets you observe which die he/she uses every time

• The “state” of each roll is known

• Training (parameter estimation)

• How often the casino player switches dice?

• How “loaded” is the loaded die?

• Simply count the frequency that each face appeared and the frequency of die switching

• May add pseudo-counts if number of observations is small

More complex scenarios criterion

• The “state” of each roll is unknown:

• You are given the results of a series of rolls

• You don’t know which number is generated by which die

• You may or may not know the parameters

• How “loaded” is the loaded die

• How frequently the casino player switches dice

The three main questions on HMMs criterion

• Decoding

GIVEN a HMM M, and a sequence x,

FIND the sequence  of states that maximizes P (x,  | M )

• Evaluation

GIVEN a HMM M, and a sequence x,

FIND P ( x | M ) [or P(x)for simplicity]

• Learning

GIVEN a HMM M with unspecified transition/emission probs.,

and a sequence x,

FIND parameters  = (ei(.), aij) that maximize P (x | )

Sometimes written as P (x, ) for simplicity.

Question # 1 – Decoding criterion

GIVEN

A HMM with parameters. And a sequence of rolls by the casino player

1245526462146146136136661664661636616366163616515615115146123562344

QUESTION

What portion of the sequence was generated with the fair die, and what portion with the loaded die?

This is the DECODING question in HMMs

1 criterion

1

1

1

2

2

2

2

K

K

K

K

A parse of a sequence

Given a sequence x = x1……xN, and a HMM with k states,

A parse of x is a sequence of states  = 1, ……, N

1

2

2

K

x1

x2

x3

xK

1 criterion

1

1

1

2

2

2

2

K

K

K

K

Probability of a parse

1

Given a sequence x = x1……xN

and a parse  = 1, ……, N

To find how likely is the parse:

(given our HMM)

P(x, ) = P(x1, …, xN, 1, ……, N)

= P(xN, N | N-1) P(xN-1, N-1 | N-2)……P(x2, 2 | 1) P(x1, 1)

= P(xN | N)P(N | N-1) ……P(x2 | 2)P(2 | 1)P(x1 | 1)P(1)

=a01 a12……aN-1Ne1(x1)……eN(xN)

2

2

K

x1

x2

x3

xK

Example criterion

0.05

• What’s the probability of

X = 1, 2, 1, 5, 6, 2, 1, 6, 2, 4?

0.95

0.95

Fair

P(1|F) = 1/6

P(2|F) = 1/6

P(3|F) = 1/6

P(4|F) = 1/6

P(5|F) = 1/6

P(6|F) = 1/6

P(1|L) = 1/10

P(2|L) = 1/10

P(3|L) = 1/10

P(4|L) = 1/10

P(5|L) = 1/10

P(6|L) = 1/2

0.05

Example criterion

• What’s the probability of

X = 1, 2, 1, 5, 6, 2, 1, 6, 2, 4?

P = ½ * P(1 | F) P(Fi+1 | Fi) …P(5 | F) P(Li+1 | Fi) P(6|L) P(Li+1 | Li) …P(4 | F)

= ½ x 0.957 0.052 x (1/6)6 x (1/10)2 x (1/2)2 = 5 x 10-11

0.05

0.05

Decoding criterion

• Parse (path) is unknown. What to do?

• Alternative algorithms:

• Most probable single path (Viterbi algorithm)

• Sequence of most probable states (Forward-backward algorithm)

The Viterbi algorithm criterion

• Goal: to find

• Is equivalent to find

The Viterbi algorithm criterion

L

L

L

L

L

L

L

L

L

L

• Find a path with the following objective:

• Maximize the product of transition and emission probabilities Maximize the sum of log probabilities

B

F

F

F

F

F

F

F

F

F

F

P(s|L) = 1/10, for s in [1..5]P(6|L) = 1/2

P(s|F) = 1/6, for s in [1..6]

Edge weight

(symbol independent)

Node weight

(depend on symbols in seq)

w criterionLF

wFF

wFF = log (aFF)

rF(xi+1) = log (eF(xi+1))

The Viterbi algorithm

x1 x2 x3 … xi xi+1 … xn-1 xn

VF (i+1) = rF (xi+1) + max VF (i) + wFF

VL (i) + wLF

L

L

L

L

L

L

L

B

E

F

F

F

F

F

F

F

Weight for the best parse of (x1…xi+1), with xi+1 emitted by state F

VL (i+1) = rL (xi+1) + max VF (i) + wFL

VL (i) + wLL

Weight for the best parse of (x1…xi+1), with xi+1 emitted by state L

Recursion from FSA directly criterion

wLL=-0.05

wFF=-0.05

WFL=-3.00

aLL=0.95

aFF=0.95

aFL=0.05

Fair

Fair

WLF=-3.00

aLF=0.05

rL(6) = -0.7

rL(s) = -2.3

(s = 1…5)

P(s|F) = 1/6

s = 1…6

P(6|L) = ½

P(s|L) = 1/10

(s = 1...5)

rF(s) = -1.8

s = 1...6

VF (i+1) = rF (xi+1) + max {VL (i) + WLF VF (i) + WFF}

VL (i+1) = rL (xi+1) + max {VL (i) + WLL VF (i) + WFL}

PF (i+1) = eF (xi+1) max {PL (i) aLF PF (i) aFF}

PL (i+1) = eL (xi+1) max {PL (i) aLL PF (i) aFL}

1 criterion

2

l

K

k

In general: more states / symbols

• Alphabet  = { b1, b2, …, bM }

• Set of states Q = { 1, ..., K }

• States are completely connected.

• K2 transitions probabilities (some may be 0)

• Each state has M transition probabilities (some may be 0)

xi xi+1

1

1

2

2

……

l

k

……

K

K

The Viterbi Algorithm criterion

x1 x2 x3 … … xi+1……… …………………………xN

Similar to “aligning” a set of states to a sequence

Time: O(K2N)

Space: O(KN)

State 1

2

l

Vl(i+1)

K

The Viterbi Algorithm (in log space) criterion

Input: x = x1……xN

Initialization:

V0(0) = 0 (zero in subscript is the start state.)

Vl(0) = -inf, for all l > 0 (0 in parenthesis is the imaginary first position)

Iteration:

for each i

for each l

Vl(i) = rl(xi)+maxk (wkl+ Vk(i-1)) // rj(xi) = log(ej(xi)), wkj = log(akj)

Ptrl(i) = argmaxk (wkl+ Vk(i-1))

end

end

Termination:

Prob(x, *) = exp{maxk Vk(N)}

Traceback:

N* = argmaxk Vk(N)

i-1* = Ptri (i)

Input: x = x1……xN

Initialization:

P0(0) = 1 (zero in subscript is the start state.)

Pl(0) = 0, for all l > 0 (0 in parenthesis is the imaginary first position)

Iteration:

for each i

for each l

Pl(i) = el(xi)maxk (akl Pk(i-1))

Ptrl(i) = argmaxk (akl Pk(i-1))

end

end

Termination:

Prob(x, *) = maxk Pk(N)

Traceback:

N* = argmaxk Pk(N)

i-1* = Ptri (i)

CpG islands criterion

• Data: 41 human sequences, including 48 CpG islands of about 1kbp each

• Viterbi:

• Found 46 of 48

• plus 121 “false positives”

• Post-processing:

• merge within 500bp

• discard < 500

• Found 46/48

• 67 false positive

Problems with Viterbi decoding criterion

• Most probable path not necessarily the only interesting one

• Single optimal vs multiple sub-optimal

• What if there are many sub-optimal paths with slightly lower probabilities?

• Global optimal vs local optimal

• What’s best globally may not be the best for each individual

Example criterion

• The dishonest casino

• Say x = 12341623162616364616234161221341

• Most probable path:  = FF……F

• However: marked letters more likely to be L than unmarked letters

• Another way to interpret the problem

• With Viterbi, every position is assigned a single label

• Confidence level for each assignment?

Posterior decoding criterion

• Viterbi finds the path with the highest probability

• We want to know

k = 1

• In order to do posterior decoding, we need to know P(x) and P(i = k, x), since

• Computing P(x) and P(x,i=k) is called the evaluation problem

• The solution: Forward-backward algorithm

Probability of a sequence criterion

• P(X | M): prob that X can be generated by M

• Sometimes simply written as P(X)

• May be written as P(X | M, θ) or P(X | θ) to emphasize that we are looking for θ to optimize the likelihood (discussed later in learning)

• Not equal to the probability of a path P(X, )

• Many possible paths can generate X. Each with a probability

• P(X) = P(X, ) = P(X | ) P()

• How to compute without summing over all possible paths (exponential of them)?

• Dynamic programming

The forward algorithm criterion

• Define fk(i) = P(x1…xi, i=k)

• Implicitly: sum over all possible paths for x1…xi-1

xi

k

The forward algorithm criterion

k

xi

The forward algorithm criterion

We can compute fk(i) for all k, i, using dynamic programming!

Initialization:

f0(0) = 1

fk(0) = 0, for all k > 0

Iteration:

fk(i) = ek(xi) jfj(i-1) ajk

Termination:

Prob(x) = kfk(N)

Relation between Forward and Viterbi criterion

VITERBI (in prob space)

Initialization:

P0(0) = 1

Pk(0) = 0, for all k > 0

Iteration:

Pk(i) = ek(xi) maxjPj(i-1) ajk

Termination:

Prob(x, *) = maxkPk(N)

FORWARD

Initialization:

f0(0) = 1

fk(0) = 0, for all k > 0

Iteration:

fk(i) = ek(xi) j fj(i-1) ajk

Termination:

Prob(x) = k fk(N)

Posterior decoding criterion

• Viterbi finds the path with the highest probability

• We want to know

k = 1

• In order to do posterior decoding, we need to know P(x) and P(i = k, x), since

Need to know how to compute this

Have just shown how to compute this

x criterioni

k

The backward algorithm criterion

• Define bk(i) = P(xi+1…xn | i=k)

• Implicitly: sum over all possible paths for xi…xn

xi

k

k criterion

xi

1

This does not include the emission probability ofxi

The forward-backward algorithm criterion

• Compute fk(i) for each state k and position i

• Compute bk(i), for each state k and position i

• Compute P(x) = kfk(N)

• Compute P(i=k | x) = fk(i) * bk(i) / P(x)

The prob of criterionx, with the constraint that xi was generated by state k

Sequence

state

x

Forward probabilities

Backward probabilities

/ P(X)

Space: O(KN)

Time: O(K2N)

P(i=k | x)

What’s criterionP(i=k | x) good for?

• For each position, you can assign a probability (in [0, 1]) to the states that the system might be in at that point – confidence level

• Assign each symbol to the most-likely state according to this probability rather than the state on the most-probable path – posterior decoding

^i = argmaxkP(i = k | x)

If P(fair) > 0.5, the roll is more likely to be generated by a fair die than a loaded die

In this example, Viterbi predicts that all rolls were from the fair die.

CpG islands again criterion

• Data: 41 human sequences, including 48 CpG islands of about 1kbp each

• Viterbi: Post-process:

• Found 46 of 48 46/48

• plus 121 “false positives” 67 false pos

• Posterior Decoding:

• same 2 false negatives 46/48

• plus 236 false positives 83 false pos

Post-process: merge within 500; discard < 500

What if a new genome comes? criterion

We just sequenced the porcupine genome

We know CpG islands play the same role in this genome

However, we have not many known CpG islands for porcupines

We suspect the frequency and characteristics of CpG islands are quite different in porcupines

How do we adjust the parameters in our model?

- LEARNING

Learning criterion

• When the state path is known

• We’ve already done that

• Estimate parameters from labeled data (known CpG and non-CpG)

• “Supervised” learning

• When the state path is unknown

• Estimate parameters without labeled data

• “unsupervised” learning

Basic idea criterion

• Estimate our “best guess” on the model parameters θ

• Use θ to predict the unknown labels

• Re-estimate a new set of θ

• Repeat 2 & 3

Multiple ways

Viterbi training criterion

• Estimate our “best guess” on the model parameters θ

• Find the Viterbi path using current θ

• Re-estimate a new set of θbased on the Viterbi path

• Count transitions/emissions on those paths, getting new θ

• Repeat 2 & 3 until converge

Baum-Welch training criterion

• Estimate our “best guess” on the model parameters θ

• Find P(i=k | x,θ) using forward-backward algorithm

• Re-estimate a new set of θbased on all possible paths

• For example, according to Viterbi, pos i is in state k and pos (i+1) is in state l

• This contributes 1 count towards the frequency that transition k l is used

• In Baum-Welch, pos i has some prob in state k and pos (i+1) has some prob in state l. This transition is counted only partially, according to the prob of this transition

• Repeat 2 & 3 until converge

i

i+1

k

l

Estimated # of criterionkl transition

Viterbi vs Baum-Welch training criterion

• Viterbi training

• Returns a single path

• Each position labeled with a fixed state

• Each transition counts one

• Each emission also counts one

• Baum-Welch training

• Does not return a single path

• Considers the prob that each transition is used and the prob that a symbol is generated by a certain state

• They only contribute partial counts

Viterbi vs Baum-Welch training criterion

• Both guaranteed to converges

• Baum-Welch improves the likelihood of the data in each iteration: P(X)

• True EM (expectation-maximization)

• Viterbi improves the probability of the most probable path in each iteration: P(X, *)

• EM-like

Expectation-maximization (EM) criterion

• Baum-Welch algorithm is a special case of the expectation-maximization (EM) algorithm, a widely used technique in statistics for learning parameters from unlabeled data

• E-step: compute the expectation (e.g. prob for each pos to be in a certain state)

• M-step: maximum-likelihood parameter estimation

• Recall: clustering

HMM summary criterion

• Viterbi – best single path

• Forward – sum over all paths

• Backward – similar

• Baum-Welch – training via EM and forward-backward

• Viterbi training – another “EM”, but Viterbi-based

Modular design of HMM criterion

• HMM can be designed modularly

• Each modular has own begin / end states (silent, i.e. no emission)

• Each module communicates with other modules only through begin/end states

C+ criterion

G+

A+

T+

B+

E+

HMM modules and non-HMM modules can be mixed

B-

E-

A-

T-

C-

G-

HMM applications criterion

• Gene finding

• Character recognition

• Speech recognition: a good tutorial on course website

• Machine translation

• Many others

Word recognition example(1). criterion

Amherst

a

0.5

0.03

b

A

c

0.005

0.31

z

• Typed word recognition, assume all characters are separated.

• Character recognizer outputs probability of the image being particular character, P(image|character).

Hidden state Observation

http://www.cedar.buffalo.edu/~govind/cs661

Word recognition example(2). criterion

• Hidden states of HMM = characters.

• Observations = typed images of characters segmented from the image . Note that there is an infinite number of observations

• Observation probabilities = character recognizer scores.

• Transition probabilities will be defined differently in two subsequent models.

http://www.cedar.buffalo.edu/~govind/cs661

Word recognition example(3). criterion

m

a

m

r

t

h

e

s

Buffalo

b

u

a

o

f

f

l

A

• If lexicon is given, we can construct separate HMM models for each lexicon word.

Amherst

0.03

0.5

0.4

0.6

• Here recognition of word image is equivalent to the problem of evaluating few HMM models.

• This is an application of Evaluation problem.

http://www.cedar.buffalo.edu/~govind/cs661

Word recognition example(4). criterion

a

b

v

f

o

m

r

t

h

e

s

• We can construct a single HMM for all words.

• Hidden states = all characters in the alphabet.

• Transition probabilities and initial probabilities are calculated from language model.

• Observations and observation probabilities are as before.

• Here we have to determine the best sequence of hidden states, the one that most likely produced word image.

• This is an application of Decoding problem.

http://www.cedar.buffalo.edu/~govind/cs661

A

• The structure of hidden states is chosen.

• Observations are feature vectors extracted from vertical slices.

• Probabilistic mapping from hidden state to feature vectors: 1. use mixture of Gaussian models

• 2. Quantize feature vector space.

http://www.cedar.buffalo.edu/~govind/cs661

s1

s2

s3

A

• HMM for character ‘A’ :

• Transition probabilities: {aij}=

• Observation probabilities: {bjk}=

 .8 .2 0 

 0 .8 .2 

 0 0 1 

 .9 .1 0 

 .1 .8 .1 

 .9 .1 0 

• The structure of hidden states:

• Observation = number of islands in the vertical slice.

• HMM for character ‘B’ :

• Transition probabilities: {aij}=

• Observation probabilities: {bjk}=

 .8 .2 0 

 0 .8 .2 

 0 0 1 

B

 .9 .1 0 

 0 .2 .8 

 .6 .4 0 

http://www.cedar.buffalo.edu/~govind/cs661

• Suppose that after character image segmentation the following sequence of island numbers in 4 slices was observed:

• { 1, 3, 2, 1}

• What HMM is more likely to generate this observation sequence , HMM for ‘A’ or HMM for ‘B’ ?

http://www.cedar.buffalo.edu/~govind/cs661

• HMM for character ‘B’:

• HMM for character ‘A’:

Hidden state sequence

Hidden state sequence

Transition probabilities

Transition probabilities

Observation probabilities

Observation probabilities

s1 s1 s2s3

s1 s1 s2s3

.8  .2  .2  .9  0  .8  .9 = 0

.8  .2  .2  .9  0  .2  .6 = 0

s1 s2 s2s3

s1 s2 s2s3

.2  .8  .2  .9  .8  .2  .6 = 0.0027648

.2  .8  .2  .9  .1  .8  .9 = 0.0020736

s1 s2 s3s3

s1 s2 s3s3

.2  .2  1  .9  .8  .4  .6 = 0.006912

.2  .2  1  .9  .1  .1  .9 = 0.000324

Total = 0.0023976

Total = 0.0096768

Consider likelihood of generating given observation for each possible sequence of hidden states:

http://www.cedar.buffalo.edu/~govind/cs661

HMM for gene finding criterion

• Foundation for most gene finders

• Include many knowledge-based fine-tunes and GHMM extensions

• We’ll only discuss basic ideas

Gene structure criterion

intron1

intron2

DNA

exon2

exon3

Intergenic

exon1

5’

3’

transcription

Pre-mRNA

splicing

Mature mRNA

translation

Exon: coding

Intron: non-coding

Intergenic: non-coding

protein

Transcription criterion

(where genetic information is stored)

• DNA-RNA pair:

• A=U, C=G

• T=A, G=C

(for making mRNA)

Coding strand: 5’-ACGTAGACGTATAGAGCCTAG-3’

Template strand: 3’-TGCATCTGCATATCTCGGATC-5’

mRNA: 5’-ACGUAGACGUAUAGAGCCUAG-3’

Coding strand and mRNA have the same sequence, except that T’s in DNA are replaced by U’s in mRNA.

Translation criterion

• The sequence of codons is translated to a sequence of amino acids

• Gene: -GCT TGT TTA CGA ATT-

• mRNA: -GCUUGUUUACGAAUU -

• Peptide: - Ala - Cys - Leu - Arg - Ile –

• Start codon: AUG

• Also code Met

• Stop codon: UGA, UAA, UAG

The Genetic Code criterion

Third

letter

GATCGGTCGAGCGTAAGCTAGCTAG criterion

ATCGATGATCGATCGGCCATATATC

ACTAGAGCTAGAATCGATAATCGAT

CGATATAGCTATAGCTATAGCCTAT

Finding genes

Human

Fugu

worm

E.coli

As the coding/non-coding length ratio decreases, exon prediction becomes more complex

Gene prediction in prokaryote criterion

• Finding long ORFs (open reading frame)

• An ORF may not contain stop codons

• Average ORF length = 64/3

• Expect 300bp ORF per 36kbp

• Actual ORF length ~ 1000bp

• Codon biases

• Some triplets are used more frequently than others

• Codon third position biases

exon1 criterion

exon3

Intergenic

exon2

intron1

intron2

5’

3’

Promoter

ATG

STOP

Poly-A

Splicing acceptor: AG

Splicing donor: GT

5’-UTR

3’-UTR

HMM for eukaryote gene finding

• Basic idea is the same: the distributions of nucleotides is different in exon and other regions

• Alone won’t work very well

• More signals are needed

• How to combine all the signal together?

Simplest model criterion

• Exon length may not be exact multiple of 3

• Basically have to triple the number of states to remember the excess number of bases in the previous state

intron

Intergenic

exon

4 emission probability

64 triplets emission probabilities

4 emission probability

Actually more accurate at the di-amino-acid level, i.e. 2 codons. Many methods use 5th-order Markov model for all regions

More detailed model criterion

Single exon

Init

exon

intron

Intergenic

Term

exon

Internal

Exon

Sub-models criterion

CDS: coding sequence

• START, STOP are PWMs

• Including start and stop codons and surrounding bases

5’-UTR

START

CDS

Init exon

3’-UTR

CDS

STOP

Term exon

Sub-model for intron criterion

• Sequence logos: an informative display of PWMs

• Within each column, relative height represents probability

• Height of each column reflects “information content”

Splice donor

Intron

body

Splice

acceptor

Intron

Duration modeling criterion

• For any sub-path, the probability consists of two components

• The product of emission probabilities

• Depend on symbols and state path

• The product of transition probabilities

• Depend on state path

Duration modeling criterion

• Model a stretch of DNA for which the distribution does not change for a certain length

• The simplest model implies that

P(length = L) = pL-1(1-p)

• i.e., length follows geometric distribution

• Not always appropriate

P

Duration: the number of times that a state is used consecutively without visiting other states

p

s

1-p

L

Duration models criterion

P

s

s

s

s

1-p

Min, then geometric

P

P

P

P

s

s

s

s

1-p

1-p

1-p

1-p

Negative binominal

Explicit duration modeling criterion

Exon

Intron

Intergenic

P(A | I) = 0.3

P(C | I) = 0.2

P(G | I) = 0.2

P(T | I) = 0.3

Generalized HMM. Often used in gene finders

P

L

Empirical intron length distribution

Explicit duration modeling criterion

x1 x2 x3 ………………………………………..xN

• For each position j and each state i

• Need to consider the transition from all previous positions

• Time: O(N2K2)

• N can be 108

1

2

Pk(i)

K

Speedup GHMM criterion

• Restrict maximum duration length to be L

• O(LNK2)

• However, intergenic and intron can be quite long

• L can be 105

• Compromise: explicit duration for exons only, geometric for all other states

• Pre-compute all possible starting points of ORFs

• For init exon: ATG

• For internal/terminal exon: splice donor signal (GT)

GeneScan model criterion

Approaches to gene finding criterion

• Homology

• BLAST, BLAT, etc.

• Ab initio

• Genscan, Glimmer, Fgenesh, GeneMark, etc.

• Each one has been tuned towards certain organisms

• Hybrids

• Twinscan, SLAM

• Use pair-HMM, or pre-compute score for potential coding regions based on alignment

• None are perfect, never used alone in practice

Current status criterion

• More accurate on internal exons

• Determining boundaries of init and term exons is hard

• Biased towards multiple-exon genes

• Alternative splicing is hard

• Non-coding RNA is hard

• State of the Art: criterion

• predictions ~ 60% similar to real proteins

• ~80% if database similarity used

• lab verification still needed, still expensive

HMM wrap up criterion

• We’ve talked about

• Probability, mainly Bayes Theorem

• Markov models

• Hidden Markov models

• HMM parameter estimation given state path

• Decoding given HMM and parameters

• Viterbi

• F-B

• Learning

• Baum-Welch (Expectation-Maximization)

• Viterbi

HMM wrap up criterion

• We’ve also talked about

• Extension to gHMMs

• gHMM for gene finding

• We did not talk about

• Higher-order Markov models

• How to escape from local optima in learning