CS 6243 Machine Learning. Markov Chain and Hidden Markov Models. Outline. Background on probability Hidden Markov models Algorithms Applications. Probability Basics. Definition (informal)
Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.
Markov Chain and Hidden Markov Models
S
B
A
P(A  B) = P(A B) / P(B)
<=> P(A B) = P(A  B) P(B)
<=> P(A B) = P(BA) P(A)
These two constraints are logically equivalent
Marginalization
Exhaustive conditionalization
P(even)
= P(even  d < 6) * P (d<6) +
P(even  d = 6) * P (d=6)
= 2/5 * 0.5 + 1 * 0.5
= 0.7
Conditional probability
(likelihood)
P
(
A

B
)
P
(
B
)
Prior of B
=>
=
P
(
B

A
)
P
(
A
)
Posterior probability
Prior of A (Normalizing constant)
This is known as Bayes Theorem or Bayes Rule, and is (one of) the most useful relations in probability and statistics
Bayes Theorem is definitely the fundamental relation in Statistical Pattern Recognition
= P(A  Bj) * P(Bj) / jP(A  Bj)*P(Bj)
Posterior probability
Prior of Bj
Likelihood
Normalizing constant
(theorem of total probabilities)
Bj: different models / hypotheses
In the observation of A, should you choose a model that maximizes P(Bj  A) or P(A  Bj)? Depending on how much you know about Bj !
= P(666  loaded) * P(loaded) / P(666)
= 0.53 * 0.01 / (0.53 * 0.01 + (1/6)3 * 0.99)
= 0.21
= P(66666  loaded) * P(loaded) / P(66666)
= 0.55 * 0.01 / (0.55 * 0.01 + (1/6)5 * 0.99)
= 0.71
= i=1..nP(xiM1)
= 0.258 = 1.53e5
= i=1..nP(xiM2)
= 0.25 0.33 = 8.64e6
P(XM1) / P(XM2) = P(xiM1)/P(xiM2) = (0.25/0.2)5 (0.25/0.3)3
LLR = log(P(xiM1)/P(xiM2))
= nASA + nCSC + nGSG + nTST
= 5 * log(1.25) + 3 * log(0.833) = 0.57
Si = log (P(i  M1) / P(i  M2)), i = A, C, G, T
Log likelihood ratio (LLR)
Log (P(M1X) / P(M2X))
= LLR + log(P(M1)) – log(P(M2))
= nASA + nCSC + nGSG + nTST + log(P(M1)) – log(P(M2))
We have assumed independence of nucleotides in different positions  unrealistic in biology
P(A) * P(CA) * P(GC) … P(AT)
= aBA aAC aCG …aTA
a+st
ast
βCG = log2(a+CG/a CG) = log2(0.274/0.078) = 1.812
βBA = log2(a+ A/a  A) = log2(0.591/1.047) = 0.825
= βBA + 2βAC +4βCG +βGG +βGC +βGA +βGT +βTC
= 0.825+ 2*.419 + 4*1.812+.313 +.461  .624  .730 + .573
= 7.25
Figure 3.2 (Durbin book) The histogram of lengthnormalized scores for all the sequences. CpG islands are shown with dark grey and nonCpG with light grey.
Definition: A hidden Markov model (HMM) is a fivetuple
aij = transition prob from state i to state j
ai1 + … + aiK = 1, for all states i = 1…K
a01 + … + a0K = 1
ek(b) = P( xi = b  i = k)
ek(b1) + … + ek(bM) = 1, for all states k = 1…K
1
2
K
…
A casino has two dice:
P(1) = P(2) = P(3) = P(4) = P(5) = P(6) = 1/6
P(1) = P(2) = P(3) = P(4) = P(5) = 1/10
P(6) = 1/2
Casino player switches back and forth between fair and loaded die once in a while
aFF = 0.95
aFL = 0.05
aLL = 0.95
Fair
LOADED
eF(1) = 1/6
eF(2) = 1/6
eF(3) = 1/6
eF(4) = 1/6
eF(5) = 1/6
eF(6) = 1/6
eL(1) = 1/10
eL(2) = 1/10
eL(3) = 1/10
eL(4) = 1/10
eL(5) = 1/10
eL(6) = 1/2
aLF = 0.05
Transition probability
Emission probability
GIVEN a HMM M, and a sequence x,
FIND the sequence of states that maximizes P (x,  M )
GIVEN a HMM M, and a sequence x,
FIND P ( x  M ) [or P(x)for simplicity]
GIVEN a HMM M with unspecified transition/emission probs.,
and a sequence x,
FIND parameters = (ei(.), aij) that maximize P (x  )
Sometimes written as P (x, ) for simplicity.
GIVEN
A HMM with parameters. And a sequence of rolls by the casino player
1245526462146146136136661664661636616366163616515615115146123562344
QUESTION
What portion of the sequence was generated with the fair die, and what portion with the loaded die?
This is the DECODING question in HMMs
1 criterion
1
1
1
…
2
2
2
2
…
…
…
…
…
K
K
K
K
…
A parse of a sequenceGiven a sequence x = x1……xN, and a HMM with k states,
A parse of x is a sequence of states = 1, ……, N
1
2
2
K
x1
x2
x3
xK
1 criterion
1
1
1
…
2
2
2
2
…
…
…
…
…
K
K
K
K
…
Probability of a parse1
Given a sequence x = x1……xN
and a parse = 1, ……, N
To find how likely is the parse:
(given our HMM)
P(x, ) = P(x1, …, xN, 1, ……, N)
= P(xN, N  N1) P(xN1, N1  N2)……P(x2, 2  1) P(x1, 1)
= P(xN  N)P(N  N1) ……P(x2  2)P(2  1)P(x1  1)P(1)
=a01 a12……aN1Ne1(x1)……eN(xN)
2
2
K
x1
x2
x3
xK
0.05
= Fair, Fair, Fair, Fair, Load, Load, Load, Load, Fair, Fair
X = 1, 2, 1, 5, 6, 2, 1, 6, 2, 4?
0.95
0.95
Fair
LOADED
P(1F) = 1/6
P(2F) = 1/6
P(3F) = 1/6
P(4F) = 1/6
P(5F) = 1/6
P(6F) = 1/6
P(1L) = 1/10
P(2L) = 1/10
P(3L) = 1/10
P(4L) = 1/10
P(5L) = 1/10
P(6L) = 1/2
0.05
= Fair, Fair, Fair, Fair, Load, Load, Load, Load, Fair, Fair
X = 1, 2, 1, 5, 6, 2, 1, 6, 2, 4?
P = ½ * P(1  F) P(Fi+1  Fi) …P(5  F) P(Li+1  Fi) P(6L) P(Li+1  Li) …P(4  F)
= ½ x 0.957 0.052 x (1/6)6 x (1/10)2 x (1/2)2 = 5 x 1011
0.05
0.05
L
L
L
L
L
L
L
L
L
L
B
F
F
F
F
F
F
F
F
F
F
P(sL) = 1/10, for s in [1..5]P(6L) = 1/2
P(sF) = 1/6, for s in [1..6]
Edge weight
(symbol independent)
Node weight
(depend on symbols in seq)
w criterionLF
wFF
wFF = log (aFF)
rF(xi+1) = log (eF(xi+1))
The Viterbi algorithmx1 x2 x3 … xi xi+1 … xn1 xn
VF (i+1) = rF (xi+1) + max VF (i) + wFF
VL (i) + wLF
…
…
L
L
L
L
L
L
L
B
E
…
…
F
F
F
F
F
F
F
Weight for the best parse of (x1…xi+1), with xi+1 emitted by state F
VL (i+1) = rL (xi+1) + max VF (i) + wFL
VL (i) + wLL
Weight for the best parse of (x1…xi+1), with xi+1 emitted by state L
wLL=0.05
wFF=0.05
WFL=3.00
aLL=0.95
aFF=0.95
aFL=0.05
Fair
LOADED
Fair
LOADED
WLF=3.00
aLF=0.05
rL(6) = 0.7
rL(s) = 2.3
(s = 1…5)
P(sF) = 1/6
s = 1…6
P(6L) = ½
P(sL) = 1/10
(s = 1...5)
rF(s) = 1.8
s = 1...6
VF (i+1) = rF (xi+1) + max {VL (i) + WLF VF (i) + WFF}
VL (i+1) = rL (xi+1) + max {VL (i) + WLL VF (i) + WFL}
PF (i+1) = eF (xi+1) max {PL (i) aLF PF (i) aFF}
PL (i+1) = eL (xi+1) max {PL (i) aLL PF (i) aFL}
1 criterion
2
l
K
…
k
In general: more states / symbolsxi xi+1
1
1
2
2
…
……
l
k
……
…
K
K
x1 x2 x3 … … xi+1……… …………………………xN
Similar to “aligning” a set of states to a sequence
Time: O(K2N)
Space: O(KN)
State 1
2
l
Vl(i+1)
K
Input: x = x1……xN
Initialization:
V0(0) = 0 (zero in subscript is the start state.)
Vl(0) = inf, for all l > 0 (0 in parenthesis is the imaginary first position)
Iteration:
for each i
for each l
Vl(i) = rl(xi)+maxk (wkl+ Vk(i1)) // rj(xi) = log(ej(xi)), wkj = log(akj)
Ptrl(i) = argmaxk (wkl+ Vk(i1))
end
end
Termination:
Prob(x, *) = exp{maxk Vk(N)}
Traceback:
N* = argmaxk Vk(N)
i1* = Ptri (i)
Input: x = x1……xN
Initialization:
P0(0) = 1 (zero in subscript is the start state.)
Pl(0) = 0, for all l > 0 (0 in parenthesis is the imaginary first position)
Iteration:
for each i
for each l
Pl(i) = el(xi)maxk (akl Pk(i1))
Ptrl(i) = argmaxk (akl Pk(i1))
end
end
Termination:
Prob(x, *) = maxk Pk(N)
Traceback:
N* = argmaxk Pk(N)
i1* = Ptri (i)
k = 1
xi
k
We can compute fk(i) for all k, i, using dynamic programming!
Initialization:
f0(0) = 1
fk(0) = 0, for all k > 0
Iteration:
fk(i) = ek(xi) jfj(i1) ajk
Termination:
Prob(x) = kfk(N)
VITERBI (in prob space)
Initialization:
P0(0) = 1
Pk(0) = 0, for all k > 0
Iteration:
Pk(i) = ek(xi) maxjPj(i1) ajk
Termination:
Prob(x, *) = maxkPk(N)
FORWARD
Initialization:
f0(0) = 1
fk(0) = 0, for all k > 0
Iteration:
fk(i) = ek(xi) j fj(i1) ajk
Termination:
Prob(x) = k fk(N)
k = 1
Need to know how to compute this
Have just shown how to compute this
x criterioni
k
xi
k
The prob of criterionx, with the constraint that xi was generated by state k
Sequence
state
x
Forward probabilities
Backward probabilities
/ P(X)
Space: O(KN)
Time: O(K2N)
P(i=k  x)
^i = argmaxkP(i = k  x)
If P(fair) > 0.5, the roll is more likely to be generated by a fair die than a loaded die
In this example, Viterbi predicts that all rolls were from the fair die.
Postprocess: merge within 500; discard < 500
We just sequenced the porcupine genome
We know CpG islands play the same role in this genome
However, we have not many known CpG islands for porcupines
We suspect the frequency and characteristics of CpG islands are quite different in porcupines
How do we adjust the parameters in our model?
 LEARNING
Multiple ways
Estimated # of criterionkl transition
Amherst
a
0.5
0.03
b
A
c
0.005
0.31
z
Hidden state Observation
http://www.cedar.buffalo.edu/~govind/cs661
http://www.cedar.buffalo.edu/~govind/cs661
m
a
m
r
t
h
e
s
Buffalo
b
u
a
o
f
f
l
A
Amherst
0.03
0.5
0.4
0.6
http://www.cedar.buffalo.edu/~govind/cs661
a
b
v
f
o
m
r
t
h
e
s
http://www.cedar.buffalo.edu/~govind/cs661
A
http://www.cedar.buffalo.edu/~govind/cs661
s1
s2
s3
A
.8 .2 0
0 .8 .2
0 0 1
.9 .1 0
.1 .8 .1
.9 .1 0
.8 .2 0
0 .8 .2
0 0 1
B
.9 .1 0
0 .2 .8
.6 .4 0
http://www.cedar.buffalo.edu/~govind/cs661
http://www.cedar.buffalo.edu/~govind/cs661
Hidden state sequence
Hidden state sequence
Transition probabilities
Transition probabilities
Observation probabilities
Observation probabilities
s1 s1 s2s3
s1 s1 s2s3
.8 .2 .2 .9 0 .8 .9 = 0
.8 .2 .2 .9 0 .2 .6 = 0
s1 s2 s2s3
s1 s2 s2s3
.2 .8 .2 .9 .8 .2 .6 = 0.0027648
.2 .8 .2 .9 .1 .8 .9 = 0.0020736
s1 s2 s3s3
s1 s2 s3s3
.2 .2 1 .9 .8 .4 .6 = 0.006912
.2 .2 1 .9 .1 .1 .9 = 0.000324
Total = 0.0023976
Total = 0.0096768
Consider likelihood of generating given observation for each possible sequence of hidden states:
http://www.cedar.buffalo.edu/~govind/cs661
intron1
intron2
DNA
exon2
exon3
Intergenic
exon1
5’
3’
transcription
PremRNA
splicing
Mature mRNA
translation
Exon: coding
Intron: noncoding
Intergenic: noncoding
protein
(where genetic information is stored)
(for making mRNA)
Coding strand: 5’ACGTAGACGTATAGAGCCTAG3’
Template strand: 3’TGCATCTGCATATCTCGGATC5’
mRNA: 5’ACGUAGACGUAUAGAGCCUAG3’
Coding strand and mRNA have the same sequence, except that T’s in DNA are replaced by U’s in mRNA.
GATCGGTCGAGCGTAAGCTAGCTAG criterion
ATCGATGATCGATCGGCCATATATC
ACTAGAGCTAGAATCGATAATCGAT
CGATATAGCTATAGCTATAGCCTAT
Finding genesHuman
Fugu
worm
E.coli
As the coding/noncoding length ratio decreases, exon prediction becomes more complex
exon1 criterion
exon3
Intergenic
exon2
intron1
intron2
5’
3’
Promoter
ATG
STOP
PolyA
Splicing acceptor: AG
Splicing donor: GT
5’UTR
3’UTR
HMM for eukaryote gene findingintron
Intergenic
exon
4 emission probability
64 triplets emission probabilities
4 emission probability
Actually more accurate at the diaminoacid level, i.e. 2 codons. Many methods use 5thorder Markov model for all regions
CDS: coding sequence
5’UTR
START
CDS
Init exon
3’UTR
CDS
STOP
Term exon
Splice donor
Intron
body
Splice
acceptor
Intron
P(length = L) = pL1(1p)
P
Duration: the number of times that a state is used consecutively without visiting other states
p
s
1p
L
P
s
s
s
s
1p
Min, then geometric
P
P
P
P
s
s
s
s
1p
1p
1p
1p
Negative binominal
Exon
Intron
Intergenic
P(A  I) = 0.3
P(C  I) = 0.2
P(G  I) = 0.2
P(T  I) = 0.3
Generalized HMM. Often used in gene finders
P
L
Empirical intron length distribution
x1 x2 x3 ………………………………………..xN
1
2
Pk(i)
K