250 likes | 332 Views
Hidden Markov Models, Conditional Random Fields and Kalman Filter. Rao Kotagiri 2013 Origins for these slides are from many sources with my changes. The purpose of this talk is to give you basic understanding these mathematical tools.
E N D
Hidden Markov Models, Conditional Random Fields and Kalman Filter Rao Kotagiri 2013 Origins for these slides are from many sources with my changes. The purpose of this talk is to give you basic understanding these mathematical tools. These techniques are universal and can be used in many contexts and applications.
1 2 2 1 1 1 1 … 2 2 2 2 … K … … … … x1 K K K K x2 x3 xN … Hidden Markov Models
Hidden Markov Models o.05 o.95 o.95 • Alphabet = { b1, b2, …, bM } • Set of states Q = { 1, ..., K } • Transition probabilities • aij = transition prob from state i to state j • Emission probabilities • ei(b) = P( x = b | = i) Markov Property: P(t+1 = k | “whatever happened so far”) = P(t+1 = k | 1, 2, …, t, x1, x2, …, xt) = P(t+1 = k | t) Given a sequence x = x1……xN, A parse of x is a sequence of states = 1, ……, N FAIR LOADED o.05 34261455666464126643146141 FAIR LOADED FAIR
1 1 1 1 … 2 2 2 2 … … … … … K K K K … Likelihood of a Parse Simply, multiply all the orange arrows! (transition probs and emission probs) 1 2 2 K x1 x2 x3 xN
1 1 1 1 … 2 2 2 2 … … … … … K K K K … Likelihood of a parse A compact way to write a01 a12……aN-1N e1(x1)……eN(xN) Number all parameters aij and ei(b); n params Example: a0F : 1; a0L : 2; aF,F : 3, aF,L : 4 ,.., aL,L : 6, eF(1)= 7… eL(6) = 18 Then, count in x and the # of times each parameter j = 1, …, n occurs F(j, x, ) = # parameter j occurs in (x, ) (call F(.,.,.) the feature counts) Then, P(x, ) = j=1…n jF(j, x, ) = = exp[j=1…nlog(j)F(j, x, )] 1 Given a sequence x = x1……xN and a parse = 1, ……, N, To find how likely is the parse: (given our HMM) P(x, ) = P(x1, …, xN, 1, ……, N) = a01 a12……aN-1Ne1(x1)……eN(xN) 2 2 K x1 x2 x3 xN
Hidden Markov Models o.05 o.95 o.95 FAIR LOADED o.05
Example: the dishonest casino Let the sequence of rolls be: x = 1, 2, 1, 5, 6, 2, 1, 5, 2, 4 Then, what is the likelihood of = F, F, …, F? (say initial probs a0Fair = ½, aoLoaded = ½) ½ P(1 | Fair) P(Fair | Fair) P(2 | Fair) P(Fair | Fair) … P(4 | Fair) = ½ (1/6)10 (0.95)9 = .00000000521158647211 ~= 0.5 10-9
Example: the dishonest casino • So, the likelihood the die is fair in this run • is just 0.521 10-9 • x = 1, 2, 1, 5, 6, 2, 1, 5, 2, 4 • OK, but what is the likelihood of = L, L, …, L? ½ P(1 | Loaded) P(Loaded, Loaded) … P(4 | Loaded) = ½ (1/10)9 (1/2)1 (0.95)9 = .00000000015756235243 ~= 0.16 10-9 Therefore, it somewhat more likely that all the rolls are done with the fair die, than that they are all done with the loaded die
Example: the dishonest casino Let the sequence of rolls be: x = 1, 6, 6, 5, 6, 2, 6, 6, 3, 6 Now, what is the likelihood = F, F, …, F? ½ (1/6)10 (0.95)9 = 0.5 10-9, same as before What is the likelihood = L, L, …, L? ½ (1/10)4 (1/2)6 (0.95)9 = .00000049238235134735 ~= 0.5 10-7 So, it is 100 times more likely the die is loaded
The three main questions on HMMs • Evaluation GIVEN a HMM M, and a sequence x, FIND Prob[ x | M ] • Decoding GIVEN a HMM M, and a sequence x, FIND the sequence of states that maximizes P[ x, | M ] • Learning GIVEN a HMM M, with unspecified transition/emission probs., and a sequence x, FIND parameters = (ei(.), aij) that maximize P[ x | ]
Notational Conventions The model M is: architecture (#states, etc), parameters = aij, ei(.) P[x | M] is the same with P[ x | ], and P[ x ], when the architecture, and the parameters, respectively, are implied P[ x, | M ], P[ x, | ] and P[ x, ] are the same when the architecture, and the parameters, are implied LEARNING: we write P[ x | ] to emphasize that we are seeking the * that maximizes P[ x | ]
Conditional Random Fields A brief description of a relatively new kind of graphical model
1 1 1 1 … 2 2 2 2 … … … … … K K K K … Let’s look at an HMM again 1 Why are HMMs convenient to use? • Because we can do dynamic programming with them! • “Best” state sequence for 1…i interacts with “best” sequence for i+1…N using K2 arrows Vl(i+1) = el(i+1) maxk Vk(i) akl = maxk( Vk(i) + [ e(l, i+1) + a(k, l) ] ) (where e(.,.) and a(.,.) are logs) • Total likelihood of all state sequences for 1…i+1 can be calculated from total likelihood for 1…i by only summing up K2 arrows 2 2 K x1 x2 x3 xN
1 1 1 1 … 2 2 2 2 … … … … … K K K K … Let’s look at an HMM again 1 • Some shortcomings of HMMs • Can’t model state duration • Solution: explicit duration models (Semi-Markov HMMs) • Unfortunately, state i cannot “look” at any letter other than xi! • Strong independence assumption: P(i | x1…xi-1, 1…i-1) = P(i | i-1) 2 2 K x1 x2 x3 xN
1 1 1 1 … 2 2 2 2 … … … … … K K K K … Let’s look at an HMM again 1 • Another way to put this, features used in objective function P(x, ): • akl, ek(b), where b • At position i: all K2akl features, and all K el(xi) features play a role • “Given that prev. state is k, current state is l, how much is current score?” • Vl(i) = Vk(i – 1) + (a(k, l) + e(l, i)) = Vk(i – 1) + g(k, l, xi) • Let’s generalize g!!! Vk(i – 1) + g(k, l, x, i) 2 2 K x1 x2 x3 xN
“Features” that depend on many pos. in x i-1 i • What do we put in g(k, l, x, i)? • The “higher” g(k, l, x, i), the more we like going from k to l at position i • Richer models using this additional power • Examples • Casino player looks at previous 100 pos’ns; if > 50 6s, he likes to go to Fair g(Loaded, Fair, x, i) += 1[xi-100, …, xi-1 has > 50 6s] wDON’T_GET_CAUGHT • Genes are close to CpG islands; for any state k, g(k, exon, x, i) += 1[xi-1000, …, xi+1000 has > 1/16 CpG] wCG_RICH_REGION x7 x8 x9 x10 x1 x2 x3 x4 x5 x6
“Features” that depend on many pos. in x x7 x8 x9 x10 x1 x2 x3 x4 x5 x6 Conditional Random Fields—Features • Define a set of features that you think are important • All features should be functions of current state, previous state, x, and position i • Example: • Old features: transition kl, emission b from state k • Plus new features: prev 100 letters have 50 6s • Number the features 1…n: f1(k, l, x, i), …, fn(k, l, x, i) • features are indicator true/false variables • Find appropriate weights w1,…, wn for when each feature is true • weights are the parameters of the model • Let’s assume for now each feature has a weight wj • Then, g(k, l, x, i) = j=1…nfj(k, l, x, i) wj
“Features” that depend on many pos. in x x7 x8 x9 x10 x1 x2 x3 x4 x5 x6 Define Vk(i): Optimal score of “parsing” x1…xi and ending in state k Then, assuming Vk(i) is optimal for every k at position i, it follows that Vl(i+1) = maxk [Vk(i) + g(k, l, x, i+1)] Why? Even though at position i+1 we “look” at arbitrary positions in x, we are only “affected” by the choice of ending state k Therefore, Viterbi algorithm again finds optimal (highest scoring) parse for x1…xN
1 2 3 4 5 6 … x1 x2 x3 x4 x5 x6 1 2 3 4 5 6 … x1 x2 x3 x4 x5 x6 “Features” that depend on many pos. in x • Score of a parse depends on all of x at each position • Can still do Viterbi because state i only “looks” at prev. state i-1 and the constant sequence x HMM CRF
How many parameters are there, in general? • Arbitrarily many parameters! • For example, let fj(k, l, x, i) depend on xi-5, xi-4, …, xi+5 • Then, we would have up to K | |11 parameters! • Advantage: powerful, expressive model • Example: “if there are more than 50 6’s in the last 100 rolls, but in the surrounding 18 rolls there are at most 3 6’s, this is evidence we are in Fair state” • Interpretation: casino player is afraid to be caught, so switches to Fair when he sees too many 6’s • Example: “if there are any CG-rich regions in the vicinity (window of 2000 pos) then favor predicting lots of genes in this region” • Question: how do we train these parameters?
Conditional Training • Hidden Markov Model training: • Given training sequence x, “true” parse • Maximize P(x, ) • Disadvantage: • P(x, ) = P( | x)P(x) Quantity we care about so as to get a good parse Quantity we don’t care so much about because x is always given
Conditional Training P(x, ) = P( | x)P(x) P( | x) = P(x, ) / P(x) Recall F(j, x, ) = # times feature fj occurs in (x, ) = i=1…N fj(k, l, x, i) ; count fj in x, In HMMs, let’s denote by wj the weight of jth feature: wj = log(akl) or log(ek(b)) Then, HMM: P(x, ) =exp[j=1…n wj F(j, x, )] CRF: Score(x, ) =exp[j=1…n wj F(j, x, )]
Conditional Training In HMMs, P( | x) = P(x, ) / P(x) P(x, ) =exp[j=1…n wjF(j, x, )] P(x) = exp[j=1…n wjF(j, x, )]=: Z(x) Then, in CRF we can do the same to normalize Score(x, ) into a prob. PCRF( | x) = exp[j=1…n wjF(j, x, )]/ Z(x) QUESTION: Why is this a probability???
Conditional Training • We need to be given a set of sequences x and “true” parses • Calculate Z(x) by a sum-of-paths algorithm similar to HMM • We can then easily calculate P( | x) • Calculate partial derivative of P( | x) w.r.t. each parameter wj (not covered—akin to forward/backward) • Update each parameter with gradient descent! • Continue until convergence to optimal set of weights P( | x) = exp[j=1…n wjF(j, x, )]/ Z(x) is convex!!!
Conditional Random Fields—Summary • Ability to incorporate complicated non-local feature sets • Do away with some independence assumptions of HMMs • Parsing is still equally efficient • Conditional training • Train parameters that are best for parsing, not modeling • Need labeled examples—sequences x and“true” parses (Can train on unlabeled sequences, however it is unreasonable to train too many parameters this way) • Training is significantly slower—many iterations of forward/backward