Rao Kotagiri 2013 Origins for these slides are from many sources with my changes.

Hidden Markov Models, Conditional Random Fields and Kalman Filter Rao Kotagiri 2013 Origins for these slides are from many sources with my changes. The purpose of this talk is to give you basic understanding these mathematical tools. These techniques are universal and can be used in many contexts and applications.

1 2 2 1 1 1 1 … 2 2 2 2 … K … … … … x1 K K K K x2 x3 xN … Hidden Markov Models

Hidden Markov Models o.05 o.95 o.95 • Alphabet  = { b1, b2, …, bM } • Set of states Q = { 1, ..., K } • Transition probabilities • aij = transition prob from state i to state j • Emission probabilities • ei(b) = P( x = b |  = i) Markov Property: P(t+1 = k | “whatever happened so far”) = P(t+1 = k | 1, 2, …, t, x1, x2, …, xt) = P(t+1 = k | t) Given a sequence x = x1……xN, A parse of x is a sequence of states  = 1, ……, N FAIR LOADED o.05 34261455666464126643146141 FAIR LOADED FAIR

1 1 1 1 … 2 2 2 2 … … … … … K K K K … Likelihood of a Parse Simply, multiply all the orange arrows! (transition probs and emission probs) 1 2 2 K x1 x2 x3 xN

1 1 1 1 … 2 2 2 2 … … … … … K K K K … Likelihood of a parse A compact way to write a01 a12……aN-1N e1(x1)……eN(xN) Number all parameters aij and ei(b); n params Example: a0F : 1; a0L : 2; aF,F : 3, aF,L : 4 ,.., aL,L : 6, eF(1)= 7… eL(6) = 18 Then, count in x and  the # of times each parameter j = 1, …, n occurs F(j, x, ) = # parameter j occurs in (x, ) (call F(.,.,.) the feature counts) Then, P(x, ) = j=1…n jF(j, x, ) = = exp[j=1…nlog(j)F(j, x, )] 1 Given a sequence x = x1……xN and a parse  = 1, ……, N, To find how likely is the parse: (given our HMM) P(x, ) = P(x1, …, xN, 1, ……, N) = a01 a12……aN-1Ne1(x1)……eN(xN) 2 2 K x1 x2 x3 xN

Hidden Markov Models o.05 o.95 o.95 FAIR LOADED o.05

Example: the dishonest casino Let the sequence of rolls be: x = 1, 2, 1, 5, 6, 2, 1, 5, 2, 4 Then, what is the likelihood of  = F, F, …, F? (say initial probs a0Fair = ½, aoLoaded = ½) ½  P(1 | Fair) P(Fair | Fair) P(2 | Fair) P(Fair | Fair) … P(4 | Fair) = ½  (1/6)10  (0.95)9 = .00000000521158647211 ~= 0.5  10-9

Example: the dishonest casino • So, the likelihood the die is fair in this run • is just 0.521  10-9 • x = 1, 2, 1, 5, 6, 2, 1, 5, 2, 4 • OK, but what is the likelihood of  = L, L, …, L? ½  P(1 | Loaded) P(Loaded, Loaded) … P(4 | Loaded) = ½  (1/10)9  (1/2)1 (0.95)9 = .00000000015756235243 ~= 0.16  10-9 Therefore, it somewhat more likely that all the rolls are done with the fair die, than that they are all done with the loaded die

Example: the dishonest casino Let the sequence of rolls be: x = 1, 6, 6, 5, 6, 2, 6, 6, 3, 6 Now, what is the likelihood  = F, F, …, F? ½  (1/6)10  (0.95)9 = 0.5  10-9, same as before What is the likelihood  = L, L, …, L? ½  (1/10)4  (1/2)6 (0.95)9 = .00000049238235134735 ~= 0.5  10-7 So, it is 100 times more likely the die is loaded

The three main questions on HMMs • Evaluation GIVEN a HMM M, and a sequence x, FIND Prob[ x | M ] • Decoding GIVEN a HMM M, and a sequence x, FIND the sequence  of states that maximizes P[ x,  | M ] • Learning GIVEN a HMM M, with unspecified transition/emission probs., and a sequence x, FIND parameters  = (ei(.), aij) that maximize P[ x |  ]

Notational Conventions The model M is: architecture (#states, etc), parameters  = aij, ei(.) P[x | M] is the same with P[ x |  ], and P[ x ], when the architecture, and the parameters, respectively, are implied P[ x,  | M ], P[ x,  |  ] and P[ x,  ] are the same when the architecture, and the parameters, are implied LEARNING: we write P[ x |  ] to emphasize that we are seeking the * that maximizes P[ x |  ]

Conditional Random Fields A brief description of a relatively new kind of graphical model

1 1 1 1 … 2 2 2 2 … … … … … K K K K … Let’s look at an HMM again 1 Why are HMMs convenient to use? • Because we can do dynamic programming with them! • “Best” state sequence for 1…i interacts with “best” sequence for i+1…N using K2 arrows Vl(i+1) = el(i+1) maxk Vk(i) akl = maxk( Vk(i) + [ e(l, i+1) + a(k, l) ] ) (where e(.,.) and a(.,.) are logs) • Total likelihood of all state sequences for 1…i+1 can be calculated from total likelihood for 1…i by only summing up K2 arrows 2 2 K x1 x2 x3 xN

1 1 1 1 … 2 2 2 2 … … … … … K K K K … Let’s look at an HMM again 1 • Some shortcomings of HMMs • Can’t model state duration • Solution: explicit duration models (Semi-Markov HMMs) • Unfortunately, state i cannot “look” at any letter other than xi! • Strong independence assumption: P(i | x1…xi-1, 1…i-1) = P(i | i-1) 2 2 K x1 x2 x3 xN

1 1 1 1 … 2 2 2 2 … … … … … K K K K … Let’s look at an HMM again 1 • Another way to put this, features used in objective function P(x, ): • akl, ek(b), where b   • At position i: all K2akl features, and all K el(xi) features play a role • “Given that prev. state is k, current state is l, how much is current score?” • Vl(i) = Vk(i – 1) + (a(k, l) + e(l, i)) = Vk(i – 1) + g(k, l, xi) • Let’s generalize g!!! Vk(i – 1) + g(k, l, x, i) 2 2 K x1 x2 x3 xN

“Features” that depend on many pos. in x i-1 i • What do we put in g(k, l, x, i)? • The “higher” g(k, l, x, i), the more we like going from k to l at position i • Richer models using this additional power • Examples • Casino player looks at previous 100 pos’ns; if > 50 6s, he likes to go to Fair g(Loaded, Fair, x, i) += 1[xi-100, …, xi-1 has > 50 6s]  wDON’T_GET_CAUGHT • Genes are close to CpG islands; for any state k, g(k, exon, x, i) += 1[xi-1000, …, xi+1000 has > 1/16 CpG]  wCG_RICH_REGION x7 x8 x9 x10 x1 x2 x3 x4 x5 x6

“Features” that depend on many pos. in x x7 x8 x9 x10 x1 x2 x3 x4 x5 x6 Conditional Random Fields—Features • Define a set of features that you think are important • All features should be functions of current state, previous state, x, and position i • Example: • Old features: transition kl, emission b from state k • Plus new features: prev 100 letters have 50 6s • Number the features 1…n: f1(k, l, x, i), …, fn(k, l, x, i) • features are indicator true/false variables • Find appropriate weights w1,…, wn for when each feature is true • weights are the parameters of the model • Let’s assume for now each feature has a weight wj • Then, g(k, l, x, i) = j=1…nfj(k, l, x, i)  wj

“Features” that depend on many pos. in x x7 x8 x9 x10 x1 x2 x3 x4 x5 x6 Define Vk(i): Optimal score of “parsing” x1…xi and ending in state k Then, assuming Vk(i) is optimal for every k at position i, it follows that Vl(i+1) = maxk [Vk(i) + g(k, l, x, i+1)] Why? Even though at position i+1 we “look” at arbitrary positions in x, we are only “affected” by the choice of ending state k Therefore, Viterbi algorithm again finds optimal (highest scoring) parse for x1…xN

1 2 3 4 5 6 … x1 x2 x3 x4 x5 x6 1 2 3 4 5 6 … x1 x2 x3 x4 x5 x6 “Features” that depend on many pos. in x • Score of a parse depends on all of x at each position • Can still do Viterbi because state i only “looks” at prev. state i-1 and the constant sequence x HMM CRF

How many parameters are there, in general? • Arbitrarily many parameters! • For example, let fj(k, l, x, i) depend on xi-5, xi-4, …, xi+5 • Then, we would have up to K  |  |11 parameters! • Advantage: powerful, expressive model • Example: “if there are more than 50 6’s in the last 100 rolls, but in the surrounding 18 rolls there are at most 3 6’s, this is evidence we are in Fair state” • Interpretation: casino player is afraid to be caught, so switches to Fair when he sees too many 6’s • Example: “if there are any CG-rich regions in the vicinity (window of 2000 pos) then favor predicting lots of genes in this region” • Question: how do we train these parameters?

Conditional Training • Hidden Markov Model training: • Given training sequence x, “true” parse  • Maximize P(x, ) • Disadvantage: • P(x, ) = P( | x)P(x) Quantity we care about so as to get a good parse Quantity we don’t care so much about because x is always given

Conditional Training P(x, ) = P( | x)P(x) P( | x) = P(x, ) / P(x) Recall F(j, x, ) = # times feature fj occurs in (x, ) = i=1…N fj(k, l, x, i) ; count fj in x,  In HMMs, let’s denote by wj the weight of jth feature: wj = log(akl) or log(ek(b)) Then, HMM: P(x, ) =exp[j=1…n wj  F(j, x, )] CRF: Score(x, ) =exp[j=1…n wj  F(j, x, )]

Conditional Training In HMMs, P( | x) = P(x, ) / P(x) P(x, ) =exp[j=1…n wjF(j, x, )] P(x) = exp[j=1…n wjF(j, x, )]=: Z(x) Then, in CRF we can do the same to normalize Score(x, ) into a prob. PCRF( | x) = exp[j=1…n wjF(j, x, )]/ Z(x) QUESTION: Why is this a probability???

Conditional Training • We need to be given a set of sequences x and “true” parses  • Calculate Z(x) by a sum-of-paths algorithm similar to HMM • We can then easily calculate P( | x) • Calculate partial derivative of P( | x) w.r.t. each parameter wj (not covered—akin to forward/backward) • Update each parameter with gradient descent! • Continue until convergence to optimal set of weights P( | x) = exp[j=1…n wjF(j, x, )]/ Z(x) is convex!!!

Conditional Random Fields—Summary • Ability to incorporate complicated non-local feature sets • Do away with some independence assumptions of HMMs • Parsing is still equally efficient • Conditional training • Train parameters that are best for parsing, not modeling • Need labeled examples—sequences x and“true” parses  (Can train on unlabeled sequences, however it is unreasonable to train too many parameters this way) • Training is significantly slower—many iterations of forward/backward

Rao Kotagiri 2013 Origins for these slides are from many sources with my changes.

Rao Kotagiri 2013 Origins for these slides are from many sources with my changes.

Presentation Transcript

the slides from my 2005 SVPCA talk are available

Where are these from

Tips for using these slides

Download these slides from: people.csail.mit.edu / seneff

Introduction to Data Mining (these slides are based on a variety of sources)

About these Slides

My slides:

My Slides

Using These Slides

These Are My Shapes

What are the origins of all these items?

These Are My Software Installations

Photos are from different sources

These slides are excerpts from the Seminole Audubon Society

With help from these Sources: Microsoft PowerPoint2003 rickriorden

These slides:

These slides:

Using these slides

Outline for these slides

How to Use these Slides These slides are from the new PowerPoint template.

Parts of these slides are based on

These slides: