1 / 74

Fast Temporal State-Splitting for HMM Model Selection and Learning

Fast Temporal State-Splitting for HMM Model Selection and Learning. Sajid Siddiqi Geoffrey Gordon Andrew Moore. x. t. x. t. How many kinds of observations (x) ?. x. t. How many kinds of observations (x) ? 3. x. t. How many kinds of observations (x) ? 3

dobry
Download Presentation

Fast Temporal State-Splitting for HMM Model Selection and Learning

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Fast Temporal State-Splittingfor HMM Model Selection and Learning Sajid Siddiqi Geoffrey Gordon Andrew Moore

  2. x t

  3. x t How many kinds of observations (x) ?

  4. x t How many kinds of observations (x) ? 3

  5. x t How many kinds of observations (x) ? 3 How many kinds of transitions (xt+1|xt)?

  6. x t How many kinds of observations (x) ? 3 How many kinds of transitions (xt+1|xt)? 4

  7. x t We say that this sequence ‘exhibits four states under the first-order Markov assumption’ Our goal is to discover the number of such states (and their parameter settings) in sequential data, and to do so efficiently How many kinds of observations (x) ? 3 How many kinds of transitions ( xtxt+1)? 4

  8. Definitions An HMM is a 3-tuple  = {A,B,π}, where A : NxN transition matrix B :NxM observation probability matrix π : Nx1 prior probability vector || : number of states in HMM , i.e. N T : number of observations in sequence qt: the state the HMM is in at time t

  9. O0 q0 q1 q2 q3 q4 O1 O2 1/3 O3 O4 HMMs as DBNs

  10. q0 q1 q2 q3 q4 Notation: Transition Model 1/3 Each of these probability tables is identical

  11. q0 q1 q2 q3 q4 Observation Model O0 O1 O2 O3 Notation: O4

  12. HMMs as DBNs O0 q0 q1 q2 q3 q4 O1 O2 1/3 O3 O4

  13. HMMs as DBNs O0 q0 q1 q2 q3 q4 O1 O2 1/3 O3 O4 HMMs as FSAs S2 S1 S3 S4

  14. Operations on HMMs Problem 1: Evaluation Given an HMM and an observation sequence, what is the likelihood of this sequence? Problem 2: Most Probable Path Given an HMM and an observation sequence, what is the most probable path through state space? Problem 3: Learning HMM parameters Given an observation sequence and a fixed number of states, what is an HMM that is likely to have produced this string of observations? Problem 3: Learning the number of states Given an observation sequence, what is an HMM (of any size) that is likely to have produced this string of observations?

  15. Operations on HMMs

  16. Path Inference • Viterbi Algorithm for calculating argmaxQ P(O,Q|)

  17. Path Inference • Viterbi Algorithm for calculating argmaxQ P(O,Q|) Running time: O(TN2) Yields a globally optimal path through hidden state space, associating each timestep with exactly one HMM state.

  18. Parameter Learning I • Viterbi Training(≈ K-means for sequences)

  19. Parameter Learning I • Viterbi Training(≈ K-means for sequences) Q*s+1 = argmaxQP(O,Q|s) (Viterbi algorithm) s+1 = argmaxP(O,Q*s+1|) Running time: O(TN2) per iteration Models the posterior belief as a δ-function per timestep in the sequence. Performs well on data with easily distinguishable states.

  20. Parameter Learning II • Baum-Welch(≈ GMM for sequences) • Iterate the following two steps until • Calculate the expected complete log-likelihood given s • Obtain updated model parameters s+1by maximizing this log-likelihood

  21. Parameter Learning II • Baum-Welch(≈ GMM for sequences) • Iterate the following two steps until • Calculate the expected complete log-likelihood given s • Obtain updated model parameters s+1by maximizing this log-likelihood Obj(,s)=EQ[P(O,Q|) | O,s] s+1 = argmaxObj(,s) Running time: O(TN2) per iteration, but with a larger constant Models the full posterior belief over hidden states per timestep. Effectively models sequences with overlapping states at the cost of extra computation.

  22. HMM Model Selection • Distinction between model search and actual selection step • We can search the spaces of HMMs with different N using parameter learning, and perform selection using a criterion like BIC.

  23. HMM Model Selection • Distinction between model search and actual selection step • We can search the spaces of HMMs with different N using parameter learning, and perform selection using a criterion like BIC. Running time: O(Tn2) to compute likelihood for BIC

  24. HMM Model Selection I • for n = 1 … Nmax • Initialize n-state HMM randomly • Learn model parameters • Calculate BIC score • If best so far, store model • if larger model not chosen, stop

  25. HMM Model Selection I • for n = 1 … Nmax • Initialize n-state HMM randomly • Learn model parameters • Calculate BIC score • If best so far, store model • if larger model not chosen, stop Running time: O(Tn2) per iteration Drawback: Local minima in parameter optimization

  26. HMM Model Selection II • for n = 1 … Nmax • for i = 1 … NumTries • Initialize n-state HMM randomly • Learn model parameters • Calculate BIC score • If best so far, store model • if larger model not chosen, stop

  27. HMM Model Selection II • for n = 1 … Nmax • for i = 1 … NumTries • Initialize n-state HMM randomly • Learn model parameters • Calculate BIC score • If best so far, store model • if larger model not chosen, stop Running time: O(NumTries x Tn2) per iteration Evaluates NumTries candidate models for each n to overcome local minima. However: expensive, and still prone to local minima especially for large N

  28. Idea: Binary state splits* to generate candidate models • To split state s into s1and s2, • Create ’ such that ’\s  \s • Initialize ’s1 and ’s2 based on s and on parameter constraints Notation:s : HMM parameters related to state s\s : HMM parameters not related to state s * first proposed in Ostendorf and Singer, 1997

  29. Idea: Binary state splits* to generate candidate models • To split state s into s1and s2, • Create ’ such that ’\s  \s • Initialize ’s1 and ’s2 based on s and on parameter constraints • This is an effective heuristic for avoiding local minima Notation:s : HMM parameters related to state s\s : HMM parameters not related to state s * first proposed in Ostendorf and Singer, 1997

  30. Overall algorithm

  31. Overall algorithm Start with a small number of states EM (B.W. or V.T.) Binary state splits* followed by EM BIC on training set Stop when bigger HMMis not selected

  32. Overall algorithm Start with a small number of states EM (B.W. or V.T.) Binary state splitsfollowed by EM BIC on training set Stop when bigger HMMis not selected What is ‘efficient’? Want this loop to be at most O(TN2)

  33. HMM Model Selection III • Initialize n0-state HMM randomly • for n = n0 … Nmax • Learn model parameters • for i = 1 … n • Split state i, learn model parameters • Calculate BIC score • If best so far, store model • if larger model not chosen, stop

  34. HMM Model Selection III • Initialize n0-state HMM randomly • for n = n0 … Nmax • Learn model parameters • for i = 1 … n • Split state i, learn model parameters • Calculate BIC score • If best so far, store model • if larger model not chosen, stop O(Tn2)

  35. HMM Model Selection III • Initialize n0-state HMM randomly • for n = n0 … Nmax • Learn model parameters • for i = 1 … n • Split state i, learn model parameters • Calculate BIC score • If best so far, store model • if larger model not chosen, stop O(Tn2) Running time: O(Tn3) per iteration of outer loop More effective at avoiding local minima than previous approaches. However, scales poorly because of n3 term.

  36. Fast Candidate Generation

  37. Fast Candidate Generation Only consider timesteps owned by s in Viterbi path Only allow parameters of split states to vary Merge parameters and store as candidate

  38. OptimizeSplitParams I: Split-State Viterbi Training (SSVT) Iterate until convergence:

  39. Constrained Viterbi Splitting state s to s1,s2. We calculate using a fast ‘constrained’ Viterbi algorithm over only those timesteps owned by s in Q*, and constraining them to belong to s1or s2 .

  40. The Viterbi path is denoted by Suppose we split state N into s1,s2

  41. The Viterbi path is denoted by Suppose we split state N into s1,s2 ? ? ? ? ? ? ? ?

  42. The Viterbi path is denoted by Suppose we split state N into s1,s2

  43. The Viterbi path is denoted by Suppose we split state N into s1,s2

  44. OptimizeSplitParams I: Split-State Viterbi Training (SSVT) Iterate until convergence: Running time: O(|Ts|n) per iteration When splitting state s, assumes rest of the HMM parameters (\s ) and rest of the Viterbi path (Q*\Ts) are both fixed

  45. Fast approximate BIC Compute once for base model: O(Tn2) Update optimistically* for candidate model: O(|Ts|) * first proposed in Stolcke and Omohundro, 1994

  46. HMM Model Selection IV • Initialize n0-state HMM randomly • for n = n0 … Nmax • Learn model parameters • for i = 1 … n • Split state i, optimize by constrained EM • Calculate approximate BIC score • If best so far, store model • if larger model not chosen, stop

  47. HMM Model Selection IV • Initialize n0-state HMM randomly • for n = n0 … Nmax • Learn model parameters • for i = 1 … n • Split state i, optimize by constrained EM • Calculate approximate BIC score • If best so far, store model • if larger model not chosen, stop O(Tn) Running time: O(Tn2) per iteration of outer loop!

More Related