Hidden Markov Models - Training

1 / 16

# Hidden Markov Models - Training - PowerPoint PPT Presentation

##### Hidden Markov Models - Training

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.

- - - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - - -
##### Presentation Transcript

1. Hidden Markov Models - Training

2. Parameter Estimation • How to specify a HMM mode? • Design the structure: states, transitions, etc. • Assign the parameter values: the transition and emission probabilities akl and ek(b)

3. Parameter Estimation • Suppose we have a set of sequences that we want the HMM model to fit it well. (i.e., we want the HMM model to generate the sequences with high probabilities) • These sequences are called training sequences. Let them be x1, x2, x3,… xn • Assume they are independent • logP(x1, x2, x3,… xn |Θ) = ∑i logP(xi |Θ) where Θ represents the entire current set of values of the parameters in the model • The goal is to maximize ∑i logP(xi |Θ).

4. Parameter Estimation When The Paths of States Are Known • Count the number of times each particular transition or emissions is used in the set of training sequences. Akl and Ek(b). • Maximum likelihood estimators for akl and ek(b) are given by

5. Parameter Estimation When The Paths of States Are Known • Problems: Vulnerable to overfitting if there are insufficient training data. • For example, if there is a state k that is never used in the set of example sequences, then the estimation equations are undefined for that state. • Solution: Add predetermined pseudocounts to the Akl and Ek(b)

6. Parameter Estimation When The Paths of States Are Known • Akl= number of transitions k to l in training data + rkl • Ek(b) = number of emissions of symbol b from state k in the training data + rk(b) • The pseudocounts should reflect our prior biases about the probability values • Small priority values indicate weak prior knowledge, and larger values indicate more definite prior knowledge

7. Parameter Estimation When The Paths of State Are Unknown • When the paths are unknown for the training sequences, the previous method can not be applied. We will use iterative procedures to train: • Baum-Welch Training • Viterbi Training

8. Iterative process • Solution: Iterative process • Assign a set of initial values to Akl and Ek(b) • Repeat until some stopping criterion is reached. • Find the most probable state path for each training sequence based on current Akl and Ek(b) • Consider the most probable paths as the actual paths and use Formula 1 to derive new values of for Akl and Ek(b).

9. Iterative process • The overall log likelihood of the model is increased by the iteration, and hence that the process will converge to a local maximum. • Unfortunately, there are usually many local maximums, and the starting values will strongly determine which local maximum the process will be stuck in. • Thus, may need to try different starting points.

10. Baum-Welch • The Baum-Welch algorithm calculates Akl and Ek(b) as the expected times each transition or emission is used, given the training sequences. • The probability that transition klis used at position i in sequence x is :

11. Baum-Welch • From this we can derive the expected number of times that transition klis used by summing over all positions and all training sequences is the forward variable, and is the backward variable Similarly, we can calculate the expected number of times that letter b appears in state k by: Where the inner sum is only for those positions i where symbol b is emitted

12. Baum-Welch • Having calculated these expectations, the new model parameters can be calculated using (1). • Based on the new parameters, we can iteratively obtain newer values of As and Es as before. • The process is converging in a continuous-values space, and so will never in fact reach a maximum. • Stopping criterion is needed: 1) the change in total log likelihood is sufficiently small. 2) normalize the log likelihood by the number of sequences n and maybe also by the sequence lengths, then consider the change in the average log likelihood per residue.

13. Baum-Welch • Algorithm: Baum-Welch • Initialization: Pick arbitrary model parameters • Recurrence: • Set all the A and E variables to their pseudocount values r (or to 0) • For each Sequence j=1..n: • Calculate for sequence j using the forward algorithm • Calculate for sequence j using the backward algorithm • Add the contribution of sequence j to A (using formula 3) and E (using formula 4) • Calculate new model parameters (using formula 1) • Calculate new log likelihood of the model • Termination: • Stop if the change in log likelihood is less than some predefined threshold or the maximum number of iterations is exceeded.

14. Baum-Welch • Baum-Welch algorithm is a special case of EM algorithm, a very powerful approach for probabilistic parameter estimation.

15. Viterbi Training • An alternative to Baum-Welch training. • The most probable paths for the training sequences are derived using viterbi algorithm, and these are used in the re-estimation process. • The process is also iterated when the new parameter values are obtained. • It will converge precisely, because the assignment of paths is a discrete process. • Stopping criteria: None of the paths change. At this point, the parameter estimates will not change either, since they are determined completely by the paths

16. Viterbi Training • Difference between Baum-Welch and Viterbi training: • Total possibility vs best path • Baum-Welch maximize logP(x1, x2, x3,… xn |Θ), while viterbi maximize the contribution to the likelihood logP(x1, x2, x3,… xn |Θ, p*(x1),…,p*(xn)) from the most probable paths for all the sequences. • Thus, Viterbi training performs less well in general than Baum-Welch. • But it is widely used when the primary purpose of HMM is to decode via Viterbi alignment.