1 / 77

Hidden Markov Models Part 2: Algorithms

Hidden Markov Models Part 2: Algorithms. CSE 4309 – Machine Learning Vassilis Athitsos Computer Science and Engineering Department University of Texas at Arlington. Hidden Markov Model. An HMM consists of: A set of states . An initial state probability function .

cbond
Download Presentation

Hidden Markov Models Part 2: Algorithms

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Hidden Markov ModelsPart 2: Algorithms CSE 4309 – Machine Learning Vassilis Athitsos Computer Science and Engineering Department University of Texas at Arlington

  2. Hidden Markov Model • An HMM consists of: • A set of states . • An initial state probability function. • A state transition matrix, of size , where • Observation probability functions, also called emission probabilities, defined as:

  3. The Basic HMM Problems • We will next see algorithms for the four problems that we usually want to solve when using HMMs. • Problem 1: learn an HMM using training data. This involves: • Learning the initial state probabilities . • Learning the state transition matrix . • Learning the observation probability functions . • Problems 2, 3, 4 assume we have already learned an HMM. • Problem 2: given a sequence of observations, compute : the probability of given the model. • Problem 3: given an observation sequence , find the hidden state sequence maximizing • Problem 4: given an observation sequence , compute, for any ,, the probability

  4. The Basic HMM Problems • Problem 1: learn an HMM using training data. • Problem 2: given a sequence of observations , compute : the probability of given the model. • Problem 3: given an observation sequence , find the hidden state sequence maximizing • Problem 4: given an observation sequence , compute, for any ,, the probability • We will first look at algorithms for problems 2, 3, 4. • Last, we will look at the standard algorithm for learning an HMM using training data.

  5. Probability of Observations • Problem 2: given an HMM model , and a sequence of observations , compute : the probability of given the model. • Inputs: • : a trained HMM, with probability functions specified. • A sequence of observations . • Output: .

  6. Probability of Observations • Problem 2: given a sequence of observations , compute . • Why do we care about this problem? • What would be an example application?

  7. Probability of Observations • Problem 2: given a sequence of observations , compute . • Why do we care about this problem? • We need to compute to classify . • Suppose we have multiple models . • For example, we can have one model for digit 2, one model for digit 3, one model for digit 4… • A Bayesian classifier classifies by finding the that maximizes • Using Bayes rule, we must maximize . • Therefore, we need to compute for each .

  8. The Sum Rule • The sum rule, which we saw earlier in the course, states that: • We can use the sum rule, to compute as follows: • According to this formula, we can compute by summing over all possible state sequences .

  9. The Sum Rule • According to this formula, we can compute by summing over all possible state sequences . • An HMM with parametersspecifies a joint distribution function as: • Therefore:

  10. The Sum Rule • According to this formula, we can compute by summing over all possible state sequences . • What would be the time complexity of doing this computation directly? • It would be linear to the number of all possible state sequences , which is typically exponential to (the length of ). • Luckily, there is a polynomial time algorithm for computing , that uses dynamic programming.

  11. Dynamic Programming for • To compute , we use a dynamic programming algorithm that is called the forward algorithm. • We define a 2D array of problems, of size . • : length of observation sequence . • : number of states in the HMM. • Problem : • Compute

  12. The Forward Algorithm - Initialization • Problem : • Compute • Problem :

  13. The Forward Algorithm - Initialization • Problem : • Compute • Problem : • Compute • Solution: • ???

  14. The Forward Algorithm - Initialization • Problem : • Compute • Problem : • Compute • Solution:

  15. The Forward Algorithm - Main Loop • Problem : • Compute • Solution:

  16. The Forward Algorithm - Main Loop • Problem : • Compute • Solution:

  17. The Forward Algorithm - Main Loop • Problem : • Compute • Solution:

  18. The Forward Algorithm - Main Loop • Problem : • Compute • Solution: • Thus, . • is easy to compute once all values have been computed.

  19. The Forward Algorithm • Our aim was to compute . • Using the dynamic programming algorithm we just described, we can compute . • How can we use those values to compute • We use the sum rule:

  20. Problem 3: Find the Best • Problem 3: • Inputs: a trained HMM , and an observation sequence • Output: the state sequence maximizing . • Why do we care? • Some times, we want to use HMMs for classification. • In those cases, we do not really care about maximizing , we just want to find the HMM parameters that maximize . • Some times, we want to use HMMs to figure out the most likely values of the hidden states given the observations. • In the tree ring example, our goal was to figure out the average temperature for each year. • Those average temperatures were the hidden states. • Solving problem 3 provides the most likely sequence of average temperatures.

  21. Problem 3: Find the Best • Problem 3: • Input: an observation sequence • Output: the state sequence maximizing . • We want to maximize . • Using the definition of conditional probabilities: • Since is known, will be the same over all possible state sequences . • Therefore, to find the that maximizes , it suffices to find the that maximizes .

  22. Problem 3: Find the Best • Problem 3: • Input: an observation sequence • Output: the state sequence maximizing , which is the same as maximizing . • Solution: (again) dynamic programming. • Problem : Compute and , where: • Note that: • We optimize over all possible values . • However, is constrained to be equal to . • Values are known, and thus they are fixed.

  23. Problem 3: Find the Best • Problem 3: • Input: an observation sequence • Output: the state sequence maximizing , which is the same as maximizing . • Solution: (again) dynamic programming. • Problem : Compute and , where: • is the joint probability of: • the first observations • the sequence we stored in

  24. Problem 3: Find the Best • Problem 3: • Input: an observation sequence • Output: the state sequence maximizing , which is the same as maximizing . • Solution: (again) dynamic programming. • Problem : Compute and , where: • is the joint probability of: • the first observations • the sequence we stored in • stores a sequence of state values. • stores a number (a probability).

  25. The Viterbi Algorithm • Problem : Compute and , where: • We use a dynamic programming algorithm to compute and for all such that . • This dynamic programming algorithm is called the Viterbi algorithm.

  26. The Viterbi Algorithm - Initialization • Problem : Compute and , where: • What is ? • It is the empty sequence. • We must maximize over the previous states , but , so there are no previous states. • What is ? • .

  27. The Viterbi Algorithm – Main Loop • Problem : Compute and , where: • How do we find ? • for some and some . • The last element of is some . • Therefore, is formed by appending some to some sequence

  28. The Viterbi Algorithm – Main Loop • Problem : Compute and , where: • for some and some . • We can prove that , for that . • The proof is very similar to the proof that DTW finds the optimal alignment. • If is better than , then is better than , which is a contradiction.

  29. The Viterbi Algorithm – Main Loop • Problem : Compute and , where: • for some . • Let's define as: • Then: • .

  30. The Viterbi Algorithm – Output • Our goal is to find the maximizing , which is the same as finding the maximizing. • The Viterbi algorithm computes and , where: • Let's define as: • Then, the maximizing is

  31. State Probabilities at Specific Times • Problem 4: • Inputs: a trained HMM , and an observation sequence • Output:an array, of size , where . • In words: given an observation sequence , we want to compute, for any moment in time and any state , the probability that the hidden state at that moment is .

  32. State Probabilities at Specific Times • Problem 4: • Inputs: a trained HMM , and an observation sequence • Output:an array, of size , where . • We have seen, in our solution for problem 2, that the forward algorithm computes an array, of size , where . • We can also define another array , of size , where. • In words, is the probability of all observations after moment , given that the state at moment is .

  33. State Probabilities at Specific Times • Problem 4: • Inputs: a trained HMM , and an observation sequence • Output:an array, of size , where . • We have seen, in our solution for problem 2, that the forward algorithm computes an array, of size , where . • We can also define another array , of size , where. • Then, • Therefore:

  34. State Probabilities at Specific Times • The forward algorithm computes an array, of size , where . • We define another array , of size , where. • Then, • In the above equation: • and are computed by the forward algorithm. • We need to compute . • We compute in a way very similar to the forward algorithm, but going backwards this time. • The resulting combination of the forward and backward algorithms is called (not surprisingly) the forward-backward algorithm.

  35. The Backward Algorithm • We want to compute the values of an array , of size , where . • Again, we will use dynamic programming. • Problem is to compute value . • However, this time we start from the end of the observations. • First, compute . • In this case, the observation sequence is empty, because • What is the probability of an empty sequence?

  36. Backward Algorithm - Initialization • We want to compute the values of an array , of size , where . • Again, we will use dynamic programming. • Problem is to compute value . • However, this time we start from the end of the observations. • First, compute . • In this case, the observation sequence is empty, because • What is the probability of an empty sequence?

  37. Backward Algorithm - Initialization • Next:compute .

  38. Backward Algorithm - Initialization • Next: compute .

  39. Backward Algorithm – Main Loop • Next:compute , for .

  40. Backward Algorithm – Main Loop • Next:compute , for . • We will take a closer look at the last step…

  41. Backward Algorithm – Main Loop • , by definition of .

  42. Backward Algorithm – Main Loop by application of chain rule: , which can also be written as

  43. Backward Algorithm – Main Loop (by definition of )

  44. Backward Algorithm – Main Loop ( are conditionally independent of given )

  45. Backward Algorithm – Main Loop • In the previous slides, we showed that • Note that is . • Consequently, we obtain the recurrence:

  46. Backward Algorithm – Main Loop • Thus, we can easily compute all values using this recurrence. • The key thing is to proceed in decreasing order of , since values depend on values .

  47. The Forward-Backward Algorithm • Inputs: • A trained HMM . • An observation sequence • Output: • An array, of size , where . • Algorithm: • Use the forward algorithm to compute . • Use the backward algorithm to compute . • For to : • For to :

  48. Problem 1: Training an HMM • Goal: Estimate parameters . • The training data can consist of multiple observation sequences . • We denote the length of training sequence as . • We denote the elements of each observation sequence as: • Before we start training, we need to decide on: • The number of states. • The transitions that we will allow.

  49. Problem 1: Training an HMM • While we are given , we are not given the corresponding hidden state sequences . • We denote the elements of each hidden state sequence as: • The training algorithm is called Baum-Welch algorithm. • It is an Expectation-Maximization algorithm, similar to the algorithm we saw for learning Gaussian mixtures.

  50. Expectation-Maximization • When we wanted to learn a mixture of Gaussians, we had the following problem: • If we knew the probability of each object belonging to each Gaussian, we could estimate the parameters of each Gaussian. • If we knew the parameters of each Gaussian, we could estimate the probability of each object belonging to each Gaussian. • However, we know neither of these pieces of information. • The EM algorithm resolved this problem using: • An initialization of Gaussian parameters to some random or non-random values. • A main loop where: • The current values of the Gaussian parameters are used to estimate new weights of membership of every training object to every Gaussian. • The current estimated membership weights are used to estimate new parameters (mean and covariance matrix) for each Gaussian.

More Related