1 / 55

CS479/679 Pattern Recognition Spring 2013 – Dr. George Bebis

Hidden Markov Models (HMMs) Chapter 3 (Duda et al.) – Section 3.10 ( Warning : this section has lots of typos). CS479/679 Pattern Recognition Spring 2013 – Dr. George Bebis. Sequential vs Temporal Patterns. Sequential patterns: The order of data points is irrelevant. Temporal patterns:

Download Presentation

CS479/679 Pattern Recognition Spring 2013 – Dr. George Bebis

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Hidden Markov Models (HMMs)Chapter 3 (Duda et al.) – Section 3.10(Warning: this section has lots of typos) CS479/679 Pattern RecognitionSpring 2013 – Dr. George Bebis

  2. Sequential vs Temporal Patterns • Sequential patterns: • The order of data points is irrelevant. • Temporal patterns: • The order of data points is important (i.e., time series). • Data can be represented by a number of states. • States at time t are influenced directly by states in previous time steps (i.e., correlated) .

  3. Hidden Markov Models (HMMs) • HMMs are appropriate for problems that have an inherent temporality. • Speech recognition • Gesture recognition • Human activity recognition

  4. First-Order Markov Models • Represented by a graph where every node corresponds to a state ωi. • The graph can be fully-connected with self-loops.

  5. First-Order Markov Models (cont’d) • Links between nodes ωi and ωj are associated with a transition probability: P(ω(t+1)=ωj / ω(t)=ωi )=αij which is the probability of going to state ωjat time t+1 given that the state at time t was ωi (first-order model).

  6. First-Order Markov Models (cont’d) • Markov models are fully described by their transition probabilities αij • The following constraints should be satisfied:

  7. Example: Weather Prediction Model • Assume three weather states: • ω1: Precipitation (rain, snow, hail, etc.) • ω2: Cloudy • ω3: Sunny ω1 Transition Matrix ω2 ω1ω2ω3 ω1 ω2 ω3 ω3

  8. Computing the probability P(ωT) of a sequence of states ωT • Given a sequence of states ωT=(ω(1), ω(2),..., ω(T)),the probability that the model generated ωT is equal to the product of the corresponding transition probabilities: where P(ω(1)/ ω(0))=P(ω(1))is the prior probability of the first state.

  9. Example: Weather Prediction Model (cont’d) • What is the probability that the weather for eight consecutive days is: “sunny-sunny-sunny-rainy-rainy-sunny-cloudy-sunny” ? ω8=ω3ω3ω3ω1ω1ω3ω2ω3 P(ω8)=P(ω3)P(ω3/ω3)P(ω3/ω3) P(ω1/ω3) P(ω1/ω1) P(ω3/ω1)P(ω2/ω3)P(ω3/ω2)=1.536 x 10-4

  10. Limitations of Markov models • In Markov models, each state is uniquely associated with an observable event. • Once an observation is made, the state of the system is trivially retrieved. • Such systems are not of practical use for most applications.

  11. Hidden States and Observations • Assume that each state can generate a number of outputs (i.e., observations) according to some probability distribution. • Each observation can potentially be generated at any state. • State sequence is not directly observable (i.e., hidden) but can be approximated from observation sequence.

  12. First-order HMMs • Augment Markov model such that when it is in state ω(t)it also emits some symbol v(t) (visible state) among a set of possible symbols. • We have access to the visible states v(t) only, while ω(t) are unobservable.

  13. Example: Weather Prediction Model (cont’d) v1: temperature v2: humidity etc. Observations:

  14. Observation Probabilities • When the model is in state ωj at time t, the probability of emitting a visible state vkat that time is denoted as: P(v(t)=vk / ω(t)= ωj)=bjk where (observation probabilities) • For every sequence of hidden states, there is an associated sequence of visible states: ωT=(ω(1), ω(2),..., ω(T))  VT=(v(1), v(2),..., v(T))

  15. Absorbing State ω0 • Given a state sequence and its corresponding observation sequence: ωT=(ω(1), ω(2),..., ω(T))  VT=(v(1), v(2),..., v(T)) we assume thatω(T)=ω0 is some absorbing state, which uniquely emits symbol v(T)=v0 • Once entering the absorbing state, the system can not escape from it.

  16. HMM Formalism • An HMM is defined by {Ω, V, P, A, B}: • Ω : {ω1… ωn } are the possible states • V : {v1…vm } are the possible observations • P = {pi} are the prior state probabilities • A = {aij} are the state transition probabilities • B = {bik} are the observation state probabilities

  17. Some Terminology • Causal: the probabilities depend only upon previous states. • Ergodic: Given some starting state, every one of the states has a non-zero probability of occurring. “left-right” HMM

  18. Coin toss example • You are in a room with a barrier (e.g., a curtain) through which you cannot see what is happening on the other side. • On the other side of the barrier is another person who is performing a coin (or multiple coin) toss experiment. • The other person will tell you only the result of the experiment, not how he obtained that result. e.g., VT=HHTHTTHH...T=v(1),v(2), ..., v(T)

  19. Coin toss example (cont’d) • Problem: derive an HMM model to explain the observed sequence of heads and tails. • The coins represent the hiddenstates since we do not know which coin was tossed each time. • The outcome of each toss represents an observation. • A “likely” sequence of coins (state sequence) may be inferred from the observations. • The state sequence might not be unique in general.

  20. Coin toss example: 1-fair coin model • There are 2 states, each associated with either heads (state1) or tails (state2). • Observation sequence uniquely defines the states (i.e., states are nothidden). observation probabilities

  21. Coin toss example: 2-fair coins model • There are 2 states, each associated with a coin; a third coin is used to decide which of the fair coins to flip. • Neither state is uniquely associated with either heads or tails. observation probabilities

  22. Coin toss example: 2-biased coins model • There are 2 states, each associated with a biased coin; a third coin is used to decide which of the biased coins to flip. • Neither state is uniquely associated with either heads or tails. observation probabilities

  23. Coin toss example:3-biased coins model • There are 3 states, each state associated with a biased coin; we decide which coin to flip using some way (e.g., other coins). • Neither state is uniquely associated with either heads or tails. observation probabilities

  24. Which model is best? • Since the states are not observable, the best we can do is to select the model θ that best explains the observations: maxθ P(VT / θ) • Long observation sequences are typically better in selecting the best model.

  25. Classification Using HMMs • Given an observation sequence VT and set of possible models θ, choose the model with the highest probability P(θ / VT) . Bayes rule:

  26. Three basic HMM problems • Evaluation • Determine the probability P(VT) that a particular sequence of visible states VT was generated by a given model (i.e., Forward/Backward algorithm). • Decoding • Given a sequence of visible states VT, determine the most likely sequence of hidden states ωT that led to those observations (i.e., using Viterbi algorithm). • Learning • Given a set of visible observations, determine aij and bjk (i.e., using EM algorithm - Baum-Welch algorithm).

  27. Evaluation • The probability that a model produces VT can be computed using the theorem of total probability: where ωrT=(ω(1), ω(2),..., ω(T)) is a possible state sequence and rmaxis the max number of state sequences. • For a model with c states ω1, ω2,..., ωc , rmax=cT

  28. Evaluation (cont’d) • We can rewrite each term as follows: • Combining the two equations we have:

  29. Evaluation (cont’d) • Given aij and bjk, it is straightforward to compute P(VT). • What is the computational complexity? O(T rmax)=O(T cT)

  30. Recursive computation of P(VT) (HMM Forward) ω(1) ω(t) ω(t+1) ω(T) ωi ωj ... v(t+1) v(T) v(1) v(t)

  31. Recursive computation of P(VT)(HMM Forward) (cont’d) using marginalization: or

  32. Recursive computation of P(VT) (HMM Forward) (cont’d) ω0

  33. Recursive computation of P(VT)(HMM Forward) (cont’d) for j=1 to c do (if t=T, j=0) (i.e., corresponds to state ω(T)=ω0) • What is the computational complexity in this case? O(T c2)

  34. Example ω0 ω1ω2 ω3 ω0 ω1 ω2 ω3 ω0 ω1 ω2 ω3

  35. Example (cont’d) VT =v1 v3 v2 v0 • Similarly for t=2,3,4 • Finally: 0.2 initial state 0.2 0.8

  36. ω(1) v(1) Recursive computation of P(VT)(HMM backward) βj(t+1) /ω (t+1)=ωj) βi(t) i βi(t) ωi ω(t) ω(t+1) ω(T) ωi ωj ... v(t) v(t+1) v(T)

  37. ω(1) v(1) Recursive computation of P(VT)(HMM backward) (cont’d) =ωj)) or i ω(t) ω(t+1) ω(T) ωi ωj v(t) v(t+1) v(T)

  38. Recursive computation of P(VT)(HMM backward) (cont’d)

  39. Decoding • Find the most probable sequence of hidden states. • Use an optimality criterion - different optimality criteria lead to different solutions. • Algorithm 1: choose the states ω(t) which are individually most likely.

  40. Decoding – Algorithm 1

  41. Decoding (cont’d) • Algorithm 2: at each time step t, find the state that has the highest probability αi(t) (i.e., use forward algorithm with minor changes).

  42. Decoding – Algorithm 2

  43. Decoding – Algorithm 2 (cont’d)

  44. Decoding – Algorithm 2 (cont’d) • There is no guarantee that the path is a valid one. • The path might imply a transition that is not allowed by the model. Example: 0 1 2 3 4 not allowed since ω32=0

  45. Decoding (cont’d) • Algorithm 3: find the single best sequence ωT by maximizing P(ωT/VT) • This is the most widely used algorithm known as Viterbi algorithm.

  46. Decoding – Algorithm 3 maximize: P(ωT/VT)

  47. Decoding – Algorithm 3 (cont’d) recursion (similar to Forward Algorithm, except that it uses maximization over previous states instead of summation\)

  48. Learning • Determine the transition and emission probabilities aij and bjv from a set of training examples (i.e., observation sequences V1T, V2T,..., VnT). • There is no known way to find the ML solution analytically. • It would be easy if we knew the hidden states • Hidden variable problem use EM algorithm!

  49. Learning (cont’d) • EM algorithm • Update aij and bjk iteratively to better explain the observed training sequences. V: V1T, V2T,..., VnT • Expectation step: p(ωT/V, θ) • Maximization step: θt+1=argmax θ E[log p(ωT,VT/ θ)/ VT, θt]

  50. Learning (cont’d) • Updating transition/emission probabilities:

More Related