1 / 45

Temporal Probabilistic Models

Temporal Probabilistic Models. Motivation. Observing a stream of data Monitoring (of people, computer systems, etc ) Surveillance, tracking Finance & economics Science Questions: Modeling & forecasting Unobserved variables. Agenda. Markov models Hidden Markov Models

patch
Download Presentation

Temporal Probabilistic Models

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Temporal Probabilistic Models

  2. Motivation • Observing a stream of data • Monitoring (of people, computer systems, etc) • Surveillance, tracking • Finance & economics • Science • Questions: • Modeling & forecasting • Unobserved variables

  3. Agenda • Markov models • Hidden Markov Models • Four HMM inference tasks • Most often used: • Filtering • Most likely explanation • Applications to NLP

  4. Time Series Modeling • Time occurs in steps t=0,1,2,… • Time step can be seconds, days, years, etc • State variable Xt, t=0,1,2,… • For partially observed problems, we see observations Ot, t=1,2,… and do not see the X’s • X’s are hidden variables (aka latent variables)

  5. Modeling Time • Arrow of time • Causality? Bayesian networks to the rescue Causes Effects

  6. X0 X1 X2 X3 Probabilistic Modeling • For now, assume fully observable case • What parents? X0 X1 X2 X3

  7. Ex: Probabilistic state machine p33 p44 p11 Random variable Xt: state at time t Val(Xt)={1,…,n} P(Xt+1=i|Xt=j) = pij: Transition model p23 p34 p12 x2 x3 x4 x1 p26 p15 p53 p46 x6 p65 x5 p88 p27 p57 p75 x7 x8 p78

  8. X0 X0 X0 X0 X1 X1 X1 X1 X2 X2 X2 X2 X3 X3 X3 X3 Markov Assumption • Assume Xt+k is independent of all Xi for i<tP(Xt+k | X0,…,Xt+k-1) = P(Xt+k | Xt,…,Xt+k-1) • K-th order Markov Chain Order 0 Order 1 Order 2 Order 3

  9. X0 X1 X2 X3 1st order Markov Chain • MC’s of order k>1 can be converted into a 1st order MC[left as exercise] • So w.o.l.o.g., “MC” refers to a 1st order MC

  10. Inference in MC • What independence relationships can we read from the BN? X0 X1 X2 X3 Observe X1 X0 independent of X2, X3, … P(Xt|Xt-1) known as transition model

  11. Inference in MC • Prediction: the probability of future state? • P(Xt) = Sx0,…,xt-1P (X0,…,Xt) = Sx0,…,xt-1P (X0) Px1,…,xt P(Xi|Xi-1)= Sxt-1P(Xt|Xt-1) P(Xt-1) • “Blurs” over time, and approaches stationary distribution as t grows • Limited prediction power • Rate of blurring known as mixing time [Incremental approach]

  12. How does the Markov assumption affect the choice of state? • Suppose we’re tracking a point (x,y) in 2D • What if the point is… • A momentumless particlesubject to thermal vibration? • A particle with velocity? • A particle with intent, likea person?

  13. How does the Markov assumption affect the choice of state? • Suppose the point is the position of our robot, and we observe velocity and intent • What if: • Terrain conditions affectspeed? • Battery level affects speed? • Position is noisy, e.g. GPS?

  14. Is the Markov assumption appropriate for: • A car on a slippery road? • Sales of toothpaste? • The stock market?

  15. History Dependence • In Markov models, the state must be chosen so that the future is independent of history given the current state • Often this requires adding variables that cannot be directly observed

  16. X0 X1 X2 X3 Partial Observability • Hidden Markov Model (HMM) Hidden state variables Observed variables O1 O2 O3 P(Ot|Xt) called the observation model (or sensor model)

  17. X0 X1 X2 X3 Inference in HMMs • Filtering • Prediction • Smoothing, aka hindsight • Most likely explanation O1 O2 O3

  18. Inference in HMMs • Filtering • Prediction • Smoothing, aka hindsight • Most likely explanation Query variable X0 X1 X2 O1 O2

  19. Filtering • Name comes from signal processing • P(Xt|o1:t) = Sxt-1P(xt-1|o1:t-1)P(Xt|xt-1,ot) • P(Xt|Xt-1,ot) = P(ot|Xt-1,Xt)P(Xt|Xt-1)/P(ot|Xt-1) = a P(ot|Xt)P(Xt|Xt-1) Query variable X0 X1 X2 O1 O2

  20. Filtering • P(Xt|o1:t) = aSxt-1P(xt-1|o1:t-1) P(ot|Xt)P(Xt|xt-1) • Forward recursion • If we keep track of P(Xt|o1:t)=> O(1) updates for all t! Query variable X0 X1 X2 O1 O2

  21. Inference in HMMs • Filtering • Prediction • Smoothing, aka hindsight • Most likely explanation Query X0 X1 X2 X3 O1 O2 O3

  22. Prediction • P(Xt+k|o1:t) • 2 steps: P(Xt|o1:t), then P(Xt+k|Xt) • Filter then predict as with standard MC Query X0 X1 X2 X3 O1 O2 O3

  23. Inference in HMMs • Filtering • Prediction • Smoothing, aka hindsight • Most likely explanation Query X0 X1 X2 X3 O1 O2 O3

  24. Standard filtering to time k Smoothing • P(Xk|o1:t) for k < t • P(Xk|o1:k,ok+1:t)= P(ok+1:t|Xk,o1:k)P(Xk|o1:k)/P(ok+1:t|o1:k)= aP(ok+1:t|Xk)P(Xk|o1:k) Query X0 X1 X2 X3 O1 O2 O3

  25. Backward recursion Smoothing • Computing P(ok+1:t|Xk) • P(ok+1:t|Xk) = Sxk+1P(ok+1:t|Xk,xk+1) P(xk+1|Xk)= Sxk+1P(ok+1:t|xk+1) P(xk+1|Xk)= Sxk+1P(ok+2:t|xk+1)P(ok+1|xk+1)P(xk+1|Xk) Given prior states X0 X1 X2 X3 What’s the probability of this sequence? O1 O2 O3

  26. Inference in HMMs • Filtering • Prediction • Smoothing, aka hindsight • Most likely explanation Query returns a path through state space x0,…,x3 X0 X1 X2 X3 O1 O2 O3

  27. Most Likely Explanation • Find: arg max x0,…,xt P(x0,…,xt | o1,…,ot) • The whole sequence x0,…,xtis under consideration • Many practical applications Protein sequence alignment Intelligent user interfaces Communication codes

  28. Viterbi Algorithm • Observation:P(x0:t| o1:t) = P(x0:t-1,xt | o1:t)= 1/Z P(xt|xt-1) P(Ot|xt) P(x0:t-1| o1:t-1) • If we knew what xt-1* and P(x0:t-1*| o1:t-1) were, then we could determine x0:t* by maximizing over possible assignments to xt • Suggests a dynamic programming algorithm to determine this recursively

  29. Viterbi Algorithm for discrete X • Store two arrays V[i,k], p[i,k] • V[i,k]: the max probability of any sequence ending in Xi=k • V[i,k] = maxx[0:i-1]P(x0:i-1,Xi=k|o1:i)

  30. Viterbi Algorithm for discrete X • Store two arrays V[i,k], p[i,k] • V[i,k]: the max probability of any sequence ending in Xi=k • V[i,k] = maxx[0:i-1]P(x0:i-1,Xi=k|o1:i) • V[i,k] = maxx[0:i-1]P(x0:i-1,Xi=k|o1:i) • P(x0:i-1,Xi=k|o1:i) = P(Xi=k|xi-1)P(oi|Xi=k)P(x0:i-1|o1:i-1)

  31. Viterbi Algorithm for discrete X • Store two arrays V[i,k], p[i,k] • V[i,k]: the max probability of any sequence ending in Xi=k • V[i,k] = maxx[0:i-1]P(x0:i-1,Xi=k|o1:i) • V[i,k] = maxx[0:i-1]P(x0:i-1,Xi=k|o1:i) • P(x0:i-1,Xi=k|o1:i) = P(Xi=k|xi-1)P(oi|Xi=k)P(x0:i-1|o1:i-1) • maxx[0:i-1] P(x0:i-1,Xi=k|o1:i) = P(o[i]|Xi=k) maxx[0:i-1] P(Xi=k|xi-1) P(x0:i-1|o1:i-1)

  32. Viterbi Algorithm for discrete X • Store two arrays V[i,k], p[i,k] • V[i,k]: the max probability of any sequence ending in Xi=k • V[i,k] = maxx[0:i-1]P(x0:i-1,Xi=k|o1:i) • V[i,k] = maxx[0:i-1]P(x0:i-1,Xi=k|o1:i) • P(x0:i-1,Xi=k|o1:i) = P(Xi=k|xi-1)P(oi|Xi=k)P(x0:i-1|o1:i-1) • maxx[0:i-1] P(x0:i-1,Xi=k|o1:i) = P(o[i]|Xi=k) maxx[0:i-1] P(Xi=k|xi-1) P(x0:i-1|o1:i-1)= P(o[i]|Xi=k) maxxi-1 P(Xi=k|xi-1) maxx[0:i-2] P(x0:i-1|o1:i-1)

  33. Viterbi Algorithm for discrete X • Store two arrays V[i,k], p[i,k] • V[i,k]: the max probability of any sequence ending in Xi=k • V[i,k] = maxx[0:i-1]P(x0:i-1,Xi=k|o1:i) • V[i,k] = maxx[0:i-1]P(x0:i-1,Xi=k|o1:i) • P(x0:i-1,Xi=k|o1:i) = P(Xi=k|xi-1)P(oi|Xi=k)P(x0:i-1|o1:i-1) • maxx[0:i-1] P(x0:i-1,Xi=k|o1:i) = P(o[i]|Xi=k) maxx[0:i-1] P(Xi=k|xi-1) P(x0:i-1|o1:i-1) = P(o[i]|Xi=k) maxxi-1 P(Xi=k|xi-1) maxx[0:i-2]P(x0:i-1|o1:i-1) • maxxi-1 P(Xi=k|xi-1)maxx[0:i-2] P(x0:i-1|o1:i-1)

  34. Viterbi Algorithm for discrete X • Store two arrays V[i,k], p[i,k] • V[i,k]: the max probability of any sequence ending in Xi=k • V[i,k] = maxx[0:i-1]P(x0:i-1,Xi=k|o1:i) • V[i,k] = maxx[0:i-1]P(x0:i-1,Xi=k|o1:i) • P(x0:i-1,Xi=k|o1:i) = P(Xi=k|xi-1)P(oi|Xi=k)P(x0:i-1|o1:i-1) • maxx[0:i-1] P(x0:i-1,Xi=k|o1:i) = P(o[i]|Xi=k) maxx[0:i-1] P(Xi=k|xi-1)P(x0:i-1|o1:i-1) = P(o[i]|Xi=k) maxxi-1 P(Xi=k|xi-1) maxx[0:i-2] P(x0:i-1|o1:i-1) • maxxi-1 P(Xi=k|xi-1) maxx[0:i-2] P(x0:i-1|o1:i-1) =maxxi-1 P(Xi=k|xi-1) maxx[0:i-2] P(x0:i-2,Xi-1=xi-1|o1:i-1)

  35. Viterbi Algorithm for discrete X • Store two arrays V[i,k], p[i,k] • V[i,k]: the max probability of any sequence ending in Xi=k • V[i,k] = maxx[0:i-1]P(x0:i-1,Xi=k|o1:i) • V[i,k] = maxx[0:i-1]P(x0:i-1,Xi=k|o1:i) • P(x0:i-1,Xi=k|o1:i) = P(Xi=k|xi-1)P(oi|Xi=k)P(x0:i-1|o1:i-1) • maxx[0:i-1] P(x0:i-1,Xi=k|o1:i) = P(o[i]|Xi=k) maxx[0:i-1] P(Xi=k|xi-1)P(x0:i-1|o1:i-1) = P(o[i]|Xi=k) maxxi-1 P(Xi=k|xi-1) maxx[0:i-2] P(x0:i-1|o1:i-1) • maxxi-1 P(Xi=k|xi-1) maxx[0:i-2] P(x0:i-1|o1:i-1) =maxxi-1 P(Xi=k|xi-1) maxx[0:i-2] P(x0:i-2,Xi-1=xi-1|o1:i-1) =maxxi-1 P(Xi=k|xi-1) V[i-1,xi-1]

  36. Viterbi Algorithm for discrete X • Store two arrays V[i,k], p[i,k] • V[i,k]: the max probability of any sequence ending in Xi=k • V[i,k] = maxx[0:i-1]P(x0:i-1,Xi=k|o1:i) • Recursive computation of V[t,k]: • V[0,k] P(x0=k) (base case) • V[i,k] P(oi|Xi=k) maxxi-1 P(Xi=k|xi-1) V[i-1,xi-1] • Define p[i,k] to give the predecessor of Xi=k that resulted in the value of V[i,k] (i.e., arg max)

  37. Viterbi Algorithm for Discrete X • Forward pass: calculate V[i,k] and p[i,k] from i=0 to t • V[0,k] P(x0=k) (base case) • V[i,k] P(oi|Xi=k)maxxi-1 P(Xi=k|xi-1) V[i-1,xi-1] • p[i,k] arg maxxi-1P(Xi=k|xi-1) V[i-1,xi-1] • Backward pass: extract the most likely path using p • xtargmaxk V[t,k] • xi-1p[i,xi] • The result x0,…,xt is the most likely explanation

  38. In practice • Complexity: O( t |Val(X)|2 ) • Long or rare observation sequences => log-probabilities are more numerically stable

  39. Applications of HMMs in NLP • Speech recognition • Hidden phones(e.g., ah eh ee th r) • Observed, noisy acoustic features (produced by signal processing)

  40. Phone Observation Models Phonet Model defined to be robust over variations in accent, speed, pitch, noise Featurest Signal processing Features(24,13,3,59)

  41. Phone Transition Models Phonet Phonet+1 Good models will capture (among other things): Pronunciation of wordsSubphone structure Coarticulation effects Triphone models = order 3 Markov chain Featurest

  42. Word Segmentation • Words run together when pronounced • Unigrams P(wi) • Bigrams P(wi|wi-1) • Trigrams P(wi|wi-1,wi-2) Random 20 word samples from R&N using N-gram models Logical are as confusion a may right tries agent goal the was diesel more object then information-gathering search is Planning purely diagnostic expert systems are very similar computational approach would be represented compactly using tic tac toe a predicate Planning and scheduling are integrated the success of naïve bayes model is just a possible prior source by that time

  43. Tricks to improve recognition • Narrow the # of variables • Digits, yes/no, phone tree • Training with real user data • Real story: “Yes ma’am”

  44. Next Class • Continue R&N 15

More Related