1 / 44

CS b553: Algorithms for Optimization and Learning

CS b553: Algorithms for Optimization and Learning. Temporal sequences: Hidden Markov Models and Dynamic Bayesian Networks. Motivation. Observing a stream of data Monitoring (of people, computer systems, etc ) Surveillance, tracking Finance & economics Science Questions:

trinh
Download Presentation

CS b553: Algorithms for Optimization and Learning

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. CS b553: Algorithms for Optimization and Learning Temporal sequences: Hidden Markov Models and Dynamic Bayesian Networks

  2. Motivation • Observing a stream of data • Monitoring (of people, computer systems, etc) • Surveillance, tracking • Finance & economics • Science • Questions: • Modeling & forecasting • Unobserved variables

  3. Time Series Modeling • Time occurs in steps t=0,1,2,… • Time step can be seconds, days, years, etc • State variable Xt, t=0,1,2,… • For partially observed problems, we see observations Ot, t=1,2,… and do not see the X’s • X’s are hidden variables (aka latent variables)

  4. Modeling Time • Arrow of time • Causality => Bayesian networks are natural models of time series Causes Effects

  5. X0 X1 X2 X3 Probabilistic Modeling • For now, assume fully observable case • What parents? X0 X1 X2 X3

  6. X0 X0 X0 X0 X1 X1 X1 X1 X2 X2 X2 X2 X3 X3 X3 X3 Markov Assumption • Assume Xt+k is independent of all Xi for i<tP(Xt+k | X0,…,Xt+k-1) = P(Xt+k | Xt,…,Xt+k-1) • K-th order Markov Chain Order 0 Order 1 Order 2 Order 3

  7. Y0 X0 Y1 X1 X2 Y2 X3 Y3 1st order Markov Chain • MC’s of order k>1 can be converted into a 1st order MC on the variable Yt = {Xt,…,Xt+k-1} • So w.o.l.o.g., “MC” refers to a 1st order MC X0 X1’ X2’ X3’ X1 X2 X3 X4

  8. Inference in MC • What independence relationships can we read from the BN? X0 X1 X2 X3 Observe X1 X0 independent of X2, X3, … P(Xt|Xt-1) known as transition model

  9. Inference in MC • Prediction: the probability of future state? • P(Xt) = Sx0,…,xt-1P (X0,…,Xt) = Sx0,…,xt-1P (X0) Px1,…,xt P(Xi|Xi-1)= Sxt-1P(Xt|Xt-1) P(Xt-1) • Approach: maintain a belief statebt(X)=P(Xt), use above equation to advance to bt+1(X) • Equivalent to VE algorithm in sequential order [Recursive approach]

  10. Belief state evolution • P(Xt) = Sxt-1P(Xt|Xt-1) P(Xt-1) • “Blurs” over time, and (typically) approaches a stationary distribution as t grows • Limited prediction power • Rate of blurring known as mixing time

  11. Stationary distributions • For discrete variables Val(X)={1,…,n}: • Transition matrix Tij = P(Xt=i|Xt-1=j) • Belief bt(X) is just a vector bt,i=P(Xt=i) • Belief update equation: bt+1 = T*bt • A stationary distribution b is one in which b = Tb • => b is an eigenvector of T with eigenvalue 1 • => b is in the null space of (T-I)

  12. History Dependence • In Markov models, the state must be chosen so that the future is independent of history given the current state • Often this requires adding variables that cannot be directly observed minimum essentials “the bare” market wipes himselfwith the rabbit Are these people walking toward you or away from you? What comes next?

  13. X0 X1 X2 X3 Partial Observability • Hidden Markov Model (HMM) Hidden state variables Observed variables O1 O2 O3 P(Ot|Xt) called the observation model (or sensor model)

  14. X0 X1 X2 X3 Inference in HMMs • Filtering • Prediction • Smoothing, aka hindsight • Most likely explanation O1 O2 O3

  15. Inference in HMMs • Filtering • Prediction • Smoothing, aka hindsight • Most likely explanation Query variable X0 X1 X2 O1 O2

  16. Filtering • Name comes from signal processing • P(Xt|o1:t) = Sxt-1P(xt-1|o1:t-1)P(Xt|xt-1,ot) • P(Xt|Xt-1,ot) = P(ot|Xt-1,Xt)P(Xt|Xt-1)/P(ot|Xt-1) = a P(ot|Xt)P(Xt|Xt-1) Query variable X0 X1 X2 O1 O2

  17. Filtering • P(Xt|o1:t) = aSxt-1P(xt-1|o1:t-1) P(ot|Xt)P(Xt|xt-1) • Forward recursion • If we keep track of belief state bt(X) = P(Xt|o1:t)=> O(|Val(X)|2) updates for each t! Query variable X0 X1 X2 O1 O2

  18. Predict-Update interpretation • Given old belief state bt-1(X) • Predict: First compute MC updatebt’(Xt)=P(Xt|o1:t-1) = aSxbt-1(x) P(Xt|Xt-1=x) • Update: Re-weight to account for observation probabilities: • bt(x) = bt’(x)P(ot|Xt=x) Query variable X0 X1 X2 O1 O2

  19. Inference in HMMs • Filtering • Prediction • Smoothing, aka hindsight • Most likely explanation Query X0 X1 X2 X3 O1 O2 O3

  20. Prediction • P(Xt+k|o1:t) • 2 steps: P(Xt|o1:t), then P(Xt+k|Xt) • Filterto time t, then predict as with standard MC Query X0 X1 X2 X3 O1 O2 O3

  21. Inference in HMMs • Filtering • Prediction • Smoothing, aka hindsight • Most likely explanation Query X0 X1 X2 X3 O1 O2 O3

  22. Standard filtering to time k Smoothing • P(Xk|o1:t) for k < t • P(Xk|o1:k,ok+1:t)= P(ok+1:t|Xk,o1:k)P(Xk|o1:k)/P(ok+1:t|o1:k)= aP(ok+1:t|Xk)P(Xk|o1:k) Query X0 X1 X2 X3 O1 O2 O3

  23. Backward recursion Smoothing • Computing P(ok+1:t|Xk) • P(ok+1:t|Xk) = Sxk+1P(ok+1:t|Xk,xk+1) P(xk+1|Xk)= Sxk+1P(ok+1:t|xk+1) P(xk+1|Xk)= Sxk+1P(ok+2:t|xk+1)P(ok+1|xk+1)P(xk+1|Xk) Given prior states X0 X1 X2 X3 What’s the probability of this sequence? O1 O2 O3

  24. Interpretation • Filtering/prediction: • Equivalent to forward variable elimination / belief propagation • Smoothing: • Equivalent to forward VE/BP up to query variable, then backward VE/BP from last observation back to query variable • Running BP to completion gives the smoothed estimates for all variables (forward-backward algorithm)

  25. Inference in HMMs • Filtering • Prediction • Smoothing, aka hindsight • Most likely explanation • Subject of next lecture Query returns a path through state space x0,…,x3 X0 X1 X2 X3 O1 O2 O3

  26. Applications of HMMs in NLP • Speech recognition • Hidden phones(e.g., ah eh ee th r) • Observed, noisy acoustic features (produced by signal processing)

  27. Phone Observation Models Phonet Model defined to be robust over variations in accent, speed, pitch, noise Featurest Signal processing Features(24,13,3,59)

  28. Phone Transition Models Phonet Phonet+1 Good models will capture (among other things): Pronunciation of wordsSubphone structure Coarticulation effects Triphone models = order 3 Markov chain Featurest

  29. Word Segmentation • Words run together when pronounced • Unigrams P(wi) • Bigrams P(wi|wi-1) • Trigrams P(wi|wi-1,wi-2) Random 20 word samples from R&N using N-gram models Logical are as confusion a may right tries agent goal the was diesel more object then information-gathering search is Planning purely diagnostic expert systems are very similar computational approach would be represented compactly using tic tac toe a predicate Planning and scheduling are integrated the success of naïve bayes model is just a possible prior source by that time

  30. What about models with many variables? • Say X has n binary variables, O has m binary variables • Naively, a distribution over Xt may be intractable to represent (2n entries) • Transition models P(Xt|Xt-1) require 22n entries • Observation models P(Ot|Xt) require 2n+m entries • Is there a better way?

  31. Example: Failure detection • Consider a battery meter sensor • Battery = true level of battery • BMeter = sensor reading • Transient failures: send garbage at time t • Persistent failures: send garbage forever

  32. Example: Failure detection • Consider a battery meter sensor • Battery = true level of battery • BMeter = sensor reading • Transient failures: send garbage at time t • 5555500555… • Persistent failures: sensor is broken • 5555500000…

  33. Dynamic Bayesian Network • Template model relates variables on prior time step to the next time step (2-TBN) • “Unrolling” the template for all t gives the ground Bayesian network Batteryt-1 Batteryt BMetert BMetert ~ N(Batteryt,s)

  34. Dynamic Bayesian Network Batteryt-1 Batteryt BMetert BMetert ~ N(Batteryt,s) Transient failure model P(BMetert=0 | Batteryt=5) = 0.03

  35. With model Without model Results on Transient Failure Meter reads 55555005555… Transient failure occurs E(Batteryt)

  36. Results on Persistent Failure Meter reads 5555500000… Persistent failure occurs E(Batteryt) With transient model

  37. Persistent Failure Model Brokent-1 Brokent Batteryt-1 Batteryt BMetert BMetert ~ N(Batteryt,s) P(BMetert=0 | Batteryt=5) = 0.03 P(BMetert=0 | Brokent) = 1

  38. With persistent failure model Results on Persistent Failure Meter reads 5555500000… Persistent failure occurs E(Batteryt) With transient model

  39. How to perform inference on DBN? • Exact inference on “unrolled” BN • E.g. Variable Elimination • Typical order:eliminate sequential time steps so that the network isn’t actually constructed • Unrolling is done only implicitly Br0 Br1 Br2 Br3 Br4 Ba0 Ba1 Ba2 Ba3 Ba4 BM1 BM2 BM3 BM4

  40. Entanglement Problem • After n time steps, all n variables in the belief state become dependent! • Unless 2-TBN can be partitioned into disjoint subsets (rare) • Lost sparsitystructure

  41. Approximate inference in DBNs • Limited history updates • Assumed factorization of belief state • Particle filtering

  42. Independent Factorization • Idea: assume belief state P(Xt) factors across individual attributes P(Xt) = P(X1,t)*…*P(Xn,t) • Filtering: only maintain factored distributions P(X1,t|O1:t),…,P(Xn,t|O1:t) • Filtering update: P(Xk,t|O1:t) = Sxt-1P(Xk,t|Ot,Xt-1) P(Xt-1|O1:t-1) = marginal probability query over 2-TBN X1,t-1 X1,t O1,t Om,t Xn,t-1 Xn,t

  43. Next time • Viterbi algorithm • Read K&F 13.2 for some context • Kalman and particle filtering • Read K&F 15.3-4

More Related