1 / 66

Padhraic Smyth Information and Computer Science University of California, Irvine datalab.uci

A Review of Hidden Markov Models for Context-Based Classification ICML’01 Workshop on Temporal and Spatial Learning Williams College June 28th 2001. Padhraic Smyth Information and Computer Science University of California, Irvine www.datalab.uci.edu. Outline. Context in classification

tomai
Download Presentation

Padhraic Smyth Information and Computer Science University of California, Irvine datalab.uci

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. A Review of Hidden Markov Models for Context-Based ClassificationICML’01 Workshop onTemporal and Spatial LearningWilliams CollegeJune 28th 2001 Padhraic Smyth Information and Computer Science University of California, Irvine www.datalab.uci.edu

  2. Outline • Context in classification • Brief review of hidden Markov models • Hidden Markov models for classification • Simulation results: how useful is context? • (with Dasha Chudova, UCI)

  3. Historical Note • “Classification in Context” was well-studied in pattern recognition in the 60’s and 70’s • e.g, recursive Markov-based algorithms were proposed, before hidden Markov algorithms and models were fully understood • Applications in • OCR for word-level recognition • remote-sensing pixel classification

  4. Papers of Note Raviv, J., “Decision-making in Markov chains applied to the problem of pattern recognition”, IEEE Info Theory, 3(4), 1967 Hanson, Riseman, and Fisher, “Context in word recognition,” Pattern Recognition, 1976 Toussaint, G., “The use of context in pattern recognition,” Pattern Recognition, 10, 1978 Mohn, Hjort, and Storvik, “A simulation study of some contextual classification methods for remotely sensed data,” IEEE Trans Geo. Rem. Sens., 25(6), 1987.

  5. Context-Based Classification Problems • Medical Diagnosis • classification of a patient’s state over time • Fraud Detection • detection of stolen credit card • Electronic Nose • detection of landmines • Remote Sensing • classification of pixels into ground cover

  6. Modeling Context • Common Theme = Context • class labels (and features) are “persistent” in time/space

  7. Modeling Context • Common Theme = Context • class labels (and features) are “persistent” in time/space Features (observed) X1 X2 X3 XT - - - - - - - - Class (hidden) CT C1 C2 C3 Time

  8. Feature Windows • Predict Ct using a window, e.g., f(Xt, Xt-1, Xt-2) • e.g., NETtalk application Features (observed) X1 X2 X3 XT - - - - - - - - Class (hidden) CT C1 C2 C3 Time

  9. Alternative: Probabilistic Modeling • E.g., assume p(Ct | history) = p(Ct | Ct-1) • first order Markov assumption on the classes Features (observed) X1 X2 X3 XT - - - - - - - - Class (hidden) CT C1 C2 C3 Time

  10. Brief review of hidden Markov models (HMMs)

  11. Graphical Models • Basic Idea: p(U) <=> an annotated graph • Let U be a set of random variables of interest • 1-1 mapping from U to nodes in a graph • graph encodes “independence structure” of model • numerical specifications of p(U) are stored locally at the nodes

  12. B A C Acyclic Directed Graphical Models (aka belief/Bayesian networks) p(A,B,C) = p(C|A,B)p(A)p(B) In general, p(X1, X2,....XN) = p(Xi | parents(Xi ) )

  13. p(X1, X2,....XN) =potential(clique i) Undirected Graphical Models (UGs) • Undirected edges reflect correlational dependencies • e.g., particles in physical systems, pixels in an image • Also known as Markov random fields, Boltzmann machines, etc

  14. A B C Examples of 3-way Graphical Models Markov chain p(A,B,C) = p(C|B) p(B|A) p(A)

  15. A B C A B C Examples of 3-way Graphical Models Markov chain p(A,B,C) = p(C|B) p(B|A) p(A) Independent Causes: p(A,B,C) = p(C|A,B)p(A)p(B)

  16. Hidden Markov Graphical Model • Assumption 1: • p(Ct | history) = p(Ct | Ct-1) • first order Markov assumption on the classes • Assumption 2: • p(Xt | history, Ct ) = p(Xt | Ct) • Xt only depends on current class Ct

  17. Hidden Markov Graphical Model Features (observed) X1 X2 X3 XT - - - - - - - - Class (hidden) CT C1 C2 C3 Time Notes: - all temporal dependence is modeled through the class variable C - this is the simplest possible model - Avoids modeling p(X|other X’s)

  18. Generalizations of HMMs Spatial Rainfall (observed) R1 R2 R3 RT - - - - - - - - State (hidden) CT C1 C2 C3 Atmospheric (observed) A1 A2 A3 AT Hidden state model relating atmospheric measurements to local rainfall “Weather state” couples multiple variables in time and space (Hughes and Guttorp, 1996) Graphical models = language for spatio-temporal modeling

  19. Exact Probability Propagation (PP) Algorithms • Basic PP Algorithm • Pearl, 1988; Lauritzen and Spiegelhalter, 1988 • Assume the graph has no loops • Declare 1 node (any node) to be a root • Schedule two phases of message-passing • nodes pass messages up to the root • messages are distributed back to the leaves • (if loops, convert loopy graph to an equivalent tree)

  20. Properties of the PP Algorithm • Exact • p(node|all data) is recoverable at each node • i.e., we get exact posterior from local message-passing • modification: MPE = most likely instantiation of all nodes jointly • Efficient • Complexity: exponential in size of largest clique • Brute force: exponential in all variables

  21. Hidden Markov Graphical Model Features (observed) X1 X2 X3 XT - - - - - - - - Class (hidden) CT C1 C2 C3 Time

  22. PP Algorithm for a HMM Features (observed) X1 X2 X3 XT - - - - - - - - Class (hidden) CT C1 C2 C3 Let CT be the root

  23. PP Algorithm for a HMM Features (observed) X1 X2 X3 XT - - - - - - - - Class (hidden) CT C1 C2 C3 Let CT be the root Absorb evidence from X’s (which are fixed)

  24. PP Algorithm for a HMM Features (observed) X1 X2 X3 XT - - - - - - - - Class (hidden) CT C1 C2 C3 Let CT be the root Absorb evidence from X’s (which are fixed) Forward pass: pass evidence forward from C1

  25. PP Algorithm for a HMM Features (observed) X1 X2 X3 XT - - - - - - - - Class (hidden) CT C1 C2 C3 Let CT be the root Absorb evidence from X’s (which are fixed) Forward pass: pass evidence forward from C1 Backward pass: pass evidence backward from CT (This is the celebrated “forward-backward” algorithm for HMMs)

  26. Comments on F-B Algorithm • Complexity = O(T m2) • Has been reinvented several times • e.g., BCJR algorithm for error-correcting codes • Real-time recursive version • run algorithm forward to current time t • can propagate backwards to “revise” history

  27. HMMs and Classification

  28. Forward-Backward Algorithm • Classification • Algorithm produces p(Ct|all other data) at each node • to minimize 0-1 loss • choose most likely class at each t • Most likely class sequence? • Not the same as the sequence of most likely classes • can be found instead with Viterbi/dynamic programming • replace sums in F-B with “max”

  29. Supervised HMM learning • Use your favorite classifier to learn p(C|X) • i.e., ignore temporal aspect of problem (temporarily) • Now, estimate p(Ct | Ct-1) from labeled training data • We have a fully operational HMM • no need to use EM for learning if class labels are provided (i.e., do “supervised HMM learning”)

  30. Fault Diagnosis Application(Smyth, Pattern Recognition, 1994) Features X1 X2 X3 XT - - - - - - - - Fault Classes CT C1 C2 C3 Fault Detection in 34m Antenna Systems: Classes: {normal, short-circuit, tacho problem, ..} Features: AR coefficients measured every 2 seconds Classes are persistent over time

  31. Approach and Results • Classifiers • Gaussian model and neural network • trained on labeled “instantaneous window” data • Markov component • transition probabilities estimated from MTBF data • Results • discriminative neural net much better than Gaussian • Markov component reduced the error rate (all false alarms) of 2% to 0%.

  32. Classification with and withoutthe Markov context X1 X2 X3 XT - - - - - - - - CT C1 C2 C3 We will compare what happens when (a) we just make decisions based on p(Ct | Xt) (“ignore context”) (b) we use the full Markov context (i.e., use forward-backward to “integrate” temporal information)

  33. Simulation Experiments

  34. Systematic Simulations X1 X2 X3 XT - - - - - - - - CT C1 C2 C3 Simulation Setup 1. Two Gaussian classes, at mean 0 and mean 1 => vary “separation” = sigma of the Gaussians 2. Markov dependence A = [p 1-p ; 1-p p] Vary p (self-transition) = “strength of context” Look at Bayes error with and without context

  35. In summary…. • Context reduces error • greater Markov dependence => greater reduction • Reduction is dramatic for p>0.9 • e.g., even with minimal Gaussian separation, Bayes error can be reduced to zero!!

  36. Approximate Methods • Forward-Only: • necessary in many applications • “Two nearest-neighbors” • only use information from C(t-1) and C(t+1) • How suboptimal are these methods?

More Related