Create Presentation
Download Presentation

Padhraic Smyth Information and Computer Science University of California, Irvine datalab.uci

Padhraic Smyth Information and Computer Science University of California, Irvine datalab.uci

165 Views

Download Presentation
Download Presentation
## Padhraic Smyth Information and Computer Science University of California, Irvine datalab.uci

- - - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - - -

**A Review of Hidden Markov Models for Context-Based**ClassificationICML’01 Workshop onTemporal and Spatial LearningWilliams CollegeJune 28th 2001 Padhraic Smyth Information and Computer Science University of California, Irvine www.datalab.uci.edu**Outline**• Context in classification • Brief review of hidden Markov models • Hidden Markov models for classification • Simulation results: how useful is context? • (with Dasha Chudova, UCI)**Historical Note**• “Classification in Context” was well-studied in pattern recognition in the 60’s and 70’s • e.g, recursive Markov-based algorithms were proposed, before hidden Markov algorithms and models were fully understood • Applications in • OCR for word-level recognition • remote-sensing pixel classification**Papers of Note**Raviv, J., “Decision-making in Markov chains applied to the problem of pattern recognition”, IEEE Info Theory, 3(4), 1967 Hanson, Riseman, and Fisher, “Context in word recognition,” Pattern Recognition, 1976 Toussaint, G., “The use of context in pattern recognition,” Pattern Recognition, 10, 1978 Mohn, Hjort, and Storvik, “A simulation study of some contextual classification methods for remotely sensed data,” IEEE Trans Geo. Rem. Sens., 25(6), 1987.**Context-Based Classification Problems**• Medical Diagnosis • classification of a patient’s state over time • Fraud Detection • detection of stolen credit card • Electronic Nose • detection of landmines • Remote Sensing • classification of pixels into ground cover**Modeling Context**• Common Theme = Context • class labels (and features) are “persistent” in time/space**Modeling Context**• Common Theme = Context • class labels (and features) are “persistent” in time/space Features (observed) X1 X2 X3 XT - - - - - - - - Class (hidden) CT C1 C2 C3 Time**Feature Windows**• Predict Ct using a window, e.g., f(Xt, Xt-1, Xt-2) • e.g., NETtalk application Features (observed) X1 X2 X3 XT - - - - - - - - Class (hidden) CT C1 C2 C3 Time**Alternative: Probabilistic Modeling**• E.g., assume p(Ct | history) = p(Ct | Ct-1) • first order Markov assumption on the classes Features (observed) X1 X2 X3 XT - - - - - - - - Class (hidden) CT C1 C2 C3 Time**Graphical Models**• Basic Idea: p(U) <=> an annotated graph • Let U be a set of random variables of interest • 1-1 mapping from U to nodes in a graph • graph encodes “independence structure” of model • numerical specifications of p(U) are stored locally at the nodes**B**A C Acyclic Directed Graphical Models (aka belief/Bayesian networks) p(A,B,C) = p(C|A,B)p(A)p(B) In general, p(X1, X2,....XN) = p(Xi | parents(Xi ) )**p(X1, X2,....XN) =potential(clique i)**Undirected Graphical Models (UGs) • Undirected edges reflect correlational dependencies • e.g., particles in physical systems, pixels in an image • Also known as Markov random fields, Boltzmann machines, etc**A**B C Examples of 3-way Graphical Models Markov chain p(A,B,C) = p(C|B) p(B|A) p(A)**A**B C A B C Examples of 3-way Graphical Models Markov chain p(A,B,C) = p(C|B) p(B|A) p(A) Independent Causes: p(A,B,C) = p(C|A,B)p(A)p(B)**Hidden Markov Graphical Model**• Assumption 1: • p(Ct | history) = p(Ct | Ct-1) • first order Markov assumption on the classes • Assumption 2: • p(Xt | history, Ct ) = p(Xt | Ct) • Xt only depends on current class Ct**Hidden Markov Graphical Model**Features (observed) X1 X2 X3 XT - - - - - - - - Class (hidden) CT C1 C2 C3 Time Notes: - all temporal dependence is modeled through the class variable C - this is the simplest possible model - Avoids modeling p(X|other X’s)**Generalizations of HMMs**Spatial Rainfall (observed) R1 R2 R3 RT - - - - - - - - State (hidden) CT C1 C2 C3 Atmospheric (observed) A1 A2 A3 AT Hidden state model relating atmospheric measurements to local rainfall “Weather state” couples multiple variables in time and space (Hughes and Guttorp, 1996) Graphical models = language for spatio-temporal modeling**Exact Probability Propagation (PP) Algorithms**• Basic PP Algorithm • Pearl, 1988; Lauritzen and Spiegelhalter, 1988 • Assume the graph has no loops • Declare 1 node (any node) to be a root • Schedule two phases of message-passing • nodes pass messages up to the root • messages are distributed back to the leaves • (if loops, convert loopy graph to an equivalent tree)**Properties of the PP Algorithm**• Exact • p(node|all data) is recoverable at each node • i.e., we get exact posterior from local message-passing • modification: MPE = most likely instantiation of all nodes jointly • Efficient • Complexity: exponential in size of largest clique • Brute force: exponential in all variables**Hidden Markov Graphical Model**Features (observed) X1 X2 X3 XT - - - - - - - - Class (hidden) CT C1 C2 C3 Time**PP Algorithm for a HMM**Features (observed) X1 X2 X3 XT - - - - - - - - Class (hidden) CT C1 C2 C3 Let CT be the root**PP Algorithm for a HMM**Features (observed) X1 X2 X3 XT - - - - - - - - Class (hidden) CT C1 C2 C3 Let CT be the root Absorb evidence from X’s (which are fixed)**PP Algorithm for a HMM**Features (observed) X1 X2 X3 XT - - - - - - - - Class (hidden) CT C1 C2 C3 Let CT be the root Absorb evidence from X’s (which are fixed) Forward pass: pass evidence forward from C1**PP Algorithm for a HMM**Features (observed) X1 X2 X3 XT - - - - - - - - Class (hidden) CT C1 C2 C3 Let CT be the root Absorb evidence from X’s (which are fixed) Forward pass: pass evidence forward from C1 Backward pass: pass evidence backward from CT (This is the celebrated “forward-backward” algorithm for HMMs)**Comments on F-B Algorithm**• Complexity = O(T m2) • Has been reinvented several times • e.g., BCJR algorithm for error-correcting codes • Real-time recursive version • run algorithm forward to current time t • can propagate backwards to “revise” history**Forward-Backward Algorithm**• Classification • Algorithm produces p(Ct|all other data) at each node • to minimize 0-1 loss • choose most likely class at each t • Most likely class sequence? • Not the same as the sequence of most likely classes • can be found instead with Viterbi/dynamic programming • replace sums in F-B with “max”**Supervised HMM learning**• Use your favorite classifier to learn p(C|X) • i.e., ignore temporal aspect of problem (temporarily) • Now, estimate p(Ct | Ct-1) from labeled training data • We have a fully operational HMM • no need to use EM for learning if class labels are provided (i.e., do “supervised HMM learning”)**Fault Diagnosis Application(Smyth, Pattern Recognition,**1994) Features X1 X2 X3 XT - - - - - - - - Fault Classes CT C1 C2 C3 Fault Detection in 34m Antenna Systems: Classes: {normal, short-circuit, tacho problem, ..} Features: AR coefficients measured every 2 seconds Classes are persistent over time**Approach and Results**• Classifiers • Gaussian model and neural network • trained on labeled “instantaneous window” data • Markov component • transition probabilities estimated from MTBF data • Results • discriminative neural net much better than Gaussian • Markov component reduced the error rate (all false alarms) of 2% to 0%.**Classification with and withoutthe Markov context**X1 X2 X3 XT - - - - - - - - CT C1 C2 C3 We will compare what happens when (a) we just make decisions based on p(Ct | Xt) (“ignore context”) (b) we use the full Markov context (i.e., use forward-backward to “integrate” temporal information)**Systematic Simulations**X1 X2 X3 XT - - - - - - - - CT C1 C2 C3 Simulation Setup 1. Two Gaussian classes, at mean 0 and mean 1 => vary “separation” = sigma of the Gaussians 2. Markov dependence A = [p 1-p ; 1-p p] Vary p (self-transition) = “strength of context” Look at Bayes error with and without context**In summary….**• Context reduces error • greater Markov dependence => greater reduction • Reduction is dramatic for p>0.9 • e.g., even with minimal Gaussian separation, Bayes error can be reduced to zero!!**Approximate Methods**• Forward-Only: • necessary in many applications • “Two nearest-neighbors” • only use information from C(t-1) and C(t+1) • How suboptimal are these methods?