Padhraic Smyth Information and Computer Science University of California, Irvine datalab.uci

A Review of Hidden Markov Models for Context-Based ClassificationICML’01 Workshop onTemporal and Spatial LearningWilliams CollegeJune 28th 2001 Padhraic Smyth Information and Computer Science University of California, Irvine www.datalab.uci.edu

Outline • Context in classification • Brief review of hidden Markov models • Hidden Markov models for classification • Simulation results: how useful is context? • (with Dasha Chudova, UCI)

Historical Note • “Classification in Context” was well-studied in pattern recognition in the 60’s and 70’s • e.g, recursive Markov-based algorithms were proposed, before hidden Markov algorithms and models were fully understood • Applications in • OCR for word-level recognition • remote-sensing pixel classification

Papers of Note Raviv, J., “Decision-making in Markov chains applied to the problem of pattern recognition”, IEEE Info Theory, 3(4), 1967 Hanson, Riseman, and Fisher, “Context in word recognition,” Pattern Recognition, 1976 Toussaint, G., “The use of context in pattern recognition,” Pattern Recognition, 10, 1978 Mohn, Hjort, and Storvik, “A simulation study of some contextual classification methods for remotely sensed data,” IEEE Trans Geo. Rem. Sens., 25(6), 1987.

Context-Based Classification Problems • Medical Diagnosis • classification of a patient’s state over time • Fraud Detection • detection of stolen credit card • Electronic Nose • detection of landmines • Remote Sensing • classification of pixels into ground cover

Modeling Context • Common Theme = Context • class labels (and features) are “persistent” in time/space

Modeling Context • Common Theme = Context • class labels (and features) are “persistent” in time/space Features (observed) X1 X2 X3 XT - - - - - - - - Class (hidden) CT C1 C2 C3 Time

Feature Windows • Predict Ct using a window, e.g., f(Xt, Xt-1, Xt-2) • e.g., NETtalk application Features (observed) X1 X2 X3 XT - - - - - - - - Class (hidden) CT C1 C2 C3 Time

Alternative: Probabilistic Modeling • E.g., assume p(Ct | history) = p(Ct | Ct-1) • first order Markov assumption on the classes Features (observed) X1 X2 X3 XT - - - - - - - - Class (hidden) CT C1 C2 C3 Time

Brief review of hidden Markov models (HMMs)

Graphical Models • Basic Idea: p(U) <=> an annotated graph • Let U be a set of random variables of interest • 1-1 mapping from U to nodes in a graph • graph encodes “independence structure” of model • numerical specifications of p(U) are stored locally at the nodes

B A C Acyclic Directed Graphical Models (aka belief/Bayesian networks) p(A,B,C) = p(C|A,B)p(A)p(B) In general, p(X1, X2,....XN) = p(Xi | parents(Xi ) )

p(X1, X2,....XN) =potential(clique i) Undirected Graphical Models (UGs) • Undirected edges reflect correlational dependencies • e.g., particles in physical systems, pixels in an image • Also known as Markov random fields, Boltzmann machines, etc

A B C Examples of 3-way Graphical Models Markov chain p(A,B,C) = p(C|B) p(B|A) p(A)

A B C A B C Examples of 3-way Graphical Models Markov chain p(A,B,C) = p(C|B) p(B|A) p(A) Independent Causes: p(A,B,C) = p(C|A,B)p(A)p(B)

Hidden Markov Graphical Model • Assumption 1: • p(Ct | history) = p(Ct | Ct-1) • first order Markov assumption on the classes • Assumption 2: • p(Xt | history, Ct ) = p(Xt | Ct) • Xt only depends on current class Ct

Hidden Markov Graphical Model Features (observed) X1 X2 X3 XT - - - - - - - - Class (hidden) CT C1 C2 C3 Time Notes: - all temporal dependence is modeled through the class variable C - this is the simplest possible model - Avoids modeling p(X|other X’s)

Generalizations of HMMs Spatial Rainfall (observed) R1 R2 R3 RT - - - - - - - - State (hidden) CT C1 C2 C3 Atmospheric (observed) A1 A2 A3 AT Hidden state model relating atmospheric measurements to local rainfall “Weather state” couples multiple variables in time and space (Hughes and Guttorp, 1996) Graphical models = language for spatio-temporal modeling

Exact Probability Propagation (PP) Algorithms • Basic PP Algorithm • Pearl, 1988; Lauritzen and Spiegelhalter, 1988 • Assume the graph has no loops • Declare 1 node (any node) to be a root • Schedule two phases of message-passing • nodes pass messages up to the root • messages are distributed back to the leaves • (if loops, convert loopy graph to an equivalent tree)

Properties of the PP Algorithm • Exact • p(node|all data) is recoverable at each node • i.e., we get exact posterior from local message-passing • modification: MPE = most likely instantiation of all nodes jointly • Efficient • Complexity: exponential in size of largest clique • Brute force: exponential in all variables

Hidden Markov Graphical Model Features (observed) X1 X2 X3 XT - - - - - - - - Class (hidden) CT C1 C2 C3 Time

PP Algorithm for a HMM Features (observed) X1 X2 X3 XT - - - - - - - - Class (hidden) CT C1 C2 C3 Let CT be the root

PP Algorithm for a HMM Features (observed) X1 X2 X3 XT - - - - - - - - Class (hidden) CT C1 C2 C3 Let CT be the root Absorb evidence from X’s (which are fixed)

PP Algorithm for a HMM Features (observed) X1 X2 X3 XT - - - - - - - - Class (hidden) CT C1 C2 C3 Let CT be the root Absorb evidence from X’s (which are fixed) Forward pass: pass evidence forward from C1

PP Algorithm for a HMM Features (observed) X1 X2 X3 XT - - - - - - - - Class (hidden) CT C1 C2 C3 Let CT be the root Absorb evidence from X’s (which are fixed) Forward pass: pass evidence forward from C1 Backward pass: pass evidence backward from CT (This is the celebrated “forward-backward” algorithm for HMMs)

Comments on F-B Algorithm • Complexity = O(T m2) • Has been reinvented several times • e.g., BCJR algorithm for error-correcting codes • Real-time recursive version • run algorithm forward to current time t • can propagate backwards to “revise” history

HMMs and Classification

Forward-Backward Algorithm • Classification • Algorithm produces p(Ct|all other data) at each node • to minimize 0-1 loss • choose most likely class at each t • Most likely class sequence? • Not the same as the sequence of most likely classes • can be found instead with Viterbi/dynamic programming • replace sums in F-B with “max”

Supervised HMM learning • Use your favorite classifier to learn p(C|X) • i.e., ignore temporal aspect of problem (temporarily) • Now, estimate p(Ct | Ct-1) from labeled training data • We have a fully operational HMM • no need to use EM for learning if class labels are provided (i.e., do “supervised HMM learning”)

Fault Diagnosis Application(Smyth, Pattern Recognition, 1994) Features X1 X2 X3 XT - - - - - - - - Fault Classes CT C1 C2 C3 Fault Detection in 34m Antenna Systems: Classes: {normal, short-circuit, tacho problem, ..} Features: AR coefficients measured every 2 seconds Classes are persistent over time

Approach and Results • Classifiers • Gaussian model and neural network • trained on labeled “instantaneous window” data • Markov component • transition probabilities estimated from MTBF data • Results • discriminative neural net much better than Gaussian • Markov component reduced the error rate (all false alarms) of 2% to 0%.

Classification with and withoutthe Markov context X1 X2 X3 XT - - - - - - - - CT C1 C2 C3 We will compare what happens when (a) we just make decisions based on p(Ct | Xt) (“ignore context”) (b) we use the full Markov context (i.e., use forward-backward to “integrate” temporal information)

Simulation Experiments

Systematic Simulations X1 X2 X3 XT - - - - - - - - CT C1 C2 C3 Simulation Setup 1. Two Gaussian classes, at mean 0 and mean 1 => vary “separation” = sigma of the Gaussians 2. Markov dependence A = [p 1-p ; 1-p p] Vary p (self-transition) = “strength of context” Look at Bayes error with and without context

In summary…. • Context reduces error • greater Markov dependence => greater reduction • Reduction is dramatic for p>0.9 • e.g., even with minimal Gaussian separation, Bayes error can be reduced to zero!!

Approximate Methods • Forward-Only: • necessary in many applications • “Two nearest-neighbors” • only use information from C(t-1) and C(t+1) • How suboptimal are these methods?

Padhraic Smyth Information and Computer Science University of California, Irvine datalab.uci

Padhraic Smyth Information and Computer Science University of California, Irvine datalab.uci

Presentation Transcript

ADVANCE PROGRAM University of California Irvine

Padhraic Smyth Information and Computer Science University of California, Irvine www.datalab.uci.edu

* Fordham University Department of Computer and Information Science

University of California Irvine

Center for Embedded Computer Systems University of California, Irvine

Ross Conner University of California Irvine USA and

Padhraic Smyth, Sharad Mehrotra Information and Computer Science University of California, Irvine

UCI University of California, Irvine

Hawaii Pacific University and University of California Irvine

University of California Irvine

Center for Embedded Computer Systems University of California, Irvine

University of California, Irvine Undergraduate Students

University of Southern California Department Computer Science

University of California, Irvine University Registrar

University of California, Irvine and San Diego

Alfred Kobsa School of Information and Computer Science University of California, Irvine, U.S.A.

G. Avolio – University of California, Irvine

ADVANCE PROGRAM University of California Irvine

Padhraic Smyth Department of Information and Computer Science University of California, Irvine

Getting Grants University of California, Irvine

University of California, Irvine

Elizabeth Losh, University of California, Irvine