Hidden Markov Models

Ting-Yu Mu Wooyoung Lee Hidden Markov Models

Markov Model Hidden Markov Model Evaluation problem Decoding Problem Learning Problem Index

Markov Model • Set of states: • Process moves from one state to another generating a sequence of states : • Markov chain property: probability of each subsequent state depends only on what was the previous state: • To define Markov model, the following probabilities have to be specified: • transition probabilities • and initial probabilities

Example of Markov Model 0.3 0.7 Rain Dry 0.2 0.8 • Two states : ‘Rain’ and ‘Dry’. • Transition probabilities: P(‘Rain’|‘Rain’)=0.3 , P(‘Dry’|‘Rain’)=0.7 , P(‘Rain’|‘Dry’)=0.2, P(‘Dry’|‘Dry’)=0.8 • Initial probabilities: say P(‘Rain’)=0.4 , P(‘Dry’)=0.6 .

By Markov chain property, probability of state sequence can be found by the formula: • Suppose we want to calculate a probability of a sequence of states in example, {‘Dry’,’Dry’,’Rain’,Rain’}. P({‘Dry’,’Dry’,’Rain’,Rain’} ) = P(‘Rain’|’Rain’) P(‘Rain’|’Dry’) P(‘Dry’|’Dry’) P(‘Dry’)= = 0.3*0.2*0.8*0.6 Calculation of sequence probability

Hidden Markov models. • Set of states: • Process moves from one state to another generating a sequence of states : • Markov chain property: probability of each subsequent state depends only on what was the previous state: • States are not visible, but each state randomly generates one of M observations (or visible states) • To define hidden Markov model, the following probabilities have to be specified: matrix of transition probabilities A=(aij), aij= P(si| sj) , matrix of observation probabilities B=(bi (vm)), bi(vm)= P(vm| si) and a vector of initial probabilities =(i), i= P(si) . Model is represented by M=(A, B, ).

Example of Hidden Markov Model 0.3 0.7 Low High 0.2 0.8 Rain 0.6 0.6 0.4 0.4 Dry

Example of Hidden Markov Model • Two states : ‘Low’ and ‘High’ atmospheric pressure. • Two observations : ‘Rain’ and ‘Dry’. • Transition probabilities: P(‘Low’|‘Low’)=0.3 , • P(‘High’|‘Low’)=0.7 , • P(‘Low’|‘High’)=0.2, • P(‘High’|‘High’)=0.8 • Observation probabilities : P(‘Rain’|‘Low’)=0.6 , • P(‘Dry’|‘Low’)=0.4 , • P(‘Rain’|‘High’)=0.4 , • P(‘Dry’|‘High’)=0.3 . • Initial probabilities: say P(‘Low’)=0.4 , P(‘High’)=0.6 .

Calculation of observation sequence probability • Suppose we want to calculate a probability of a sequence of observations in our example, {‘Dry’,’Rain’}. • Consider all possible hidden state sequences: • P({‘Dry’,’Rain’} ) = P({‘Dry’,’Rain’} , {‘Low’,’Low’}) + P({‘Dry’,’Rain’} , {‘Low’,’High’}) + P({‘Dry’,’Rain’} , {‘High’,’Low’}) + P({‘Dry’,’Rain’} , {‘High’,’High’}) • where first term is : • P({‘Dry’,’Rain’} , {‘Low’,’Low’})= • P({‘Dry’,’Rain’} | {‘Low’,’Low’}) P({‘Low’,’Low’}) = • P(‘Dry’|’Low’)P(‘Rain’|’Low’) P(‘Low’)P(‘Low’|’Low) • = 0.4*0.6*0.4*0.3

Main issues using HMMs • Evaluation problem • Given the HMM M=(A, B, ) and the observation sequence O=o1 o2 ... oK, calculate the probability that model M has generated sequence O. • Decoding problem • Given the HMM M=(A, B, ) and the observation sequence O=o1 o2 ... oK, calculate the most likely sequence of hidden states si that produced this observation sequence O. • Learning problem • Given some training observation sequences O=o1 o2 ... oK and general structure of HMM (numbers of hidden and visible states), determine HMM parameters M=(A, B, ) that best fit training data. • O=o1...oKdenotes a sequence of observations ok{v1,…,vM}.

Evaluation Problem. • Given the HMM M=(A, B, ) and the observation sequence O=o1 o2 ... oK, calculate the probability that model M has generated sequence O. • Trying to find probability of observations O=o1 o2 ... oK by means of considering all hidden state sequences (as was done in example) is impractical: • NK hidden state sequences - exponential complexity. • Use Forward-Backward HMM algorithms for efficient calculations. • Define the forward variable k(i) as the joint probability of the partial observation sequence o1 o2 ... ok and that the hidden state at time k is si: k(i)= P(o1 o2 ... ok , qk=si)

Trellis representation of an HMM s1 sN s2 si sN s2 si s1 s1 s2 sj s1 s2 si sN sN o1 okok+1 oK = Observations a1j a2j aij aNj Time= 1 k k+1 K

Forward recursion for HMM • Initialization: • 1(i)= P(o1 , q1=si) = ibi (o1) , 1<=i<=N. • Forward recursion: • k+1(i)= P(o1 o2 ... ok+1 , qk+1=sj) = • iP(o1 o2 ... ok+1 , qk=si, qk+1=sj) = • iP(o1 o2 ... ok , qk=si) aijbj(ok+1 ) = • [i k(i) aij ] bj(ok+1 ) , 1<=j<=N, 1<=k<=K-1. • Termination: • P(o1 o2 ... oK) = iP(o1 o2 ... oK , qK=si) = i K(i) • Complexity : • N2K operations.

Backward recursion for HMM • Define the forward variable k(i) as the joint probability of the partial observation sequence ok+1 ok+2 ... oKgiven that the hidden state at time k is si: k(i)= P(ok+1 ok+2 ... oK|qk=si) • Initialization: • K(i)= 1 , 1<=i<=N. • Backward recursion: • k(j)= P(ok+1 ok+2 ... oK|qk=sj) = • iP(ok+1 ok+2 ... oK , qk+1=si|qk=sj) = • iP(ok+2 ok+3 ... oK|qk+1=si) aji bi (ok+1 ) = • i k+1(i) aji bi (ok+1 ) , 1<=j<=N, 1<=k<=K-1. • Termination: • P(o1 o2 ... oK) = iP(o1 o2 ... oK , q1=si) = • iP(o1 o2 ... oK|q1=si) P(q1=si) = i 1(i) bi (o1) i

Decoding problem • Given the HMM M=(A, B, ) and the observation sequence O=o1 o2 ... oK, calculate the most likely sequence of hidden states si that produced this observation sequence. • We want to find the state sequence Q= q1…qK which maximizes P(Q | o1 o2 ... oK) , or equivalently P(Q , o1 o2 ... oK) . • Brute force consideration of all paths takes exponential time. • Use efficient Viterbi algorithm instead. • Define variable k(i) as the maximum probability of producing observation sequence o1 o2 ... ok when moving along any hidden state sequence q1… qk-1 and getting into qk= si . • k(i) = max P(q1… qk-1 ,qk= si,o1 o2 ... ok) • where max is taken over all possible paths q1… qk-1 .

Viterbi algorithm (1) s1 si sN sj qk-1 qk a1j aij aNj • General idea: • if best path ending in qk= sjgoes through qk-1= sithen it should coincide with best path ending in qk-1= si. • k(i) = max P(q1… qk-1 ,qk= sj,o1 o2 ... ok) = • maxi [ aijbj(ok ) max P(q1… qk-1= si,o1 o2 ... ok-1) ] • To backtrack best path keep info that predecessor of sj was si.

Viterbi algorithm (2) • Initialization: • 1(i) = max P(q1= si,o1) = ibi (o1) , 1<=i<=N. • Forward recursion: • k(j) = max P(q1… qk-1 ,qk= sj,o1 o2 ... ok) = • maxi [ aij bj (ok ) max P(q1… qk-1= si,o1 o2 ... ok-1) ]= • maxi [ aij bj (ok ) k-1(i) ] , 1<=j<=N, 2<=k<=K. • Termination: choose best path ending at time K • maxi [ K(i) ] • Backtrack best path. This algorithm is similar to the forward recursion of evaluation problem, with  replaced by max and additional backtracking.

Learning problem (1) • Given some training observation sequences O=o1 o2 ... oK and general structure of HMM (numbers of hidden and visible states), determine HMM parameters M=(A, B, ) that best fit training data, that is maximizes P(O |M) . • There is no algorithm producing optimal parameter values. • Use iterative expectation-maximization algorithm to find local maximum of • P(O |M) - Baum-Welch algorithm.

Learning problem (2) Number of times observation vm occurs in state si Number of times in state si bi(vm )= P(vm| si)= • If training data has information about sequence of hidden states, then use maximum likelihood estimation of parameters: • aij= P(si| sj) = Number of transitions from state sj to state si Number of transitions out of state sj

Baum-Welch algorithm aij= P(si| sj) = Expected number of transitions from statesjto statesi Expected number of transitions out of statesj Expected number of times observationvmoccurs in statesi Expected number of times in statesi bi(vm )= P(vm| si)= i = P(si) = Expected frequency in statesiat time k=1. General idea:

Baum-Welch algorithm:expectation step(1) • Define variable k(i,j) as the probability of being in state si at time k and in state sj at time k+1, given the observation sequence o1 o2 ... oK. • k(i,j)= P(qk= si,qk+1= sj|o1 o2 ... oK) P(qk= si,qk+1= sj,o1 o2 ... ok) P(o1 o2 ... ok) k(i,j)= = P(qk= si,o1 o2 ... ok) aij bj (ok+1 ) P(ok+2 ... oK |qk+1= sj) P(o1 o2 ... ok) = k(i)aij bj (ok+1 )k+1(j) i jk(i)aij bj (ok+1 )k+1(j)

Baum-Welch algorithm: expectation step(2) k(i)k(i) i k(i)k(i) P(qk= si,o1 o2 ... ok) P(o1 o2 ... ok) k(i)= = • Define variable k(i) as the probability of being in state si at time k, given the observation sequence o1 o2 ... oK . • k(i)= P(qk= si|o1 o2 ... oK)

Baum-Welch algorithm: maximization step k k(i,j) k k(i) = Expected number of transitions from statesjto statesi Expected number of transitions out of statesj aij = k k(i,j) k,ok= vmk(i) Expected number of times observation vmoccurs in state si Expected number of times in state si bi(vm )= = i= (Expected frequency in statesiat time k=1) =1(i).

An HMM-Based Threshold Model Approach for Gesture Recognition Pattern Analysis and Machine Intelligence

Baum-Welch algorithm: expectation step(3) • We calculated k(i,j) = P(qk= si,qk+1= sj|o1 o2 ... oK) • and k(i)= P(qk= si|o1 o2 ... oK) • Expected number of transitions from state si to state sj = • = k k(i,j) • Expected number of transitions out of state si = k k(i) • Expected number of times observation vm occurs in state si = • = k k(i) , k is such that ok= vm • Expected frequency in state siat time k=1 : 1(i) .

Human Gesture: • A space of motion expressed by the body • Hand Gesture: • Most expressive and frequently used • Reason for research: • The interaction using hand gesture as the alternative form of human-computer interface • Replace the traditional input devices • XBOX360: Kinect Introduction

Pattern Spotting: • Locating meaningful patterns from a stream of input signal • Critical to locate: • Start point/end point of a gesture pattern • Difficult tasks due to: • Segmentation ambiguity • Spatiotemporal variability Introduction

One of the difficulties: • Not knowing the gesture boundaries: • Reference patterns need to be matched with all possible segments of input signal Introduction

Another difficulties: • Same gesture varies dynamically in: • Shape of the gesture • Duration of the gesture • An ideal recognizer: • Can extract gesture segments from the continuous input signal • Allowing a wide range of spatiotemporal variability Introduction

Hidden Markov Models (HMM): • A useful tool for modeling the spatiotemporal variability of gestures • Has limitation in representing non-gesture patterns: • Impossible to obtain the set of non-gesture training patterns • The number of meaningless motion can be: infinite • The paper introduced the Threshold model to solve above issue Introduction

Gesture spotting procedure: • The gesture in this paper: • The spatiotemporal sequence of feature vectors consist of the direction of hand movement Introduction

The paper chose the 10 most common used browsing command of PowerPoint Research background

For each gesture: • Designed a model using left-right HMM utilizing the temporal characteristics of gesture signals • The number of states in the model is determined based on the complexity of the gesture shape • The models are trained using: • Baum-Welch re-estimation algorithm Research background

The basics of Hidden Markov Model: • Rich in mathematical structures • A collection of states connected by transitions: • Each transition has a pair of probabilities: • Transition probability – the probability for taking the transition • Output probability – the conditional probability of producing an output symbol • Has elegant and efficient algorithms: • Baum-Welch algorithm • Viterbi search algorithm Research background

The formal characterization of HMM: • – A set of N states. The state at time t denoted as • – A set of M distinct output symbols. The output at time t denoted as • A = – An N * N matrix for the state transition probability: , • B = – An N * M matrix for the output symbol probability, is the probability of producing at time t in state Research background

The formal characterization of HMM: • – The initial state distribution, where is the probability that the state is the initial state • A, B and are probabilistic, they must satisfy the following constrains: Research background

The Threshold Model: • For correct gesture spotting, the likelihood of a gesture model for a given pattern should be distinct (high) enough • The value of the likelihood yields by the model is used as a threshold • A gesture is recognized only if the likelihood of the best gesture model is higher than threshold model Spotting with threshold model

The internal segmentation property of HMM: • Each self-transitioned state represents a segmental pattern of a target gesture • That outgoing transitions represent a sequential progression of the segments in a gesture • The fully connected model was constructed: Spotting with threshold model

In the fully connected model: • Each state can be reached by all other states in a single transition • Output and self-transition probabilities in the model are kept as in the gesture models • All outgoing transition probabilities are equally assigned as (excluding the start and final states) • A candidate gesture is found when a specific gesture model rises above the threshold model Spotting with threshold model

Spotting with threshold model

[1] Pattern Recognition. http://en.wikipedia.org/wiki/Pattern_recognition [2] Gesture Recognition. http://en.wikipedia.org/wiki/Gesture_recognition [3] Pointing device gesture. http://en.wikipedia.org/wiki/Gesture-based_interface [4] J. Davis and M. Shah, “Visual Gesture Recognition,” IEEE Proc.Visualization and Image Signal Processing, vol. 141, no. 2, pp. 101-106, Apr. 1994 [5] B. Phil, “Hidden Markov Models,” http://digital.cs.usu.edu/~cyan/CS7960/hmm-tutorial.pdf References

[6]Forward-Backward Algorithm: http://en.wikipedia.org/wiki/Forward%E2%80%93backward_algorithm [7]Baum-Welsh Algorithmhttp://jedlik.phy.bme.hu/~gerjanos/HMM/node11.html http://www.indiana.edu/~iulg/moss/hmmcalculations.pdf [8]Viterbi Algorithm: http://www.indiana.edu/~iulg/moss/hmmcalculations.pdf References

Hidden Markov Models