Learning and Recognizing Human Dynamics in Video Sequences Christoph Bregler

Learning and Recognizing Human Dynamics in Video SequencesChristoph Bregler Alvina Goh Reading group: 07/06/06

Motivation • Seeing lights attached to the joints of an actor, humans were able to distinguish human gaits, dance styles, stair climbing, or even the gender and identity. • This paper attempts to find the right balance of supplied structure and learned parameters. • Guiding principles: • no early commitment to specific hypotheses • higher level hypothesis should be able to disambiguate lower level estimates • low computation and representation costs • mid and higher level models should be learnable

Motivation • Human motion is represented at many levels of abstraction. • This paper describes a way of combining cues from the lowest level to the highest level in order to do activity recognition. • By suggesting the idea of representing motion data by movemes (like phonemes in speech recognition), it is possible to compose a complex activity (word) out of simple movemes.

Probabilistic Compositional Framework • Low-level primitives: areas of coherent motion • Image region belonging to a rigid body segment is one coherent motion • Mid-level categories: simple movements These are represented by linear dynamical systems • High-level complex gestures: a sequence of simple movements These are represented by Hidden Markov Models as successive phases of simple movements

Probabilistic Compositional Framework Each dynamical model corresponds to the emission probability of the state of a hidden Markov model Temporal sequences of blob tracks are grouped to linear stochastic dynamical models. Each blob is presented with a probability distribution over coherent motion (rigid/affine), color (HSV values), and spatial support regions. At each pixel, represent spatio-temporal image gradients, and the color value as a random variable

Probabilistic Compositional Framework ( j ) I P H I M M I I I I i 1 2 1 2 t t ; ; : : : ; ; : : : ; • Example of one leg during a walk cycle • One coherent blob for upper leg, another for the lower leg • One dynamical system when the leg has ground support, another when swinging above ground State space: translation and angular velocities • One cyclic HMM with 2 states • Sequence of images, • need to find corresponding blob estimates, linear dynamical systems, and HMMs for a set of different gaits, • classify using the posterior probability ie, HMM with the highest score is the most likely complex gesture performed in the image sequence

1st and 2nd Levels Each blob is presented with a probability distribution over coherent motion (rigid/affine), color (HSV values), and spatial support regions. At each pixel, represent spatio-temporal image gradients, and the color value as a random variable

Classification of Pixels into Blobs • For each pixel location (x,y), we need to estimate the label S(x,y), which indicates which blob the pixel belongs to. (assuming there are K blobs) • For each one of the K blobs, we need to estimate the motion, color and spatial distribution. • In order to estimate the labels S(x,y){1,2,...,K} and the model parameters for motion, color and spatiality simultaneously, EM is used.

Representation of Mixture of Blobs • Set of blob hypotheses for a given image frame I(t) are represented as a mixture of multivariate Gaussians (t) • Each k(t) contains the parameters for coherent motion and color and the center of mass and second moments in each blob. A background class with uniform distribution is also defined. • Likelihood of an image frame I(t) conditional on a mixture of blobs hypothesis is: We want to maximize this cost function Spatial proximity prior for blob k

Before we can maximize the cost ( ( ( ( ( ( ( ) ) ) j j ) j ( ) ( ( ) ) ) ) ( ) ) µ µ P r P P I r I I I I i 0 t t t t t t t t t + ¢ x x x y x y y y v x x m x y y o y o n p a r a x m e y e r s = k k t ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; • We need to model This term is defined using the spatial-temporal image gradient (motion) and color values. • Optical flow • How do we model the pdf for optical flow? This is done with a zero-mean Gaussian distribution as described in the paper E. Simoncelli, E. Adelson, and D. Heeger, "Probability distributions of optical flow," in Proceedings of the IEEE Computer Vision and Pattern Recognition Conference, pp. 310--315, 1991. • This defines which we use for

Expectation step ( ) ( ( ) j ( ) ( ) ) k µ S P S I t t t t x y x y x y x y = = k ; ; ; ; ; ; ; ; ; ( j ( ) ) ( ( ) j ( ) ) µ µ P P I t t t / w x y x y x y k k k ; ; ; ; ; ; • Estimation of the support layer for each blob, which is the posterior probability Note that we are calculating the expected membership.

Maximization step P ( ) S t x y k ; ; x y ; w = k P P ( ) S t x y k k ; ; x y ; • Seek to maximize the expected log-likelihood. This is equivalent to minimizing the following • Minimizing (8) wrt the constraint k wk=1 is equivalent to assigning • Minimizing (9) is equivalent to computing the weighted means and covariances for the support layer • Minimizing (10) is done by extending the Lucas-Kanade motion estimation in the paper “Good Features to track” by Shi and Tomasi, CVPR 1994

A side note • Black ink = high probability • Support map has high probability for the motion model at regions with high gradients as they can be uniquely matched to specific motion models. At non-textured regions, equal probability is assigned to several motion models. • This approach can be viewed as an edge based tracker at regions with high edge gradients, and a region based tracker at regions with high texture.

Considering Past Estimates • Since EM converges to a local maxima only, it is important to initialize the starting point intelligently. • Now given past estimates of the blob parameters (t-1), Kalman filters is used to predict the mean and covariance of (t). (t) = state space of the filter • The EM starting point is the predicted Kalman state.

3rd and 4th Levels Each dynamical model corresponds to the emission probability of the state of a hidden Markov model Temporal sequences of blob tracks are grouped to linear stochastic dynamical models.

Classification of Blobs into Dynamical Systems • Similar to what was done in the lower levels where we introduced the hidden variables Sk(t,x,y), indicating the probability of a blob at a pixel. • We now introduce the variable Dm(t,k), which groups a sequence of blobs k(t), k(t-1),.. k(t-d) to a dynamical system m. • In order to do so, we assume the following discrete 2nd order stochastic dynamical system (moveme) The state variable Q(t) is the motion estimate of the specific blob k(t), w is system noise, and Cm=Bm BmT is the system covariance

Classification of Complex Gestures • Hidden Markov Models are used to represent complex geatures composed of simple dynamical systems. The state of the HMM corresponds to the validity of the dynamic system. The emission probabilities are represented by the dynamic stochastic system. • We want to compute the global best segmentation across time. This is done using dynamic programming. • Estimate that a HMMi fits a track P(Dm(t,k) is the probability that dynamical system m fits blob k at time t. trn,m is the HMM transition probability between state n and m. • Compare across all the complex category HMMi and an outlier model HMM0, classify.

Hybrid Dynamical Models [ ] Á A A B 0 1 = m m m m ; ; • We need to estimate the system parameters of each dynamical model and the entries trm,n of the HMM transition probability matrix. • However, if we are only given the motion trajectories Q(1), Q(2),.. Q(T), we do not know the partition into subsequences. If we know the partition, calculating the system parameters is easy. • Proceed in a EM manner by maximizing the log likelihood of a set of M dynamical systems and the corresponding HMM

Expectation step ( ( ) j ( ) ( ) ) Á Á P D Q Q T T R 1 t M H M M 1 m ; : : ; ; ; : : : ; ; • Estimation of the partition of the training set. Find the probability Dm(t) that training example Q(t) was generated by dynamical system m. computed with dynamic programming with linear complexity

Maximization step • Seek to maximize the following expected log-likelihood for each model m This is done by solving a linear equation. • New estimate of the HMM transition probability is also computed with EM.

Experiments: Training and Validation of Gait Models • 33 sequences of 5 subjects • Running, Walking, Skipping • Sequences start at different phases • 4 dynamical models per gait • Uniform partition where each model is assigned 1/4

Experiments: Recognizing of Gaits • Apply the learned dynamical models and HMM on unseen data • Outlier model with 1 state and constant velocity dynamical model • Highest likelihood is the final gait classification

Experiments: Recognizing of Gaits

Conclusion • Decomposes the domain and incorporates different levels of abstraction using mixture models, EM, recursive Kalman and Markov estimation. • How much data is need to build the recognizer? • How much computational time?

Learning and Recognizing Human Dynamics in Video Sequences Christoph Bregler

Learning and Recognizing Human Dynamics in Video Sequences Christoph Bregler

Presentation Transcript

HUMAN POPULATION DYNAMICS

Recognizing and Tracking Human Action

Shape and Dynamics in Human Movement Analysis

Human Dynamics

HUMAN POPULATION DYNAMICS

Lecture 29 - Polymorphisms in Human DNA Sequences

HUMAN LEARNING AND LEARNING

Recognizing Human Actions by Attributes

Recognizing Prior Learning

Shape and Dynamics in Human Movement Analysis

Video Trails: Representing and Visualizing Structure in Video Sequences

Developmental Sequences in Second Language Learning

MDA: Human Dynamics

Recognizing Human Body Motion

Human protein reference sequences

Human Population Dynamics

Video Trails: Representing and Visualizing Structure in Video Sequences

Learning, Recognizing, and Assisting with Activities

LEARNING DYNAMICS