Learning the Appearance and Motion of People in Video (The Science of Silly Walks)

Learning the Appearance and Motion of People in Video (The Science of Silly Walks) Hedvig Sidenbladh Michael J. Black Defense Research Institute Stockholm Sweden Department of Computer Science Brown University http://www.nada.kth.se/~hedvig http://www.cs.brown.edu/~black

Collaborators David Fleet, Xerox PARC Nancy Pollard, Brown University Dirk Ormoneit and Trevor Hastie Dept. of Statistics, Stanford University Allan Jepson, University of Toronto

The (Silly) Problem Unsolved without manual intervention.

Inferring 3D Human Motion * Infer 3D human motion from 2D image properties. * No special clothing * Monocular, grayscale, sequences (archival data) * Unknown, cluttered, environment * Incremental estimation

Singularities in viewing direction Unusual viewpoints Self occlusion Low contrast Ambiguous matches Why is it Hard?

Clothing and Lighting

Large Motions Limbs move rapidly with respect to their width. Non-linear dynamics. Motion blur.

Ambiguities Where is the leg? Which leg is in front?

Ambiguities Accidental alignment

Ambiguities Occlusion Whose legs are whose?

Requirements 1. Represent uncertainty and multiple hypotheses. 2. Model non-linear dynamics of the body. 3. Exploit image cues in a robust fashion. 4. Integrate information over time. 5. Combine multiple image cues.

Simple Body Model * Limbs are truncated cones * Parameter vector of joint angles and angular velocities = f

Need a constraining likelihood model that is also • invariant to variations in human appearance. 2. Need a prior model of how people move. 3. Need an effective way to explore the model space (very high dimensional) and represent ambiguities. Inference/Issues Bayesian formulation p(model | cues) = p(cues | model) p(model) p(cues)

What Image Cues? Pixels? Temporal differences? Background differences? Edges? Color? Silhouettes? Optical flow?

Brightness Constancy I(x, t+1) = I(x+u,t) + h Image motion of foreground as a function of the 3D motion of the body. Problem: no fixed model of appearance (drift).

State of the Art. Bregler and Malik ‘98 • * Brightness constancy cue • insensitive to appearance • * Full-body required multiple cameras. • * Single hypothesis. • MAP estimate

State of the Art. Cham and Rehg ‘99 * Single camera, multiple hypotheses. * 2D templates (solves drift but is view dependent) I(x, t) = I(x+u,0) + h

Edges as a Cue? • Probabilistic model? • Under/over-segmentation, • thresholds, …

Deutscher, North, Bascle, & Blake ‘99 State of the Art. * Multiple cameras * Simplified, clothing, lighting and background.

What do people look like? Changing background Varying shadows Occlusion Deforming clothing Low contrast limb boundaries What do non-people look like?

Key Idea #1 (Rigorous Likelihood) 1. Use the 3D model to predict the location of limb boundaries (not necessarily features) in the scene. 2. Compute various filter responses steered to the predicted orientation of the limb. 3. Compute likelihood of filter responses using a statistical model learnedfrom examples.

Natural Image Statistics * Statistics of image derivatives are non-Gaussian. * Consistent across scale. Ruderman. Lee, Mumford, Huang. Portilla and Simoncelli. Olshausen & Field. Xu, Wu, & Mumford. …

Statistics of Edges Statistics of filter responses, F, on edges, pon(F), differs from background statistics, poff (F). Likelihood ratio, pon/ poff , can be used for edge detection and road following. Geman & Jednyak and Konishi, Yuille, & Coughlan What about the object specific statistics of limbs? * edge may be present or not.

Object-Specific Statistics

Edge Filters Normalized derivatives of Gaussians (Lindeberg, Granlund and Knutsson, Perona, Freeman&Adelson, …) Edge filter response steered to limb orientation: Filter responses steered to arm orientation.

Distribution of Edge Filter Responses pon(F) poff(F)

Contrast Normalization? Lee, Mumford & Huang

Contrast Normalization • Maximize difference between distributions • * e.g. Bhattarcharyya distance:

Local Contrast Normalization

Ridge Features Scale specific

Ridge Filters Relationship between limb diameter in image and scale of maximum ridge filter response.

Ridge Thigh Statistics

Brightness Constancy I(x, t) I(x+u, t+1) What are the statistics of brightness variation I(x, t) - I(x+u, t+1)? Variation due to clothing, self shadowing, etc.

Brightness Constancy • well fit by t-distribution or Cauchy distribution (heavy tails) • related to robust statistics

Key Idea #2 (Explain the Image) p(image | foreground, background) Generic, unknown, background Foreground person See also McCormick and Isard, ICCV’01. Foreground should explain what the background can’t.

Likelihood Steered edge filter responses crude assumption: filter responses independent across scale.

2. Need a prior model of how people move. Inference/Issues Bayesian formulation p(model | cues) = p(cues | model) p(model) p(cues) • Need a constraining likelihood model that is also • invariant to variations in human appearance.

joint angles time Learning Human Motion * constrain the posterior to likely & valid poses/motions * model the variability 3D motion-capture data. * Database with multiple actors and a variety of motions. (from M. Gleicher)

Key Idea #3 (Trade learning for search.) Problem: * insufficient data to learn a prior probabilistic model of human motion. Alternative: * the data represents all we know * replace representation and learning with search. (challenge: search has to be fast)

Texture Synthesis Synthetic Texture “Database” * De Bonnet & Viola, Efros & Leung, Efros & Freeman, Paztor & Freeman, Hertzmann et al, … * Image(s) as an implicit probabilistic model. Efros & Freeman’01

Implicit Probabilistic Model Key idea: probabilistic search (log time) of this tree approximates sampling from p(stored sequence | generated sequence).

Synthesis • * Colors indicate different training sequences. • * For graphics, we need • - editability, constraints (ground contact, pose, interpenetration), key frames, style, …

Tracking * Efficiently generate samples (image data will sort out which are good). * Temperature parameter controls randomness of tree search.

Posterior over model parameters given an image sequence. Temporal model (prior) Likelihood of observing the image given the model parameters Posterior from previous time instant Bayesian Formulation

Elbow bends What does the posterior look like? Shoulder: 3dof Elbow: 1dof

Inference/Issues Bayesian formulation p(model | cues) = p(cues | model) p(model) p(cues) • Need a constraining likelihood model that is also • invariant to variations in human appearance. 2. Need a prior model of how people move. 3. Need an effective way to explore the model space (very high dimensional) and represent ambiguities.

Key Idea #4 (Represent Ambiguity) * Represent a multi-modal posterior probability distribution over model parameters - sampled representation - each sample is a pose and its probability - predict over time using a particle filtering approach. Samples from a distribution over 3D poses.

Particle Filtering * large literature (Gordon et al ‘93, Isard & Blake ‘96,…) * non-Gaussian posterior approximated by N discrete samples * explicitly represent the ambiguities * exploit stochastic sampling for tracking

Particle Filter Posterior Temporal dynamics sample sample normalize Posterior Likelihood

Particle Filter Isard & Blake ‘96

Learning the Appearance and Motion of People in Video (The Science of Silly Walks)

Learning the Appearance and Motion of People in Video (The Science of Silly Walks)

Presentation Transcript

Motion Compensated Prediction and the Role of the DCT in Video Coding

Describing people s appearance 2007.12.20

Describing People Appearance

Wear And Appearance Of The Uniform

Describing people and their appearance

Motion graphics and video composing

Motion Graphics and Video Compositing

Motion Segmentation in Forward Motion Video

Watch the video. Who are the people in the video

theme: Describing people – Appearance.

Describing People: Appearance

People Detection in Video Stream

Learning the Appearance and Motion of People in Video

People are in Motion and the Library Follows Them!

VIDEO OF MOTION

MOTION ESTIMATION AND VIDEO COMPRESSION

Distributed Video Coding with Unsupervised Learning of Motion Estimation

Motion Segmentation in Forward Motion Video

Wear And Appearance Of The Uniform

Full Motion Video in ArcGIS

Gesture Recognition Using 3D Appearance and Motion Features

Describing People (Appearance)