learning and vision for multimodal conversational interfaces trevor darrell vision interface group mit csail lab

1. Learning and Vision for Multimodal Conversational Interfaces

3. Visual Context for Conversation Who is there? (presence, identity) Which person said that? (audiovisual grouping) Where are they? (location) What are they looking / pointing at? (pose, gaze) What are they doing? (activity)

4. Learning Visual conversational context cues are hard to model analytically. Learning methods are appropriate Different techniques for different cues, levels of representation, input modes, ... (At least for now�)

5. Today Speaker segregation using audio-visual mutual information discard background sounds separate multiple conversational streams Head pose detection and tracking with multi-view appearance models attention agreement Articulated pose tracking by learning model constraints, or example-based inference� gesture �body language�

6. Today Speaker segregation using audio-visual mutual information discard background sounds separate multiple conversational streams Head pose detection and tracking with multi-view appearance models attention agreement Articulated pose tracking by learning model constraints, or example-based inference� gesture �body language�

7. Is that you talking?

8. Audio-visual synchrony Can we find a relationship between audio and visual events (e.g., speech)?

9. Audio-visual synchrony Can we find a relationship between audio and visual events (e.g., speech)? Model-free?

10. Audio-visual synchrony Yes, by learning a model of audio-visual synchrony. Three approaches: Pixel-wise corellation with video [Hershey and Movellan] Correlation of optimal projection [Slaney and Covell] Non-parametric Mutual Information analysis on optimal projection [Fisher et al.]

11. Audio-based Image localization E.g., locate visual sources given audio information:

12. Audio-based Image localization Image variance (ignoring audio) will find all motion in the sequence:

13. Audio-based Image localization Estimate mutual information between audio and video:

15. Cannonical correlation projection Different from hershey and movellan in that it asks what combination of audio and video data produces the best correlation rather than treating each pixel independently. However the performance is dependent on both the training and testing data sizes. Perhaps the best contribution was that they showed MFCCs and LPC representations were much better than audio power or spectrogram using their technique.Different from hershey and movellan in that it asks what combination of audio and video data produces the best correlation rather than treating each pixel independently. However the performance is dependent on both the training and testing data sizes. Perhaps the best contribution was that they showed MFCCs and LPC representations were much better than audio power or spectrogram using their technique.

16. Non-parametric Mutual Information Match audio to video using adaptive feature basis Exploit joint statistics of image and audio signal Efficient non-parametric density estimation

17. Maximally Informative Subspace Key difference is that we don�t have labels for the training data. We have to learn statistic for both modalities. We learn projections to reveal� simple structure�Key difference is that we don�t have labels for the training data. We have to learn statistic for both modalities. We learn projections to reveal� simple structure�

18. Audio-visual synchrony detection MI: 0.68 0.61 0.19 0.20 Compute similarity matrix for 8 subjects:

19. Today Speaker segregation using audio-visual mutual information discard background sounds separate multiple conversational streams Head pose detection and tracking with multi-view appearance models attention agreement Articulated pose tracking by learning model constraints, or example-based inference� gesture �body language�

20. Head pose tracking

21. Lots of Work on Face Pose Tracking� Cylindrical approx. [LaCascia & Sclaroff] 3D Mesh approx. [Essa] 3D Morphable model [Blanz & Vetter] Multi-view keyframes from 3D model [Vachetti et al.] View-based eigenspaces [Srinivasan & Boyer] [Pentland et al.] � Online: Still hard to initialize... Offline: Constraint pose space;Online: Still hard to initialize... Offline: Constraint pose space;

22. Pose Estimation

23. User Dependent Keyframes

24. User-Independent Prior Model

25. 3D View-based Eigenspaces

26. View-based Eigenspaces Explain what are the images (mean face +3 first eigenvectorsExplain what are the images (mean face +3 first eigenvectors

27. 3D View-based Eigenspaces Per view PCA We also keep basis images for the depth channelPer view PCA We also keep basis images for the depth channel

28. 3D View-based Eigenspaces UD is weight (add to slide) Present the 3 techniques: (1) Independent SVD (2) Concatenate I and Z; SVD (3) SVD on I; Transfer weights Informally, we found (3) to work best. So far it�s what works best. We are still exploring this topic. Depth basis images instead of depth eigenspaces Keep to answer a question Variation on depth is not independent for variation on intensity. Intensity variation are more relevant to match identity; Depth variation will show up with intensity variation. UD is weight (add to slide) Present the 3 techniques: (1) Independent SVD (2) Concatenate I and Z; SVD (3) SVD on I; Transfer weights Informally, we found (3) to work best. So far it�s what works best. We are still exploring this topic. Depth basis images instead of depth eigenspaces Keep to answer a question Variation on depth is not independent for variation on intensity. Intensity variation are more relevant to match identity; Depth variation will show up with intensity variation.

29. Reconstruction Where come from subwindow Animate equationsWhere come from subwindow Animate equations

30. Reconstruction Remove correlation function? Lambda=var(I)/var(z)Remove correlation function? Lambda=var(I)/var(z)

31. Reconstruction All view and depth Equation for reconstructionAll view and depth Equation for reconstruction

32. Pose Estimation Add depth images Motion model is constant (identity matrix) == Random walkAdd depth images Motion model is constant (identity matrix) == Random walk

33. Pose Estimation Spend less time on matrix C Add depth images Motion model is constant (identity matrix) == Random walk Deltas are pose-changes measurements; Spend less time on matrix C Add depth images Motion model is constant (identity matrix) == Random walk Deltas are pose-changes measurements;

34. Experiments Image sequences from stereo cameras Prior model: 14 subjects in 28 orientations Ground truth with Inertia Cube sensor Compare with OSU pose estimator [Srinivasan & Boyer �02] Use same training set for eigenspaces Chrinivassan Describe OSU Explain Inertia Cube Learn interpolation between correlation coefficient to estimate poseChrinivassan Describe OSU Explain Inertia Cube Learn interpolation between correlation coefficient to estimate pose

35. Results Comment on video Subtract ground true. Merge Rx Ry RzComment on video Subtract ground true. Merge Rx Ry Rz

36. Exploiting cascades for speed But, correlation search step is very slow! Using a cascade detection paradigm [Viola, Jones], many patterns can be quickly rejected. Set false negative rate to be very low (e.g. 1%) per stage each stage may have low hit rate (30-40%) but overall architecture is efficient and accurate Multi-view cascade detection to obtain coarse initial pose estimate In general simple classifiers, while they are more efficient, they are also weaker. We could define a computational risk hierarchy (in analogy with structural risk minimization)� A nested set of classifier classes The training process is reminiscent of boosting� - previous classifiers reweight the examples used to train subsequent classifiers The goal of the training process is different - instead of minimizing errors minimize false positivesIn general simple classifiers, while they are more efficient, they are also weaker. We could define a computational risk hierarchy (in analogy with structural risk minimization)� A nested set of classifier classes The training process is reminiscent of boosting� - previous classifiers reweight the examples used to train subsequent classifiers The goal of the training process is different - instead of minimizing errors minimize false positives

37. Pose aware interfaces Interface Agent responds to gaze of user agent should know when it�s being attended to turn-taking pragmatics eventually, anaphora + object reference Prototype Smart-room interface �sam� Early experiments with face tracker on meeting room table�

38. SAM

40. Head nod detection Track 6DOF motion of head nod and shake gestures Experiment with simple motion energy ratio test. Initial results promising

41. Today Speaker segregation using audio-visual mutual information discard background sounds separate multiple conversational streams Head pose detection and tracking with multi-view appearance models attention agreement Articulated pose tracking by learning model constraints, or example-based inference� gesture �body language�

42. Articulated pose sensing

43. Learning Articulated Tracking Model-based approach works for 3-D data and pure articulation constraints� Need to learn joint limits and other behavioral constraints (with a classic model-based tracker) Without direct 3-D data, example-based techniques are most promising�

44. Model-based Approach



47. ICP with articulated motion constraint Minimize distance between 3D-data and 3-D articulated model Apply ICP to each object in the articulated model to find motion (twist) dk = (w,t) with covariance Lk for each limb. Enforce joint constraints: find a set of motions dk� close to original motions that satisfy joint constraints Pure articulation can be expressed as a linear projection on stacked rigid motion �

48. Non-linear constraints Limitations of Pure Articulation Constraints Can not capture the limits on the range of motion of human joints Can not capture behavioral limits of body pose Learning approach: learn a discriminative model of valid / invalid pose Train SVM for use as a Lagrangian constraint Valid body poses extracted from mocap data (150,000 poses) Invalid body poses generated randomly Cross-validation classification error rates at around .061% Three key parts and decision functions: Body Tracker, Gesture Detection, and Gesture Classification Computational expensive - Gesture Classification and Body Tracking Three key parts and decision functions: Body Tracker, Gesture Detection, and Gesture Classification Computational expensive - Gesture Classification and Body Tracking

49. Video

50. Multimodal gestures Three key parts and decision functions: Body Tracker, Gesture Detection, and Gesture Classification Computational expensive - Gesture Classification and Body Tracking Three key parts and decision functions: Body Tracker, Gesture Detection, and Gesture Classification Computational expensive - Gesture Classification and Body Tracking

51. Learning pose without 3-D observations Model based approach difficult with more impoverished observations�e.g., contour or edge features Example based learning approach Generate corpus of training data with model (Poser) Find nearest neighbors using fast hashing techniques (LSH) Optionally use local regression on NN With segmented contours shape context features bipartite graph matching via Earth Movers� Distance With unsegmented edge features feature selection using paired classification problem extend LSH to use �Parameter sensitive Hashing�

52. Parameter sensitive hashing When explicit feature (shape context) is not available, feature selection is needed Features for an optimal distance can be found by training a classifier on an equivalence task LSH+classifier-based feature selection=PSH e.g., hashing functions sensitive to distance in a parameter space, not feature space. �Parameter Sensitive Hashing� [Shakhnarovich et al.]

53. Parameter sensitive hashing

54. Saturday Workshop

55. Schedule

56. Today Learning methods are critical for robust estimation of synchrony, pose and other conversational context cues: Speaker segregation using audiovisual mutual information Head pose estimation using multi-view manifolds and detection cascade trees Real-time articulated tracking from stereo data with SVM-based joint constraints Monocular tracking using example-based inference with fast nearest neighbor methods

57. Acknowledgements Greg Shakhnarovich Kristen Grauman Neal Checka David Demirdjian Theresa Ko John Fisher Louis-Philippe Morency Mike Siracusa �

learning and vision for multimodal conversational interfaces trevor darrell vision interface group mit csail lab

learning and vision for multimodal conversational interfaces trevor darrell vision interface group mit csail lab

Presentation Transcript

Learning and Vision: Generative Methods

Machine Learning for Graphics and Vision

Jaime Teevan MIT, CSAIL

Interface for people with limited vision

Shoes as a Platform for Vision Paul Fitzpatrick Charles Kemp MIT CSAIL

Affective Multimodal Interfaces

Deep Learning for Vision

Computer Vision for Tangible User Interfaces

MIT CSAIL Biomedical Image Analysis Group

Perceptive Context for Pervasive Computing Trevor Darrell Vision Interface Group

Multilingual Conversational Interfaces: An NTT-MIT Collaboration

Stefan Marti Speech Interface Group MIT Media Lab

NEW DAY FOR LEARNING VISION

Jaime Teevan MIT, CSAIL

Multimodal Interfaces

Jaime Teevan MIT, CSAIL

Multimodal Interfaces