1 / 57

learning and vision for multimodal conversational interfaces trevor darrell vision interface group mit csail lab

Natural Interfaces. Conversation would improve many interactions.Currently, conversational interfaces are useless in most situations with more than one user, or with real-world references.Visual Context is missing

adamdaniel
Download Presentation

learning and vision for multimodal conversational interfaces trevor darrell vision interface group mit csail lab

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


    1. Learning and Vision for Multimodal Conversational Interfaces

    3. Visual Context for Conversation Who is there? (presence, identity) Which person said that? (audiovisual grouping) Where are they? (location) What are they looking / pointing at? (pose, gaze) What are they doing? (activity)

    4. Learning Visual conversational context cues are hard to model analytically. Learning methods are appropriate Different techniques for different cues, levels of representation, input modes, ... (At least for now…)

    5. Today Speaker segregation using audio-visual mutual information discard background sounds separate multiple conversational streams Head pose detection and tracking with multi-view appearance models attention agreement Articulated pose tracking by learning model constraints, or example-based inference… gesture “body language”

    6. Today Speaker segregation using audio-visual mutual information discard background sounds separate multiple conversational streams Head pose detection and tracking with multi-view appearance models attention agreement Articulated pose tracking by learning model constraints, or example-based inference… gesture “body language”

    7. Is that you talking?

    8. Audio-visual synchrony Can we find a relationship between audio and visual events (e.g., speech)?

    9. Audio-visual synchrony Can we find a relationship between audio and visual events (e.g., speech)? Model-free?

    10. Audio-visual synchrony Yes, by learning a model of audio-visual synchrony. Three approaches: Pixel-wise corellation with video [Hershey and Movellan] Correlation of optimal projection [Slaney and Covell] Non-parametric Mutual Information analysis on optimal projection [Fisher et al.]

    11. Audio-based Image localization E.g., locate visual sources given audio information:

    12. Audio-based Image localization Image variance (ignoring audio) will find all motion in the sequence:

    13. Audio-based Image localization Estimate mutual information between audio and video:

    15. Cannonical correlation projection Different from hershey and movellan in that it asks what combination of audio and video data produces the best correlation rather than treating each pixel independently. However the performance is dependent on both the training and testing data sizes. Perhaps the best contribution was that they showed MFCCs and LPC representations were much better than audio power or spectrogram using their technique.Different from hershey and movellan in that it asks what combination of audio and video data produces the best correlation rather than treating each pixel independently. However the performance is dependent on both the training and testing data sizes. Perhaps the best contribution was that they showed MFCCs and LPC representations were much better than audio power or spectrogram using their technique.

    16. Non-parametric Mutual Information Match audio to video using adaptive feature basis Exploit joint statistics of image and audio signal Efficient non-parametric density estimation

    17. Maximally Informative Subspace Key difference is that we don’t have labels for the training data. We have to learn statistic for both modalities. We learn projections to reveal” simple structure”Key difference is that we don’t have labels for the training data. We have to learn statistic for both modalities. We learn projections to reveal” simple structure”

    18. Audio-visual synchrony detection MI: 0.68 0.61 0.19 0.20 Compute similarity matrix for 8 subjects:

    19. Today Speaker segregation using audio-visual mutual information discard background sounds separate multiple conversational streams Head pose detection and tracking with multi-view appearance models attention agreement Articulated pose tracking by learning model constraints, or example-based inference… gesture “body language”

    20. Head pose tracking

    21. Lots of Work on Face Pose Tracking… Cylindrical approx. [LaCascia & Sclaroff] 3D Mesh approx. [Essa] 3D Morphable model [Blanz & Vetter] Multi-view keyframes from 3D model [Vachetti et al.] View-based eigenspaces [Srinivasan & Boyer] [Pentland et al.] … Online: Still hard to initialize... Offline: Constraint pose space;Online: Still hard to initialize... Offline: Constraint pose space;

    22. Pose Estimation

    23. User Dependent Keyframes

    24. User-Independent Prior Model

    25. 3D View-based Eigenspaces

    26. View-based Eigenspaces Explain what are the images (mean face +3 first eigenvectorsExplain what are the images (mean face +3 first eigenvectors

    27. 3D View-based Eigenspaces Per view PCA We also keep basis images for the depth channelPer view PCA We also keep basis images for the depth channel

    28. 3D View-based Eigenspaces UD is weight (add to slide) Present the 3 techniques: (1) Independent SVD (2) Concatenate I and Z; SVD (3) SVD on I; Transfer weights Informally, we found (3) to work best. So far it’s what works best. We are still exploring this topic. Depth basis images instead of depth eigenspaces Keep to answer a question Variation on depth is not independent for variation on intensity. Intensity variation are more relevant to match identity; Depth variation will show up with intensity variation. UD is weight (add to slide) Present the 3 techniques: (1) Independent SVD (2) Concatenate I and Z; SVD (3) SVD on I; Transfer weights Informally, we found (3) to work best. So far it’s what works best. We are still exploring this topic. Depth basis images instead of depth eigenspaces Keep to answer a question Variation on depth is not independent for variation on intensity. Intensity variation are more relevant to match identity; Depth variation will show up with intensity variation.

    29. Reconstruction Where come from subwindow Animate equationsWhere come from subwindow Animate equations

    30. Reconstruction Remove correlation function? Lambda=var(I)/var(z)Remove correlation function? Lambda=var(I)/var(z)

    31. Reconstruction All view and depth Equation for reconstructionAll view and depth Equation for reconstruction

    32. Pose Estimation Add depth images Motion model is constant (identity matrix) == Random walkAdd depth images Motion model is constant (identity matrix) == Random walk

    33. Pose Estimation Spend less time on matrix C Add depth images Motion model is constant (identity matrix) == Random walk Deltas are pose-changes measurements; Spend less time on matrix C Add depth images Motion model is constant (identity matrix) == Random walk Deltas are pose-changes measurements;

    34. Experiments Image sequences from stereo cameras Prior model: 14 subjects in 28 orientations Ground truth with Inertia Cube sensor Compare with OSU pose estimator [Srinivasan & Boyer ’02] Use same training set for eigenspaces Chrinivassan Describe OSU Explain Inertia Cube Learn interpolation between correlation coefficient to estimate poseChrinivassan Describe OSU Explain Inertia Cube Learn interpolation between correlation coefficient to estimate pose

    35. Results Comment on video Subtract ground true. Merge Rx Ry RzComment on video Subtract ground true. Merge Rx Ry Rz

    36. Exploiting cascades for speed But, correlation search step is very slow! Using a cascade detection paradigm [Viola, Jones], many patterns can be quickly rejected. Set false negative rate to be very low (e.g. 1%) per stage each stage may have low hit rate (30-40%) but overall architecture is efficient and accurate Multi-view cascade detection to obtain coarse initial pose estimate In general simple classifiers, while they are more efficient, they are also weaker. We could define a computational risk hierarchy (in analogy with structural risk minimization)… A nested set of classifier classes The training process is reminiscent of boosting… - previous classifiers reweight the examples used to train subsequent classifiers The goal of the training process is different - instead of minimizing errors minimize false positivesIn general simple classifiers, while they are more efficient, they are also weaker. We could define a computational risk hierarchy (in analogy with structural risk minimization)… A nested set of classifier classes The training process is reminiscent of boosting… - previous classifiers reweight the examples used to train subsequent classifiers The goal of the training process is different - instead of minimizing errors minimize false positives

    37. Pose aware interfaces Interface Agent responds to gaze of user agent should know when it’s being attended to turn-taking pragmatics eventually, anaphora + object reference Prototype Smart-room interface “sam” Early experiments with face tracker on meeting room table…

    38. SAM

    40. Head nod detection Track 6DOF motion of head nod and shake gestures Experiment with simple motion energy ratio test. Initial results promising

    41. Today Speaker segregation using audio-visual mutual information discard background sounds separate multiple conversational streams Head pose detection and tracking with multi-view appearance models attention agreement Articulated pose tracking by learning model constraints, or example-based inference… gesture “body language”

    42. Articulated pose sensing

    43. Learning Articulated Tracking Model-based approach works for 3-D data and pure articulation constraints… Need to learn joint limits and other behavioral constraints (with a classic model-based tracker) Without direct 3-D data, example-based techniques are most promising…

    44. Model-based Approach

    45. Model-based Approach

    46. Model-based Approach

    47. ICP with articulated motion constraint Minimize distance between 3D-data and 3-D articulated model Apply ICP to each object in the articulated model to find motion (twist) dk = (w,t) with covariance Lk for each limb. Enforce joint constraints: find a set of motions dk’ close to original motions that satisfy joint constraints Pure articulation can be expressed as a linear projection on stacked rigid motion  

    48. Non-linear constraints Limitations of Pure Articulation Constraints Can not capture the limits on the range of motion of human joints Can not capture behavioral limits of body pose Learning approach: learn a discriminative model of valid / invalid pose Train SVM for use as a Lagrangian constraint Valid body poses extracted from mocap data (150,000 poses) Invalid body poses generated randomly Cross-validation classification error rates at around .061% Three key parts and decision functions: Body Tracker, Gesture Detection, and Gesture Classification Computational expensive - Gesture Classification and Body Tracking Three key parts and decision functions: Body Tracker, Gesture Detection, and Gesture Classification Computational expensive - Gesture Classification and Body Tracking

    49. Video

    50. Multimodal gestures Three key parts and decision functions: Body Tracker, Gesture Detection, and Gesture Classification Computational expensive - Gesture Classification and Body Tracking Three key parts and decision functions: Body Tracker, Gesture Detection, and Gesture Classification Computational expensive - Gesture Classification and Body Tracking

    51. Learning pose without 3-D observations Model based approach difficult with more impoverished observations…e.g., contour or edge features Example based learning approach Generate corpus of training data with model (Poser) Find nearest neighbors using fast hashing techniques (LSH) Optionally use local regression on NN With segmented contours shape context features bipartite graph matching via Earth Movers’ Distance With unsegmented edge features feature selection using paired classification problem extend LSH to use “Parameter sensitive Hashing”

    52. Parameter sensitive hashing When explicit feature (shape context) is not available, feature selection is needed Features for an optimal distance can be found by training a classifier on an equivalence task LSH+classifier-based feature selection=PSH e.g., hashing functions sensitive to distance in a parameter space, not feature space. “Parameter Sensitive Hashing” [Shakhnarovich et al.]

    53. Parameter sensitive hashing

    54. Saturday Workshop

    55. Schedule

    56. Today Learning methods are critical for robust estimation of synchrony, pose and other conversational context cues: Speaker segregation using audiovisual mutual information Head pose estimation using multi-view manifolds and detection cascade trees Real-time articulated tracking from stereo data with SVM-based joint constraints Monocular tracking using example-based inference with fast nearest neighbor methods

    57. Acknowledgements Greg Shakhnarovich Kristen Grauman Neal Checka David Demirdjian Theresa Ko John Fisher Louis-Philippe Morency Mike Siracusa …

More Related