1 / 26

CALO VISUAL INTERFACE RESEARCH PROGRESS

CALO VISUAL INTERFACE RESEARCH PROGRESS. David Demirdjian Trevor Darrell MIT CSAIL. pTablet (or pLaptop!). Goal: visual cues to conversation or interaction state: presence attention turn-taking agreement and grounding gestures emotion and expression cues visual speech features.

aimee
Download Presentation

CALO VISUAL INTERFACE RESEARCH PROGRESS

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. CALO VISUAL INTERFACERESEARCH PROGRESS David Demirdjian Trevor Darrell MIT CSAIL

  2. pTablet (or pLaptop!) Goal: visual cues to conversation or interaction state: • presence • attention • turn-taking • agreement and grounding gestures • emotion and expression cues • visual speech features

  3. Functional Capabilities Help CALO infer: • whether the user is still participating in a conversation or interaction, • is focused on the interface or listening to another person. • when the user is speaking, • further features pertaining to visual speech • non-verbal means to observe whether a user is confirming understanding of or agreement with the current topic or question, • is confused or irritated both for meeting understanding, and CALO UI…

  4. Machine Learning Research Challenges Focusing on learning methods which capture personalized interaction • Articulatory models of visual speech • Sample-based methods for body tracking • Hidden-state conditional random fields • Context-based gesture recognition (Not all are yet in deployed demo…)

  5. Articulatory models of visual speech: • Traditional models of visual speech presume synchronous units based on visimes, the visual correlate of phonemes. • Audiovisual speech production is often asynchronous • Model with formed with a series of loosely coupled streams of articulatory features. (See Saenko and Darrell, ICMI 2004, and Saenko et al., ICCV 2005, for more information.)

  6. Sample-based methods for body tracking • Tracking human bodies requires exploration of a high-dimensional state space • Estimated posteriors are often sharp and multimodal. • New tracking techniques based on novel approximate nearest neighbor hashing method which have comprehensive pose coverage, and optimally integrate information over time. • These techniques are suitable for real-time markerless motion capture, and for tracking the human body to infer attention and gesture. (See Demirdjian et al. ICCV 2005, and Taycher et al. CVPR 2006 for more information.)

  7. Hidden-state conditional random fields Discriminative techniques are efficient and accurate, and learn to represent only the portion of a state necessary for a specific task. Conditional random fields are effective at recognizing visual gestures, but lack the ability of generative models to capture gesture substructure through hidden state. We have developed a hidden-state conditional random field formulation. (See Wang et al. CVPR 2006.)

  8. Hidden Conditional Random Fields for Head Gesture Recognition 3 classes – Nods, Shakes, Junk

  9. Context-based gesture recognition Recognition of user’s gesture should be done in the context of the current interaction Visual recognition can be augmented with context cues from the interaction state • conversational dialog with an embodied agent • interaction with a conventional windows and mouse interface. See Morency, Sidner and Darrell, ICMI 2005 and Morency and Darrell, IUI 2006

  10. User Adaptive Agreement Recognition • Person’s idiolect • User agreement from recognized speech and head gestures • multimodal co-training Challenges: • Asynchrony between modalities • “Missing data” problem

  11. Status • New pTablet functionalities: • Face/gaze tracking • Head gesture recognition (nod/shake) + Gaze • Lip/Mouth motion detection • User enrollment/recognition (ongoing work) • A/V Integration: • Audio-visual sync./calibration • Meeting visualization/understanding

  12. VTracker pTablet system • user model (frontal view) • pose (6D) • OAA messages • person ID • head pose • gesture • lips moving pTablet camera Head Gesture Recognition Speech audio

  13. Speaking activity detection • Face tracking as: Rigid pose

  14. Speaking activity detection • Face tracking as: Rigid pose + Non-rigid facial deformations

  15. Speaking activity detection Speaking activity ~ high motion energy in Mouth/lips region • weak assumption (eg. hand moving in front of mouth will trigger speaking activity detection) • But complement well audio-based speaker detection

  16. Speaking activity detection

  17. User enrollment/recognition Idea: • At start the user is automatically identified and logged in by the pTablet. • If the user is not recognized or misrecognized, he will have to login manually. • Face recognition based on a Feature Set Matching algorithm (The Pyramid Match Kernel: Discriminative Classification with Sets of Image Features. Grauman and Darrell ICCV’05)

  18. Audio-Visual Calibration • Temporal calibration: • aligning audio with visual data How? by aligning lip motion energy in images with audio energy • Geometric calibration: • estimate camera location/orientation in the world coordinate system

  19. Audio-Visual Integration CAMEO pTablets

  20. Calibration • Alternative approach to estimate the position/orientation of the pTablets with or without global view (eg. from CAMEO) Idea: use discourse information (eg. who is talking to who, dialog bw. 2 people) and local head pose to find the location of the pTablets…

  21. A/V Integration • AVIntegrator: • Same functionalities as Year 2 (eg. includes activity recognition, etc…) • modified to accept calibration data estimated externally

  22. Integration and activity estimation • A/V integration • Activity estimation: • Who’s in the room • Who is looking at who? • Who is talking to who? • …

  23. CAMEO VTracker VTracker pTablet A/V Integrator system Calibration information • OAA/MOKB messages • user list • speaker • agrees/disagrees • who to whom A/V Integrator eg. current speaker Discourse/Dialog Speechrecognition ?

  24. A/V Integration

  25. Demonstration? • Real-time meeting understanding • Use of pTablet suite for interaction with personal CALO:eg.: • use of head pose/lip motion for speaking activity detection • Yes/No answer by head nods/shakes • Visual login

More Related