1 / 21

Exploiting video information for Meeting Structuring

…. …. …. Exploiting video information for Meeting Structuring. Agenda. Introduction Feature set extension Video features processing Video features integration Preliminary results Conclusions. Meeting Structuring (1).

dacia
Download Presentation

Exploiting video information for Meeting Structuring

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. …. …. …. Exploiting video information for Meeting Structuring

  2. Agenda • Introduction • Feature set extension • Video features processing • Video features integration • Preliminary results • Conclusions

  3. Meeting Structuring (1) • Goal: recognise events which involve one or more communicative modalities: • Monologue / Dialogue / Note taking / Presentation / Presentation at the whiteboard • Working environment: “IDIAP framework” : • 69 five minutes long meetings of 4 participants • 30 transcribed meetings • Scripted meeting structure

  4. Meeting Structuring (2) • 3 audio derived feature families: Speaker turns, Prosodic Features, Lexical Features Speaker Turns Mic. Array Beam-forming Prosody Lapel Mic. Pitch baseline Energy Rate Of Speech ASR Lexical features Transcription. M/DI discrimination

  5. Meeting Structuring (3) • Dynamic Bayesian Network based models (using GMTK, Bilmes et al.) • Multi-stream processing (parallel stream processing) • “Counter structure” (state duration modelling) …. …. C0 C0 C0 • 3 feature families: • Prosodic features (S1) • Speaker Turns (S2) • Lexical features (S3) • Leave-one-out cross-validation • over 30 annotated meetings E0 E0 E0 A0 At At+1 …. …. S01 St1 St+11 …. …. S02 St2 St+12 …. Y01 Yt1 Yt+11 Y02 Yt2 Yt+12

  6. Feature set extension (1) Multi-party meeting are multi-modal communicative processes Our features cover only two modalities: audio (prosodic features & speaker turns) and lexical content (lexical monologue/dialogue discriminator) Exploiting video contents is the next step!!

  7. Feature set extension (2) Goal: improve the recognition of “Note taking”, “Presentation” and “Whiteboard” The three most confused symbols Three meeting actions which highly involve body/hands movements Approach: extract low level video features and leave their interpretation to high level specialised models

  8. Feature set extension (3) We need motion features for hands/head-torso regions • Constraints: • The system must be simple • Reliable against “environmental” changes (lighting, backgrounds, …) • Open to further extensions / modifications • Initial assumptions: • Meetings video contents are quite “static” • Participants occupy only few spatial regions and tend to stay there • Meeting room configuration (camera positions, seats, furniture …) is fixed

  9. Video feature extraction (1) • Motion analysis is performed using : Kanade Lucas Tomasi (KLT) feature tracking… …and partitioning resulting trajectories according to their relative position into the scene Four spatial regions for each scene: Head 1 / 2 Hands 1 / 2

  10. KLT (1) Assumption: brightness of every point of a (slow) moving or static object does not change for images taken at near time instants (Taylor series approximated to the 1st derivative) Optical flow constraint equation : Represents how fast the intensity is changing with time Brightness gradient Moving object speeds If we have one equation in two unknown; hence more than one solution

  11. KLT (2) are neighbour points of x, with same constant velocity • Minimizing weighted least square error: • In two dimensions the system has the form: • If the solution is :

  12. KLT (3) A good feature is : • one that can be tracked well … (Tomasi et al.) if are the eigenvalues of , the system is well-conditioned if: • … and even better if it is part of a human body (high texture content) Large eigenvalues, but in the same range Pixel with higher probability to be skin are preferred

  13. KLT (4) KLT feature tracking consists of 3 steps : • Select n good features • Track the selected n features • Replace lost features We decided to track n=100 features is a square (7x7) window

  14. Skin modelling Color based approach: (Cr,Cb) chromatic subspace Skin samples taken from unused meetings Initial experiments made using a single Gaussian Now: 3 components Gaussian Mixture Model

  15. Video feature extraction (2) Structure of the implemented system: Video Skin Detection KLT Skin model 100 features / frame Trajectory Structure 100 trajectories / frame

  16. Video feature extraction (3) Trajectory Structure Remove: long and quite static trajectories Define 4 partitions (regions) (2 x heads,2 x hands) H1 H2 Ha1 Ha2 4 regions Trajectories classification + R 4 regions Define 2 additional fixed regions L Evaluate: Average Motion

  17. Video feature extraction (4) 2. 1. 3. 4.

  18. H1 H2 Ha1 Ha2 Video feature extraction (5) Taking motion vectors averaged over many trajectories helps reducing noise For each scene 4 motion vectors, one for each region, are estimated (to be soon enhanced with 2 more regions/vectors L and R) In order to detect if someone is entering or leaving the scene • Open issues: • Loss of tracking for fast moving objects • Account during the tracking • Assumption of a fixed scene structure • Delayed/offline processing

  19. …. …. C0 C0 C0 E0 E0 E0 A0 At At+1 …. …. Speaker turns S01 St1 St+11 …. …. Prosodic features …. S02 St2 St+12 …. S03 St3 St+13 …. …. S04 St4 St+14 …. Y01 Yt1 Yt+11 Lexical features Y02 Yt1 Yt2 Yt+12 Yt+11 Y03 Yt3 Yt+13 Video features Y04 Yt4 Yt+14 Integration Goal: extend the multi-stream model with a new video stream • It is possible that the extended model will be • intractable due to the increased state space • In this case: • State space reduction through a multi-time-scale • approach will be attempted • Early integration of Speaker turns + • Lexical features will be investigated

  20. Preliminary results • Before proceeding with the proposed integration we need to: • compare video performances against the other features families • validate the extracted video features Video features alone have quite poor performances, but they seem to be helpful if evaluated together with Speaker Turns • (Speaker Turns) + (Prosody + Lexical Features) • (Speaker Turns) + (Video Features)

  21. Summary • Extraction of video features through: • A skin detector enhanced KLT feature tracker • Segmentation of trajectories into 4/6 spatial regions (Simple and fast approach, but with some open problems) • Validation of Motion Vectors as a video feature • Integration in the existing framework (work in progress)

More Related