1 / 22

modeling individual and group actions in meetings with layered HMMs

modeling individual and group actions in meetings with layered HMMs. dong zhang, daniel gatica-perez samy bengio, iain mccowan, guillaume lathoud idiap research institute martigny, switzerland. meetings as sequences of actions. human interaction similar/complementary roles

walt
Download Presentation

modeling individual and group actions in meetings with layered HMMs

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. modeling individual and group actions in meetings with layered HMMs dong zhang, daniel gatica-perez samy bengio, iain mccowan, guillaume lathoud idiap research institute martigny, switzerland

  2. meetings as sequences of actions • human interaction • similar/complementary roles • individuals constrained by group • agenda: prior sequence • discussion points • presentations • decisions to be made • minutes: posterior sequence • key phases • summarized discussions • decisions made

  3. the goal: recognizing sequences of meeting actions Timeline Discussion Phase Presentation Group Discussion Topic Whether Budget Group Interest Level High Neutral High Group Task Information Sharing Decision Making group-level actions = meeting actions meeting views

  4. our work: two-layer HMMs • decompose the recognition problem • both layers use HMMs • individual action layer: I-HMM: various models • group action layer: G-HMM

  5. our work in detail • definition of meeting actions • audio-visual observations • action recognition • results D. Zhang et al, “Modeling Individual and Group Actions in Meetings with Layered HMMs”, IEEE CVPR Workshop on Event Mining, 2004. I. McCowan et al, ICASSP 2003, PAMI 2005. N. Oliver et al, ICMI 2002.

  6. actions in a set • consistent: one view, answering one question • mutually exclusive • exhaustive • each view a set of actions A = { A1, A2, A3, A4, …, AN } 1. defining meeting actions • multiple parallel views • tech-based: what we can recognize? • application-based: respond to user needs • psychology-based: coding schemes from social psychology

  7. multi-modal turn-taking • describes the group discussion state A = { ‘discussion’, ‘monologue’ (x4), ‘white-board’, ‘presentation’, ‘note-taking’, ‘monologue + note-taking’ (x4), ‘white-board + note-taking’, ‘presentation + note-taking’} • individual actions I = { ‘speaking’, ‘writing’, ‘idle’} • actions are multi-modal in nature

  8. Person 1 S S W Person 4 S W S example W Person 2 W S W W Person 3 W S S W Presentation Used Whiteboard Used Group Action Monologue1 + Note-taking Discussion Presentation + Note-taking Whiteboard + Note-taking

  9. 2. audio-visual observations • audio • 12 channels, 48 kHz • 4 lapel microphones • 1 microphone array • video • 3 CCTV cameras • all synchronized

  10. multimodal feature extraction: audio • microphone array • speech activity (SRP-PHAT) • seats • presentation/whiteboard area • speech/silence segmentation • lapel microphones • speech pitch • speech energy • speaking rate

  11. multimodal feature extraction: video • head + hands blobs • skin colour models (GMM) • head position • hands position + features (eccentricity,size,orientation) • head + hands blob motion • moving blobs from background subtraction

  12. compared with single-layer HMM • smaller observation spaces • I-HMM trained with much more data • G-HMM less sensitive to feature variations • combinations can be explored 3. recognition with two-layer HMM • each layer trained independently • trained as in ASR (Torch) • simultaneous segmentation and recognition

  13. models for I-HMM • early integration • all observations concatenated • correlation between streams • frame-synchronous streams • multi-stream (Dupont, TMM 2000) • HMM per stream (a or v), trained independently • decoding: weighted likelihoods combined at each frame • little inter-stream asynchrony • multi-band and a-v ASR • asynchronous (Bengio, NIPS 2002) • a and v streams with single state sequence • states emit on one or both streams, given a sync variable • inter-stream asynchrony

  14. linking the two layers • hard decision i-action model with highest probability outputs 1; all other models output 0. • soft decision outputs probability for each individual action model HD: (1, 0, 0) SD: (0.9, 0.05, 0.05) Audio-visual features

  15. 4. experiments: data + setup • 59 meetings (30/29 train/test) • four-people, five-minute • scripts • schedule of actions • natural behavior • features: 5 f/s mmm.idiap.ch

  16. performance measures • individual actions: frame error rate (FER) • group actions: action error rate (AER) • Subs: number of substituted actions • Del: number of deleted actions • Ins: number of added actions • Total actions: number of target actions

  17. visual-only audio-only audio-visual • asynchronous effects between modalities • accuracy: speaking: 96.6 %, writing: 90.8%, idle: 81.5% results: individual actions 43000 frames (0.8,0.2) (0.2-2.2s)

  18. 8% improvement, significant at 96% level results: group actions • multi-modality outperforms single modalities • two-layer HMM outperforms single-layer HMM for a-only, v-only and a-v • best model: A-HMM • soft decision slightly better than hard decision

  19. action-based meeting structuring

  20. conclusions • structuring meetings as sequences of meeting actions • layered HMMs successful for recognition • turn-taking patterns: useful for browsing • public dataset, standard evaluation procedures • open issues • less training data (unsupervised, acm mm04) • other relevant actions (interest-level, icassp05) • other features (words, emotions) • efficient models for many interacting streams

  21. Linking Two Layers (1)

  22. Normalization Linking Two Layers (2) Please refer to: D. Zhang, et al “Modeling Individual and Group Actions in Meetings: a Two-Layer HMM Framework”. In IEEE Workshop on Event Mining, CVPR, 2004 .

More Related