1 / 51

Unsupervised and weakly-supervised discovery of events in video (and audio)

Unsupervised and weakly-supervised discovery of events in video (and audio). Fernando De la Torre. A dream. Outline. Introduction CMU-Multimodal Activity database Unsupervised discovery of video events Aligned Cluster Analysis (ACA) Weakly-supervised discovery of video events

cicada
Download Presentation

Unsupervised and weakly-supervised discovery of events in video (and audio)

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Unsupervised and weakly-supervised discovery of events in video(and audio) Fernando De la Torre

  2. A dream

  3. Outline • Introduction • CMU-Multimodal Activity database • Unsupervised discovery of video events • Aligned Cluster Analysis (ACA) • Weakly-supervised discovery of video events • Detection-Segmentation SVMs • Conclusions

  4. Quality of life technologies (QLoT)

  5. Multimodal data collection • 40 subjects, 5 recipes • www.kitchen.cs.cmu.edu

  6. Multimodal data collection • 40 subjects, 5 recipes • www.kitchen.cs.cmu.edu

  7. Anomalous dataset

  8. Time series analysis • Anomalous detection formulated as detecting outliers in multimodal time series. • Supervised • Unsupervised • Semi-supervised or weakly supervised

  9. Time series analysis • Anomalous detection formulated as detecting outliers in multimodal time series. • Supervised • Unsupervised • Semi-supervised or weakly supervised

  10. Unsupervised discovery ofevents in video

  11. Motivation • Mining facial expression for one subject

  12. Motivation • Mining facial expression for one subject • Mining facial expression for one subject • Summarization • Visualization • Indexing

  13. Motivation Looking forward • Mining facial expression for one subject Sleeping Waking up Smiling Looking up • Summarization • Visualization • Indexing

  14. Motivation • Mining facial expression of one subject • Summarization • Embedding • Indexing

  15. Motivation • Mining facial expression for one subject • Summarization • Embedding • Indexing

  16. Related work in time series • Change point detection (e.g. Page ‘54, Stephens 94’, Lai ‘95, Ge and Smyth ‘00, Steyvers & Brown ’05, Murphy et al. ‘07, Harchaoui et al. ‘08) • Segmental HMMs (e.g. Ge and Smith ‘00, Kohlmoren et al. ’01, Ding & Fan ‘07) • Mixtures of HMMs (e.g. Fine et al. ‘98, Murphy & Paskin ‘01, Oliver et al. ’02, Alon et al. ‘03) • Switching LDS (e.g. Pavolvic et al. ‘00, Oh et al. ‘08, Turaga et al. ‘09) • Hierarchical Dirichelet Process (e.g. Beal et al. ‘02, Fox et al. ‘08) • Aligned Cluster Analysis (ACA)

  17. Summarization with ACA

  18. Kernel k-means and spectral clustering(Ding et al. ‘02, Dhillon et al. ‘04, Zass and Shashua ‘05, De la Torre ‘06) x x y y x y y 5 7 2 4 6 9 3 1 8 10 x

  19. Problem formulation for ACA Labels (G) Start and end of the segments (h) Dynamic Time Alignment Kernel (Shimodaira et al. 01)

  20. Problem formulation for ACA Dynamic Time Alignment Kernel (Shimodaira et al. 01) mc X X [Si , Si+1) [Si , Si+1) mc

  21. Matrix formulation for ACA 23 frames, 3 clusters clusters segments segments samples Dynamic Time Alignment Kernel (Shimodaira et al. 01)

  22. Facial image features • Active Appearance Models (Baker and Matthews ‘04) Appearance • Image features Shape Upper face Lower face

  23. Unsupervised facial event discovery

  24. Facial event discovery across subjects • Cohn-Kanade: 30 people and five different expressions (surprise, joy, sadness, fear, anger)

  25. Facial event discovery across subjects • Cohn-Kanade: 30 people and five different expressions (surprise, joy, sadness, fear, anger) • 10 sets of 30 people

  26. Honey bee dance (Oh et al. ‘08) Three behaviors: 1-waggling 2-turning left 3-turning right

  27. Clustering human motion

  28. Weakly supervised discoveryof events in images and video

  29. Spot the differences!

  30. What distinguish these images?

  31. Classification of time series

  32. Similarity of these problems? • Global statistics are not distinctive enough! • Better understanding of the discriminative regions or events

  33. Image Bag of ‘regions’ At least one positive All negative

  34. Support vector machines (SVMs)

  35. Learning formulation • Standard SVM -1 -3 -2 3 -1 0.5 (Andrews et. al. ’03, Felzenszwalb et al. ‘08)

  36. Optimization 1) 0.5 100ms/image (480*640 pixels) (Lampert et al. CVPR08) 0.1 all possible subwindows 2) 1 -1 -3 -2 2 3) SVM with QP

  37. Discriminative patterns in time series We name it: k-segmentation At most k disjoint intervals 10ms/sequence (15000 frames) • Efficient search: Global optimum guaranteed!

  38. Representation of signals Training data Compute frame-level feature vectors clustering Visual dictionary IDs of visual words

  39. K-segmentation Original signal IDs of visual words Histogram of visual words We need:

  40. What is ? IDs of visual words Original signal (x) SVM parameters Consider m-segmentation: m-segmentation  (m+1)-segmentation Situation 1: Situation 2:

  41. Experiment 1 – glasses vs. no-glasses • 624 images, 20 people under different expression/pose • 8 people training (126 sunglasses, 128 no glasses), 12 testing (185 sunglasses and 185 no glasses)

  42. Localization result

  43. Experiment 2 – car vs. no car • 400 images, half contains cars and other half no cars. • Each image 10,000 SIFT descriptors and a vocabulary of 1,000 visual words.

  44. Localization result

  45. Bad localization cases

  46. Classification performance discriminative regions whole image Our method outperforms SVM with human labels!!! Human labels

  47. Experiment 3 – synthetic data Positive class Negative class Accuracy Result k: maximum number of disjoint intervals.

  48. Experiment 4 – mouse activity • Mouse activities: • Drinking, eating, exploring, grooming, sleeping

  49. Result – F1 scores

  50. Conclusions • CMU Multimodal Activity database • Unsupervised discovery of events in time-series • Aligned Cluster Analysis for summarization, indexing and visualization of time-series • Code online (www.humansensing.cs.cmu.edu) • Open problems: automatic selection of number of clusters • Weakly-supervised discovery of events in time-series • DS-SVM • Novel & efficient algorithm for time series • Outperform methods with human labeled data • Kernel methods a fundamental framework for multimodal data fusion.

More Related