Unsupervised and weakly-supervised discovery of events in video (and audio)

Unsupervised and weakly-supervised discovery of events in video(and audio) Fernando De la Torre

A dream

Outline • Introduction • CMU-Multimodal Activity database • Unsupervised discovery of video events • Aligned Cluster Analysis (ACA) • Weakly-supervised discovery of video events • Detection-Segmentation SVMs • Conclusions

Quality of life technologies (QLoT)

Multimodal data collection • 40 subjects, 5 recipes • www.kitchen.cs.cmu.edu

Anomalous dataset

Time series analysis • Anomalous detection formulated as detecting outliers in multimodal time series. • Supervised • Unsupervised • Semi-supervised or weakly supervised

Unsupervised discovery ofevents in video

Motivation • Mining facial expression for one subject

Motivation • Mining facial expression for one subject • Mining facial expression for one subject • Summarization • Visualization • Indexing

Motivation Looking forward • Mining facial expression for one subject Sleeping Waking up Smiling Looking up • Summarization • Visualization • Indexing

Motivation • Mining facial expression of one subject • Summarization • Embedding • Indexing

Motivation • Mining facial expression for one subject • Summarization • Embedding • Indexing

Related work in time series • Change point detection (e.g. Page ‘54, Stephens 94’, Lai ‘95, Ge and Smyth ‘00, Steyvers & Brown ’05, Murphy et al. ‘07, Harchaoui et al. ‘08) • Segmental HMMs (e.g. Ge and Smith ‘00, Kohlmoren et al. ’01, Ding & Fan ‘07) • Mixtures of HMMs (e.g. Fine et al. ‘98, Murphy & Paskin ‘01, Oliver et al. ’02, Alon et al. ‘03) • Switching LDS (e.g. Pavolvic et al. ‘00, Oh et al. ‘08, Turaga et al. ‘09) • Hierarchical Dirichelet Process (e.g. Beal et al. ‘02, Fox et al. ‘08) • Aligned Cluster Analysis (ACA)

Summarization with ACA

Kernel k-means and spectral clustering(Ding et al. ‘02, Dhillon et al. ‘04, Zass and Shashua ‘05, De la Torre ‘06) x x y y x y y 5 7 2 4 6 9 3 1 8 10 x

Problem formulation for ACA Labels (G) Start and end of the segments (h) Dynamic Time Alignment Kernel (Shimodaira et al. 01)

Problem formulation for ACA Dynamic Time Alignment Kernel (Shimodaira et al. 01) mc X X [Si , Si+1) [Si , Si+1) mc

Matrix formulation for ACA 23 frames, 3 clusters clusters segments segments samples Dynamic Time Alignment Kernel (Shimodaira et al. 01)

Facial image features • Active Appearance Models (Baker and Matthews ‘04) Appearance • Image features Shape Upper face Lower face

Unsupervised facial event discovery

Facial event discovery across subjects • Cohn-Kanade: 30 people and five different expressions (surprise, joy, sadness, fear, anger)

Facial event discovery across subjects • Cohn-Kanade: 30 people and five different expressions (surprise, joy, sadness, fear, anger) • 10 sets of 30 people

Honey bee dance (Oh et al. ‘08) Three behaviors: 1-waggling 2-turning left 3-turning right

Clustering human motion

Weakly supervised discoveryof events in images and video

Spot the differences!

What distinguish these images?

Classification of time series

Similarity of these problems? • Global statistics are not distinctive enough! • Better understanding of the discriminative regions or events

Image Bag of ‘regions’ At least one positive All negative

Support vector machines (SVMs)

Learning formulation • Standard SVM -1 -3 -2 3 -1 0.5 (Andrews et. al. ’03, Felzenszwalb et al. ‘08)

Optimization 1) 0.5 100ms/image (480*640 pixels) (Lampert et al. CVPR08) 0.1 all possible subwindows 2) 1 -1 -3 -2 2 3) SVM with QP

Discriminative patterns in time series We name it: k-segmentation At most k disjoint intervals 10ms/sequence (15000 frames) • Efficient search: Global optimum guaranteed!

Representation of signals Training data Compute frame-level feature vectors clustering Visual dictionary IDs of visual words

K-segmentation Original signal IDs of visual words Histogram of visual words We need:

What is ? IDs of visual words Original signal (x) SVM parameters Consider m-segmentation: m-segmentation  (m+1)-segmentation Situation 1: Situation 2:

Experiment 1 – glasses vs. no-glasses • 624 images, 20 people under different expression/pose • 8 people training (126 sunglasses, 128 no glasses), 12 testing (185 sunglasses and 185 no glasses)

Localization result

Experiment 2 – car vs. no car • 400 images, half contains cars and other half no cars. • Each image 10,000 SIFT descriptors and a vocabulary of 1,000 visual words.

Localization result

Bad localization cases

Classification performance discriminative regions whole image Our method outperforms SVM with human labels!!! Human labels

Experiment 3 – synthetic data Positive class Negative class Accuracy Result k: maximum number of disjoint intervals.

Experiment 4 – mouse activity • Mouse activities: • Drinking, eating, exploring, grooming, sleeping

Result – F1 scores

Conclusions • CMU Multimodal Activity database • Unsupervised discovery of events in time-series • Aligned Cluster Analysis for summarization, indexing and visualization of time-series • Code online (www.humansensing.cs.cmu.edu) • Open problems: automatic selection of number of clusters • Weakly-supervised discovery of events in time-series • DS-SVM • Novel & efficient algorithm for time series • Outperform methods with human labeled data • Kernel methods a fundamental framework for multimodal data fusion.

Unsupervised and weakly-supervised discovery of events in video (and audio)

Unsupervised and weakly-supervised discovery of events in video (and audio)

Presentation Transcript

Audio and Video

Streaming Audio and Video

Unsupervised Discovery of Morphemes

Algorithms for Distributed Supervised and Unsupervised Learning

Supervised and unsupervised wrapper generation

Unsupervised and Weakly-Supervised Probabilistic Modeling of Text

Unsupervised Commonality Discovery in Images

Audio in Video Games

Accessibility of Video and Audio

Lab 5 Unsupervised and supervised clustering

Audio and Video

Audio and Video

Unsupervised and Semi-Supervised Learning of Tone and Pitch Accent

Classification Supervised and unsupervised

Unsupervised and Supervised Tracking

Audio and Video

Audio and Video

Audio and Video

Weakly Supervised Action Recognition

Linking Audio and Video Clips

Audio and Video

Unsupervised and weakly-supervised discovery of events in video (and audio)