1 / 40

Indexing of video sequences: a generic approach for handling multiple specific features

Indexing of video sequences: a generic approach for handling multiple specific features. Nicolas Moënne-Loccoz, Eric Bruno and Stéphane Marchand-Maillet Viper group Computer Vision and Multimedia Lab University of Geneva, CH. Outline. Content-based video indexing

jaron
Download Presentation

Indexing of video sequences: a generic approach for handling multiple specific features

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Indexing of video sequences: a generic approach for handling multiple specific features Nicolas Moënne-Loccoz, Eric Bruno and Stéphane Marchand-Maillet Viper group Computer Vision and Multimedia Lab University of Geneva, CH 23/4/2006 – Dagstuhl seminar – Dagstuhl, DE

  2. Outline • Content-based video indexing • Event-based (specific) video indexing • Bags of trajectories • Interactive classification • Results • Interactive (generic) indexing process • Multimodal fusion • Dissimilarity representation • Some results 23/4/2006 – Dagstuhl seminar – Dagstuhl, DE

  3. Content-based video indexing Multimodal content abstraction: • From raw signal, infer high-level properties • Raw signal = (audio,video, text,metadata…)n • High-level properties • Semantic labels at various levels (event, object, story,…) • Data characteristics (up, down, slow, exciting,…) • Essentially 2 strategies • Supervised classification (eg activity recognition) • Interactive retrieval (eg QBE) 23/4/2006 – Dagstuhl seminar – Dagstuhl, DE

  4. Event characterization • Event: long-term spatio-temporal object [Irani 99] • Issues: • Object localization • Object supporting region • Assumption made (eg motion history): • Single event • Static camera 23/4/2006 – Dagstuhl seminar – Dagstuhl, DE

  5. Ideal against practical • Ideal case: 1 event (walking) and no camera motion • Real case: crowd (possibly camera motion) Well-defined signature Problems essentially due to temporal projection 23/4/2006 – Dagstuhl seminar – Dagstuhl, DE

  6. Unconstrained event modeling • Create a robust event representation • In terms of scale • In terms of lighting conditions • In terms of contents • … • Create a sparse event representation • Concentrate on salient content • Use the bag of trajectories strategyto index events 23/4/2006 – Dagstuhl seminar – Dagstuhl, DE

  7. Robust, sparse event representation • Based on local features in spatial domain • Based on the notion of saliency • Entropy, cornerness, frequency • For “every” video frame Ft: local featuresWt= {wt} • wt = (position, orientation, scale) = (vt , Qt , st) 23/4/2006 – Dagstuhl seminar – Dagstuhl, DE

  8. Temporal handling • Match features frame per frame • Best bipartite match between feature sets Wt et Wt+dt • Greedy match (approximation) • Hungarian algorithm • "Motion field" for frame Ft • Trajectory from t to t+k: z[t,t+k]= {wt ,wt+1 … ,wt+k} 23/4/2006 – Dagstuhl seminar – Dagstuhl, DE

  9. Temporal persistency • Sufficient for rough global motion model estimation • Affine model estimation • Global motion (camera motion) compensation 23/4/2006 – Dagstuhl seminar – Dagstuhl, DE

  10. Event representation • Issues: • Variable number of features • Variable size of trajectories • Bag of features representation • Trajectory quantization • polar coordinates of motion vector • normalised scale parameter 23/4/2006 – Dagstuhl seminar – Dagstuhl, DE

  11. Multiscale histograms of trajectories • Multiscale histograms [Chen05] 23/4/2006 – Dagstuhl seminar – Dagstuhl, DE

  12. Learning events • State-of-the-art classifier: SVM • Histogram-compliant kernel function: kernel where: 23/4/2006 – Dagstuhl seminar – Dagstuhl, DE

  13. Results • LFT: Local feature trajectories • Corpora • Laptev corpus for specific event detection • Caviar corpus for specific event detection • TrecVid corpus for generic event classification • Baselines • HMH: Motion Histograms – Hu moments [Bobick 2001]. SVM with RBF kernel • MGH: Histograms of Multiscale Spatiotemporal gradients SVM with kernel [Irani99] 23/4/2006 – Dagstuhl seminar – Dagstuhl, DE

  14. Corpus I • I. Laptev’s human activity datasets • > 2000 videos (25 persons, 5 scenarios, 6 activities) 23/4/2006 – Dagstuhl seminar – Dagstuhl, DE

  15. Examples Running Jogging Handclapping 23/4/2006 – Dagstuhl seminar – Dagstuhl, DE

  16. Results I 23/4/2006 – Dagstuhl seminar – Dagstuhl, DE

  17. Corpus II • CAVIAR Shop Monitor • Multiple occurrences (5 events) Context Aware Vision using Image-based Active Recognition 23/4/2006 – Dagstuhl seminar – Dagstuhl, DE

  18. Results II 23/4/2006 – Dagstuhl seminar – Dagstuhl, DE

  19. Corpus III • TrecVid News broadcast corpus • > 2000 shots (6 events) 23/4/2006 – Dagstuhl seminar – Dagstuhl, DE

  20. Results III 23/4/2006 – Dagstuhl seminar – Dagstuhl, DE

  21. From abstraction to indexing • Robust temporal content representation based on salient content • This strategy leads to event classification • May be a soft indicator for these events • How to combine this information with other features? • Multimodal fusion… • … at query time  interactive Multimodal fusion 23/4/2006 – Dagstuhl seminar – Dagstuhl, DE

  22. Content-based video indexing Multimodal content abstraction: • From raw signal, infer high-level properties • Raw signal = (audio,video, text,metadata…)n • High-level properties • Semantic labels at various levels (event, object, story,…) • Data characteristics (up, down, slow, exciting,…) • Essentially 2 strategies • Supervised classification (eg activity recognition) • Interactive retrieval (eg QBE) 23/4/2006 – Dagstuhl seminar – Dagstuhl, DE

  23. External video indexing • Hundred of thousand of high-dimensional descriptors associated to various modalities • Video segments need to be compared to each others according to their features • The index consists in dissimilarities matrices computed off-line • Fast retrieval • Homogeneity of the index whatever the features used • But initial feature values are lost! • How to efficiently store updatable matrices? 23/4/2006 – Dagstuhl seminar – Dagstuhl, DE

  24. Interactive multimodal retrieval • Query-by-example (QBE) paradigm associated to relevance feedback • Set of positive & negative examples : • Multiple feature spaces available through dissimilarity matrices • Set of M feature space distances : • Constraint: real-time interactions • Dissimilarity based-learning 23/4/2006 – Dagstuhl seminar – Dagstuhl, DE

  25. Dissimilarity space • Pair-wise dissimilarities replace features • R={p1,…,pN} is the representation set, then the dissimilarity space is: • If R=S+, • Low dimensional space (size N) • 1+x to 1+1 classification x2 d+2 x1 d+1 Feature space Dissimilarity space 23/4/2006 – Dagstuhl seminar – Dagstuhl, DE

  26. Kernel Discriminant Analysis • Relevance feedback: user gives • S+ positive examples pi+ di+ • S-negative examples pi- di- • Estimate a ranking function that places positives on the top and pushes negatives to the end • Solution is an expansion of kernel functions centered on training vectors 23/4/2006 – Dagstuhl seminar – Dagstuhl, DE

  27. Multimodal Analysis • Dissimilarities are known for M features • Multiple dissimilarity spaces • Concatenation • Multimodal RBF kernel 23/4/2006 – Dagstuhl seminar – Dagstuhl, DE

  28. Example d(z, p1+) d(z, p2+) 23/4/2006 – Dagstuhl seminar – Dagstuhl, DE

  29. Evaluation • TRECVid 2003 corpus  around 120 hrs of annotated videos • 37’000 shots indexed by low-level features • Global color histogram • Global motion histogram (MPEG motion vectors) • ASR histogram (word occurrences and co-occurrences) • Euclidean distance between histograms 23/4/2006 – Dagstuhl seminar – Dagstuhl, DE

  30. Adding modalities Average Precision(100 instances of the query « Basketball ») 23/4/2006 – Dagstuhl seminar – Dagstuhl, DE

  31. Varying the size of the training set Average Precision(100 instances of the query « Basketball ») 23/4/2006 – Dagstuhl seminar – Dagstuhl, DE

  32. Cont’d Average Precision(100 instances of the query « Basketball ») 23/4/2006 – Dagstuhl seminar – Dagstuhl, DE

  33. Conclusion • Some specific features may be designed as an instantiation of domain/expert knowledge • Robustness should be preserved • They may not solve all interesting problems • They should be part of a more generic framework • Dissimilarty representation for homogeneous features • Interactive learning for online relevance feedback 23/4/2006 – Dagstuhl seminar – Dagstuhl, DE

  34. Vicode 2.0 23/4/2006 – Dagstuhl seminar – Dagstuhl, DE

  35. 23/4/2006 – Dagstuhl seminar – Dagstuhl, DE

  36. Summary • Dissimilarity spaces allow tractable computation on large collections represented in high dimensional spaces • Multimodal dissimilarity (MD) space • Real-time user interactions • KFD seems to be able to learn in MD space, but problem with kernel selection and tuning • Combination of kernels (e.g. linear with rbf)? • Interactive learning of kernel parameters? 23/4/2006 – Dagstuhl seminar – Dagstuhl, DE

  37. Interfaces • Relevance feedback acquisition for temporal audiovisual data ? • Visualization for multimodal temporal data ? • Visualisation for collection at several levels (frame, shot story,…)? 23/4/2006 – Dagstuhl seminar – Dagstuhl, DE

  38. Visual QBE of video shots • Interface for efficient RF interaction 23/4/2006 – Dagstuhl seminar – Dagstuhl, DE

  39. Visual exploration of retrieval results • Multimodal visualisation (audio, ASR, motion)… 23/4/2006 – Dagstuhl seminar – Dagstuhl, DE

  40. Visual exploration of video documents • Document visualization at multiple granularities (frame, atom, shot, story,…) 23/4/2006 – Dagstuhl seminar – Dagstuhl, DE

More Related