1 / 26

Visual Event Recognition in Videos by Learning from Web Data

Visual Event Recognition in Videos by Learning from Web Data. Lixin Duan † , Dong Xu † , Ivor Tsang † , Jiebo Luo ¶ † Nanyang Technological University, Singapore ¶ Kodak Research Labs, Rochester, NY, USA. Outline. Overview of the Event Recognition System Similarity between Videos

anja
Download Presentation

Visual Event Recognition in Videos by Learning from Web Data

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Visual Event Recognition in Videos by Learning from Web Data LixinDuan†, Dong Xu†, Ivor Tsang†, JieboLuo¶ †Nanyang Technological University, Singapore ¶ Kodak Research Labs, Rochester, NY, USA

  2. Outline • Overview of the Event Recognition System • Similarity between Videos • Aligned Space-Time Pyramid Matching • Cross-Domain Problem • Adaptive Multiple Kernel Learning • Experiments • Conclusion

  3. Overview • GOAL: Recognize consumer videos • Large intra-class variability; limited labeled videos Wedding Sports Picnic

  4. Overview • GOAL: Recognize consumer videos by leveraging a large number of loosely labeled web videos (e.g., from YouTube) Wedding Consumer Videos Sports A Large Number of Web Videos Picnic

  5. Overview • Flowchart of the system Video Database Test video Output Classifier

  6. Similarity between Videos • Pyramid matching methods • Temporally aligned pyramid matching, D. Xu and S.-F. Chang [1] • Unaligned space-time pyramid matching, I. Laptev [2] Time axis Space axes Space-time axes

  7. Similarity between Videos • Aligned Space-Time Pyramid Matching • Each video is divided into non-overlapped space-time volumes, where . • Greater variability • Two-step approach • Distances between space-time volumes: solved by existing methods such as bag-of-words model, I. Laptev [2]

  8. Similarity between Videos • Aligned Space-Time Pyramid Matching • Level 1 Distance

  9. Similarity between Videos Distance • Integer-flow Earth Mover’s Distance (EMD), Y. Rubner [3] s.t.

  10. Similarity between Videos • Integer-flow Earth Mover’s Distance (EMD), Y. Rubner [3] Distance s.t.

  11. Cross-Domain Problem • Data distribution mismatch between consumer videos and web videos • Consumer videos: Naturally captured • Web videos: Edited; Selected • Maximum Mean Discrepancy (MMD), K. M. Borgwardt[4] where , and .

  12. Cross-Domain Problem • Suppose there are pre-learned classifiers • is learned by SVM with the labeled training data from both domains • Proposed target decision function Prior information where is the linear combination coefficient and is the perturbation function.

  13. Cross-Domain Problem • Motivated by Multiple Kernel Learning (MKL) (F. Bach [5]), perturbation function • MKL: • MMD where . , where where

  14. Cross-Domain Problem • Adaptive Multiple Kernel Learning (A-MKL) MMD Structural risk functional where

  15. Cross-Domain Problem • Dual form of • A-MKL algorithm • Iteratively solve the linear coefficients and the dual variables in the dual form of .

  16. Cross-Domain Problem • Feature Replication (FR), H. DauméIII [6] • Augment features • Domain Transfer SVM (DTSVM), L. Duan [7] • No prior information • Adaptive SVM (A-SVM), J. Yang [8] • is pre-defined • is modeled by SVM

  17. Experiments • Data set • 195 consumer videos and 906 web videos collected by ourselves and from Kodak Consumer Video Benchmark Data Set [5] • 6 events: “wedding”, “birthday”, “picnic”, “parade”, “show” and “sports” • Training data: 3 videos per event from consumer videos and all web videos • Test data: The rest consumer videos

  18. Experiments • Two types of features • Space-time (ST) feature, Laptev et al. [1] • SIFT feature, Lowe [2] • Four types of base kernels • Gaussian: • Laplacian: • Inverse Square Distance: • Inverse Distance:

  19. Experiments Unaligned Aligned • Aligned Space-Time Pyramid Matching (ASTPM) vs. Unaligned Space-Time Pyramid Matching (USTPM) • ASTPM is better than USTPM at Level 1

  20. Experiments • 80 base kernels in total: 2 pyramid levels, 2 types of features, 5 kernel parameters and 4 types of kernels • Average classifiers at Level () • : 20 base classifiers learned by SVM • : 20 base classifiers learned by SVM • Pre-learned classifiers : 4 average classifiers

  21. Experiments • Comparisons of cross-domain learning methods • (a) SIFT features • (b) ST features • (c) SIFT features and ST features • “parade”: 75.7% (A-MKL) vs. 62.2% (FR)

  22. Experiments • Comparisons of cross-domain learning methods • Relative improvements • SVM_T: 36.9% • SVM_AT: 8.6% • Feature Replication (FR) [6]: 7.6% • Adaptive SVM (A-SVM) [7]: 49.6% • Domain Transfer SVM (DTSVM) [8]: 9.9% • MKL-based methods • Better fuse SIFT features and ST features • Handle noise in the loose labels

  23. Conclusion • We propose a new event recognition framework for consumer videos by leveraging a large number of loosely labeled web videos. • We develop a new aligned space-time pyramid matching method. • We present a new cross-domain learning method A-MKL which handles the mismatch between the data distributions of the consumer video domain and the web video domain.

  24. References [1] D. Xu and S.-F. Chang. Video event recognition using kernel methods with multi-level temporal alignment. T-PAMI, 30(11):1985–1997, 2008. [2] I. Laptev, M. Marszałek, C. Schmid, and B. Rozenfeld. Learning realistic human actions from movies. In CVPR, 2008. [3] Y. Rubner, C. Tomasi, and L. J. Guibas. The Earth mover’s distance as a metric for image retrieval. IJCV, 40(2): 99-121, 2000. [4] K. M. Borgwardt, A. Gretton, M. J. Rasch, H.-P. Kriegel, B. Schölkopf, and A. Smola. Integrating structured biological data by kernel maximum mean discrepancy. In ISMB, 2006.

  25. References [5] F. Bach, G. R. G. Lanckriet, and M. I. Jordan. Multiple kernel learning, conic duality and the SMO algorithm. In ICML, 2004. [6] H. Daumé III. Frustratingly easy domain adaptation. In ACL, 2007. [7] L. Duan, I. W. Tsang, D. Xu, and S. J. Maybank. Domain transfer svm for video concept detection. In CVPR, 2009. [8] J. Yang, R. Yan, and A. G. Hauptmann. Cross-domain video concept detection using adaptive svms. In ACM MM, 2007. [9] D. G. Lowe. Distinctive image features from scale-invariant keypoints. IJCV, 60(2):91–110, 2004.

  26. Thank you!

More Related