1 / 16

SUPER: Towards Real-time Event Recognition in Internet Videos

S peeded Up E vent R ecognition. SUPER: Towards Real-time Event Recognition in Internet Videos. Yu-Gang Jiang School of Computer Science Fudan University Shanghai, China ygj@fudan.edu.cn. ACM International Conference on Multimedia Retrieval (ICMR), Hong Kong, China, Jun. 2012.

Download Presentation

SUPER: Towards Real-time Event Recognition in Internet Videos

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Speeded UpEvent Recognition SUPER: Towards Real-time Event Recognition in Internet Videos Yu-Gang Jiang School of Computer Science Fudan University Shanghai, Chinaygj@fudan.edu.cn ACM International Conference on Multimedia Retrieval (ICMR), Hong Kong, China, Jun. 2012. ACM ICMR 2012, Hong Kong, June 2012

  2. The Problem • Recognize high-level events in videos • We’re particularly interested in Internet Consumer videos • Applications • Video Search • Personal Video Collection Management • Smart Advertising • Intelligence Analysis • … …

  3. Our Objective Improve Efficiency Maintain Accuracy

  4. The Baseline Recognition Framework Best Performing approach in TRECVID-2010 Multimedia event detection (MED) task Feature extraction SIFT Classifier Late Average Fusion χ2 kernel SVM Spatial-temporal interest points MFCC audio feature Yu-Gang Jiang, XiaohongZeng, Guangnan Ye, Subh Bhattacharya, Dan Ellis, Mubarak Shah, Shih-Fu Chang, Columbia-UCF TRECVID2010 Multimedia Event Detection: Combining Multiple Modalities, Contextual Concepts, and Temporal Matching, NIST TRECVID Workshop, 2010.

  5. Three Audio-Visual Features… • SIFT (visual) • D. Lowe, IJCV ‘04 • STIP (visual) • I. Laptev, IJCV ‘05 • MFCC (audio) … 16ms 16ms

  6. Bag-of-words Representation • SIFT / STIP / MFCC words • Soft weighting(Jiang, Ngo and Yang, ACM CIVR 2007) Bag-of-SIFT Bag of audio words / bag of frames: K. Lee and D. Ellis, Audio-Based Semantic Concept Classification for Consumer Video, IEEE Trans on Audio, Speech, and Language Processing, 2010

  7. Baseline Speed… • 4 Factors on speed: Feature, Classifier, Fusion, Frame Sampling Feature extraction Total: 1003 seconds per video ! SIFT Late Average Fusion Classifier Spatial-temporal interest points 82.0 χ2 kernel SVM MFCC audio feature 916.8 ~2.00 <<1 2.36 Feature efficiency is measured in seconds needed for processing an 80-second video sequence (for SIFT: 0.5fps). Classification time is measured by classifying a video using classifiers of all the 20 categories

  8. Dataset: Columbia Consumer Videos (CCV) Basketball Non-music Performance Skiing Dog Wedding Reception Baseball Swimming Bird Wedding Ceremony Parade Soccer Biking Graduation Wedding Dance Beach Yu-Gang Jiang, Guangnan Ye, Shih-Fu Chang, Daniel Ellis, Alexander C. Loui, Consumer Video Understanding: A Benchmark Database and An Evaluation of Human and Machine Performance, in ACM ICMR 2011. Playground Cat Birthday Celebration Music Performance Ice Skating

  9. Feature Options • (Sparse) SIFT • STIP • MFCC • Dense SIFT (DIFT) • Dense SURF (DURF) • Self-Similarities (SSIM) • Color Moments (CM) • GIST • LBP • TINY Uijlings, Smeulders, Scha, Real-time bag of words, approximately,in ACM CIVR 2009. Suggested feature combinations:

  10. Classifier Kernels • Chi Square Kernel • Histogram Intersection Kernel (HI) • Fast HI Kernel (fastHI) Maji, Berg, Malik, Classification Using Intersection Kernel Support Vector Machines is Efficient,in CVPR 2008.

  11. Multi-modality Fusion • Early FusionFeature concatenation • Kernel FusionKf=K1+K2+… • Late Fusionfusion of classificationscore MFCC, DURF, SSIM, CM, GIST, LBP MFCC, DURF

  12. Frame Sampling K. Schindler and L. van Gool, Action snippets: How many frames does human action recognition require?, in CVPR 2008. • DURF Uniformly sampling 16 frames per video seems sufficient.

  13. Frame Sampling • MFCC Sampling audio frames is always harmful.

  14. Summary • Feature: Dense SURF (DURF), MFCC, plus some global features • Classifier: Fast HI kernel SVM • Fusion: Early • Frame Selection: Audio - No; Visual - Yes 220-fold speed-up!

  15. Demo…

  16. email: ygj@fudan.edu.cn Thank you!

More Related