1 / 38

UCF VIRAT Efforts

UCF VIRAT Efforts. Bag of Video-Words Video Representation. Outline. Bag of Video-words approach for video representation Feature detection Feature quantization Histogram-based video descriptor generation Preliminary experimental results on aerial videos

emi-lynn
Download Presentation

UCF VIRAT Efforts

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. UCF VIRAT Efforts Bag of Video-Words Video Representation

  2. Outline • Bag of Video-words approach for video representation • Feature detection • Feature quantization • Histogram-based video descriptor generation • Preliminary experimental results on aerial videos • Discussion on ways to improve the performance

  3. Bag of video-words approach (I) Interest Point Detector Motion Feature Detection

  4. Bag of video-words approach (II) Video-word A Video-word B Video-word C Feature Quantization: Codebook Generation

  5. Bag of video-words approach (III) Histogram-based Video Descriptor Histogram-based video descriptor generation

  6. Similarity Metrics • Histogram Intersection • Chi-square distance

  7. Classifiers • Bayesian Classifier • K-Nearest Neighbors (KNN) • Support Vector Machines (SVM) • Histogram Intersection Kernel • Chi-square Kernel • RBF (Radial Basis Function) Kernel

  8. Experiments on Aerial videos • Dataset • Blimp with a HD camera on a gimbal • 11 Actions: Digging, gesturing, picking up, throwing, kicking, carrying object, walking, standing, running, entering vehicle, exiting vehicle

  9. Clipping & Cropping Actions End of Frame Start of Frame • Optimal box is created so that the object of interest doesn't go out of view in all the frames (Start Frame to End Frame)

  10. Feature Detection for Video Clips 200 Features Digging Kicking Throwing Walking

  11. Classification Results (I) • “kicking”(22 clips) v.s. “non kicking” (22 clips)

  12. Classification Results (II)

  13. Classification Results (III) • “Digging”, “Kicking”, “Walking”, “Throwing” ( 25clips x 4 ) digging kicking Similarity Matrix (Histogram Intersection) throwing walking

  14. Classification Results (V) • Average accuracy with different codebook size • Confusion table for the case of codebook size of 300

  15. Misclassified examples (I) • Misclassified “walking” into “kicking”

  16. Misclassified examples (I) • Misclassified “digging” into “walking”

  17. Misclassified examples (III) • Misclassified “walking” into “throwing”

  18. How to improve the performance? • Low Level Features • Stable motion features • Different Motion Features • Different Motion Feature Sampling • Hybrid of Motion and Static Features • Video-words generation • Unsupervised method • Hierarchical K-Means (David Nister, et al., CVPR 2006) • Supervised method • Random Forest (Bill Triggs, et al., NIPS 2007) • “Visual Bits” (Rong Jin, et al., CVPR 2008) • Classifiers • SVM Kernels : histogram intersection v.s. Chi-Square distance • Multiple Kernels

  19. End of Frame Start of Frame Stable motion features • Motion compensation • Video clipping and cropping

  20. Different Low-level Features • Flattened gradient vector (magnitude) • Histogram of Gradient (direction) • Histogram of Optical Flow • Combination of all types of features

  21. Feature sampling • Feature detection: Gabor filter or 3D Harris corner detection • Random sampling • Grid-based sampling • Bill Triggs et al., Sampling Strategies for Bag-of-Features Image Classification, ECCV 2006

  22. Hybrid of Motion and Static Features (I) • Multiple-frame Features (spatiotemporal, motion) • 3D Harris • Capture the local spatiotemporal information around the interest points • Single-frame Features (spatial, static) • 2D Harris detector • MSER (Maximally Stable Extremal Regions ) detector • Perform action recognition by a sequence instantaneous postures or poses • Overcome the shortcoming of multiple-frame features which require relative stable camera motion • Hybrid of motion and static features • Represent a video by the combination of multiple-frame and single-frame features

  23. 2D Harris MSER Hybrid of Motion and Static Features (II) • Examples of 2D Harris and MSER feature

  24. Hybrid of Motion and Static Features (III) • Experiments on three action datasets • KTH, 6 action categories, 600 videos • UCF sports, 10 action categories, about 200 videos • YouTube videos, 11 action categories, about 1,100 videos

  25. Waving Clapping Boxing Walking Running Jogging KTH dataset

  26. Experimental results on KTH dataset • Recognition using either Motion (L), Static (M) features and Hybrid (R) features Average Accuracy 92.66% Average Accuracy 87.65% Average Accuracy 82.96%

  27. Results on UCF sports dataset The average accuracy for static, motion and static+motion experimental strategy is 74.5%, 79.6% and 84.5% respectively.

  28. YouTube Video Dataset (I) Golf Swinging Diving Cycling Juggling Riding

  29. YouTube Video Dataset (II) Tennis Swinging Swinging Basketball Shooting Trampoline Jumping Volleyball Spiking

  30. Results on YouTube dataset • The average accuracy for motion, static and hybrid features are 65.4%, 63.1% and 71.2%, respectively

  31. Hierarchical K-Means (I) • Traditional k-means • Slow when generating large size of codebook • Less discriminative when dealing with large size of codebook • Hierarchical k-means • Building a tree on the training features • Children nodes are clusters of applying k-means on the parent node • Treat each node as a “word”, so the tree is a hierarchical codebook • D. Nister, Scalable Recognition with a Vocabulary Tree, CVPR 2006

  32. Hierarchical K-Means (II) • Advantages • Tree also defines the quantization of features, so it integrates the indexing and quantization in one tree • It is much more efficient when generating a large size of codebook • The word (node) frequency can be integrated with the inverse document frequency to weight it. • It can generate more discriminative word than that of k-means • Large size of codebook which can generally obtain better performance.

  33. Random Forests (I) • K-means based quantization methods • Unsupervised • It suffers from the high dimensionality of the features • Single-tree based methods • Each path through the tree typically accesses only a few of the feature dimensions • It fails to deal with the variance of the feature dimensions • It is fast, but performance is not even as good as k-means • Random Forests • Build an ensemble trees • Each tree node is splitted by checking the randomly selected subset of feature dimensions • Building all the trees using video or image labels (supervised method) • Instead of taking the trees as an ensemble classifiers, we treat all the leaves of all the trees as “words”. • The generated “words” are more meaningful and discriminative, since it contains class category information

  34. Random Forests (II)

  35. “Visual Bits” (I) • Both k-means or random forests • Treat all the features equally when generating the codebooks. • Hard assignment (each feature can only be assigned to one “word”) • “Visual Bits” • Rong Jin et al., Unifying Discriminative Visual Codebook Generation with Classifier Training for Object Category Recognition, CVPR 2008 • Training a visual codebook for each category, so it can overcome the shortcomings of “hard assignment” of the features • It integrates the classification and codebook generation together, so it is able to select the relevant features by weighting them

  36. “Visual Bits” (II)

  37. Classifiers • Kernel SVM • Histogram Intersection • Chi-square distance • Multiple kernels • Fuse different type of features • Fuse different distance metrics

  38. The end… • Thank you!

More Related