1 / 32

Cross-View Action Recognition via View Knowledge Transfer

Cross-View Action Recognition via View Knowledge Transfer. Jingen Liu 1 , Mubarak Shah 2 , Benjamin Kuipers 1 , Silvio Savarese 1. 1 Department of EECS University of Michigan Ann Arbor , MI, USA. 2 Department of EECS University of Central Florida Orlando, FL, USA.

kathy
Download Presentation

Cross-View Action Recognition via View Knowledge Transfer

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Cross-View Action Recognition via View Knowledge Transfer Jingen Liu1, Mubarak Shah2, Benjamin Kuipers1, Silvio Savarese1 1 Departmentof EECS University of Michigan Ann Arbor, MI, USA 2 Departmentof EECS University of Central Florida Orlando, FL, USA • IEEE International Conference on Computer Vision and Pattern Recognition, 2011

  2. Cross-View Action Recognition • View 1: having labeled examples to train an action classifier F1 • View 2: having NO training examples, i.e., “Checking watch” • Question: How to use knowledge of view 1 to recognize unknown actions of view 2? Low-level features representation Low-level features representation View 1 “Checking watch” ? Classifier View 2 “Checking watch”

  3. Cross-View Action Recognition • Directly use classifier F1 to recognize actions of view 2? • No! Performance decreases dramatically • Motion appearance looks very different across views Low-level features representation Low-level features representation View 1 “Checking watch” ? Classifier View 2 “Checking watch”

  4. Analogy to Text Analysis • Cross-lingual text categorization/retrieval [Bel et al. 2004, Pirkola 98] • Translate them into a common language • E.g., an interlingua, as used in machine translation [Hutchins et al. 92] • Underlying assumption: having word-by-word Common Languages OR An Interlingua In Chinese In French

  5. Our Proposal • An “action view interlingua” • Treat each view point as a language; construct vocabulary • Model an action by a Bag-of-Visual-Words (BoVW) • Translate two BoVWs into an “action view interlingua” View 1 Histogram of Visual-Words Vocabulary V1 Videos An Action View Interlingua View 2 Vocabulary V2 Videos Histogram of Visual-Words

  6. Previous Work • Geometry-based approaches • Geometric measurement of body joints • C. Rao et al. IJCV 2002, V. Paramesmaran et al. IJCV 2006, etc. • Require stable body joint detection and tracking • 3D reconstruction related • D. Weinland et al. ICCV07, P. Yan et al. CVPR08, F. Lv et al. ICCV07, D. Gavrila et al. CVPR96, R. Li et al. ICCV07, etc. • Strict alignments between views • Computationally expensive in reconstruction • Temporal self-similarity matrix [Junejo et al. ECCV08] • Non knowledge transfer; • Poor performance on top view

  7. Previous Work • Transfer-based approaches • Farhadi et al. ECCV08 • Requires feature to feature correspondence at frame level • Mapping is provided by a trained predictor • Mapping is conducted in one direction • Farhadi et al. ICCV 09 • Abstract discriminative aspects • Training a hash mapping • No explicit model transfer

  8. Our Contributions • Advantages of our approach • More flexible: no geometry constraints, human body joint detection and tracking, and 3D reconstruction • No requirement on strict temporal alignment • Two directional mapping rather than one direction • No supervision for bilingual words discovery • Fuse transferred multi-view knowledge using Locally Weighted Ensemble method  Info. Exchange  First View Features Second View Features First View Features Second View Features

  9. Our Framework First View Second View • Phase I: Discovery of bilingual words • Given N pairs of unlabelled videos captured from two views • Learn two view-dependent visual vocabularies • Discover bi-lingual words by bipartite graph partitioning Training Data Matrix M BoVW models MS V2 V1 First View A Graph Partitioning Vocabulary V1 Vocabulary V2 Second View Z Bipartite Graph Bilingual Words BoVW models MT

  10. Our Framework First View Second View • Phase I: Discovery of bilingual words • Given N pairs of unlabelled videos captured from two views • Learn two view-dependent visual vocabularies • Discover bi-lingual words by bipartite graph partitioning Training Data Matrix M BoVW models M1 V2 V1 First View B Y Z A Graph Partitioning Vocabulary V1 Vocabulary V2 BoBW models Second View Bipartite Graph Bilingual Words BoVW models M2

  11. Our Framework • Phase II: cross-view novel action recognition Training Classifier on Source View Source View Target View Novel Action Recognizing Source View Action Model Learning Bag-of-Visual-Words Bag-of-Bilingual-Words Bag-of-Bilingual-Words Training videos Bilingual Words Target View Testing videos Testing Classifier on Target View Bag-of-Visual-Words

  12. Low-level Action Representation visual words x Examples d • Acquiring the training matrix M Feature Detector 3D cuboids extraction View 1 View 2 Bag-of-Visual-Words (BoVW) model Feature Clustering Visual Word A Visual Word B Visual vocabulary Video-words histogram

  13. Bipartite Graph Modeling visual words Target View Source View • Build a bipartite graph between two views • Edge weights matrix , where S is a similarity matrix • Generate similarity matrix S • In the column space of M, each S(i,j) of S can be estimated, Video Examples X: Visual words of view 1 W Y: Visual words of view 2

  14. Bipartite Graph Bi-Partitioning Bipartite graph partition: • [1] H. Zha, X. He, C. Ding, H. Simon & M. Gu, CIKM 2001 • [2] I.S. Dhillon, SIGKDD 2001 A. Before Partition B. After Partition Two clusters (1,2,3; a, b) & (4,5; c, d, e) -> two bilingual words

  15. IXMAS Data Set • IXMAS videos: 11 actions performed by 10 actors, taken from 5 views. Check-watch Scratch-head Wave-hand Kicking Pick-up Sit-down C0 C1 C2 C3 C4

  16. Data Partition Classes Ys IXMAS Data Classes Z Classes Z Classes Y Check-watch Scratch-head Wave-hand Pick-up Kick Sit-down Source View Target View

  17. Data Partition View 1 Classes Ys View 2 IXMAS Data source view Classes Z Classes Z Classes Y target view Learning Bilingual Words Training Z classes Testing Z classes Check-watch Scratch-head Wave-hand Pick-up Kick Sit-down Source View Target View

  18. Data Partition target view Classes Ys source view IXMAS Data source view Classes Z Classes Z Classes Y target view Learning Bilingual Words Training Z+Y classes Testing Z classes Check-watch Scratch-head Wave-hand Pick-up Kick Sit-down Source View Target View

  19. Results on View Knowledge Transfer Training View Testing View • “W/O” and “W/” show the results without and with view knowledge transfer via the bag of bilingual words, respectively • Average, “W/O”=10.9%, “W/” = 67.4%

  20. Performance of Transfer Training View Testing View • “W/O” and “W/” show the results without and with view knowledge transfer via the bag of bilingual words, respectively • Average, “W/O”=10.9%, “W/” = 67.4%

  21. Performance of Transfer Training View Testing View • “W/O” and “W/” show the results without and with view knowledge transfer via the bag of bilingual words, respectively • Average, “W/O”=10.9%, “W/” = 67.4%

  22. Performance of Transfer Training View Testing View • “W/O” and “W/” show the results without and with knowledge transfer respectively • Average, woTran=10.9%, wTran = 67.4%

  23. Performance of Transfer Training View Testing View • “W/O” and “W/” show the results without and with knowledge transfer respectively • Average, woTran=10.9%, wTran = 67.4%

  24. Performance Comparison • Low-level features: ST cuboids + shape-flow features [D. Tran et al. ECCV 2008] • Columns “A”: A. Farhadi et al. ECCV 2008 • Columns “B”: I. N. Junejo, et al. ECCV 2008 • Columns “C”: A. Farhadi et al. ICCV 2009

  25. Performance Comparison • Low-level features: ST cuboids + shape-flow features [D. Tran et al. ECCV 2008] • Columns “A”: A. Farhadi et al. ECCV 2008 • Columns “B”: I. N. Junejo, et al. ECCV 2008 • Columns “C”: A. Farhadi et al. ICCV 2009

  26. Performance Comparison • Low-level features: ST cuboids + shape-flow features [D. Tran et al. ECCV 2008] • Columns “A”: A. Farhadi et al. ECCV 2008 • Columns “B”: I. N. Junejo, et al. ECCV 2008 • Columns “C”: A. Farhadi et al. ICCV 2009

  27. Transferred Knowledge Fusion • One target view V.S. n-1 source views • Each source view have an action classifier • How to fuse the knowledge to final decision? • Locally Weighted Ensemble strategy [ Gao et al. SIGKDD 08 ] + + + + + + + + + + + + + + + + + + + + + + + + – + – + + + – – + + – – – + – – R – – – – – – – – – – – R – – – – – – – – – – – – – – – Fusion Classifier of Source 1 Classifier of Source 2

  28. Knowledge Fusion Results • Each column denotes a testing (target) view, and the rest four views are source view

  29. Knowledge Fusion Results • Each column denotes a testing (target) view, and the rest four views are source view

  30. Detailed Recognition Rate

  31. Summary • Create an “action view interlingua” for cross-view action recognition • Bilingual words serve as a bridge for view knowledge transfer • Fuse multiple transferred knowledge using Locally Weighted Ensemble method • Our approach achieves state-of-the-art performance

  32. Thank You! Acknowledgements: UMich Intelligent Robotics Lab UMich Computer Vision Lab UCF Computer Vision Lab NSF

More Related