1 / 25

Tsz-Ho Yu

Real-time Articulated Hand Pose Estimation using Semi-supervised Transductive Regression Forests . Tsz-Ho Yu. T-K Kim. Danhang Tang. Sponsored by . Motivation. Multiple cameras with invserse kinematics [Bissacco et al. CVPR2007] [Yao et al. IJCV2012] [Sigal IJCV2011] .

sibley
Download Presentation

Tsz-Ho Yu

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Real-time Articulated Hand Pose Estimation using Semi-supervised Transductive Regression Forests Tsz-Ho Yu T-K Kim DanhangTang Sponsored by

  2. Motivation Multiple cameras with invserse kinematics [Bissacco et al. CVPR2007] [Yao et al. IJCV2012] [Sigal IJCV2011] Specialized hardware(e.g. structured light sensor, TOF camera) [Shotton et al. CVPR’11][Baak et al. ICCV2011] [Ye et al. CVPR2011] [Sun et al. CVPR2012] Learning-based (regression) [Navaratnam et al. BMVC2006][Andriluka et al. CVPR2010]

  3. Motivation • Discriminative approaches (RF) have achieved great success in humanbody pose estimation. • Efficient – real-time • Accurate – frame-basis, not rely on tracking • Require a large dataset to cover many poses • Train on synthetic, test on real data • Didn’t exploit kinematic constraints Examples: Shotton et al. CVPR’11, Girshick et al. ICCV’11, Sun et al. CVPR’12

  4. Challenges for Hand? • Viewpoint changes and self occlusions • Discrepancy between synthetic and real data is larger than human body • Labeling is difficult and tedious!

  5. Our method • Viewpoint changes and self occlusions Hierarchical Hybrid Forest Transductive Learning • Discrepancy between synthetic and real data is larger than human body Semi-supervised Learning • Labeling is difficult and tedious!

  6. Existing Approaches • Generative approaches • Model-fitting • No training is required • Slow • Needs initialisation and tracking Oikonomidis et al. ICCV2011 Motion capture Ballan et al. ECCV 2012 De La Gorce et al. PAMI2010 Hamer et al. ICCV2009 • Discriminative approaches • Similar solutions to human body pose estimation • Performance on real data remains challenging • Discriminative approaches • Similar solutions to human body pose estimation • Performance on real data remains challenging • Xu and Cheng ICCV 2013 Stengeret al. IVC 2007 Keskin et al. ECCV2012 Wang et al. SIGGRAPH2009

  7. Our method • Viewpoint changes and self occlusions Hierarchical Hybrid Forest • Discrepancy between synthetic and real data is larger than human body • Labeling is difficult and tedious!

  8. Hierarchical Hybrid Forest Viewpoint Classification: Qa • STR forest: • Qa – View point classification quality (Information gain) Qapv =αQa+ (1-α)βQP + (1-α)(1-β)QV

  9. Hierarchical Hybrid Forest Viewpoint Classification: Qa Finger joint Classification: Qp • STR forest: • Qa – View point classification quality (Information gain) • Qp – Joint label classification quality (Information gain) Qapv =αQa+ (1-α)βQP+ (1-α)(1-β)QV

  10. Hierarchical Hybrid Forest Viewpoint Classification: Qa Finger joint Classification: Qp Pose Regression: Qv • STR forest: • Qa – View point classification quality (Information gain) • Qp – Joint label classification quality (Information gain) • Qv – Compactness of voting vectors (Determinant of covariance trace) Qapv =αQa+ (1-α)βQP + (1-α)(1-β)QV

  11. Hierarchical Hybrid Forest Viewpoint Classification: Qa Finger Joint Classification: Qp Pose Regression:Qv • STR forest: • Qa – View point classification quality (Information gain) • Qp – Joint label classification quality (Information gain) • Qv – Compactness of voting vectors (Determinant of covariance trace) • (α,β) – Margin measures of view point labels and joint labels Qapv =αQa+ (1-α)βQP + (1-α)(1-β)QV

  12. Our method • Viewpoint changes and self occlusions Transductive Learning • Discrepancy between synthetic and real data is larger than human body Semi-supervised Learning • Labeling is difficult and tedious!

  13. Transductive learning Source space (Synthetic data S) Target space (Realistic data R) • Training data D = {Rl, Ru, S}: labeled unlabeled • Synthetic data S: • Generated from an articulated hand model. All labeled. • Realistic data R: • Captured from Primesense depth sensor • A small part of R, Rlare labeled manually (unlabeled set Ru)

  14. Transductive learning Source space (Synthetic data S) Target space (Realistic data R) • Training data D = {Rl, Ru, S}: • Realistic data R: • Captured from Kinect • A small part of R, Rlare labeled manually (unlabeled set Ru) • Synthetic data S: • Generated from a articulated hand model, where |S| >> |R|

  15. Transductive learning Source space (Synthetic data S) Target space (Realistic data R) • Training data D = {Rl, Ru, S}: • Similar data-points in Rl and S are paired(if separated by split function give penalty)

  16. Semi-supervised learning Source space (Synthetic data S) Target space (Realistic data R) • Training data D = {Rl, Ru, S}: • Similar data-points in Rl and S are paired(if separated by split function give penalty) • Introduce a semi-supervised term to make use of unlabeled real data when evaluating split function

  17. Kinematic refinement

  18. Experiment settings • Training data: • Synthetic data(337.5K images) • Real data(81K images, <1.2K labeled) • Evaluation data: • Three different testing sequences • Sequence A --- Single viewpoint(450 frames) • Sequence B --- Multiple viewpoints, with slow hand movements(1000 frames) • Sequence C --- Multiple viewpoints, with fast hand movements(240 frames)

  19. Self comparison experiment • Self comparison(Sequence A): • This graph shows the joint classification accuracy of Sequence A. • Realistic and synthetic baselines produced similar accuracies. • Using the transductive term is better than simply augmented real and synthetic data. • All terms together achieves the best results.

  20. Multiview experiments • Multi view experiment (Sequence C):

  21. Conclusion • A 3D hand pose estimation algorithm • STR forest: Semi-supervised and transductive regression forest • A data-driven refinement scheme to rectify the shortcomings of STR forest • Real-time (25Hz on Intel i7 PC without CPU/GPU optimisation) • Works better than state-of-the-arts • Makes use of unlabelled data, required less manual annotation. • More accurate in real scenario

  22. Video demo

  23. Thank you! http://www.iis.ee.ic.ac.uk/icvl

More Related