1 / 38

2D Human Pose Estimation in TV Shows

2D Human Pose Estimation in TV Shows. Vittorio Ferrari Manuel Marin Andrew Zisserman Dagstuhl Seminar July 2008. Goal : detect people and estimate 2D pose in images/video. Pose spatial configuration of body parts Estimation localize body parts in. Desired scope

Download Presentation

2D Human Pose Estimation in TV Shows

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. 2D Human Pose Estimationin TV Shows Vittorio Ferrari Manuel Marin Andrew Zisserman Dagstuhl Seminar July 2008

  2. Goal: detect people and estimate 2D pose in images/video Pose spatial configuration of body parts Estimation localize body parts in Desired scope TV shows and feature films focus on upper-body Bodyparts fixed aspect-ratio rectangles for head, torso, upper and lower arms Ramanan and Forsyth, Mori et al., Sigal and Black, Felzenszwalb and Huttenlocher, Bergtholdt et al.

  3. The search space of pictorial structures is large Bodyparts fixed aspect-ratio rectangles for head, torso, upper and lower arms = 4 parameters each • Search space • 4P dim (a scale factor per part) • 3P+1 dims (a single scale factor) • P = 6 for upper-body • 720x405 image = 10^45 configs ! • Kinematic constraints • reduce space to valid body configs (10^28) • efficiency by model independencies (10^12) Ramanan and Forsyth, Mori et al., Sigal and Black, Felzenszwalb and Huttenlocher, Bergtholdt et al.

  4. The challenge of unconstrained imagery varying illumination and low contrast; moving camera and background; multiple people; scale changes; extensive clutter; any clothing

  5. The challenge of unconstrained imagery Extremely difficult when knowing nothing about appearance/pose/location

  6. Ramanan NIPS 2006 new additions Image parsing Human detection Learn appearance models over multiple frames Foreground highlighting Spatio-temporal parsing Approach overview reduce search space every frame independently estimate pose multiple frames jointly

  7. Single frame

  8. Search space reduction by human detection Train Idea get approximate location and scale with a detector generic over pose and appearance • Building an upper-body detector • based on Dalal and Triggs CVPR 2005 • train = 96 frames X 12 perturbations • (no Buffy) Test Benefits for pose estimation + fixes scale of body parts + sets bounds on x,y locations + detects also back views + fast - little info about pose (arms) detected enlarged

  9. code available! Search space reduction by human detection Upper body detection and temporal association www.robots.ox.ac.uk/~vgg/software/UpperBody/index.html

  10. Search space reduction by foreground highlighting Idea exploit knowledge about structure of search area to initialize Grabcut (Rother et al. 2004) • Initialization • learn fg/bg models from regions where • person likely present/absent • clamp central strip to fg • don’t clamp bg (arms can be anywhere) Benefits for pose estimation + further reduce clutter + conservative (no loss 95.5% times) + needs no knowledge of background + allows for moving background output initialization

  11. Search space reduction by foreground highlighting Idea exploit knowledge about structure of search area to initialize Grabcut (Rother et al. 2004) • Initialization • learn fg/bg models from regions where • person likely present/absent • clamp central strip to fg • don’t clamp bg (arms can be anywhere) Benefits for pose estimation + further reduce clutter + conservative (no loss 95.5% times) + needs no knowledge of background + allows for moving background

  12. = spatial prior (kinematic constraints) = image evidence (given edge/app models) Pose estimation by image parsing Goal estimate posterior of part configuration • Algorithm • 1. inference with = edges • 2. learn appearance models of • body parts and background • 3. inference with = edges + appearance Advantages of space reduction + much more robust + much faster (10x-100x) Ramanan 06

  13. = spatial prior (kinematic constraints) = image evidence (given edge/app models) appearance models Pose estimation by image parsing Goal estimate posterior of part configuration • Algorithm • 1. inference with = edges • 2. learn appearance models of • body parts and background • 3. inference with = edges + appearance edge parse edge + app parse Advantages of space reduction + much more robust + much faster (10x-100x) Ramanan 06

  14. = spatial prior (kinematic constraints) = image evidence (given edge/app models) appearance models Pose estimation by image parsing no foreground highlighting Goal estimate posterior of part configuration • Algorithm • 1. inference with = edges • 2. learn appearance models of • body parts and background • 3. inference with = edges + appearance edge parse edge + app parse Advantages of space reduction + much more robust + much faster (10x-100x) Ramanan 06

  15. Failure of direct pose estimation Ramanan NIPS 2006 unaided Ramanan 06

  16. Multiple frames

  17. Transferring appearance models across frames lowest-entropy frames Idea refine parsing of difficult frames, based on appearance models from confident ones (exploit continuity of appearance) Algorithm 1. select frames with low entropy of P(L|I) 2. integrate their appearance models by min KL-divergence 3. re-parse every frame using integrated appearance models higher-entropy frame Advantages + improve parse in difficult frames + better than Ramanan CVPR 2005: richer and more robust models need no specific pose at any point in video

  18. Transferring appearance models across frames lowest-entropy frames Idea refine parsing of difficult frames, based on appearance models from confident ones (exploit continuity of appearance) Algorithm 1. select frames with low entropy of P(L|I) 2. integrate their appearance models by min KL-divergence 3. re-parse every frame using integrated appearance models higher-entropy frame Advantages + improve parse in difficult frames + better than Ramanan CVPR 2005: richer and more robust models need no specific pose at any point in video

  19. Transferring appearance models across frames Idea refine parsing of difficult frames, based on appearance models from confident ones (exploit continuity of appearance) Algorithm 1. select frames with low entropy of P(L|I) 2. integrate their appearance models by min KL-divergence 3. re-parse every frame using integrated appearance models Advantages + improve parse in difficult frames + better than Ramanan CVPR 2005: richer and more robust models need no specific pose at any point in video parse

  20. Transferring appearance models across frames Idea refine parsing of difficult frames, based on appearance models from confident ones (exploit continuity of appearance) Algorithm 1. select frames with low entropy of P(L|I) 2. integrate their appearance models by min KL-divergence 3. re-parse every frame using integrated appearance models Advantages + improve parse in difficult frames + better than Ramanan CVPR 2005: richer and more robust models need no specific pose at any point in video re-parse

  21. = kinematic constraints The repulsive model Idea extend kinematic tree with edges preferring non-overlapping left/right arms Model - add repulsive edges - inference with Loopy Belief Propagation = repulsive prior Advantage + less double-counting

  22. = kinematic constraints The repulsive model Idea extend kinematic tree with edges preferring non-overlapping left/right arms Model - add repulsive edges - inference with Loopy Belief Propagation = repulsive prior Advantage + less double-counting

  23. Image parsing Learn appearance models over multiple frames Spatio-temporal parsing After repulsive model Transferring appearance models exploits continuity of appearance Now exploit continuity of position time

  24. = spatial prior (kinematic constraints) = image evidence (given edge+app models) Spatio-temporal parsing time Idea extend model to a spatio-temporal block • Model • - add temporal edges • - inference with Loopy Belief Propagation = temporal prior (continuity constraints)

  25. = spatial prior (kinematic constraints) = image evidence (given edge+app models) Spatio-temporal parsing time Idea extend model to a spatio-temporal block • Model • - add temporal edges • - inference with Loopy Belief Propagation = temporal prior (continuity constraints) = repulsive prior (anti double-counting)

  26. Spatio-temporal parsing Image parsing Learn appearance models over multiple frames Spatio-temporal parsing Integrated spatio-temporal parsing reduces uncertainty no spatio-temporal spatio-temporal

  27. Full-body pose estimation in easier conditions Weizmann action dataset (Blank et at. ICCV 05)

  28. Evaluation dataset • >70000 frames over 4 episodes of Buffy the Vampire Slayer • (>1000 shots) - uncontrolled and extremely challenging low contrast, scale changes, moving camera and background, extensive clutter, any clothing, any pose • figure overlays = final output, after spatio-temporal inference

  29. Example estimated poses

  30. Data available! Quantitative evaluation Ground-truth 69 shots x 4 = 276 frames x 6 = 1656 body parts (sticks) Upper-body detector fires on 243 frames (88%, 1458 body parts) http://www.robots.ox.ac.uk/~vgg/data/stickmen/index.html

  31. Quantitative evaluation Pose estimation accuracy 62.6% with repulsive model ( with spatio-temporal parsing) 59.4% with appearance transfer 57.9% with foreground highlighting (best single-frame) 41.2% with detection 9.6% Ramanan NIPS 2006 unaided Conclusions: + both reduction techniques improve results += small improvement by appearance transfer … + … method good also on static images + repulsive model brings us further = spatio-temporal parsing mostly aesthetic impact

  32. Video !

More Related