1 / 95

Accurate and Efficient Gesture Spotting via Pruning and Subgesture Reasoning

Accurate and Efficient Gesture Spotting via Pruning and Subgesture Reasoning. Jonathan Alon, Vassilis Athitsos, and Stan Sclaroff Computer Science Department Boston University. Gesture Recognition Applications. Human Computer Interaction. Sign Language Analysis. Video Annotation.

dulcea
Download Presentation

Accurate and Efficient Gesture Spotting via Pruning and Subgesture Reasoning

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Accurate and Efficient Gesture Spotting via Pruning and Subgesture Reasoning Jonathan Alon, Vassilis Athitsos, and Stan SclaroffComputer Science DepartmentBoston University

  2. Gesture Recognition Applications Human ComputerInteraction Sign LanguageAnalysis VideoAnnotation • Command spotting to control: • Computer Applications [Lee&Kim 99,Zhu et al 02] • TV and Video games [Freeman et al 96, 99] • Robots [Triesch 97] UAV Guidance

  3. Classification of Gesture Recognition Problems Isolated Continuous Easier Harder Spotting and Recognition

  4. Gesture Spotting Problem • Given a vocabulary of gestures: • Locate the start and frame of a gesture within a long video stream (and recognize the gesture). Frame 334 Frame 403 Frame 733 Frame 836 “2” gesture non-gesture “5” gesture

  5. Overview • Objective • Propose an efficient and accurate gesture spotting and recognition system that enables most natural human computer interaction. • Approach • Pruning method that views pruning as a classification (learning) problem • Subgesture reasoning process that models the fact that a gesture may resemble a part of a longer gesture • Experiments • Order of magnitude speedup • 18% improvement in accuracy

  6. Gesture Spotting Framework Indirect approach: spotting is intertwined with recognition: Temporal Matching: Continuous Dynamic Programming (CDP) [Oka 98] Spotting [Morguet&Lang 98, Lee&Kim 99] Gesture Models + Pruning Classifiers Spotting Rules + Subgesture table Video Stream MatchingCosts FeatureVector Hand Detection + Feature Extraction Spotting Temporal Matching Gesture id,start and end frames

  7. Hand Detection and Feature Extraction • Hand Detection: based on color and motion • Feature: (x,y) hand centroid Input Frame Detected Hand Skin Likelihood Frame Differencing “Hand Likelihood”

  8. Temporal Matching: Continuous Dynamic Programming (CDP) Input time j Model time i

  9. Temporal Matching: Continuous Dynamic Programming (CDP) Input time j Model time i

  10. Temporal Matching: Continuous Dynamic Programming (CDP) Input time j Model d(i,j) time i Local Cost: d(i,j)=L2(Mi,Qj)

  11. Temporal Matching: Continuous Dynamic Programming (CDP) Input time j Model D(i,j) time i Cumulative Cost: D(i,j)=d(i,j)+min{D(i-1,j), D(i,j-1), D(i-1,j-1)}

  12. Temporal Matching: Continuous Dynamic Programming (CDP) Input time j Model W D(m,j) time i

  13. Temporal Matching: Continuous Dynamic Programming (CDP) Input time j Model D(m,j1) D(m,j2) time i D(m,j2) < D(m,j1)

  14. Spotting: Detection of candidate gesture end point matching cost Dg(mg,j) Detection threshold time j

  15. Why Pruning? • Search time for best matching model increases linearly with the number of gesture models. This can be too expensive for • Systems with large gesture vocabularies • Real time applications • Efficient search methods [Gao et al 00] • Fast match, N-best search, A*,… • Beam search • maintains promising hypotheses that have low matching costs within a “beam width” from the matching cost of the current best hypothesis. • requires ad hoc setting of “beam width”.

  16. Pruning: Novel Viewpoint • Pruning is a classification problem, so we can use any classifier, e.g., • based on cumulative cost. • based on observation cost. • based on transition cost. • Classifiers can be learned from training data, instead of manually specifying “beam width”. • Pruning is decoupled from recognition.

  17. Pruning: Motivating Example • If input feature j is too far from model feature i (d(i,j) > τi) then all paths going through cell (i,j) should be pruned. • For example, the start point of digit “5” is far from the start point of digit “2” both in terms of position and direction.

  18. How to Prune? Input (digit “6”) • Classifier learning objective: • maximize pruning (white cells area) s.t. • minimize expectation of pruning the optimal path (red). • Legend: • White: pruned cells • pix • Black: visited cells • Red: optimal path Model (digit “6”)

  19. Learning to prune: example classifier • Match every positive gesture example Mp with model M. • For every model feature Mi record all features Mpj that match it (using DTW). • Let • The pruning classifier for model feature Mi is:

  20. CDP with Pruning (CDPP) Qj-1 Qj • Sparse vector representation:black cells are stored in memory M1 Mm

  21. CDP with Pruning (CDPP) Qj-1 Qj C1(Qj) = ? M1 Mm

  22. CDP with Pruning (CDPP) Qj-1 Qj M1 C1(Qj) = +1 Mm

  23. CDP with Pruning (CDPP) Qj-1 Qj M1 C2(Qj) = ? Mm

  24. CDP with Pruning (CDPP) Qj-1 Qj M1 C2(Qj) = +1 Mm

  25. CDP with Pruning (CDPP) Qj-1 Qj M1 C3(Qj) = ? Mm

  26. CDP with Pruning (CDPP) Qj-1 Qj M1 C3(Qj) = -1 Skip to next cell that hasa neighbor Mm

  27. Spotting • INPUT • Matchingcosts in current frame j • Current candidategesture list • matching cost • duration • Optional: • Frame index of last detected gesture • Response time • OUTPUT • Detected gesture and gesture endpoint • OR • New candidategesture list Spotting Rules

  28. Nested Gestures Which gesture to recognize? 1 or 9? 5 or 8? 7 or 3?

  29. Nested Gestures Which gesture to recognize? 5 or 8? matching cost time j

  30. Nested Gestures Which gesture to recognize? Solution: subgesture table 5 or 8? • If a gesture is firing then if at least one of its supergestures is firing then wait; otherwise, recognize it. • If a gesture is firing and it has no supergestures then recognize it.

  31. Spotting Algorithm (1) Update Candidate Gesture List: • Find all firing models. • Conduct subgesture competitions among firing models. • Find the best firing model. • For every candidate perform overlapping and subgesture tests wrt best firing model. • Remove candidate if failed any test. • Add the best firing model if passed all tests.

  32. Spotting Algorithm (2) Spot candidate gesture if either • all of its active supergesture models started after the candidate's end frame j*. • all current active paths started after the candidate's end frame j*. • a specified number of frames have elapsed since the candidate's end frame j*.

  33. Experiments • Models: 2 users * 10 digits * 3 examples per digit. • Test: 2 users * 3 long sequences * 10 digits. • Sequence length: input: 1000-1500 frames. digit: 30-90 frames. Example Sequence

  34. Results • Accuracy: CDP = Continuous Dynamic Programming, CDPP = CDP with Pruning, CDPPS = CDP with Pruning and Subgesture Resoning • Speedup: CDPP 10 times faster than CDP.

  35. Conclusions Pruning is a classification problem. CDPP an order of magnitude faster than CDP, and 7% more accurate than CDP. Reasoning about nested gestures improves recognition accuracy. CDPPS improves accuracy by additional 12%. Both pruning and subgesture reasoning can be applied to other dynamic models (e.g., HMMs).

  36. Thank you

  37. Ongoing Work • Learn • Pruning classifiers using cross-validation. • Subgesture table. • Gesture verifiers. • Compare pruning method to Beam Search. • Handle multiple candidate hand hypotheses. • Apply methods to automatic sign language transcription.

  38. Towards Automatic Annotationof American Sign Language • Additional challenges: • Users not cooperative: fast gesture speeds; variation between users. • Significant variation in hand shape and appearance. • Different types of gestures: finger spelling, one vs. two handed.

  39. Gesture Spotting: Related Work • Direct approach [Kang et. al 04, Kahol et. Al 04] • Spotting precedes recognition. • Compute low-level motion parameters, such as velocity, acceleration, trajectory curvature. • Look for abrupt changes (zero-crossings) in those parameters to find candidate gesture boundaries. • Indirect approach [Morguet&Lang 98, Lee&Kim 99] • Spotting is intertwined with recognition. • Compute input to models matching costs. • Look for low cost to detect candidate gesture end point. (Gesture start point can be found by backtracking the optimal dynamic programming path).

  40. Approach: Continuous Dynamic Programming (CDP) [Oka 98] Input d(i,j): distance between model feature Mi and input feature Qj. D(i,j): cumulative distance between model M(1:i) and input subsequence Q(j’:j) Mi-1 “0” Model Mi Qj-1 Qj Continuous and Monotonic Warping Path “2” Model “9” Model

  41. Approach: Continuous Dynamic Programming (CDP) 0 2 Accept threshold 9 Input time j Model Optimal Warping Path (continuous & monotonic) time i Gesture Start End d(i,j): distance between model feature Mi and input feature Qj. D(i,j): cumulative distance between model M(1:i) and input subsequence Q(j-:j) Mi-1 Mi Qj-1 Qj

  42. Conclusions • CDPP an order of magnitude faster than CDP, and 7% more accurate than CDP. • CDPPS improves accuracy by additional 12%. • Both pruning and subgesture reasoning can be applied to Hidden Markov Models (HMMs). • Future Work: • Learn: • Subgesture table. • Gesture Transition Classifiers and Subsequence Classifiers. • Gesture Verifiers. • Apply methods to spot signs in American Sign Language (ASL) sequences (e.g., utterances, stories, and dialogs).

  43. Gesture Types (Channels) Body Gesture Head Gesture Hand Gesture

  44. Gesture Spotting: Related Work • Indirect approach [Morguet&Lang 98, Lee&Kim 99] • Spotting is intertwined with recognition. 0. Detect hands and extract features. • Compute input to models matching costs. • Look for low cost to detect candidate gesture end point.

  45. Pruning: Motivation • Detection and Tracking • Where (in the image is the gesture performed)? • Spotting • When (does the gesture start and end)? • Recognition • What (gesture)? • Search complexity can be high ! • | Where | * | When | * | What |

  46. Gesture End Point Detection and Gesture Recognition • The algorithm is invoked for every input frame j, and consists of two steps: • Update the current list of candidate gesture models. • Apply a set of spotting rules to decide whether or not a gesture was spotted, and if yes decide which gesture model.

  47. End Point Detection Definitions • Paths • Complete Path W(M1:m, Qj’:j): a legal warping path matching the input subsequence Qj’:j with the complete model M1:m. • Partial Path W(M1:i, Qj’:j): a legal warping path matching the input subsequence Qj’:j with part of the model M1:m. • Active Path: a partial path that has not been pruned.

  48. End Point Detection Definitions • Models • Active Model g: a model that has a complete path ending at the current input frame j. • Firing Model g: an active model with a cost below the detection acceptance threshold. • Subgesture Relationship: a gesture g1 is a subgetsure of gesture g2 if it is properly contained in g2. In this case, g2 is a supergesture of g1.

  49. Spotting Rules (1) • Zhu et. Al 02 (Spotting rules) • Based on Baudel&Beaudouin-Lafon’s Interaction Model. • A moving hand appears in the sequence. • The moving hand is the dominant moving object. • The movement of the hand follows a three-stage process: preparation, stroke, and retraction [Kendon]. • The duration of the stroke T is bounded, T1≤T≤T2, for a given sampling rate.

  50. Spotting Rules (3) • Lee&Kim 99 (End-point detection):

More Related