1 / 89

Vision-Based Retrieval of Dynamic Hand Gestures

Vision-Based Retrieval of Dynamic Hand Gestures. Thesis Proposal by Jonathan Alon. Thesis Committee: Stan Sclaroff, Margrit Betke, George Kollios, and Trevor Darrell. Example Application. Isolated Gesture Recognition. A query gesture Q

oberon
Download Presentation

Vision-Based Retrieval of Dynamic Hand Gestures

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Vision-Based Retrieval of Dynamic Hand Gestures Thesis Proposal by Jonathan Alon Thesis Committee: Stan Sclaroff, Margrit Betke, George Kollios, and Trevor Darrell

  2. Example Application

  3. Isolated Gesture Recognition • A query gesture Q • Database of gesture examples Mg, and their class labels Cg, 1gN. • Problem: Predict the class label CQ bothaccurately and efficiently CQ = ? Q M1 C1=‘CAR’ M2 C2=‘BUY’ M3 C3=‘CAR’ M4 C4=‘BUY’

  4. Research Goals , ) => A small D ( CQ=C3=‘CAR’ Q M3 , ) => CQC4=‘BUY’ A large D ( Q M4 • Problem: Predict the class label CQaccurately and efficiently: • Accurately: design a distance measure D such thatsimilarity in input space using D=>similarity in class space • Efficiently: better than brute force, computingD(Q,Mg), for all g:1gN.

  5. Example Hand Gesture Data American Sign Language “Video Gestures”

  6. Related (ASL Recognition)Work • Hand segmentation: • Previous: higher level recognition models assume perfect segmentation, and methods are either • too simple [Starner&Pentland 95, Vogler&Metaxas99, Yang&Ahuja 02] or • too complicated [Cui&Weng 95, Ong&Bowden 04] • Proposed: more sophisticated distance measure will enable simple hand segmentation, and • more general background, textured clothes, and hand occlusions. • Vocabulary size • Previous (vision-based): tens. • Proposed: hundreds. • Data • Previous: usually the researcher is the signer [Starner&Pentland 95, Cui&Weng 95]. • Proposed: native signers. Fast gesture speeds. More realistic gesture variations.

  7. Proposed methods (1) • Accurately: propose a Dynamic Space Time Warping (DSTW)algorithm that can accommodate multiple hypotheses about the hand location in every frame of the query gesture sequence. • DSTW will enable a simple and efficient multiple candidate hand detection algorithm.

  8. Proposed methods (2) 2. Efficiently: use a filtering method, which consists of two steps: • Filter step: compute D’(Q,Mg), for all g:1gN based on a fast but approximate distance D’. Retain P most promising gesture examples. • Refine step: compute D(Q,Mh), for h:1hP based on the slow but exact distance D. Predict CQ based on class labels of Nearest Neighbors (NN).

  9. Outline • Introduction • Motivation • Research Goals • Related Work • Proposed Methods • System Overview • Multiple Candidate Hand Detection • Feature Extraction and Processing • Dynamic Space-Time Warping (DSTW) • Approximate Matching via Prototypes • Feasibility Study • Thesis Roadmap • Conclusion

  10. Isolated Gesture RecognitionSystem Diagram Filter: approximatematching using D’ query gesture sequence query features Q multiple candidate hand detection candidate matches database features Mg Refine: exact matching using D multiple candidate hand subimages feature extraction and processing best matches video database of isolated gestures browsing retrieval results

  11. Contributions Filter: approximatematching using D’ query gesture sequence query features Q multiple candidate hand detection candidate matches database features Mg Refine: exact matching using D multiple candidate hand subimages feature extraction and processing best matches video database of isolated gestures browsing retrieval results

  12. System Diagram Filter: approximatematching using D’ query gesture sequence query features Q multiple candidate hand detection candidate matches database features Mg Refine: exact matching using D multiple candidate hand subimages feature extraction and processing best matches video database of isolated gestures browsing retrieval results

  13. Multiple CandidateHand Detection (1) • Key observation: the gesturing hand cannot be reliably and unambiguously detected, regardless of the visual features used for detection. • However, the gesturing hand is consistently among the top K candidates identified by e.g., skin detection (K=15 in this example). Input Frame Candidate Hand Regions

  14. Multiple CandidateHand Detection (2) Input Sequence

  15. Isolated Gesture RecognitionSystem Diagram Filter: approximatematching using D’ query gesture sequence query features Q multiple candidate hand detection candidate matches database features Mg Refine: exact matching using D multiple candidate hand subimages feature extraction and processing best matches video database of isolated gestures browsing retrieval results

  16. Feature Extraction (1) Input Gesture Sequence Multi-dimensional time series i

  17. Feature Extraction (2) • Feature requirements: • Low resolution hand image => coarse shape features. • Hand localization is not accurate => use histograms. • Features: • Position: hand centroid. • Velocity: optical flow. • Motion: optical flow direction histograms [Ardizzone and LaCascia 97] • Texture: edge orientation histograms [Roth&Freeman 95] • Shape: parameters of ellipse fit to hand [Starner 95] • Color: used for detection; not useful for recognition.

  18. System Diagram Filter: approximatematching using D’ query gesture sequence query features Q multiple candidate hand detection candidate matches database features Mg Refine: exact matching using D multiple candidate hand subimages feature extraction and processing best matches video database of isolated gestures browsing retrieval results

  19. Dynamic Time Warping (DTW) Recognition • Given a query sequence Q and a database sequence M, DTW computes the optimal alignment (or warping path) W and matching cost D. • However, DTW assumes that a single feature vector (e.g., 2D position of the hand) can be reliably extracted from each query frame. Frame 1 Frame 50 Frame 80 .. .. Q M Frame 1 DG(Mi,Qj) . . Frame 32 . . Frame 51

  20. DTW Math (1): Distance between feature vectors • Mi, Qj are F-dimensional vectors. • The distance measure between two feature vectors can be the Euclidean distance: • DG can be more general. For example, (weighted) Lp norm.

  21. DTW Math (2): Distance between (sub)sequences • Initialization • Iteration • Termination

  22. Dynamic Space-Time Warping (DSTW) Recognition • DSTW can accommodate multiple candidate feature vectors at every time step. • DSTW simultaneously localizes the gesturing hand in every frame of the query sequence and recognizes the gesture. .. .. Q M . . . . K 2 1

  23. DSTW Math • Initialization • Iteration • Termination

  24. Translation-Invariance (1) 2.1. The user may gesture in any part of the image. Solution: • Run K separate DSTW processes Pk in parallel Pk subtracts the position of the kth candidate in the first frame from all candidates in subsequent frames. • Select Pk with the best matching score.

  25. Translation-Invariance (2) 2.2. False matches occur frequently when only position feature is used.For example, notice how spurious detections on the face in the query sequence falsely match model digit 1. Solution: include velocity in the feature vector. Frame 1 Query digit 1 Model digit 1 Frame 24 Frame 36

  26. Translation-Invariance (3) 2.1. The user may gesture in any part of the image. Solution: • Use centroid of face detector’s bounding box.

  27. Scale-Invariance • Use an image pyramid. • Compare size of face bounding box. (Face detector internally uses image pyramid).

  28. Complexity • F – number of features • L – average sequence length • K – number of hand candidates ------------------------------------------------------------------ DTW: O(F·L2) DSTW: O(K·F·L2) DSTW with translation invariance: O(K2·F·L2)

  29. System Diagram Filter: approximatematching using D’ query gesture sequence query features Q multiple candidate hand detection candidate matches database features Mg Refine: exact matching using D multiple candidate hand subimages feature extraction and processing best matches video database of isolated gestures browsing retrieval results

  30. Approximate Distance D’Motivation • Lipschitz embeddings and BoostMap are embedding methods that represent each object by a vector of distances from the object to a set of d prototypes. • Can efficiently compute distances between objects in the embedded space (requiring only O(d) operations). • Can apply the same idea to time series, however • The distance representation loses all information about the alignment.

  31. Approximate Distance D’:Alignment via Prototypes M R1 1 1 2 2 3 3 4 4 5 5 7 6 6

  32. Approximate Distance D’:Alignment via Prototypes M Q R 1 1 1 2 2 2 3 3 4 3 4 5 5 7 4 6 6 5

  33. Approximate Distance D’:Alignment via Prototypes M Q R 1 1 1 2 2 2 3 3 4 3 4 5 5 7 4 6 6 5 M Q 1 1 2 2 3 3 4 5 7 4 6 5

  34. Justifying the Approximation • Why does it work? Two properties: • If the query and prototype are identical, then the approximate distance and the exact distance are identical. • If the query and database object are identical, then the approximate distance is 0, and the database object will be retrieved as Nearest Neighbor. • More information…

  35. Justifying the Approximation • Why does it work? Two properties: • If the query and prototype are identical, then the approximate distance and the exact distance are identical. • If the query and database object are identical, then the approximate distance is 0, and the database object will be retrieved as Nearest Neighbor. • More information…

  36. Prototype Selection • Approach: Sequential Forward Search(SFS): • Select the first prototype R1 that minimizes classification error. • For i=2 to d do Select the next prototype Ri that together with the set of prototypes selected so far {R1,…,Ri-1} gives the lowest classification error.

  37. Prototype Selection • Approach: Sequential Forward Search(SFS): • Select the first prototype R1 that minimizes classification error. • For i=2 to d do Select the next prototype Ri that together with the set of prototypes selected so far {R1,…,Ri-1} gives the lowest classification error. • Can do Sequential Backward Search(SBS) by removing worse prototype at every step. • Can give weights to individual prototypes or individual features.

  38. Filter and Refine • Offline: 0. Select prototypes Ri. • Embed all database gestures E(Mg). • Online: • Embed query E(Q). • Filter: compute approximate distance D’(Q,Mg) between query and all database gestures in the embedded space. • Retain P NN as candidate matches. • Refine: rerank P candidates based on the exact distance D.

  39. Complexity • F=3: number of features • L=50: average sequence length • N=10,000: number of database sequences • d=10: number of prototypes • P=10: number of retrieved database sequences --------------------------------------------------------- Brute force = O( N·F·L2 ) Compute N exact DDTW distances --------------------------------------------------------- Filter step = O( d·F·L2 + N·d·F·L ) Compute d exact WDTW alignments + Compute N approximate DDTW’ distances Refine step = O( P·F·L2 ) Compute P exact DDTW distances --------------------------------------------------------- N > (d + N·d/L + P)

  40. Reducing Complexity Filter step=O(d·F·L2+N·d·F·L) Second term is expensive. Well known NN shortcoming. Proposed solutions: • Feature selection: reduce the number of features, d·F·L. • Condensing: reduce the number of objects, N.

  41. Feasibility Study • Exact distance DDSTW • Application: recognition of “video digits”. • Compare DTW vs. DSTW accuracy. • Verify that translation-invariance works. • What is the right K? Use cross-validation. • Approximate distance D’DTW • Application: recognition of UNIPEN digits. • Measure accuracy vs. time tradeoff of approximate DTW vs. BoostMap and CSDTW. • Recognition of NIST digits, using approximate shape context distance.

  42. Video Digit Recognition Experiment • 3 users, 10 digits, 3 examples per digit. • DSTW without translation invariance • Features: Position and velocity (x,y,u,v) • Performance measure: classification accuracy (%) • 11.1%-21.1% increase in classification accuracy.

  43. UNIPEN Digit Recognition Experiment • 15,953 digit samples. • Features: Position and angle (x,y,theta) • Performance measure: classification error (%) vs. number of exact distance computations. • Using query and all database gives 1.90% error using 10,630 DDTW. • CSDTW gives 2.90% using 150 DDTW. • Given a test error of 2.80% the method is about twice faster than BoostMap and about ten times faster than CSDTW.

  44. Conclusions DSTW • Pros: • Hand detection is notmerely a bottom-up procedure. • Recognition can be achieved even in the presence of multiple “distractors”, and overlaps between the gesturing hand and the face or the other hand. • Recognition is translation-invariant. • For real-time performance, hand detection can afford to use more efficient features with higher false positive rates, and rely on DSTW’s capability to handle multiple candidates to reject many false detections. • DSTW provides a general method for matching time series, that can accommodate multiple candidate feature vectors at each time step. • Cons: • Space and time complexity increase by a factor of K for translation-dependent recognition, and by a factor of K2 for translation-invariant recognition.

  45. Conclusions Approximate Alignment via Prototypes • Pros: • Approximate alignment via prototypes is fast. • Approximate alignment via prototypes provides a general method for efficiently approximating distance measures that are based on expensive alignment methods (e.g., shape context distance). • The number of points in the two objects does not have to be equal. • The more expensive the exact alignment method the greater the benefit from approximation. • Cons: • Cannot guarantee false dismissals in filter step. • Every point in one object has to be matched with at least one point from the other object. • That excludes approximating Longest Common Subsequence (LCS) similarity measure.

  46. Gesture Spotting

  47. Isolated Gesture Recognition vs. Gesture Spotting Whole Matching vs. Subsequence Matching Q Q M1 M M2 M3 M4

  48. Gesture Spotting:Research Agenda • Indirect temporal segmentation (segmentation by recognition): implement brute-force search using sliding window. • Now, we do not know hand locations in database sequence M. • Extend DSTW to include a 4th spatial axis. Alternatively, • Assume cooperative user who marks hand locations in query. • Direct temporal segmentation: are there hand motion features that can predict gesture boundaries? • How to combine gesture boundaries estimates from direct and indirect approaches?

  49. Thesis Roadmap • Data Collection and annotation: • Isolated gesture recognition. • Gesture spotting. • Algorithms: • Hand features. • Approximate DSTW, or alternative indexing method(s). • Temporal segmentation. • Implement demos.

  50. Thank You!

More Related