Vision-Based Retrieval of Dynamic Hand Gestures

Vision-Based Retrieval of Dynamic Hand Gestures Thesis Proposal by Jonathan Alon Thesis Committee: Stan Sclaroff, Margrit Betke, George Kollios, and Trevor Darrell

Example Application

Isolated Gesture Recognition • A query gesture Q • Database of gesture examples Mg, and their class labels Cg, 1gN. • Problem: Predict the class label CQ bothaccurately and efficiently CQ = ? Q M1 C1=‘CAR’ M2 C2=‘BUY’ M3 C3=‘CAR’ M4 C4=‘BUY’

Research Goals , ) => A small D ( CQ=C3=‘CAR’ Q M3 , ) => CQC4=‘BUY’ A large D ( Q M4 • Problem: Predict the class label CQaccurately and efficiently: • Accurately: design a distance measure D such thatsimilarity in input space using D=>similarity in class space • Efficiently: better than brute force, computingD(Q,Mg), for all g:1gN.

Example Hand Gesture Data American Sign Language “Video Gestures”

Related (ASL Recognition)Work • Hand segmentation: • Previous: higher level recognition models assume perfect segmentation, and methods are either • too simple [Starner&Pentland 95, Vogler&Metaxas99, Yang&Ahuja 02] or • too complicated [Cui&Weng 95, Ong&Bowden 04] • Proposed: more sophisticated distance measure will enable simple hand segmentation, and • more general background, textured clothes, and hand occlusions. • Vocabulary size • Previous (vision-based): tens. • Proposed: hundreds. • Data • Previous: usually the researcher is the signer [Starner&Pentland 95, Cui&Weng 95]. • Proposed: native signers. Fast gesture speeds. More realistic gesture variations.

Proposed methods (1) • Accurately: propose a Dynamic Space Time Warping (DSTW)algorithm that can accommodate multiple hypotheses about the hand location in every frame of the query gesture sequence. • DSTW will enable a simple and efficient multiple candidate hand detection algorithm.

Proposed methods (2) 2. Efficiently: use a filtering method, which consists of two steps: • Filter step: compute D’(Q,Mg), for all g:1gN based on a fast but approximate distance D’. Retain P most promising gesture examples. • Refine step: compute D(Q,Mh), for h:1hP based on the slow but exact distance D. Predict CQ based on class labels of Nearest Neighbors (NN).

Outline • Introduction • Motivation • Research Goals • Related Work • Proposed Methods • System Overview • Multiple Candidate Hand Detection • Feature Extraction and Processing • Dynamic Space-Time Warping (DSTW) • Approximate Matching via Prototypes • Feasibility Study • Thesis Roadmap • Conclusion

Isolated Gesture RecognitionSystem Diagram Filter: approximatematching using D’ query gesture sequence query features Q multiple candidate hand detection candidate matches database features Mg Refine: exact matching using D multiple candidate hand subimages feature extraction and processing best matches video database of isolated gestures browsing retrieval results

Contributions Filter: approximatematching using D’ query gesture sequence query features Q multiple candidate hand detection candidate matches database features Mg Refine: exact matching using D multiple candidate hand subimages feature extraction and processing best matches video database of isolated gestures browsing retrieval results

System Diagram Filter: approximatematching using D’ query gesture sequence query features Q multiple candidate hand detection candidate matches database features Mg Refine: exact matching using D multiple candidate hand subimages feature extraction and processing best matches video database of isolated gestures browsing retrieval results

Multiple CandidateHand Detection (1) • Key observation: the gesturing hand cannot be reliably and unambiguously detected, regardless of the visual features used for detection. • However, the gesturing hand is consistently among the top K candidates identified by e.g., skin detection (K=15 in this example). Input Frame Candidate Hand Regions

Multiple CandidateHand Detection (2) Input Sequence

Isolated Gesture RecognitionSystem Diagram Filter: approximatematching using D’ query gesture sequence query features Q multiple candidate hand detection candidate matches database features Mg Refine: exact matching using D multiple candidate hand subimages feature extraction and processing best matches video database of isolated gestures browsing retrieval results

Feature Extraction (1) Input Gesture Sequence Multi-dimensional time series i

Feature Extraction (2) • Feature requirements: • Low resolution hand image => coarse shape features. • Hand localization is not accurate => use histograms. • Features: • Position: hand centroid. • Velocity: optical flow. • Motion: optical flow direction histograms [Ardizzone and LaCascia 97] • Texture: edge orientation histograms [Roth&Freeman 95] • Shape: parameters of ellipse fit to hand [Starner 95] • Color: used for detection; not useful for recognition.

Dynamic Time Warping (DTW) Recognition • Given a query sequence Q and a database sequence M, DTW computes the optimal alignment (or warping path) W and matching cost D. • However, DTW assumes that a single feature vector (e.g., 2D position of the hand) can be reliably extracted from each query frame. Frame 1 Frame 50 Frame 80 .. .. Q M Frame 1 DG(Mi,Qj) . . Frame 32 . . Frame 51

DTW Math (1): Distance between feature vectors • Mi, Qj are F-dimensional vectors. • The distance measure between two feature vectors can be the Euclidean distance: • DG can be more general. For example, (weighted) Lp norm.

DTW Math (2): Distance between (sub)sequences • Initialization • Iteration • Termination

Dynamic Space-Time Warping (DSTW) Recognition • DSTW can accommodate multiple candidate feature vectors at every time step. • DSTW simultaneously localizes the gesturing hand in every frame of the query sequence and recognizes the gesture. .. .. Q M . . . . K 2 1

DSTW Math • Initialization • Iteration • Termination

Translation-Invariance (1) 2.1. The user may gesture in any part of the image. Solution: • Run K separate DSTW processes Pk in parallel Pk subtracts the position of the kth candidate in the first frame from all candidates in subsequent frames. • Select Pk with the best matching score.

Translation-Invariance (2) 2.2. False matches occur frequently when only position feature is used.For example, notice how spurious detections on the face in the query sequence falsely match model digit 1. Solution: include velocity in the feature vector. Frame 1 Query digit 1 Model digit 1 Frame 24 Frame 36

Translation-Invariance (3) 2.1. The user may gesture in any part of the image. Solution: • Use centroid of face detector’s bounding box.

Scale-Invariance • Use an image pyramid. • Compare size of face bounding box. (Face detector internally uses image pyramid).

Complexity • F – number of features • L – average sequence length • K – number of hand candidates ------------------------------------------------------------------ DTW: O(F·L2) DSTW: O(K·F·L2) DSTW with translation invariance: O(K2·F·L2)

Approximate Distance D’Motivation • Lipschitz embeddings and BoostMap are embedding methods that represent each object by a vector of distances from the object to a set of d prototypes. • Can efficiently compute distances between objects in the embedded space (requiring only O(d) operations). • Can apply the same idea to time series, however • The distance representation loses all information about the alignment.

Approximate Distance D’:Alignment via Prototypes M R1 1 1 2 2 3 3 4 4 5 5 7 6 6

Approximate Distance D’:Alignment via Prototypes M Q R 1 1 1 2 2 2 3 3 4 3 4 5 5 7 4 6 6 5

Approximate Distance D’:Alignment via Prototypes M Q R 1 1 1 2 2 2 3 3 4 3 4 5 5 7 4 6 6 5 M Q 1 1 2 2 3 3 4 5 7 4 6 5

Justifying the Approximation • Why does it work? Two properties: • If the query and prototype are identical, then the approximate distance and the exact distance are identical. • If the query and database object are identical, then the approximate distance is 0, and the database object will be retrieved as Nearest Neighbor. • More information…

Prototype Selection • Approach: Sequential Forward Search(SFS): • Select the first prototype R1 that minimizes classification error. • For i=2 to d do Select the next prototype Ri that together with the set of prototypes selected so far {R1,…,Ri-1} gives the lowest classification error.

Prototype Selection • Approach: Sequential Forward Search(SFS): • Select the first prototype R1 that minimizes classification error. • For i=2 to d do Select the next prototype Ri that together with the set of prototypes selected so far {R1,…,Ri-1} gives the lowest classification error. • Can do Sequential Backward Search(SBS) by removing worse prototype at every step. • Can give weights to individual prototypes or individual features.

Filter and Refine • Offline: 0. Select prototypes Ri. • Embed all database gestures E(Mg). • Online: • Embed query E(Q). • Filter: compute approximate distance D’(Q,Mg) between query and all database gestures in the embedded space. • Retain P NN as candidate matches. • Refine: rerank P candidates based on the exact distance D.

Complexity • F=3: number of features • L=50: average sequence length • N=10,000: number of database sequences • d=10: number of prototypes • P=10: number of retrieved database sequences --------------------------------------------------------- Brute force = O( N·F·L2 ) Compute N exact DDTW distances --------------------------------------------------------- Filter step = O( d·F·L2 + N·d·F·L ) Compute d exact WDTW alignments + Compute N approximate DDTW’ distances Refine step = O( P·F·L2 ) Compute P exact DDTW distances --------------------------------------------------------- N > (d + N·d/L + P)

Reducing Complexity Filter step=O(d·F·L2+N·d·F·L) Second term is expensive. Well known NN shortcoming. Proposed solutions: • Feature selection: reduce the number of features, d·F·L. • Condensing: reduce the number of objects, N.

Feasibility Study • Exact distance DDSTW • Application: recognition of “video digits”. • Compare DTW vs. DSTW accuracy. • Verify that translation-invariance works. • What is the right K? Use cross-validation. • Approximate distance D’DTW • Application: recognition of UNIPEN digits. • Measure accuracy vs. time tradeoff of approximate DTW vs. BoostMap and CSDTW. • Recognition of NIST digits, using approximate shape context distance.

Video Digit Recognition Experiment • 3 users, 10 digits, 3 examples per digit. • DSTW without translation invariance • Features: Position and velocity (x,y,u,v) • Performance measure: classification accuracy (%) • 11.1%-21.1% increase in classification accuracy.

UNIPEN Digit Recognition Experiment • 15,953 digit samples. • Features: Position and angle (x,y,theta) • Performance measure: classification error (%) vs. number of exact distance computations. • Using query and all database gives 1.90% error using 10,630 DDTW. • CSDTW gives 2.90% using 150 DDTW. • Given a test error of 2.80% the method is about twice faster than BoostMap and about ten times faster than CSDTW.

Conclusions DSTW • Pros: • Hand detection is notmerely a bottom-up procedure. • Recognition can be achieved even in the presence of multiple “distractors”, and overlaps between the gesturing hand and the face or the other hand. • Recognition is translation-invariant. • For real-time performance, hand detection can afford to use more efficient features with higher false positive rates, and rely on DSTW’s capability to handle multiple candidates to reject many false detections. • DSTW provides a general method for matching time series, that can accommodate multiple candidate feature vectors at each time step. • Cons: • Space and time complexity increase by a factor of K for translation-dependent recognition, and by a factor of K2 for translation-invariant recognition.

Conclusions Approximate Alignment via Prototypes • Pros: • Approximate alignment via prototypes is fast. • Approximate alignment via prototypes provides a general method for efficiently approximating distance measures that are based on expensive alignment methods (e.g., shape context distance). • The number of points in the two objects does not have to be equal. • The more expensive the exact alignment method the greater the benefit from approximation. • Cons: • Cannot guarantee false dismissals in filter step. • Every point in one object has to be matched with at least one point from the other object. • That excludes approximating Longest Common Subsequence (LCS) similarity measure.

Gesture Spotting

Isolated Gesture Recognition vs. Gesture Spotting Whole Matching vs. Subsequence Matching Q Q M1 M M2 M3 M4

Gesture Spotting:Research Agenda • Indirect temporal segmentation (segmentation by recognition): implement brute-force search using sliding window. • Now, we do not know hand locations in database sequence M. • Extend DSTW to include a 4th spatial axis. Alternatively, • Assume cooperative user who marks hand locations in query. • Direct temporal segmentation: are there hand motion features that can predict gesture boundaries? • How to combine gesture boundaries estimates from direct and indirect approaches?

Thesis Roadmap • Data Collection and annotation: • Isolated gesture recognition. • Gesture spotting. • Algorithms: • Hand features. • Approximate DSTW, or alternative indexing method(s). • Temporal segmentation. • Implement demos.

Thank You!

Vision-Based Retrieval of Dynamic Hand Gestures

Vision-Based Retrieval of Dynamic Hand Gestures

Presentation Transcript

Hand and Arm Gestures

Props and Hand Gestures

Gestures

Image-based Material Retrieval

Hand Gestures Around the World

Dynamic Hand Written Character Recognition

Formulas Gestures

A Robust Method of Detecting Hand Gestures Using Depth Sensors

Gestures

Hand Gestures Based Applications

Asamyuktam Hastas Hand Gestures of the Heart

Gestures

GESTURES

Humanoid Locomotion Manipulated Using Dynamic Finger Gestures

Name of project : Controlling of Quadropter by hand gestures

List of Gestures

Control of avatar gestures

Pre-Summary Slide Hand Gestures (Signs)

Dynamic and static hand gesture recognition in computer vision

GESTURES

eNTERFACE’10 Vision Based Hand Puppet

GESTURES