Accurate and Efficient Gesture Spotting via Pruning and Subgesture Reasoning

Accurate and Efficient Gesture Spotting via Pruning and Subgesture Reasoning Jonathan Alon, Vassilis Athitsos, and Stan SclaroffComputer Science DepartmentBoston University

Gesture Recognition Applications Human ComputerInteraction Sign LanguageAnalysis VideoAnnotation • Command spotting to control: • Computer Applications [Lee&Kim 99,Zhu et al 02] • TV and Video games [Freeman et al 96, 99] • Robots [Triesch 97] UAV Guidance

Classification of Gesture Recognition Problems Isolated Continuous Easier Harder Spotting and Recognition

Gesture Spotting Problem • Given a vocabulary of gestures: • Locate the start and frame of a gesture within a long video stream (and recognize the gesture). Frame 334 Frame 403 Frame 733 Frame 836 “2” gesture non-gesture “5” gesture

Overview • Objective • Propose an efficient and accurate gesture spotting and recognition system that enables most natural human computer interaction. • Approach • Pruning method that views pruning as a classification (learning) problem • Subgesture reasoning process that models the fact that a gesture may resemble a part of a longer gesture • Experiments • Order of magnitude speedup • 18% improvement in accuracy

Gesture Spotting Framework Indirect approach: spotting is intertwined with recognition: Temporal Matching: Continuous Dynamic Programming (CDP) [Oka 98] Spotting [Morguet&Lang 98, Lee&Kim 99] Gesture Models + Pruning Classifiers Spotting Rules + Subgesture table Video Stream MatchingCosts FeatureVector Hand Detection + Feature Extraction Spotting Temporal Matching Gesture id,start and end frames

Hand Detection and Feature Extraction • Hand Detection: based on color and motion • Feature: (x,y) hand centroid Input Frame Detected Hand Skin Likelihood Frame Differencing “Hand Likelihood”

Temporal Matching: Continuous Dynamic Programming (CDP) Input time j Model time i

Temporal Matching: Continuous Dynamic Programming (CDP) Input time j Model d(i,j) time i Local Cost: d(i,j)=L2(Mi,Qj)

Temporal Matching: Continuous Dynamic Programming (CDP) Input time j Model D(i,j) time i Cumulative Cost: D(i,j)=d(i,j)+min{D(i-1,j), D(i,j-1), D(i-1,j-1)}

Temporal Matching: Continuous Dynamic Programming (CDP) Input time j Model W D(m,j) time i

Temporal Matching: Continuous Dynamic Programming (CDP) Input time j Model D(m,j1) D(m,j2) time i D(m,j2) < D(m,j1)

Spotting: Detection of candidate gesture end point matching cost Dg(mg,j) Detection threshold time j

Why Pruning? • Search time for best matching model increases linearly with the number of gesture models. This can be too expensive for • Systems with large gesture vocabularies • Real time applications • Efficient search methods [Gao et al 00] • Fast match, N-best search, A*,… • Beam search • maintains promising hypotheses that have low matching costs within a “beam width” from the matching cost of the current best hypothesis. • requires ad hoc setting of “beam width”.

Pruning: Novel Viewpoint • Pruning is a classification problem, so we can use any classifier, e.g., • based on cumulative cost. • based on observation cost. • based on transition cost. • Classifiers can be learned from training data, instead of manually specifying “beam width”. • Pruning is decoupled from recognition.

Pruning: Motivating Example • If input feature j is too far from model feature i (d(i,j) > τi) then all paths going through cell (i,j) should be pruned. • For example, the start point of digit “5” is far from the start point of digit “2” both in terms of position and direction.

How to Prune? Input (digit “6”) • Classifier learning objective: • maximize pruning (white cells area) s.t. • minimize expectation of pruning the optimal path (red). • Legend: • White: pruned cells • pix • Black: visited cells • Red: optimal path Model (digit “6”)

Learning to prune: example classifier • Match every positive gesture example Mp with model M. • For every model feature Mi record all features Mpj that match it (using DTW). • Let • The pruning classifier for model feature Mi is:

CDP with Pruning (CDPP) Qj-1 Qj • Sparse vector representation:black cells are stored in memory M1 Mm

CDP with Pruning (CDPP) Qj-1 Qj C1(Qj) = ? M1 Mm

CDP with Pruning (CDPP) Qj-1 Qj M1 C1(Qj) = +1 Mm

CDP with Pruning (CDPP) Qj-1 Qj M1 C2(Qj) = ? Mm

CDP with Pruning (CDPP) Qj-1 Qj M1 C2(Qj) = +1 Mm

CDP with Pruning (CDPP) Qj-1 Qj M1 C3(Qj) = ? Mm

CDP with Pruning (CDPP) Qj-1 Qj M1 C3(Qj) = -1 Skip to next cell that hasa neighbor Mm

Spotting • INPUT • Matchingcosts in current frame j • Current candidategesture list • matching cost • duration • Optional: • Frame index of last detected gesture • Response time • OUTPUT • Detected gesture and gesture endpoint • OR • New candidategesture list Spotting Rules

Nested Gestures Which gesture to recognize? 1 or 9? 5 or 8? 7 or 3?

Nested Gestures Which gesture to recognize? 5 or 8? matching cost time j

Nested Gestures Which gesture to recognize? Solution: subgesture table 5 or 8? • If a gesture is firing then if at least one of its supergestures is firing then wait; otherwise, recognize it. • If a gesture is firing and it has no supergestures then recognize it.

Spotting Algorithm (1) Update Candidate Gesture List: • Find all firing models. • Conduct subgesture competitions among firing models. • Find the best firing model. • For every candidate perform overlapping and subgesture tests wrt best firing model. • Remove candidate if failed any test. • Add the best firing model if passed all tests.

Spotting Algorithm (2) Spot candidate gesture if either • all of its active supergesture models started after the candidate's end frame j*. • all current active paths started after the candidate's end frame j*. • a specified number of frames have elapsed since the candidate's end frame j*.

Experiments • Models: 2 users * 10 digits * 3 examples per digit. • Test: 2 users * 3 long sequences * 10 digits. • Sequence length: input: 1000-1500 frames. digit: 30-90 frames. Example Sequence

Results • Accuracy: CDP = Continuous Dynamic Programming, CDPP = CDP with Pruning, CDPPS = CDP with Pruning and Subgesture Resoning • Speedup: CDPP 10 times faster than CDP.

Conclusions Pruning is a classification problem. CDPP an order of magnitude faster than CDP, and 7% more accurate than CDP. Reasoning about nested gestures improves recognition accuracy. CDPPS improves accuracy by additional 12%. Both pruning and subgesture reasoning can be applied to other dynamic models (e.g., HMMs).

Thank you

Ongoing Work • Learn • Pruning classifiers using cross-validation. • Subgesture table. • Gesture verifiers. • Compare pruning method to Beam Search. • Handle multiple candidate hand hypotheses. • Apply methods to automatic sign language transcription.

Towards Automatic Annotationof American Sign Language • Additional challenges: • Users not cooperative: fast gesture speeds; variation between users. • Significant variation in hand shape and appearance. • Different types of gestures: finger spelling, one vs. two handed.

Gesture Spotting: Related Work • Direct approach [Kang et. al 04, Kahol et. Al 04] • Spotting precedes recognition. • Compute low-level motion parameters, such as velocity, acceleration, trajectory curvature. • Look for abrupt changes (zero-crossings) in those parameters to find candidate gesture boundaries. • Indirect approach [Morguet&Lang 98, Lee&Kim 99] • Spotting is intertwined with recognition. • Compute input to models matching costs. • Look for low cost to detect candidate gesture end point. (Gesture start point can be found by backtracking the optimal dynamic programming path).

Approach: Continuous Dynamic Programming (CDP) [Oka 98] Input d(i,j): distance between model feature Mi and input feature Qj. D(i,j): cumulative distance between model M(1:i) and input subsequence Q(j’:j) Mi-1 “0” Model Mi Qj-1 Qj Continuous and Monotonic Warping Path “2” Model “9” Model

Approach: Continuous Dynamic Programming (CDP) 0 2 Accept threshold 9 Input time j Model Optimal Warping Path (continuous & monotonic) time i Gesture Start End d(i,j): distance between model feature Mi and input feature Qj. D(i,j): cumulative distance between model M(1:i) and input subsequence Q(j-:j) Mi-1 Mi Qj-1 Qj

Conclusions • CDPP an order of magnitude faster than CDP, and 7% more accurate than CDP. • CDPPS improves accuracy by additional 12%. • Both pruning and subgesture reasoning can be applied to Hidden Markov Models (HMMs). • Future Work: • Learn: • Subgesture table. • Gesture Transition Classifiers and Subsequence Classifiers. • Gesture Verifiers. • Apply methods to spot signs in American Sign Language (ASL) sequences (e.g., utterances, stories, and dialogs).

Gesture Types (Channels) Body Gesture Head Gesture Hand Gesture

Gesture Spotting: Related Work • Indirect approach [Morguet&Lang 98, Lee&Kim 99] • Spotting is intertwined with recognition. 0. Detect hands and extract features. • Compute input to models matching costs. • Look for low cost to detect candidate gesture end point.

Gesture End Point Detection and Gesture Recognition • The algorithm is invoked for every input frame j, and consists of two steps: • Update the current list of candidate gesture models. • Apply a set of spotting rules to decide whether or not a gesture was spotted, and if yes decide which gesture model.

End Point Detection Definitions • Paths • Complete Path W(M1:m, Qj’:j): a legal warping path matching the input subsequence Qj’:j with the complete model M1:m. • Partial Path W(M1:i, Qj’:j): a legal warping path matching the input subsequence Qj’:j with part of the model M1:m. • Active Path: a partial path that has not been pruned.

End Point Detection Definitions • Models • Active Model g: a model that has a complete path ending at the current input frame j. • Firing Model g: an active model with a cost below the detection acceptance threshold. • Subgesture Relationship: a gesture g1 is a subgetsure of gesture g2 if it is properly contained in g2. In this case, g2 is a supergesture of g1.

Spotting Rules (1) • Zhu et. Al 02 (Spotting rules) • Based on Baudel&Beaudouin-Lafon’s Interaction Model. • A moving hand appears in the sequence. • The moving hand is the dominant moving object. • The movement of the hand follows a three-stage process: preparation, stroke, and retraction [Kendon]. • The duration of the stroke T is bounded, T1≤T≤T2, for a given sampling rate.

Spotting Rules (3) • Lee&Kim 99 (End-point detection):

Accurate and Efficient Gesture Spotting via Pruning and Subgesture Reasoning

Accurate and Efficient Gesture Spotting via Pruning and Subgesture Reasoning

Presentation Transcript

Pruning Trees and Shrubs

Gesture and ASL Acquisition

Training and Pruning Basics

Training and Pruning Basics

RESISTANCE TRAINING AND SPOTTING TECHNIQUES

Classification.NET: Efficient and Accurate Classification in C#

Training and Pruning Basics

Grape Pruning and Harvest

Efficient and accurate surveying with GPS/Glonass

Accurate and Efficient Accreditation Documentation Preparation

Efficient and accurate algorithms for peptide mass spectrometry

Computing Branchwidth via Efficient Triangulations and Blocks

CapProbe: An Efficient and Accurate Capacity Estimation Technique

Training and Pruning

Accurate, Efficient, and Adaptive Calling Context Profiling

Pruning Trees and Shrubs

Accurate And Efficient SLA Compliance Monitoring

Increasing the probability of selecting efficient and accurate coders

Tree Trimming And Pruning

A Novel Hemispherical Basis for Accurate and Efficient Rendering

Resistance Training and Spotting Technique

Efficient and Accurate Accounting Tax Prep