1 / 26

Dealing with Acoustic Noise Part 3: Video

This lecture covers topics such as visual speech features, lip tracking, object tracking, computational complexity, AdaBoost, and AVICAR. It also discusses techniques for dealing with visual noise.

Download Presentation

Dealing with Acoustic Noise Part 3: Video

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Dealing with Acoustic Noise Part 3: Video Mark Hasegawa-Johnson University of Illinois Lectures at CLSP WS06 July 20, 2006

  2. Audio-Visual Speech Recognition: WS00 and WS06 • Visual speech features • DCT of lip rectangle • Active Appearance Models • Feature normalization • Mean and variance normalization • MLLR, fiducial-point LR and logLR • LDA and PCA • Audio – Video fusion • Two-stream HMM • Product HMM & Coupled HMM • Streams based on constriction states

  3. Face & Lip Tracking

  4. Object Tracking: Multi-Resolution(Neti et al., 2000) • Computational complexity of lip tracking: • In a 480x720 image, there are 1.2x1011 candidate lip rectangles. • Multi-resolution lip tracking: • Train lip detectors at each resolution, i.e., corners of lip rectangle must be integer multiples of Ri • Beam search: • Keep the N best candidates at resolution Ri • At resolution Ri+1, consider only the candidates within +/- Ri/2 of a candidate from the N-best list at resolution Ri. • Tune R1,R2,… and N to trade off accuracy for computational complexity

  5. Object Tracking: Fast Features(Viola and Jones, 2001)

  6. Object Tracking: AdaBoost(Schapire, 1999) • Each Viola-Jones feature defines a “weak classifier:” habcdi(x) = 1 iff fi(a,b,c,d) > threshold, else habcdi(x) = 0 • Start with equal weight for all training tokens wm(1) = 1/M, 1≤ m≤M • For each learning iteration t: • Find the (a,b,c,d,i) that minimizes the weighted training error, εt. • wm(t+1)=wm(t)(1- εt)/ εt if token m was incorrectly classified, else wm(t+1)=wm(t). Then renormalize so Σmwm(t)=1. • αt = log((1- εt)/ εt) • Final “strong classifier” is H(x) = 1 iff Σtαt ht(x) > Σtαt

  7. AdaBoost in a Bayesian Context • p(MD(x) | Ci) is well approximated by a Gaussian • Ci=0: object absent • Ci=1: object present • MD(x) defined by • Probability distribution of face center (x,y), and log (width, height), well modeled by Gaussians • Probability distribution of lip center (x,y) and size (w,h) relative to face (normalized to the range [-1,1]) is compact and unimodal • Find a lip rectangle and face rectangle that jointly maximize product of probabilities

  8. Pixel-Based Features

  9. Pixel-Based Features: Dimension

  10. Geometry Features can be useful for AVSR(Chu and Huang, 2002, using features of Tsuhan Chen) Visual Feature Extraction

  11. Combining Geometry + Pixels: AAM(Neti et al., WS00)

  12. Constellation Models(Koch) • Each patch is recognized by a likelihood p(pixels) • Relative geometries controlled by a geometry PDF • Advantages: • Good object detection accuracy • Provides information about object components • Disadvantage: computational complexity

  13. AVICAR “Constellation” • Four face rectangles provide information about face location, width, and height (useful for normalization) • Positions of lip rectangles within four faces provide information about head angle (useful for normalization) • Lip height, width provide information about whether mouth is open or closed (useful for speech recognition) • DCT of pixels within all four lip rectangles gives information about teeth and tongue (useful for speech recognition)

  14. Visual Noise • Lighting Variability • Physical model • Variance normalization • Head-Pose Variability • Physical model • Linear and log-linear regression • Dimensionality Reduction • Linear discriminant analysis • Within-condition PCA • Facial Feature Variability • MLLR

  15. Lighting Variability • Physical model (isotropic reflection): measured (r,g,b) is the product of the direction-independent reflectance (γr, γg, γb,t) of a moving fleshpoint, times its lighting (λ r, λ g, λb). • Solution: variance normalization

  16. Lighting Variability • …but time-varying high-contrast lighting would fool it • Variance normalization is useful even if the lip rectangle is marked by high-contrast lighting…

  17. Head-Pose Variability • If the head is an ellipse, its measured width wF(t) and height hF(t) are functions of roll ρ, yaw ψ, pitch φ, true height ħF and true width wFaccording to • … which can usefully be approximated as…

  18. Linear Regression • The additive random part of the lip width (wL(t)=w1+ħLcosψ(t)sinρ(t)) is proportional to similar additive variation in the head width (wF(t)=wF1+ħFcosψ(t)sinρ(t)), so we can eliminate it by orthogonalizing wL(t) to wF(t).

  19. Log Linear Regression • The multiplicative random part of the lip width (w1(t)=w2cosψ(t)cosρ(t)) is proportional to similar multiplicative variation in the head width (wF(t)=wFcosψ(t)cosρ(t)), so we can eliminate it by orthogonalizing log wL(t) to log wF(t).

  20. Facial Feature Variability • … tends to result in large changes in the feature mean (e.g., different talkers have different average lip-rectangle sizes) • Changes in the class-dependent feature mean can be compensated by MLLR

  21. WER Results from AVICAR LR = linear regression Model = model-based head-pose compensation LLR = log-linear regression 13+d+dd = 13 static features 39 = 39 static features All systems have mean and variance normalization and MLLR

  22. Audio-Visual Asynchrony For example, tongue touches the teeth before acoustic speech onset in the word “three;” lips are already round in anticipation of the /r/.

  23. Audio-Visual Asynchrony: Coupled HMM is a typical Phoneme-Viseme Model

  24. Asynchrony in Gestural Phonology “three” Round Spread Lips Dental Critical Retroflex Narrow Palatal Narrow Tongue Unvoiced Voiced Glottis time

  25. Modeling Asynchrony Using Constriction State Variables Wordt Wordt+1 Glottist Glottist+1 Tonguet Tonguet+1 Lipst Lipst+1 Audiot Audiot+1 Videot Videot+1

  26. Summary • Video feature extraction: it works! • Audiovisual fusion using GMTK: • Partha has the phoneme-viseme model working • Articulatory feature model is in progress

More Related