Download
slide1 n.
Skip this Video
Loading SlideShow in 5 Seconds..
Database and Visual Front End PowerPoint Presentation
Download Presentation
Database and Visual Front End

Database and Visual Front End

170 Views Download Presentation
Download Presentation

Database and Visual Front End

- - - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript

  1. Database and Visual Front End Makis Potamianos

  2. Active Appearance ModelVisual Features Iain Matthews

  3. Acknowledgments • Cootes, Edwards, Talyor, Manchester • Sclaroff, Boston

  4. AAM Overview Appearance Region of interest Warp to reference Shape & Appearance Landmarks Shape

  5. Relationship to DCT Features Face Detector DCT AAM Tracker AAMFeatures • External feature detector vs. model-based learned tracking • ROI ‘box’ vs. explicit shape + appearance modeling

  6. Training Data • 4072 hand labelled images = 2m 13s (/ 50h)

  7. Final Model 3 mean 3

  8. Fitting Algorithm Current model projection Current model projection Image under model Image under model Difference Difference Iterate until convergence a dc Appearance Appearance Warp to reference Warp to reference Error  weight c is all model parameters dc PredictedUpdate Image

  9. Tracking Results • Worst sequence - mean, mean square error = 548.87 • Best sequence - mean, mean square error = 89.11

  10. Tracking Results • Full-face AAM tracker on subset of VVAV database • 4,952 sequences • 1,119,256 images @ 30fps = 10h 22m • Mean, mean MSE per sentence = 254.21 • Tracking rate (m2p decode)  4 fps • Beard area and lips only models will not track • Regions lack sharp texture gradients needed locate model?

  11. Features • Use AAM full-face features directly (86 dimensional)

  12. Audio Lattice Rescoring Results Lattice random path = 78.14% DCT with LM = 51.08% DCT no LM = 61.06%

  13. Audio Lattice Rescoring Results • AAM vs. DCT vs. Noise

  14. Tracking Errors Analysis • AAM vs. Tracking error

  15. Analysis and Future Work • Models are under trained • Little more than face detection on 2m of training

  16. Analysis and Future Work reproject • Models are under trained • Little more than face detection on 2m of training • Project face through a more compact model • Retain only useful articulation information?

  17. Analysis and Future Work reproject • Models are under trained • Little more than face detection on 2m of training • Project face through a more compact model • Retain only useful articulation information? • Improve the reference shape • Minimal information loss through the warping?

  18. Asynchronous Stream Modelling Juergen Luettin

  19. The Recognition Problem M: word (phoneme) sequence M*: most likely word sequence OA: acoustic observation sequence OV: visual observation sequence

  20. Integration at the Feature Level • Assumption: • conditional dependence between modalities • integration at the feature level

  21. Integration at the Decision Level • Assumption: • conditional independence between modalities • integration at the unit level

  22. Multiple Synchronous Streams • Assumption: • conditional independence • integration at the state level Two streams in each state: X: state sequence aij : trans. prob. from i to j bj: probability density cjm: mth mixture weight of multivariate GaussianN

  23. Multiple Asynchronous Streams • Assumption: • conditional independence • integration at the unit level Decoding: individual best state sequences for audio and video

  24. Composite HMM definition 4 7 9 8 2 5 1 3 6 Speech-noise decomposition (Varga & Moore, 1993) Audio-visual decomposition (Dupont & Luettin, 1998)

  25. Stream Clustering

  26. AVSR System • 3-state HMM with 12 mixture components, 7-state HMM for composite model • context dependent phone models (silence, short pause), tree-based state clustering • cross-word context dependent decoding, using lattices computed at IBM • trigram language model • global stream weights in multi stream models, estimated on held out set

  27. Speaker independent word recognition

  28. Conclusions • AV 2 Stream asynchronous model beats other models in noisy conditions Future directions: • Transition matrices: context dependent, pruning transitions with low probability, cross-unit asynchrony • Stream weights: model based, discriminative • Clustering: taking stream-tying into account

  29. Phone Dependent Weighting Dimitra Vergyri

  30. Weight Estimation Hervé Glotin