AudioVisual-SpeechRecognition (AVSR)
This presentation is the property of its rightful owner.
Sponsored Links
1 / 12

Speech is bimodal essentially. Acoustic and Visual cues. PowerPoint PPT Presentation


  • 53 Views
  • Uploaded on
  • Presentation posted in: General

AudioVisual-SpeechRecognition (AVSR). Speech is bimodal essentially. Acoustic and Visual cues. H. McGurk and J. MacDonald, ''Hearing lips and seeing voices'', Nature, pp. 746-748, December 1976.

Download Presentation

Speech is bimodal essentially. Acoustic and Visual cues.

An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -

Presentation Transcript


Speech is bimodal essentially acoustic and visual cues

AudioVisual-SpeechRecognition (AVSR)

  • Speech is bimodal essentially.

    • Acoustic and Visual cues.

      H. McGurk and J. MacDonald, ''Hearing lips and seeing voices'', Nature, pp. 746-748, December 1976.

      T. Chen and R. Rao, ''Audio-visual integration in multimodal communication'', Proceedings of the IEEE, Special issue in Multimedia Signal Processing, vol. 86, pp. 837-852, May 1998.

      D.B. Stork and M.E. Hennecke editors, ''Speechreading by Hummans and Machines. Springer, Berlin Germany, 1996.

  • Through their integration (fusion), the aim is:

    • To increase the robustness and performance.

      There are too many papers, so lets see a few... but from the point of view of integration most of them.


Speech is bimodal essentially acoustic and visual cues

AudioVisual-SpeechRecognition (AVSR)

  • Tutorials.

    G. Potamianos, C. Neti, G. Gravier, A. Garg and A.W. Senior. Recent advances in the authomatic recognition of audio-visual speech. ''Proceedings of the IEEE, vil. 91(9), pp. 1306-1326, September 2003.

    G. Potamianos, C. Neti, J. Luettin and I. Mattews, ''Audio-visual automatic speech recognition: An overview'' In G. Bailly, E. Vatikiotis-Bateson and P. Perrier, edts. Issues in Visual and Audio-visual Speech Processing, Chapter 10. MIT Press, 2004.

  • Real Conditions.

    G. Potamianos and C. Neti, ''Audio-visual speech recognition in challenging environments'', In proc. European conference on Speech Technology, pp. 1293-1296, 2003.

    G. Potamianos, C. Neti, J. Huang, J.H. Connell, S. Chu, V. Libal, E. Marcheret, N. Haas and J. Jiang, ''Towards practical deployement of audio-visual speech recognition'', ICASSP'04, vol. 3, pp. 777-780, Montreal Canada, 2004.


Speech is bimodal essentially acoustic and visual cues

AudioVisual-SpeechRecognition (AVSR)

  • Increase Robustness and Performance Based on the fact:

    • Visual modality is independent to most of the lost of acoustic quality.

    • Visual and Acoustic modalities work in a complementary manner.

      B. Dodd and R. Campbell, eds, ''Hearing by Eye: The psychology of Lipreading''. London, England. Laurence Erlbaum Associates Ltd., 1987.

    • But, if the integrations is not well done: Catastrofic fusion.

      J.R. Movellan and P. Mineiro, ''Modularity and catastrophic fusion: A bayesian approach with applications to audio-visual speech recognition'', Tech. Rep. 97.01, Departement of Cognitive Science, UCSD, San Diego, CA, 1997.


Speech is bimodal essentially acoustic and visual cues

AVSR Integration (Fusion)

  • Early Integration (EI):

    • In the feature level, concatenate the features.

    • But features are not synchronous (VOT)!

      S. Dupont and J. Luettin, ''Audio-visual speech modeling for continuos speech recognition'', IEEE Transactions on Multimedia, vol. 2, pp. 141-151, September 2000.

      C. C. Chibelushi, J.S. Mason and F. Deravi, ''Integration of acoustic and visual speech for speaker recognition'', Eurospeech'93, Berlin,pp.157-160, September 1993.

voice onset time (VOT)


Speech is bimodal essentially acoustic and visual cues

AVSR Integration (Fusion)

  • Late Integration (LI):

    • In the decision level, combine the scores.

    • Lost of all temporal information!

      A. Adjoudani and C. Benoit, ''Audio-visual speech recognition compared acroos two architectures'', Eurospeech'95, Madrid Spain, pp. 1563-1566, September 1995.

      S. Dupont and J. Luettin, ''Audio-visual speech modeling for continuos speech recognition'', IEEE Transactions on Multimedia, vol. 2, pp. 141-151, September 2000.

      M. Heckmann, F. Berthommier and K. Kroschel, ''Noise adaptive stream weighting in audio-visual speech recognition'', EUROASIP Journal of Applied Signal Processing, vol. 1, pp. 1260-1273, November 2002.


Speech is bimodal essentially acoustic and visual cues

AVSR Integration (Fusion)

  • Middle Integration (MI) allows:

    • Specific word or sub-word models.

    • Synchronous continuous speech recognition.

      J. Luettin, G. Potamianos and C. Neti, ''Asynchronous stream modeling for large vocabulary audio-visual speech recognition'', ICASSP'01, vol. 1, pp. 169-172, Salt Lake City USA, May 2001.

      G. Potamianos, J. Luettin and C. Neti, '' Hierarchical discriminant features for audio-visual LVCSR'', ICASSP'01, vol. 1, pp. 165-168, Salt Lake City USA, May 2001.


Speech is bimodal essentially acoustic and visual cues

AVSR Integration, Dynamic Bayesian Networks

  • Multistream HMM

    • State synchrony

    • Weighting the

      observations

      A.V. Nefian, L. Liang, X. Pi, X. Liu and K. Murphy, ''Dynamic Bayesian Networks for audio-visual speech recognition'',EURASIP Journal on Applied Signal Processing, vol. 11, pp. 1-15, 2002.

      G. Potamianos, C. Neti, J. Luettin and I. Mattews, ''Audio-visual automatic speech recognition: An overview'' In G. Bailly, E. Vatikiotis-Bateson and P. Perrier, edts. Issues in Visual and Audio-visual Speech Processing, Chapter 10. MIT Press, 2004.

t=T

t=0

t=1

t=2


Speech is bimodal essentially acoustic and visual cues

AVSR Integration, Dynamic Bayesian Networks

  • Product HMM

    • Asynchrony between

      the streams

    • Too many parameters

      I am not sure about

      this graphical representation

      G. Gravier, G. Potamianos and C. Neti, ''Asynchrony modeling for audio-visual speech recognition'', In Human Language Technology Conference, 2002.

t=T

t=1

t=0

t=2


Speech is bimodal essentially acoustic and visual cues

AVSR Integration, Dynamic Bayesian Networks

  • Factorial HMM

    • Transition probabilities

      are independents for

      each stream.

      Z. Ghahramani and M.I. Jordan, ''Factorial hidden markov models'', In Proc. Advances in Neural Information Processing Systems, vol. 8 pp. 472-478, 1985.

t=T

t=1

t=0

t=2


Speech is bimodal essentially acoustic and visual cues

AVSR Integration, Dynamic Bayesian Networks

  • Coupled HMM (1/2)

    • The backbones

      have a dependence.

      M. Brand, N. Oliver and A. Pentland, ''Coupled hidden markov models for complex action recognition'', In Proc. IEEE Conf. on Computer Vision and Pattern Recognition, pp. 994-999, 1997.

      S. Chu and T. Huang, ''Audio-visual speech modeling using coupled hidden markov models'', ICASSP'02, pp. 2009-2012, 2002.

t=T

t=0

t=1

t=2


Speech is bimodal essentially acoustic and visual cues

AVSR Integration, Dynamic Bayesian Networks

  • Coupled HMM (2/2)

    A.V. Nefian, L. Liang, X. Pi, X. Liu and K. Murphy, ''Dynamic Bayesian Networks for audio-visual speech recognition'',EURASIP Journal on Applied Signal Processing, vol. 11, pp. 1-15, 2002.

    A. Subramanya, S. Gurbuz, E. Patterson, and J.N. Gowdy, ''Audiovisual speech integration using coupled hidden markov models for continous speech recognition'', ICASSP'03, 2003.


Speech is bimodal essentially acoustic and visual cues

AVSR Integration, Dynamic Bayesian Networks

  • Implicite Modeling

    J.N. Gowdy, A. Subramanaya, C. Bartels and Jeff Bilmes, ''DBN based Multi-stream models for audio-visula speech recognition'', ICASSP'04, Montreal Canada, 2004.

    X. Lei, G. Ji, T. Ng, J. Bilmes and M. Ostendorf, ''DBN based Multi-stream for Mandarin Toneme Recognition'', ICASSP'05, Filadelphie USA, 2005.


  • Login