AudioVisual-SpeechRecognition (AVSR)
Download
1 / 12

Speech is bimodal essentially. Acoustic and Visual cues. - PowerPoint PPT Presentation


  • 69 Views
  • Uploaded on
  • Presentation posted in: General

AudioVisual-SpeechRecognition (AVSR). Speech is bimodal essentially. Acoustic and Visual cues. H. McGurk and J. MacDonald, ''Hearing lips and seeing voices'', Nature, pp. 746-748, December 1976.

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha

Download Presentation

Speech is bimodal essentially. Acoustic and Visual cues.

An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -

Presentation Transcript


AudioVisual-SpeechRecognition (AVSR)

  • Speech is bimodal essentially.

    • Acoustic and Visual cues.

      H. McGurk and J. MacDonald, ''Hearing lips and seeing voices'', Nature, pp. 746-748, December 1976.

      T. Chen and R. Rao, ''Audio-visual integration in multimodal communication'', Proceedings of the IEEE, Special issue in Multimedia Signal Processing, vol. 86, pp. 837-852, May 1998.

      D.B. Stork and M.E. Hennecke editors, ''Speechreading by Hummans and Machines. Springer, Berlin Germany, 1996.

  • Through their integration (fusion), the aim is:

    • To increase the robustness and performance.

      There are too many papers, so lets see a few... but from the point of view of integration most of them.


AudioVisual-SpeechRecognition (AVSR)

  • Tutorials.

    G. Potamianos, C. Neti, G. Gravier, A. Garg and A.W. Senior. Recent advances in the authomatic recognition of audio-visual speech. ''Proceedings of the IEEE, vil. 91(9), pp. 1306-1326, September 2003.

    G. Potamianos, C. Neti, J. Luettin and I. Mattews, ''Audio-visual automatic speech recognition: An overview'' In G. Bailly, E. Vatikiotis-Bateson and P. Perrier, edts. Issues in Visual and Audio-visual Speech Processing, Chapter 10. MIT Press, 2004.

  • Real Conditions.

    G. Potamianos and C. Neti, ''Audio-visual speech recognition in challenging environments'', In proc. European conference on Speech Technology, pp. 1293-1296, 2003.

    G. Potamianos, C. Neti, J. Huang, J.H. Connell, S. Chu, V. Libal, E. Marcheret, N. Haas and J. Jiang, ''Towards practical deployement of audio-visual speech recognition'', ICASSP'04, vol. 3, pp. 777-780, Montreal Canada, 2004.


AudioVisual-SpeechRecognition (AVSR)

  • Increase Robustness and Performance Based on the fact:

    • Visual modality is independent to most of the lost of acoustic quality.

    • Visual and Acoustic modalities work in a complementary manner.

      B. Dodd and R. Campbell, eds, ''Hearing by Eye: The psychology of Lipreading''. London, England. Laurence Erlbaum Associates Ltd., 1987.

    • But, if the integrations is not well done: Catastrofic fusion.

      J.R. Movellan and P. Mineiro, ''Modularity and catastrophic fusion: A bayesian approach with applications to audio-visual speech recognition'', Tech. Rep. 97.01, Departement of Cognitive Science, UCSD, San Diego, CA, 1997.


AVSR Integration (Fusion)

  • Early Integration (EI):

    • In the feature level, concatenate the features.

    • But features are not synchronous (VOT)!

      S. Dupont and J. Luettin, ''Audio-visual speech modeling for continuos speech recognition'', IEEE Transactions on Multimedia, vol. 2, pp. 141-151, September 2000.

      C. C. Chibelushi, J.S. Mason and F. Deravi, ''Integration of acoustic and visual speech for speaker recognition'', Eurospeech'93, Berlin,pp.157-160, September 1993.

voice onset time (VOT)


AVSR Integration (Fusion)

  • Late Integration (LI):

    • In the decision level, combine the scores.

    • Lost of all temporal information!

      A. Adjoudani and C. Benoit, ''Audio-visual speech recognition compared acroos two architectures'', Eurospeech'95, Madrid Spain, pp. 1563-1566, September 1995.

      S. Dupont and J. Luettin, ''Audio-visual speech modeling for continuos speech recognition'', IEEE Transactions on Multimedia, vol. 2, pp. 141-151, September 2000.

      M. Heckmann, F. Berthommier and K. Kroschel, ''Noise adaptive stream weighting in audio-visual speech recognition'', EUROASIP Journal of Applied Signal Processing, vol. 1, pp. 1260-1273, November 2002.


AVSR Integration (Fusion)

  • Middle Integration (MI) allows:

    • Specific word or sub-word models.

    • Synchronous continuous speech recognition.

      J. Luettin, G. Potamianos and C. Neti, ''Asynchronous stream modeling for large vocabulary audio-visual speech recognition'', ICASSP'01, vol. 1, pp. 169-172, Salt Lake City USA, May 2001.

      G. Potamianos, J. Luettin and C. Neti, '' Hierarchical discriminant features for audio-visual LVCSR'', ICASSP'01, vol. 1, pp. 165-168, Salt Lake City USA, May 2001.


AVSR Integration, Dynamic Bayesian Networks

  • Multistream HMM

    • State synchrony

    • Weighting the

      observations

      A.V. Nefian, L. Liang, X. Pi, X. Liu and K. Murphy, ''Dynamic Bayesian Networks for audio-visual speech recognition'',EURASIP Journal on Applied Signal Processing, vol. 11, pp. 1-15, 2002.

      G. Potamianos, C. Neti, J. Luettin and I. Mattews, ''Audio-visual automatic speech recognition: An overview'' In G. Bailly, E. Vatikiotis-Bateson and P. Perrier, edts. Issues in Visual and Audio-visual Speech Processing, Chapter 10. MIT Press, 2004.

t=T

t=0

t=1

t=2


AVSR Integration, Dynamic Bayesian Networks

  • Product HMM

    • Asynchrony between

      the streams

    • Too many parameters

      I am not sure about

      this graphical representation

      G. Gravier, G. Potamianos and C. Neti, ''Asynchrony modeling for audio-visual speech recognition'', In Human Language Technology Conference, 2002.

t=T

t=1

t=0

t=2


AVSR Integration, Dynamic Bayesian Networks

  • Factorial HMM

    • Transition probabilities

      are independents for

      each stream.

      Z. Ghahramani and M.I. Jordan, ''Factorial hidden markov models'', In Proc. Advances in Neural Information Processing Systems, vol. 8 pp. 472-478, 1985.

t=T

t=1

t=0

t=2


AVSR Integration, Dynamic Bayesian Networks

  • Coupled HMM (1/2)

    • The backbones

      have a dependence.

      M. Brand, N. Oliver and A. Pentland, ''Coupled hidden markov models for complex action recognition'', In Proc. IEEE Conf. on Computer Vision and Pattern Recognition, pp. 994-999, 1997.

      S. Chu and T. Huang, ''Audio-visual speech modeling using coupled hidden markov models'', ICASSP'02, pp. 2009-2012, 2002.

t=T

t=0

t=1

t=2


AVSR Integration, Dynamic Bayesian Networks

  • Coupled HMM (2/2)

    A.V. Nefian, L. Liang, X. Pi, X. Liu and K. Murphy, ''Dynamic Bayesian Networks for audio-visual speech recognition'',EURASIP Journal on Applied Signal Processing, vol. 11, pp. 1-15, 2002.

    A. Subramanya, S. Gurbuz, E. Patterson, and J.N. Gowdy, ''Audiovisual speech integration using coupled hidden markov models for continous speech recognition'', ICASSP'03, 2003.


AVSR Integration, Dynamic Bayesian Networks

  • Implicite Modeling

    J.N. Gowdy, A. Subramanaya, C. Bartels and Jeff Bilmes, ''DBN based Multi-stream models for audio-visula speech recognition'', ICASSP'04, Montreal Canada, 2004.

    X. Lei, G. Ji, T. Ng, J. Bilmes and M. Ostendorf, ''DBN based Multi-stream for Mandarin Toneme Recognition'', ICASSP'05, Filadelphie USA, 2005.


ad
  • Login