slide1 n.
Download
Skip this Video
Download Presentation
Speech is bimodal essentially. Acoustic and Visual cues.

Loading in 2 Seconds...

play fullscreen
1 / 12

Speech is bimodal essentially. Acoustic and Visual cues. - PowerPoint PPT Presentation


  • 87 Views
  • Uploaded on

AudioVisual-SpeechRecognition (AVSR). Speech is bimodal essentially. Acoustic and Visual cues. H. McGurk and J. MacDonald, ''Hearing lips and seeing voices'', Nature, pp. 746-748, December 1976.

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about 'Speech is bimodal essentially. Acoustic and Visual cues.' - alaric


Download Now An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
slide1

AudioVisual-SpeechRecognition (AVSR)

  • Speech is bimodal essentially.
    • Acoustic and Visual cues.

H. McGurk and J. MacDonald, ''Hearing lips and seeing voices'', Nature, pp. 746-748, December 1976.

T. Chen and R. Rao, ''Audio-visual integration in multimodal communication'', Proceedings of the IEEE, Special issue in Multimedia Signal Processing, vol. 86, pp. 837-852, May 1998.

D.B. Stork and M.E. Hennecke editors, ''Speechreading by Hummans and Machines. Springer, Berlin Germany, 1996.

  • Through their integration (fusion), the aim is:
    • To increase the robustness and performance.

There are too many papers, so lets see a few... but from the point of view of integration most of them.

slide2

AudioVisual-SpeechRecognition (AVSR)

  • Tutorials.

G. Potamianos, C. Neti, G. Gravier, A. Garg and A.W. Senior. Recent advances in the authomatic recognition of audio-visual speech. ''Proceedings of the IEEE, vil. 91(9), pp. 1306-1326, September 2003.

G. Potamianos, C. Neti, J. Luettin and I. Mattews, ''Audio-visual automatic speech recognition: An overview'' In G. Bailly, E. Vatikiotis-Bateson and P. Perrier, edts. Issues in Visual and Audio-visual Speech Processing, Chapter 10. MIT Press, 2004.

  • Real Conditions.

G. Potamianos and C. Neti, ''Audio-visual speech recognition in challenging environments'', In proc. European conference on Speech Technology, pp. 1293-1296, 2003.

G. Potamianos, C. Neti, J. Huang, J.H. Connell, S. Chu, V. Libal, E. Marcheret, N. Haas and J. Jiang, ''Towards practical deployement of audio-visual speech recognition'', ICASSP'04, vol. 3, pp. 777-780, Montreal Canada, 2004.

slide3

AudioVisual-SpeechRecognition (AVSR)

  • Increase Robustness and Performance Based on the fact:
    • Visual modality is independent to most of the lost of acoustic quality.
    • Visual and Acoustic modalities work in a complementary manner.

B. Dodd and R. Campbell, eds, ''Hearing by Eye: The psychology of Lipreading''. London, England. Laurence Erlbaum Associates Ltd., 1987.

    • But, if the integrations is not well done: Catastrofic fusion.

J.R. Movellan and P. Mineiro, ''Modularity and catastrophic fusion: A bayesian approach with applications to audio-visual speech recognition'', Tech. Rep. 97.01, Departement of Cognitive Science, UCSD, San Diego, CA, 1997.

slide4

AVSR Integration (Fusion)

  • Early Integration (EI):
    • In the feature level, concatenate the features.
    • But features are not synchronous (VOT)!

S. Dupont and J. Luettin, ''Audio-visual speech modeling for continuos speech recognition'', IEEE Transactions on Multimedia, vol. 2, pp. 141-151, September 2000.

C. C. Chibelushi, J.S. Mason and F. Deravi, ''Integration of acoustic and visual speech for speaker recognition'', Eurospeech'93, Berlin,pp.157-160, September 1993.

voice onset time (VOT)

slide5

AVSR Integration (Fusion)

  • Late Integration (LI):
    • In the decision level, combine the scores.
    • Lost of all temporal information!

A. Adjoudani and C. Benoit, ''Audio-visual speech recognition compared acroos two architectures'', Eurospeech'95, Madrid Spain, pp. 1563-1566, September 1995.

S. Dupont and J. Luettin, ''Audio-visual speech modeling for continuos speech recognition'', IEEE Transactions on Multimedia, vol. 2, pp. 141-151, September 2000.

M. Heckmann, F. Berthommier and K. Kroschel, ''Noise adaptive stream weighting in audio-visual speech recognition'', EUROASIP Journal of Applied Signal Processing, vol. 1, pp. 1260-1273, November 2002.

slide6

AVSR Integration (Fusion)

  • Middle Integration (MI) allows:
    • Specific word or sub-word models.
    • Synchronous continuous speech recognition.

J. Luettin, G. Potamianos and C. Neti, ''Asynchronous stream modeling for large vocabulary audio-visual speech recognition'', ICASSP'01, vol. 1, pp. 169-172, Salt Lake City USA, May 2001.

G. Potamianos, J. Luettin and C. Neti, '' Hierarchical discriminant features for audio-visual LVCSR'', ICASSP'01, vol. 1, pp. 165-168, Salt Lake City USA, May 2001.

slide7

AVSR Integration, Dynamic Bayesian Networks

  • Multistream HMM
    • State synchrony
    • Weighting the

observations

A.V. Nefian, L. Liang, X. Pi, X. Liu and K. Murphy, ''Dynamic Bayesian Networks for audio-visual speech recognition'',EURASIP Journal on Applied Signal Processing, vol. 11, pp. 1-15, 2002.

G. Potamianos, C. Neti, J. Luettin and I. Mattews, ''Audio-visual automatic speech recognition: An overview'' In G. Bailly, E. Vatikiotis-Bateson and P. Perrier, edts. Issues in Visual and Audio-visual Speech Processing, Chapter 10. MIT Press, 2004.

t=T

t=0

t=1

t=2

slide8

AVSR Integration, Dynamic Bayesian Networks

  • Product HMM
    • Asynchrony between

the streams

    • Too many parameters

I am not sure about

this graphical representation

G. Gravier, G. Potamianos and C. Neti, ''Asynchrony modeling for audio-visual speech recognition'', In Human Language Technology Conference, 2002.

t=T

t=1

t=0

t=2

slide9

AVSR Integration, Dynamic Bayesian Networks

  • Factorial HMM
    • Transition probabilities

are independents for

each stream.

Z. Ghahramani and M.I. Jordan, ''Factorial hidden markov models'', In Proc. Advances in Neural Information Processing Systems, vol. 8 pp. 472-478, 1985.

t=T

t=1

t=0

t=2

slide10

AVSR Integration, Dynamic Bayesian Networks

  • Coupled HMM (1/2)
    • The backbones

have a dependence.

M. Brand, N. Oliver and A. Pentland, ''Coupled hidden markov models for complex action recognition'', In Proc. IEEE Conf. on Computer Vision and Pattern Recognition, pp. 994-999, 1997.

S. Chu and T. Huang, ''Audio-visual speech modeling using coupled hidden markov models'', ICASSP'02, pp. 2009-2012, 2002.

t=T

t=0

t=1

t=2

slide11

AVSR Integration, Dynamic Bayesian Networks

  • Coupled HMM (2/2)

A.V. Nefian, L. Liang, X. Pi, X. Liu and K. Murphy, ''Dynamic Bayesian Networks for audio-visual speech recognition'',EURASIP Journal on Applied Signal Processing, vol. 11, pp. 1-15, 2002.

A. Subramanya, S. Gurbuz, E. Patterson, and J.N. Gowdy, ''Audiovisual speech integration using coupled hidden markov models for continous speech recognition'', ICASSP'03, 2003.

slide12

AVSR Integration, Dynamic Bayesian Networks

  • Implicite Modeling

J.N. Gowdy, A. Subramanaya, C. Bartels and Jeff Bilmes, ''DBN based Multi-stream models for audio-visula speech recognition'', ICASSP'04, Montreal Canada, 2004.

X. Lei, G. Ji, T. Ng, J. Bilmes and M. Ostendorf, ''DBN based Multi-stream for Mandarin Toneme Recognition'', ICASSP'05, Filadelphie USA, 2005.