1 / 34

An Introduction to Speech Perception

An Introduction to Speech Perception. Ph.D. student: Li Yujia, Rain Supervisor: Prof. Tan Lee. Jan. 28, 2005. Contents. Basic Knowledge Speech Perception Perception Theories Speech Perception versus Music Perception Applications . Basics. Speech Perception. Theories. Speech vs. Music.

benjamin
Download Presentation

An Introduction to Speech Perception

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. An Introduction toSpeech Perception Ph.D. student: Li Yujia, Rain Supervisor: Prof. Tan Lee Jan. 28, 2005 CUHK-EE-DSPSTL

  2. Contents • Basic Knowledge • Speech Perception • Perception Theories • Speech Perception versus Music Perception • Applications CUHK-EE-DSPSTL

  3. Basics Speech Perception Theories Speech vs. Music Applications Basic Knowledge • Three levels of speech • Segments vs. supra-segments • Basic acoustic features • Auditory components of human speech perception • Basic methodology of perception research CUHK-EE-DSPSTL

  4. Basics Speech Perception Theories Speech vs. Music Applications Three levels of speech Speaker Linguistic Define rules Acoustic Speech realization Perceptual Listener Interpretation CUHK-EE-DSPSTL

  5. Basics Speech Perception Theories Speech vs. Music Applications Segments vs. Supra-segments CUHK-EE-DSPSTL

  6. Basics Speech Perception Theories Speech vs. Music Applications Basic acoustic features waveform 1/f0 formants spectrogram CUHK-EE-DSPSTL

  7. Basics Speech Perception Theories Speech vs. Music Applications Auditory components of human speech perception • The peripheral auditory organs – ear (signal processing) • The auditory nervous system – brain (interpretation) semantic prosody CUHK-EE-DSPSTL

  8. Basics Speech Perception Theories Speech vs. Music Applications Basic methodology of perception research • Stimuli: synthesized speech • Testing: by human listening • Results are affected by • Intrinsic factors: attributes to speech sounds • Extrinsic factors: resulted from experimental conditions CUHK-EE-DSPSTL

  9. Basics Speech Perception Theories Speech vs. Music Applications Speech Perception • Perception of vowels • Perception of consonants • Perception of prosody CUHK-EE-DSPSTL

  10. Basics Speech Perception Theories Speech vs. Music Applications Perception of vowels (1) • Vowel sounds are perceptually specified by their formant frequencies. Spectrogram of an /i/ vowel with first and second formant labeled. CUHK-EE-DSPSTL

  11. Basics Speech Perception Theories Speech vs. Music Applications Perception of vowels (2) • Evidence • From production: • Vowel-tongue position-vocal tract-formant frequencies. • From perception: • Synthesized speech-first two formants-different vowel sound • From physics: • “There is some evidence that the human auditory nerve already reacts directly to formant frequencies.” (Delgutte, 1980) CUHK-EE-DSPSTL

  12. Basics Speech Perception Theories Speech vs. Music Applications transition steady state Perception of consonants (1) • In perception, many consonants depend on vowels; much of stop consonants depend on the rapidly changing formant transitions. Schematic of first two formant frequency pattern for a /di/ syllable CUHK-EE-DSPSTL

  13. Basics Speech Perception Theories Speech vs. Music Applications Perception of consonants (2) Schematic representations of first two formant frequency patterns for /d/ in front of different vowels • Lack of acoustic invariance: the lack of something constant in the spectrographic representation (visual representation of speech) to explain the perception of a particular consonant. • Locus theory: the second formant frequency transitions all seem to be pointing toward the same frequency which is called locus. CUHK-EE-DSPSTL

  14. Basics Speech Perception Theories Speech vs. Music Applications Perception of consonants (3) • What is the basic unit for speech perception? • Because we cannot isolate stop consonants from vowels in perception, researchers began to think of speech as encoded (vowels and consonants are squeezed together), perhaps in syllable-sized units. • Speech can be presented at a faster speed rate (30 phonemes per second) than other sounds, and still retain its perceptual intelligibility. CUHK-EE-DSPSTL

  15. Basics Speech Perception Theories Speech vs. Music Applications Perception of prosody(1) • The perception of prosody has been described as dependent on the “melody of speech”, the fluctuations in the pitch, rhythm, and stress (Monrad-Krohn, 1947). • Related acoustic features are f0, duration and energy intensity. CUHK-EE-DSPSTL

  16. Basics Speech Perception Theories Speech vs. Music Applications Perception of prosody(2) • Perception of prosody is more complex • The relatively vague definition. • The perception of prosody is nonlinear to the acoustic features. (double f0 ≠ double pitch; double duration ≠ double stress) • Perceived over long time in a relative sense. (the degree of contrast between the values of the acoustic variables over a number of syllables) • An perceived attribute of prosody may be related to several acoustic features. (f0 is most powerful cue to stress, followed by duration and energy intensity) CUHK-EE-DSPSTL

  17. Basics Speech Perception Theories Speech vs. Music Applications Perception of prosody(3) • Research is relatively sparse • The target of our research will be: • From acoustic to perception to determine how one or several acoustic features contribute to the perceived naturalness. • Improve the naturalness of synthesized speech in an effective way. CUHK-EE-DSPSTL

  18. Basics Speech Perception Theories Speech vs. Music Applications Perception Theories • Masking • Categorical perception • Motor theory • Analysis-by-synthesis • Bottom-up versus top-down CUHK-EE-DSPSTL

  19. Basics Speech Perception Theories Speech vs. Music Applications Masking • Frequency masking • One sound cannot be perceived if another sound close in frequency has a high enough level. • Temporal masking • A sound cannot be perceived if it is too close in time to another sound. • Pre-masking tends to last 5 ms; post-masking can last from 50 to 300 ms. B A Pre-masking Post-masking B A 50-300ms 5ms CUHK-EE-DSPSTL

  20. Basics Speech Perception Theories Speech vs. Music Applications Categorical perception (1) • Voice onset time (VOT) (Lisker and Abramson, 1964) • Voiced versus voiceless (if the vocal fold vibrates, eg. /z/ and /s/) • The difference between voiced and voiceless stop consonants (eg. /b/and/p/; /d/and/t/;/g/and/k/) is actually one of the relative timing of the onset of the onset of vocal fold vibration. • The timing difference is referred to as voice onset time (VOT) CUHK-EE-DSPSTL

  21. Basics Speech Perception Theories Speech vs. Music Applications Categorical perception (2) • Voice onset time (VOT) • voiced stop consonants have a relatively short VOT; whereas voiceless consonants have a longer VOT. VOT VOT measure for a /b/ VOT VOT measure for a /p/ CUHK-EE-DSPSTL

  22. Basics Speech Perception Theories Speech vs. Music Applications Categorical perception (3) • VOT categories • From production: • From perception: VOT productions of a single normal adult speaker of American English for words beginning with /d/ and /t/. Identification functions of a single listener for VOT continuum from /d/ to /t/ in approximately 11 ms steps. Each stimulus is presented 10 times each in random order CUHK-EE-DSPSTL

  23. Basics Speech Perception Theories Speech vs. Music Applications Categorical perception (4) • Categorical Perception • The insensitivity to differences within a category, but keen sensitivity to cross-category differences, is referred to as categorical perception. • It’s characteristic of certain speech sound distinctions, and it’s generally not found for nonspeech sounds (Cutting, 1972). • It represents one of the human perceptual mechanisms coping with tremendous amount of variations rapidly (ignore nonessential variation within a category) CUHK-EE-DSPSTL

  24. Basics Speech Perception Theories Speech vs. Music Applications Motor theory (1) • Motor commands: • The neural message that the brain sends to set the articulators in motion to produce speech. • Motivation: • When a stop consonant is produced in various vowel context, because of the lack of acoustic invariance , there must be constant motor commands to the articulators to produce the same consonant. CUHK-EE-DSPSTL

  25. Basics Speech Perception Theories Speech vs. Music Applications Motor theory (2) • Original theory: • “Though we cannot exclude the possibility that a purely auditory decoder exists, we find it more plausible to assume that speech is perceived by processes that are also involved in its production” (Liberman, Cooper, Shankweiler, & Studdert-Kennedy, 1967). CUHK-EE-DSPSTL

  26. Basics Speech Perception Theories Speech vs. Music Applications Motor theory (3) () • Weak version: • Speech production offers important cues about speech perception which can be used by listeners. • Strong version: • Speech production forms the basis for speech perception. CUHK-EE-DSPSTL

  27. Basics Speech Perception Theories Speech vs. Music Applications Analysis-by-synthesis Listeners are hypothesized to decode the acoustic signal by internally generating matching signals. The signal that provides the best match is the one “perceived” by the listener. CUHK-EE-DSPSTL

  28. Basics Speech Perception Theories Speech vs. Music Applications Bottom-up versus top-down (1) • Bottom-up: • Use the acoustic information to discover what is being uttered. • Top-down: • Use linguistic information CUHK-EE-DSPSTL

  29. Basics Speech Perception Theories Speech vs. Music Applications Bottom-up versus top-down (2) • Bottom-up information is important at the beginning of utterance, while top-down information becomes primary when more syllables in an sentence are uttered. • The role of top-down information is supported, because good organization and prosody will speed up the understanding of a speech. Bottom-up Top-down CUHK-EE-DSPSTL

  30. Basics Speech Perception Theories Speech vs. Music Applications Speech Perception versus Music Perception • Physical difference in perception • Categorical perception in speech; continuous perception in music • We can discriminate about 1200 different pitches in music, but we can only absolutely identify about 7 ( Liberman, 1967). • For certain sound difference relevant to speech, listeners can only discriminate accurately about as many sounds as they can identify. For speech For music CUHK-EE-DSPSTL

  31. Basics Speech Perception Theories Speech vs. Music Applications Applications • Speech recognition • Speech synthesis • Speaker recognition • Hearing aid CUHK-EE-DSPSTL

  32. Summary • Speech perception • vowel, consonant, prosody • Perception theories • Masking, categorical perception, motor theory, analysis-by-synthesis, bottom-up and top-down • Speech vs. music perception CUHK-EE-DSPSTL

  33. Conclusions • What we have known for speech perception is very limited, especially for prosody perception. • Speech perception will help speech technology much. CUHK-EE-DSPSTL

  34. References • Jack Ryalls, 1996. A basic introduction to speech perception.San Diego, Calif. : Singular Pub. Group. • Gloria J. Borden, Katherine S. Harris, Lawrence J. Raphael, 2003. “Speech perception”, chapter 6 in Speech science primer : physiology, acoustics, and perception of speech, Philadelphia : Lippincott Williams & Wilkins. • Raymond D. Kent, 1997.”Speech perception”, chapter 10 in The speech sciences, San Diego : Singular Pub. Group. • Richard B. Ivry and Lynn C, 1998. “Speech perception and language”, chapter 6 in The two sides of perception, Cambridge, Mass. : MIT Press. • J.M. Pickett, 1999. The acoustics of speech communication : fundamentals, speech perception theory, and technology, Boston: Allyn and Bacon. • Xuedong Huang, Alex Acero, Hsiao-Wuen Hon , 2001. “Spoken language structure”, chapter 2 in Spoken language processing : a guide to theory, algorithm, and system development. Upper Saddle River, N.J. : Prentice Hall PTR. • J.Liu, 2001.Tonal behavior in some tone languages. Ph.D. Dissertation. City University of Hong Kong, 2001. • Chu Min; Lu Shinan; Si Hongyan; He Lin; Guan Dinghua, 1996. “The control of juncture and prosody in Chinese TTS system”, in the Proceedings of ICSLP 1996, Volume 1, pp 725-728. • Pagel, V.; Carbonell, N.; Laprie, Y., 1996.”A new method for speech delexicalization, and its application to the perception of French prosody”, in the Proceedings of ICSLP 1996, volume 2, pp 821-824. • Heuft, B.; Portele, T., 1996, “Synthesizing prosody: a prominence-based approach”, in the Proceedings of ICSLP 1996, volume 3, pp 1361-1364. • Vainio, M.; Jarvikivi, J.; Werner, S.; Volk, N.; Valikangas, J., 2002, “Effect of prosodic naturalness on segmental acceptability in synthetic speech”, in the Proceedings of 2002 IEEE Workshop on Speech Synthesis,pp143 – 146. • Yong-Ju Lee; Sook-Hyang Lee, 1996, “On phonetic characteristics of pause in the Korean read speech”, in the Proceedings of ICSLP 1996, Volume1,pp 118-120. • House, D., 1996, “Differential perception of tonal contours through the syllable”, in the Proceedings of ICSLP 1996, Volume4,pp 2048 – 2051. CUHK-EE-DSPSTL

More Related