710 likes | 2.6k Views
An Introduction to Speech Perception. Ph.D. student: Li Yujia, Rain Supervisor: Prof. Tan Lee. Jan. 28, 2005. Contents. Basic Knowledge Speech Perception Perception Theories Speech Perception versus Music Perception Applications . Basics. Speech Perception. Theories. Speech vs. Music.
E N D
An Introduction toSpeech Perception Ph.D. student: Li Yujia, Rain Supervisor: Prof. Tan Lee Jan. 28, 2005 CUHK-EE-DSPSTL
Contents • Basic Knowledge • Speech Perception • Perception Theories • Speech Perception versus Music Perception • Applications CUHK-EE-DSPSTL
Basics Speech Perception Theories Speech vs. Music Applications Basic Knowledge • Three levels of speech • Segments vs. supra-segments • Basic acoustic features • Auditory components of human speech perception • Basic methodology of perception research CUHK-EE-DSPSTL
Basics Speech Perception Theories Speech vs. Music Applications Three levels of speech Speaker Linguistic Define rules Acoustic Speech realization Perceptual Listener Interpretation CUHK-EE-DSPSTL
Basics Speech Perception Theories Speech vs. Music Applications Segments vs. Supra-segments CUHK-EE-DSPSTL
Basics Speech Perception Theories Speech vs. Music Applications Basic acoustic features waveform 1/f0 formants spectrogram CUHK-EE-DSPSTL
Basics Speech Perception Theories Speech vs. Music Applications Auditory components of human speech perception • The peripheral auditory organs – ear (signal processing) • The auditory nervous system – brain (interpretation) semantic prosody CUHK-EE-DSPSTL
Basics Speech Perception Theories Speech vs. Music Applications Basic methodology of perception research • Stimuli: synthesized speech • Testing: by human listening • Results are affected by • Intrinsic factors: attributes to speech sounds • Extrinsic factors: resulted from experimental conditions CUHK-EE-DSPSTL
Basics Speech Perception Theories Speech vs. Music Applications Speech Perception • Perception of vowels • Perception of consonants • Perception of prosody CUHK-EE-DSPSTL
Basics Speech Perception Theories Speech vs. Music Applications Perception of vowels (1) • Vowel sounds are perceptually specified by their formant frequencies. Spectrogram of an /i/ vowel with first and second formant labeled. CUHK-EE-DSPSTL
Basics Speech Perception Theories Speech vs. Music Applications Perception of vowels (2) • Evidence • From production: • Vowel-tongue position-vocal tract-formant frequencies. • From perception: • Synthesized speech-first two formants-different vowel sound • From physics: • “There is some evidence that the human auditory nerve already reacts directly to formant frequencies.” (Delgutte, 1980) CUHK-EE-DSPSTL
Basics Speech Perception Theories Speech vs. Music Applications transition steady state Perception of consonants (1) • In perception, many consonants depend on vowels; much of stop consonants depend on the rapidly changing formant transitions. Schematic of first two formant frequency pattern for a /di/ syllable CUHK-EE-DSPSTL
Basics Speech Perception Theories Speech vs. Music Applications Perception of consonants (2) Schematic representations of first two formant frequency patterns for /d/ in front of different vowels • Lack of acoustic invariance: the lack of something constant in the spectrographic representation (visual representation of speech) to explain the perception of a particular consonant. • Locus theory: the second formant frequency transitions all seem to be pointing toward the same frequency which is called locus. CUHK-EE-DSPSTL
Basics Speech Perception Theories Speech vs. Music Applications Perception of consonants (3) • What is the basic unit for speech perception? • Because we cannot isolate stop consonants from vowels in perception, researchers began to think of speech as encoded (vowels and consonants are squeezed together), perhaps in syllable-sized units. • Speech can be presented at a faster speed rate (30 phonemes per second) than other sounds, and still retain its perceptual intelligibility. CUHK-EE-DSPSTL
Basics Speech Perception Theories Speech vs. Music Applications Perception of prosody(1) • The perception of prosody has been described as dependent on the “melody of speech”, the fluctuations in the pitch, rhythm, and stress (Monrad-Krohn, 1947). • Related acoustic features are f0, duration and energy intensity. CUHK-EE-DSPSTL
Basics Speech Perception Theories Speech vs. Music Applications Perception of prosody(2) • Perception of prosody is more complex • The relatively vague definition. • The perception of prosody is nonlinear to the acoustic features. (double f0 ≠ double pitch; double duration ≠ double stress) • Perceived over long time in a relative sense. (the degree of contrast between the values of the acoustic variables over a number of syllables) • An perceived attribute of prosody may be related to several acoustic features. (f0 is most powerful cue to stress, followed by duration and energy intensity) CUHK-EE-DSPSTL
Basics Speech Perception Theories Speech vs. Music Applications Perception of prosody(3) • Research is relatively sparse • The target of our research will be: • From acoustic to perception to determine how one or several acoustic features contribute to the perceived naturalness. • Improve the naturalness of synthesized speech in an effective way. CUHK-EE-DSPSTL
Basics Speech Perception Theories Speech vs. Music Applications Perception Theories • Masking • Categorical perception • Motor theory • Analysis-by-synthesis • Bottom-up versus top-down CUHK-EE-DSPSTL
Basics Speech Perception Theories Speech vs. Music Applications Masking • Frequency masking • One sound cannot be perceived if another sound close in frequency has a high enough level. • Temporal masking • A sound cannot be perceived if it is too close in time to another sound. • Pre-masking tends to last 5 ms; post-masking can last from 50 to 300 ms. B A Pre-masking Post-masking B A 50-300ms 5ms CUHK-EE-DSPSTL
Basics Speech Perception Theories Speech vs. Music Applications Categorical perception (1) • Voice onset time (VOT) (Lisker and Abramson, 1964) • Voiced versus voiceless (if the vocal fold vibrates, eg. /z/ and /s/) • The difference between voiced and voiceless stop consonants (eg. /b/and/p/; /d/and/t/;/g/and/k/) is actually one of the relative timing of the onset of the onset of vocal fold vibration. • The timing difference is referred to as voice onset time (VOT) CUHK-EE-DSPSTL
Basics Speech Perception Theories Speech vs. Music Applications Categorical perception (2) • Voice onset time (VOT) • voiced stop consonants have a relatively short VOT; whereas voiceless consonants have a longer VOT. VOT VOT measure for a /b/ VOT VOT measure for a /p/ CUHK-EE-DSPSTL
Basics Speech Perception Theories Speech vs. Music Applications Categorical perception (3) • VOT categories • From production: • From perception: VOT productions of a single normal adult speaker of American English for words beginning with /d/ and /t/. Identification functions of a single listener for VOT continuum from /d/ to /t/ in approximately 11 ms steps. Each stimulus is presented 10 times each in random order CUHK-EE-DSPSTL
Basics Speech Perception Theories Speech vs. Music Applications Categorical perception (4) • Categorical Perception • The insensitivity to differences within a category, but keen sensitivity to cross-category differences, is referred to as categorical perception. • It’s characteristic of certain speech sound distinctions, and it’s generally not found for nonspeech sounds (Cutting, 1972). • It represents one of the human perceptual mechanisms coping with tremendous amount of variations rapidly (ignore nonessential variation within a category) CUHK-EE-DSPSTL
Basics Speech Perception Theories Speech vs. Music Applications Motor theory (1) • Motor commands: • The neural message that the brain sends to set the articulators in motion to produce speech. • Motivation: • When a stop consonant is produced in various vowel context, because of the lack of acoustic invariance , there must be constant motor commands to the articulators to produce the same consonant. CUHK-EE-DSPSTL
Basics Speech Perception Theories Speech vs. Music Applications Motor theory (2) • Original theory: • “Though we cannot exclude the possibility that a purely auditory decoder exists, we find it more plausible to assume that speech is perceived by processes that are also involved in its production” (Liberman, Cooper, Shankweiler, & Studdert-Kennedy, 1967). CUHK-EE-DSPSTL
Basics Speech Perception Theories Speech vs. Music Applications Motor theory (3) () • Weak version: • Speech production offers important cues about speech perception which can be used by listeners. • Strong version: • Speech production forms the basis for speech perception. CUHK-EE-DSPSTL
Basics Speech Perception Theories Speech vs. Music Applications Analysis-by-synthesis Listeners are hypothesized to decode the acoustic signal by internally generating matching signals. The signal that provides the best match is the one “perceived” by the listener. CUHK-EE-DSPSTL
Basics Speech Perception Theories Speech vs. Music Applications Bottom-up versus top-down (1) • Bottom-up: • Use the acoustic information to discover what is being uttered. • Top-down: • Use linguistic information CUHK-EE-DSPSTL
Basics Speech Perception Theories Speech vs. Music Applications Bottom-up versus top-down (2) • Bottom-up information is important at the beginning of utterance, while top-down information becomes primary when more syllables in an sentence are uttered. • The role of top-down information is supported, because good organization and prosody will speed up the understanding of a speech. Bottom-up Top-down CUHK-EE-DSPSTL
Basics Speech Perception Theories Speech vs. Music Applications Speech Perception versus Music Perception • Physical difference in perception • Categorical perception in speech; continuous perception in music • We can discriminate about 1200 different pitches in music, but we can only absolutely identify about 7 ( Liberman, 1967). • For certain sound difference relevant to speech, listeners can only discriminate accurately about as many sounds as they can identify. For speech For music CUHK-EE-DSPSTL
Basics Speech Perception Theories Speech vs. Music Applications Applications • Speech recognition • Speech synthesis • Speaker recognition • Hearing aid CUHK-EE-DSPSTL
Summary • Speech perception • vowel, consonant, prosody • Perception theories • Masking, categorical perception, motor theory, analysis-by-synthesis, bottom-up and top-down • Speech vs. music perception CUHK-EE-DSPSTL
Conclusions • What we have known for speech perception is very limited, especially for prosody perception. • Speech perception will help speech technology much. CUHK-EE-DSPSTL
References • Jack Ryalls, 1996. A basic introduction to speech perception.San Diego, Calif. : Singular Pub. Group. • Gloria J. Borden, Katherine S. Harris, Lawrence J. Raphael, 2003. “Speech perception”, chapter 6 in Speech science primer : physiology, acoustics, and perception of speech, Philadelphia : Lippincott Williams & Wilkins. • Raymond D. Kent, 1997.”Speech perception”, chapter 10 in The speech sciences, San Diego : Singular Pub. Group. • Richard B. Ivry and Lynn C, 1998. “Speech perception and language”, chapter 6 in The two sides of perception, Cambridge, Mass. : MIT Press. • J.M. Pickett, 1999. The acoustics of speech communication : fundamentals, speech perception theory, and technology, Boston: Allyn and Bacon. • Xuedong Huang, Alex Acero, Hsiao-Wuen Hon , 2001. “Spoken language structure”, chapter 2 in Spoken language processing : a guide to theory, algorithm, and system development. Upper Saddle River, N.J. : Prentice Hall PTR. • J.Liu, 2001.Tonal behavior in some tone languages. Ph.D. Dissertation. City University of Hong Kong, 2001. • Chu Min; Lu Shinan; Si Hongyan; He Lin; Guan Dinghua, 1996. “The control of juncture and prosody in Chinese TTS system”, in the Proceedings of ICSLP 1996, Volume 1, pp 725-728. • Pagel, V.; Carbonell, N.; Laprie, Y., 1996.”A new method for speech delexicalization, and its application to the perception of French prosody”, in the Proceedings of ICSLP 1996, volume 2, pp 821-824. • Heuft, B.; Portele, T., 1996, “Synthesizing prosody: a prominence-based approach”, in the Proceedings of ICSLP 1996, volume 3, pp 1361-1364. • Vainio, M.; Jarvikivi, J.; Werner, S.; Volk, N.; Valikangas, J., 2002, “Effect of prosodic naturalness on segmental acceptability in synthetic speech”, in the Proceedings of 2002 IEEE Workshop on Speech Synthesis,pp143 – 146. • Yong-Ju Lee; Sook-Hyang Lee, 1996, “On phonetic characteristics of pause in the Korean read speech”, in the Proceedings of ICSLP 1996, Volume1,pp 118-120. • House, D., 1996, “Differential perception of tonal contours through the syllable”, in the Proceedings of ICSLP 1996, Volume4,pp 2048 – 2051. CUHK-EE-DSPSTL