an introduction to speech perception l.
Download
Skip this Video
Loading SlideShow in 5 Seconds..
An Introduction to Speech Perception PowerPoint Presentation
Download Presentation
An Introduction to Speech Perception

Loading in 2 Seconds...

play fullscreen
1 / 34

An Introduction to Speech Perception - PowerPoint PPT Presentation


  • 399 Views
  • Uploaded on

An Introduction to Speech Perception. Ph.D. student: Li Yujia, Rain Supervisor: Prof. Tan Lee. Jan. 28, 2005. Contents. Basic Knowledge Speech Perception Perception Theories Speech Perception versus Music Perception Applications . Basics. Speech Perception. Theories. Speech vs. Music.

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about 'An Introduction to Speech Perception' - benjamin


An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
an introduction to speech perception

An Introduction toSpeech Perception

Ph.D. student: Li Yujia, Rain

Supervisor: Prof. Tan Lee

Jan. 28, 2005

CUHK-EE-DSPSTL

contents
Contents
  • Basic Knowledge
  • Speech Perception
  • Perception Theories
  • Speech Perception versus Music Perception
  • Applications

CUHK-EE-DSPSTL

basic knowledge

Basics

Speech Perception

Theories

Speech vs. Music

Applications

Basic Knowledge
  • Three levels of speech
  • Segments vs. supra-segments
  • Basic acoustic features
  • Auditory components of human speech perception
  • Basic methodology of perception research

CUHK-EE-DSPSTL

three levels of speech

Basics

Speech Perception

Theories

Speech vs. Music

Applications

Three levels of speech

Speaker

Linguistic

Define rules

Acoustic

Speech realization

Perceptual

Listener

Interpretation

CUHK-EE-DSPSTL

segments vs supra segments

Basics

Speech Perception

Theories

Speech vs. Music

Applications

Segments vs. Supra-segments

CUHK-EE-DSPSTL

basic acoustic features

Basics

Speech Perception

Theories

Speech vs. Music

Applications

Basic acoustic features

waveform

1/f0

formants

spectrogram

CUHK-EE-DSPSTL

auditory components of human speech perception

Basics

Speech Perception

Theories

Speech vs. Music

Applications

Auditory components of human speech perception
  • The peripheral auditory organs – ear

(signal processing)

  • The auditory nervous system – brain

(interpretation)

semantic

prosody

CUHK-EE-DSPSTL

basic methodology of perception research

Basics

Speech Perception

Theories

Speech vs. Music

Applications

Basic methodology of perception research
  • Stimuli: synthesized speech
  • Testing: by human listening
  • Results are affected by
    • Intrinsic factors: attributes to speech sounds
    • Extrinsic factors: resulted from experimental conditions

CUHK-EE-DSPSTL

speech perception

Basics

Speech Perception

Theories

Speech vs. Music

Applications

Speech Perception
  • Perception of vowels
  • Perception of consonants
  • Perception of prosody

CUHK-EE-DSPSTL

perception of vowels 1

Basics

Speech Perception

Theories

Speech vs. Music

Applications

Perception of vowels (1)
  • Vowel sounds are perceptually specified by their formant frequencies.

Spectrogram of an /i/ vowel with first and second formant labeled.

CUHK-EE-DSPSTL

slide11

Basics

Speech Perception

Theories

Speech vs. Music

Applications

Perception of vowels (2)

  • Evidence
    • From production:
      • Vowel-tongue position-vocal tract-formant frequencies.
    • From perception:
      • Synthesized speech-first two formants-different vowel sound
    • From physics:
      • “There is some evidence that the human auditory nerve already reacts directly to formant frequencies.” (Delgutte, 1980)

CUHK-EE-DSPSTL

perception of consonants 1

Basics

Speech Perception

Theories

Speech vs. Music

Applications

transition

steady state

Perception of consonants (1)
  • In perception, many consonants depend on vowels; much of stop consonants depend on the rapidly changing formant transitions.

Schematic of first two formant frequency pattern for a /di/ syllable

CUHK-EE-DSPSTL

slide13

Basics

Speech Perception

Theories

Speech vs. Music

Applications

Perception of consonants (2)

Schematic representations of first two formant frequency patterns for /d/ in front of different vowels

  • Lack of acoustic invariance: the lack of something constant in the spectrographic representation (visual representation of speech) to explain the perception of a particular consonant.
  • Locus theory: the second formant frequency transitions all seem to be pointing toward the same frequency which is called locus.

CUHK-EE-DSPSTL

slide14

Basics

Speech Perception

Theories

Speech vs. Music

Applications

Perception of consonants (3)

  • What is the basic unit for speech perception?
    • Because we cannot isolate stop consonants from vowels in perception, researchers began to think of speech as encoded (vowels and consonants are squeezed together), perhaps in syllable-sized units.
  • Speech can be presented at a faster speed rate (30 phonemes per second) than other sounds, and still retain its perceptual intelligibility.

CUHK-EE-DSPSTL

perception of prosody 1

Basics

Speech Perception

Theories

Speech vs. Music

Applications

Perception of prosody(1)
  • The perception of prosody has been described as dependent on the “melody of speech”, the fluctuations in the pitch, rhythm, and stress (Monrad-Krohn, 1947).
  • Related acoustic features are f0, duration and energy intensity.

CUHK-EE-DSPSTL

slide16

Basics

Speech Perception

Theories

Speech vs. Music

Applications

Perception of prosody(2)

  • Perception of prosody is more complex
    • The relatively vague definition.
    • The perception of prosody is nonlinear to the acoustic features.

(double f0 ≠ double pitch; double duration ≠ double stress)

    • Perceived over long time in a relative sense.

(the degree of contrast between the values of the acoustic variables over a number of syllables)

    • An perceived attribute of prosody may be related to several acoustic features.

(f0 is most powerful cue to stress, followed by duration and energy intensity)

CUHK-EE-DSPSTL

slide17

Basics

Speech Perception

Theories

Speech vs. Music

Applications

Perception of prosody(3)

  • Research is relatively sparse
  • The target of our research will be:
    • From acoustic to perception to determine how one or several acoustic features contribute to the perceived naturalness.
    • Improve the naturalness of synthesized speech in an effective way.

CUHK-EE-DSPSTL

perception theories

Basics

Speech Perception

Theories

Speech vs. Music

Applications

Perception Theories
  • Masking
  • Categorical perception
  • Motor theory
  • Analysis-by-synthesis
  • Bottom-up versus top-down

CUHK-EE-DSPSTL

masking

Basics

Speech Perception

Theories

Speech vs. Music

Applications

Masking
  • Frequency masking
    • One sound cannot be perceived if another sound close in frequency has a high enough level.
  • Temporal masking
    • A sound cannot be perceived if it is too close in time to another sound.
    • Pre-masking tends to last 5 ms; post-masking can last from 50 to 300 ms.

B

A

Pre-masking

Post-masking

B

A

50-300ms

5ms

CUHK-EE-DSPSTL

categorical perception 1

Basics

Speech Perception

Theories

Speech vs. Music

Applications

Categorical perception (1)
  • Voice onset time (VOT) (Lisker and Abramson, 1964)
    • Voiced versus voiceless (if the vocal fold vibrates, eg. /z/ and /s/)
    • The difference between voiced and voiceless stop consonants (eg. /b/and/p/; /d/and/t/;/g/and/k/) is actually one of the relative timing of the onset of the onset of vocal fold vibration.
    • The timing difference is referred to as voice onset time (VOT)

CUHK-EE-DSPSTL

slide21

Basics

Speech Perception

Theories

Speech vs. Music

Applications

Categorical perception (2)

  • Voice onset time (VOT)
    • voiced stop consonants have a relatively short VOT; whereas voiceless consonants have a longer VOT.

VOT

VOT measure for a /b/

VOT

VOT measure for a /p/

CUHK-EE-DSPSTL

slide22

Basics

Speech Perception

Theories

Speech vs. Music

Applications

Categorical perception (3)

  • VOT categories
    • From production:
    • From perception:

VOT productions of a single normal adult speaker of American English for words beginning with /d/ and /t/.

Identification functions of a single listener for VOT continuum from /d/ to /t/ in approximately 11 ms steps. Each stimulus is presented 10 times each in random order

CUHK-EE-DSPSTL

slide23

Basics

Speech Perception

Theories

Speech vs. Music

Applications

Categorical perception (4)

  • Categorical Perception
    • The insensitivity to differences within a category, but keen sensitivity to cross-category differences, is referred to as categorical perception.
    • It’s characteristic of certain speech sound distinctions, and it’s generally not found for nonspeech sounds (Cutting, 1972).
    • It represents one of the human perceptual mechanisms coping with tremendous amount of variations rapidly (ignore nonessential variation within a category)

CUHK-EE-DSPSTL

motor theory 1

Basics

Speech Perception

Theories

Speech vs. Music

Applications

Motor theory (1)
  • Motor commands:
    • The neural message that the brain sends to set the articulators in motion to produce speech.
  • Motivation:
    • When a stop consonant is produced in various vowel context, because of the lack of acoustic invariance , there must be constant motor commands to the articulators to produce the same consonant.

CUHK-EE-DSPSTL

slide25

Basics

Speech Perception

Theories

Speech vs. Music

Applications

Motor theory (2)

  • Original theory:
    • “Though we cannot exclude the possibility that a purely auditory decoder exists, we find it more plausible to assume that speech is perceived by processes that are also involved in its production” (Liberman, Cooper, Shankweiler, & Studdert-Kennedy, 1967).

CUHK-EE-DSPSTL

slide26

Basics

Speech Perception

Theories

Speech vs. Music

Applications

Motor theory (3)

()

  • Weak version:
    • Speech production offers important cues about speech perception which can be used by listeners.
  • Strong version:
    • Speech production forms the basis for speech perception.

CUHK-EE-DSPSTL

analysis by synthesis

Basics

Speech Perception

Theories

Speech vs. Music

Applications

Analysis-by-synthesis

Listeners are hypothesized to decode the acoustic signal by internally generating matching signals.

The signal that provides the best match is the one “perceived” by the listener.

CUHK-EE-DSPSTL

bottom up versus top down 1

Basics

Speech Perception

Theories

Speech vs. Music

Applications

Bottom-up versus top-down (1)
  • Bottom-up:
    • Use the acoustic information to discover what is being uttered.
  • Top-down:
    • Use linguistic information

CUHK-EE-DSPSTL

slide29

Basics

Speech Perception

Theories

Speech vs. Music

Applications

Bottom-up versus top-down (2)

  • Bottom-up information is important at the beginning of utterance, while top-down information becomes primary when more syllables in an sentence are uttered.
  • The role of top-down information is supported, because good organization and prosody will speed up the understanding of a speech.

Bottom-up

Top-down

CUHK-EE-DSPSTL

speech perception versus music perception

Basics

Speech Perception

Theories

Speech vs. Music

Applications

Speech Perception versus Music Perception
  • Physical difference in perception
  • Categorical perception in speech; continuous perception in music
    • We can discriminate about 1200 different pitches in music, but we can only absolutely identify about 7 ( Liberman, 1967).
    • For certain sound difference relevant to speech, listeners can only discriminate accurately about as many sounds as they can identify.

For speech

For music

CUHK-EE-DSPSTL

applications

Basics

Speech Perception

Theories

Speech vs. Music

Applications

Applications
  • Speech recognition
  • Speech synthesis
  • Speaker recognition
  • Hearing aid

CUHK-EE-DSPSTL

summary
Summary
  • Speech perception
    • vowel, consonant, prosody
  • Perception theories
    • Masking, categorical perception, motor theory, analysis-by-synthesis, bottom-up and top-down
  • Speech vs. music perception

CUHK-EE-DSPSTL

conclusions
Conclusions
  • What we have known for speech perception is very limited, especially for prosody perception.
  • Speech perception will help speech technology much.

CUHK-EE-DSPSTL

references
References
  • Jack Ryalls, 1996. A basic introduction to speech perception.San Diego, Calif. : Singular Pub. Group.
  • Gloria J. Borden, Katherine S. Harris, Lawrence J. Raphael, 2003. “Speech perception”, chapter 6 in Speech science primer : physiology, acoustics, and perception of speech, Philadelphia : Lippincott Williams & Wilkins.
  • Raymond D. Kent, 1997.”Speech perception”, chapter 10 in The speech sciences, San Diego : Singular Pub. Group.
  • Richard B. Ivry and Lynn C, 1998. “Speech perception and language”, chapter 6 in The two sides of perception, Cambridge, Mass. : MIT Press.
  • J.M. Pickett, 1999. The acoustics of speech communication : fundamentals, speech perception theory, and technology, Boston: Allyn and Bacon.
  • Xuedong Huang, Alex Acero, Hsiao-Wuen Hon , 2001. “Spoken language structure”, chapter 2 in Spoken language processing : a guide to theory, algorithm, and system development. Upper Saddle River, N.J. : Prentice Hall PTR.
  • J.Liu, 2001.Tonal behavior in some tone languages. Ph.D. Dissertation. City University of Hong Kong, 2001.
  • Chu Min; Lu Shinan; Si Hongyan; He Lin; Guan Dinghua, 1996. “The control of juncture and prosody in Chinese TTS system”, in the Proceedings of ICSLP 1996, Volume 1, pp 725-728.
  • Pagel, V.; Carbonell, N.; Laprie, Y., 1996.”A new method for speech delexicalization, and its application to the perception of French prosody”, in the Proceedings of ICSLP 1996, volume 2, pp 821-824.
  • Heuft, B.; Portele, T., 1996, “Synthesizing prosody: a prominence-based approach”, in the Proceedings of ICSLP 1996, volume 3, pp 1361-1364.
  • Vainio, M.; Jarvikivi, J.; Werner, S.; Volk, N.; Valikangas, J., 2002, “Effect of prosodic naturalness on segmental acceptability in synthetic speech”, in the Proceedings of 2002 IEEE Workshop on Speech Synthesis,pp143 – 146.
  • Yong-Ju Lee; Sook-Hyang Lee, 1996, “On phonetic characteristics of pause in the Korean read speech”, in the Proceedings of ICSLP 1996, Volume1,pp 118-120.
  • House, D., 1996, “Differential perception of tonal contours through the syllable”, in the Proceedings of ICSLP 1996, Volume4,pp 2048 – 2051.

CUHK-EE-DSPSTL