Human and machine performance in speech processing
1 / 34

- PowerPoint PPT Presentation

  • Uploaded on

Human and Machine Performance in Speech Processing. Louis C.W. Pols Institute of Phonetic Sciences / ACLC University of Amsterdam, The Netherlands (Apologies: this presentation resembles keynote at ICPhS’99, San Fransisco, CA). IFA Herengracht 338 Amsterdam. welcome. Heraeus-Seminar

I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
Download Presentation

PowerPoint Slideshow about '' - Anita

An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.

- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
Human and machine performance in speech processing l.jpg

Human andMachine Performancein Speech Processing

Louis C.W. Pols

Institute of Phonetic Sciences / ACLC

University of Amsterdam, The Netherlands

(Apologies: this presentation resembles keynote at ICPhS’99, San Fransisco, CA)

Slide2 l.jpg


Herengracht 338




“Speech Recognition and Speech Understanding”

April 3-5, 2000, Physikzentrum Bad Honnef, Germany

Overview l.jpg

  • Phonetics and speech technology

  • Do recognizers need ‘intelligent ears’?

  • What is knowledge?

  • How good is human/machine speech recogn.?

  • How good is synthetic speech?

  • Pre-processor characteristics

  • Useful (phonetic) knowledge

  • Computational phonetics

  • Discussion/conclusions

Phonetics speech technology l.jpg
Phonetics  Speech Technology

Machine performance more difficult if l.jpg
Machine performancemore difficult, if ……..

  • test condition deviates from training condition, because of:

    • nativeness and age of speakers

    • size and content of vocabulary

    • speaking style, emotion, rate

    • microphone, background noise, reverberation, communication channel

    • nonavailability of certain features

  • however, machines get never tired, bored or distracted

Do recognizers need intelligent ears l.jpg
Do recognizers needintelligent ears?

  • intelligent ears  front-end pre-processor

  • only if it improves performance

  • humans are generally better speech processors than machines, perhaps system developers can learn from human behavior

  • robustness at stake (noise, reverberation, incompleteness, restoration, competing speakers, variable speaking rate, context, dialects, non-nativeness, style, emotion)

What is knowledge l.jpg
What is knowledge?

  • phonetic knowledge

  • probabilistic knowledge from databases

  • fixed set of features vs. adaptable set

  • trading relations, selectivity

  • knowledge of the world, expectation

  • global vs. detailed

    see video

    (with permission from Interbrew Nederland NV)

Video is a metaphor for l.jpg
Video is a metaphor for:

  • from global to detail (world  Europe  Holland  North Sea coast  Scheveningen  beach

     young lady  drinking Dommelsch beer)

  • sound  speech  speaker  English  utterance

  • ‘recognize speech’ or ‘wreck a nice beach’

  • zoom in on whatever information is available

  • make intelligent interpretation, given context

  • beware for distracters!

Human auditory sensitivity l.jpg
Human auditory sensitivity

  • stationary vs. dynamic signals

  • simple vs. spectrally complex

  • detection threshold

  • just noticeable differences

Slide11 l.jpg

Detection thresholds and jnd


simple, stationary signals single-formant-like

periodic signals


3 - 5%


1.5 Hz


20 - 40%

Table 3 in Proc. ICPhS’99 paper

Dl for short speech like transitions l.jpg
DL for short speech-like transitions




longer trans.

Adopted from van Wieringen & Pols (Acta Acustica ’98)

How good is human machine speech recognition l.jpg
How good ishuman / machine speech recognition?

How good is human machine speech recognition14 l.jpg
How good ishuman / machine speech recognition?

  • machine SR surprisingly good for certain tasks

  • machine SR could be better for many others

    • robustness, outliers

  • what are the limits of human performance?

    • in noise

    • for degraded speech

    • missing information (trading)

Human word intelligibility vs noise l.jpg
Human word intelligibility vs. noise

humans start to have some trouble

recognizers have trouble!

Adopted from Steeneken (1992)

Robustness to degraded speech l.jpg
Robustness to degraded speech

  • speech = time-modulated signal in frequency bands

  • relatively insensitive to (spectral) distortions

    • prerequisite for digital hearing aid

    • modulating spectral slope: -5 to +5 dB/oct, 0.25-2 Hz

  • temporal smearing of envelope modulation

    • ca. 4 Hz max. in modulation spectrum  syllable

    • LP>4 Hz and HP<8 Hz little effect on intelligibility

  • spectral envelope smearing

    • for BW>1/3 oct masked SRT starts to degrade

      (for references, see paper in Proc. ICPhS’99)

Robustness to degraded speech and missing information l.jpg
Robustness to degraded speechand missing information

  • partly reversed speech (Saberi & Perrott, Nature, 4/99)

    • fixed duration segments time reversed or shifted in time

    • perfect sentence intelligibility up to 50 ms

      (demo: every 50 ms reversed original )

    • low frequency modulation envelope (3-8 Hz) vs. acoustic spectrum

    • syllable as information unit? (S. Greenberg)

  • gap and click restoration (Warren)

  • gating experiments

Slide18 l.jpg
How good is synthetic speech?(not main theme of this seminar, however, still attention for synthesis and dialogue)

  • good enough for certain applications

  • could be better in most others

  • evaluation: application-specific

  • or multi-tier required

  • interesting experience: Synthesis workshop at Jenolan Caves, Australia, Nov. 1998

Workshop evaluation procedure l.jpg
Workshop evaluation procedure

  • participants as native listeners

  • DARPA-type procedures in data preparations

  • balanced listening design

  • no detailed results made public

  • 3 text types

    • newspaper sentences

    • semantically unpredictable sentences

    • telephone directory entries

  • 42 systems in 8 languages tested

Some global results l.jpg
Some global results

  • it worked!, but many practical problems

    (for demo see

  • this seems the way to proceed and to expand

  • global rating (poor to excellent)

    • text analysis, prosody & signal processing

  • and/or more detailed scores

  • transcriptions subjectively judged

    • major/minor/no problems per entry

  • web site access of several systems


Phonetic knowledge to improve speech synthesis l.jpg
Phonetic knowledge to improve speech synthesis

(supposing concatenative synthesis)

  • control emotion, style, voice characteristics

  • perceptual implications of

    • parameterization (LPC, PSOLA)

    • discontinuities (spectral, temporal, prosody)

  • improve naturalness (prosody!)

  • active adaptation to other conditions

    • hyper/hypo, noise, comm. channel, listener impairment

  • systematic evaluation

Desired pre processor characteristics in automatic speech recognition l.jpg
Desired pre-processor characteristicsin Automatic Speech Recognition

  • basic sensitivity for stationary and dynamic sounds

  • robustness to degraded speech

    • rather insensitive to spectral and temporal smearing

  • robustness to noise and reverberation

  • filter characteristics

    • is BP, PLP, MFCC, RASTA, TRAPS good enough?

    • lateral inhibition (spectral sharpening); dynamics

  • what can be neglected?

    • non-linearities, limited dynamic range, active elements, co-modulation, secondary pitch, etc.

Caricature of present day speech recognizer l.jpg
Caricature of present-day speech recognizer

  • trained with a variety of speech input

    • much global information, no interrelations

  • monaural, uni-modal input

  • pitch extractor generally not operational

  • performs well on average behavior

    • does poorly on any type of outlier (OOV, non-native, fast

      or whispered speech, other communication channel)

  • neglects lots of useful (phonetic) information

  • heavily relies on language model

Useful phonetic knowledge neglected so far l.jpg
Useful (phonetic) knowledge neglected so far

  • pitch information

  • (systematic) durational variability

  • spectral reduction/coarticulation (other than multiphone)

  • intelligent selection from multiple features

  • quick adaptation to speaker, style & channel

  • communicative expectations

  • multi-modality

  • binaural hearing

Useful information durational variability l.jpg
Useful information: durational variability

Adopted from Wang (1998)

Useful information durational variability27 l.jpg
Useful information: durational variability

overall average=95 ms

normal rate=95

primary stress=104

word final=136

utterance final=186

Adopted from Wang (1998)

Useful information v and c reduction coarticulation l.jpg
Useful information:V and C reduction, coarticulation

  • spectral variability is not random but, at least partly, speaker-, style-, and context-specific

  • read - spontaneous; stressed - unstressed

  • not just for vowels, but also for consonants

    • duration

    • spectral balance

    • intervocalic sound energy difference

    • F2 slope difference

    • locus equation

Slide29 l.jpg

C-duration C error rate

Mean consonant duration

Mean error rate for C identification

791 VCV pairs (read & spontan.; stressed & unstr. segments; one male)

C-identification by 22 Dutch subjects

Adopted from van Son & Pols (Eurospeech’97)

Other useful information l.jpg
Other useful information:

  • pronunciation variation (ESCA workshop)

  • acoustic attributes of prominence (B. Streefkerk)

  • speech efficiency (post-doc project R. v. Son)

  • confidence measure

  • units in speech recognition

    • rather than PLU, perhaps syllables (S. Greenberg)

  • quick adaptation

  • prosody-driven recognition / understanding

  • multiple features

Speech efficiency l.jpg
Speech efficiency

  • speech is most efficient if it contains only the information needed to understand it:

    “Speech is the missing information” (Lindblom, JASA ‘96)

  • less information needed for more predictable things:

    • shorter duration and more spectral reduction for high-frequent syllables and words

    • C-confusion correlates with acoustic factors (duration, CoG) and with information content (syll./word freq.) I(x) = -log2(Prob(x)) in bits

      (see van Son, Koopmans-van Beinum, and Pols (ICSLP’98))

Correlation between consonant confusion and 4 measures indicated l.jpg
Correlation between consonant confusion and 4 measures indicated

Dutch male sp.

20 min. R/S

12 k syll.

8k words

791 VCV R/S

308 lex. str. (+)

483 unstr. (–)

C ident. 22 Ss

+ p  0.01

 p  0.001

Adopted from van Son et al. (Proc. ICSLP’98)

Computational phonetics first suggested by r moore icphs 95 stockholm l.jpg
Computational Phonetics indicated(first suggested by R. Moore, ICPhS’95 Stockholm)

  • duration modeling

  • optimal unit selection (like in concatenative synthesis)

  • pronunciation variation modeling (SpeCom Nov. ‘99)

  • vowel reduction models

  • computational prosody

  • information measures for confusion

  • speech efficiency models

  • modulation transfer function for speech

Discussion conclusions l.jpg
Discussion indicated/ Conclusions

  • speech technology needs further improvement for certain tasks (flexibility, robustness)

  • phonetic knowledge can help if provided in an implementable form; computational phonetics is probably a good way to do that

  • phonetics and speech / language technology should work together more closely, for their mutual benefit

  • this Heraeus-seminar is a possible platform for that discussion