From last time …
This presentation is the property of its rightful owner.
Sponsored Links
1 / 21

From last time … PowerPoint PPT Presentation


  • 43 Views
  • Uploaded on
  • Presentation posted in: General

From last time …. Grammar. Recognized Words “zero” “three” “two”. Cepstrum. Probabilities “z” -0.81 “th” = 0.15 “t” = 0.03. Decoder. Signal Processing. Probability Estimator. ASR System Architecture. Speech Signal. Pronunciation Lexicon. A Few Points about Human Speech Recognition.

Download Presentation

From last time …

An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -

Presentation Transcript


From last time

From last time …


From last time

Grammar

RecognizedWords

“zero”

“three”

“two”

Cepstrum

Probabilities“z” -0.81“th” = 0.15“t” = 0.03

Decoder

Signal Processing

ProbabilityEstimator

ASR System Architecture

Speech

Signal

Pronunciation

Lexicon


A few points about human speech recognition

A Few Points about Human Speech Recognition

(See Chapter 18 for much more on this)


Human speech recognition

Human Speech Recognition

  • Experiments dating from 1918 dealing with noise, reduced BW (Fletcher)

  • Statistics of CVC perception

  • Comparisons between human and machine speech recognition

  • A few thoughts


The ear

The Ear


The cochlea

The Cochlea


Assessing recognition accuracy

Assessing Recognition Accuracy

  • Intelligibility

  • Articulation - Fletcher experiments

    • CVC, VC, CV, syllables in carrier sentences

    • Tests over different SNR, bands

    • Example: “The first group is `mav’ (forced choice between mav and nav)

    • Used sharp lowpass and/or highpass filtered. For equal energy, crossover is 450 Hz; for equal articulation, 1550 Hz.


Results

Results

  • S = vc2

  • Articulation Index (the original “AI”)

  • Error independence between bands

    • Articulatory band ~ 1 mm along basilar membrane

    • 20 filters between 300 and 8000 Hz

    • A single zero error band -> no error!

    • Robustness to a range of problems

    • AI = ∑k 1/K (SNRk / 30) where SNR saturates at 0 and 30


Ai additivity

AI additivity

  • s(a,b) = phone accuracy for band from a to b, a<b<c

  • (1-s(a,c)) = (1-s(a,b))(1-s(b,c))

  • log10(1-s(a,c)) = log10(1-s(a,b)) + log10(1-s(b,c))

  • AI(s) = log10(1-s) / log10(1-smax)

  • AI(s(a,c)) = AI(s(a,b)) + AI(s(b,c))


Jont allen interpretation the big idea

Jont Allen interpretation:The Big Idea

  • Humans don’t use frame-like spectral templates

  • Instead, partial recognition in bands

  • Combined for phonetic (syllabic?) recognition

  • Important for 3 reasons:

    • Based on decades of listening experiments

    • Based on a theoretical structure that matched the results

    • Different from what ASR systems do


Questions about ai

Questions about AI

  • Based on phones - the right unit for fluent speech?

  • Lost correlation between distant bands?

  • Lippmann experiments, disjoint bands

    • Signal above 8 kHz helps a lot in combination with signal below 800 Hz


Human sr vs asr quantitative comparisons

Human SR vs ASR: Quantitative Comparisons

  • Lippmann compilation (see book): typically ~factor of 10 in WER

  • Hasn’t changed too much since his study

  • Keep in mind this caveat: “human” scores are ideal - under sustained real conditions people don’t pay perfect attention (especially after lunch)


Human sr vs asr quantitative comparisons 2

Human SR vs ASR: Quantitative Comparisons (2)

Word error rates for 5000 word Wall Street Journal

read speech task using additive automotive noise

(old numbers – ASR would be a bit better now)


Human sr vs asr qualitative comparisons

Human SR vs ASR: Qualitative Comparisons

  • Signal processing

  • Subword recognition

  • Temporal integration

  • Higher level information


Human sr vs asr signal processing

Human SR vs ASR: Signal Processing

  • Many maps vs one

  • Sampled across time-frequency vs sampled in time

  • Some hearing-based signal processing already in ASR


Human sr vs asr subword recognition

Human SR vs ASR: Subword Recognition

  • Knowing what is important (from the maps)

  • Combining it optimally


Human sr vs asr temporal integration

Human SR vs ASR: Temporal Integration

  • Using or ignoring duration (e.g., VOT)

  • Compensating for rapid speech

  • Incorporating multiple time scales


Human sr vs asr higher levels

Human SR vs ASR: Higher levels

  • Syntax

  • Semantics

  • Pragmatics

  • Getting the gist

  • Dialog to learn more


Human sr vs asr conclusions

Human SR vs ASR: Conclusions

  • When we pay attention, human SR much better than ASR

  • Some aspects of human models going into ASR

  • Probably much more to do, when we learn how to do it right


  • Login