A tutorial on pronunciation modeling for large vocabulary speech recognition
This presentation is the property of its rightful owner.
Sponsored Links
1 / 35

A Tutorial on Pronunciation Modeling for Large Vocabulary Speech Recognition PowerPoint PPT Presentation


  • 97 Views
  • Uploaded on
  • Presentation posted in: General

A Tutorial on Pronunciation Modeling for Large Vocabulary Speech Recognition. Dr. Eric Fosler-Lussier Presentation for CiS 788. Overview. Our task: moving from “read speech recognition” to recognizing spontaneous conversational speech

Download Presentation

A Tutorial on Pronunciation Modeling for Large Vocabulary Speech Recognition

An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -

Presentation Transcript


A tutorial on pronunciation modeling for large vocabulary speech recognition

A Tutorial on Pronunciation Modeling for Large Vocabulary Speech Recognition

Dr. Eric Fosler-Lussier

Presentation for CiS 788


Overview

Overview

  • Our task: moving from “read speech recognition” to recognizing spontaneous conversational speech

  • Two basic approaches for modeling pronunciationvariation

    • Encoding linguistic knowledge to pre-specify possiblealternative pronunciations of words

    • Deriving alternatives directly from a pronunciation corpus.

  • Purposes of this tutorial

    • Explain basic linguistic concepts in phonetics and phonology

    • Outline several pronunciation modeling strategies

    • Summarize promising recent research directions.


Pronunciations pronunciation modeling

Pronunciations & Pronunciation Modeling


Pronunciations pronunciation modeling1

Pronunciations & Pronunciation Modeling

  • Why sub-word units?

    • Data sparseness at word level

    • Intermediate level allows extensible vocabulary

  • Why phone(me)s?

    • Available dictionaries/orthographies assume this unit

    • Research suggests humans use this unit

    • Phone inventory more manageable than syllables, etc. (in e.g., English)


Statistical underpinnings for pronunciation modeling

Statistical Underpinnings for Pronunciation Modeling

  • In the whole-word approach, we could find the most likely utterance (word-string) M* given the perceived signal:

    M* =


Statistical underpinnings for pronunciation modeling1

Statistical Underpinnings for Pronunciation Modeling

  • With independence assumptions, we can use the following approximation:

  • Argmax P(M|X)


Statistical underpinnings for pronunciation modeling2

Statistical Underpinnings for Pronunciation Modeling

  • PA(X|Q): the acoustic model

    • continuous sound (vector)s to discrete phone (state)s

    • Analogous to “categorical perception” in human hearing

  • PQ(Q|M): the pronunciation model

    • Probability of phone states given words

    • Also includes context-dependence & duration models

  • PL(M): the language model

    • The prior probability of word sequences


Statistical underpinnings for pronunciation modeling3

Statistical Underpinnings for Pronunciation Modeling

The three models working in sequence:


Linguistic formalisms pronunciation variation

Linguistic Formalisms & Pronunciation Variation

  • Phones & Phonemes

  • (Articulatory) Features

  • Phonological Rules

  • Finite State Transducers


Linguistic formalisms pronunciation variation1

Linguistic Formalisms & Pronunciation Variation

  • Phones & Phonemes

    • Phones: Types of (uttered) segments

      • E.g., [p] unaspirated voiceless labial stop [spik]

      • Vs. [ph] aspirated voiceless labial stop [phik]

    • Phonemes: Mental abstractions of phones

      • /p/ in speak = /p/ in peak to naïve speakers

    • ARPABET: between phones & phonemes

    • SAMPAbet: closer to phones, but not perfect…


Sampa for american english

Selected Consonants (arpa)

tS chin tSIn (ch)

dZ gin dZIn (jh)

T thin TIn (th)

D this DIs (dh)

Z measure "mEZ@` (zh)

N thing TIN (ng)

j yacht jAt (y)

4 butterbV4@` (dx)

Selected Vowels(arpa)

{ pat p{t (ae)

A pot pAt(aa)

V cut kVt (uh) !

U put pUt (uh) !

aI rise raIz (ay)

3` furs f3`z(er)

@ allow @laU (ax)

@` corner kOrn@` (axr)

SAMPA for American English


Linguistic formalisms pronunciation variation2

Linguistic Formalisms & Pronunciation Variation

  • (Articulatory) Features

    • Describe where (place) and how (manner) a sound is made, and whether it is voiced.

    • Typical features (dimensions) for vowels include height, backness, & roundness

  • (Acoustic) Features

    • Vowel features actually correlate better with formants than with actual tongue position


A tutorial on pronunciation modeling for large vocabulary speech recognition

From Hume-O’Haire & Winters (2001)


Linguistic formalisms pronunciation variation3

Linguistic Formalisms & Pronunciation Variation

  • Phonological Rules

    • Used to classify, explain, and predict phonetic alternations in related words: write (t) vs. writer (dx)

    • May also be useful for capturing differences in speech mode (e.g., dialect, register, rate)

    • Example: flapping in American English


Linguistic formalisms pronunciation variation4

Linguistic Formalisms & Pronunciation Variation

  • Finite State Transducers

    • (Same example transducer as on Tuesday)


Linguistic formalisms pronunciation variation5

Linguistic Formalisms & Pronunciation Variation

  • Useful properties of FSTs

    • Invertible

      (thus usable in both production & recognition)

    • Learnable (Oncina, Garcia, & Vidal 1993, Gildea & Jurafsky 1996)

    • Composable

    • Compatible with HMMs


Asr models predicting variation in pronunciations

ASR Models: Predicting Variation in Pronunciations

  • Knowledge-Based Approaches

    • Hand-Crafted Dictionaries

    • Letter to Sound Rules

    • Phonological Rules

  • Data-Driven Approaches

    • Baseform Learning

    • Learning Pronunciation Rules


Asr models predicting variation in pronunciations1

ASR Models: Predicting Variation in Pronunciations

  • Hand-Crafted Dictionaries

    • E.g., CMUdict, Pronlex for American English

    • The most readily available starting point

    • Limitations:

      • Generally only one or two pronunciations per word

      • Does not reflect fast speech, multi-word context

      • May not contain e.g., proper names, acronyms

      • Time-consuming to build for new languages


Asr models predicting variation in pronunciations2

ASR Models: Predicting Variation in Pronunciations

  • Letter to Sound Rules

    • In English, used to supplement dictionaries

    • In e.g., Spanish, may be enough by themselves

    • Can be learned (e.g. by DTs, ANNs)

    • Hard-to-catch Exceptions:

      • Compound-words, acronyms, etc.

      • Loan words, foreign words

      • Proper names (Brands, people, places)


Asr models predicting variation in pronunciations3

ASR Models: Predicting Variation in Pronunciations

  • Phonological Rules

    • Useful for modeling e.g., fast speech, likely non-canonical pronunciations

    • Can provide basis for speaker-adaptation

    • Limitations:

      • Requires labeled corpus to learn rule probabilities

      • May over-generalize, creating spurious homophones

      • (Pruning minimizes this)


Examples of fast speech rules

Examples of Fast-Speech Rules


Asr models predicting variation in pronunciations4

ASR Models: Predicting Variation in Pronunciations

  • Automatic Baseform Learning

    1) Use ASR with “dummy” dictionary to find “surface” phone sequences of an utterance

    2) Find canonical pronunciation of utterance (e.g., by forced-Viterbi)

    3) Align these two (w/ dynamic programming)

    4) Record “surface pronunciations” of words


Asr models predicting variation in pronunciations5

ASR Models: Predicting Variation in Pronunciations

  • Limitations of Baseform Learning

    • Limited to single-word learning

    • Ignores multi-word phrases, cross word-boundary effects (e.g., Did you  “didja”)

    • Misses generalizations across words (e.g., learns flapping separately for each word)


Asr models predicting variation in pronunciations6

ASR Models: Predicting Variation in Pronunciations

  • Learning Pronunciation Rules

    • Each word has a canonical pronunciation c1 c2 …cj…cn.

    • Each phone cj in a word can be pronounced by some sj.

    • Set of surface pronunciations S: {Si = si1, …, sin}

    • Taking canonical tri-phone and last surface phone into account, the probability of a given Si can be estimated:


Asr models predicting variation in pronunciations7

ASR Models: Predicting Variation in Pronunciations

  • (Machine) Learning Pronunciation Rules

    • Typical ML techniques apply: CART, ANNs, etc.

    • Using features (pre-specified or learned) helps

    • Brill-type rules (e.g., Yang & Martens 2000):

      • A  B // C __ D with P(B|A,C,D) positive rule

      • A  not B // C __ D with 1 - P(B|A,C,D) neg. rule

        (Note: equivalent to Two-level rule types 1 & 4)


Asr models predicting variation in pronunciations8

ASR Models: Predicting Variation in Pronunciations

  • Pruning Learned Rules & Pronunciations

    • Vary # of allowed pronunciations by word-frequency

      E.g., f (count(w)) = k log(count(w))

    • Use probability threshold for candidate pronunciations

      • Absolute cutoff

      • “Relmax” (relative to maximum) cutoff

    • Use acoustic confidence C(pj,wi) as measure


Online transformation based pronunciation modeling

Online Transformation-Based Pronunciation Modeling

  • In theory, a dynamic dictionary could halve error-rates

    • Using an “oracle dictionary” for each utterance in switchboard reduces error by 43%

    • Using e.g., multi-word context, hidden speaking-mode states may capture some of this.

    • Actual results less dramatic, of course!


Online transformation based pronunciation modeling1

Online Transformation-Based Pronunciation Modeling


Five problems yet to be solved

Five Problems Yet to Be Solved

  • Confusability and Discriminability

  • Hard Decisions

  • Consistency

  • Information Structure

  • Moving Beyond Phones as Basic Units


Five problems yet to be solved1

Five Problems Yet to Be Solved

  • Confusability and Discriminability

    • New pronunciations can create homophones not only with other words, but with parts of words.

    • Few exact metrics exist to measure confusion


Five problems yet to be solved2

Five Problems Yet to Be Solved

  • Hard Decisions

    • Forced-Viterbi throws away good, but “second-best” representations.

    • N-best would avoid this (Mokbel and Jouvet), but problematic for large-vocabulary

    • DTs also introduce hard decisions and data-splitting


Five problems yet to be solved3

Five Problems Yet to Be Solved

  • Consistency

    • Current ASR works word-by-word w/o picking up on long-term patterns (e.g., stretches of fast speech, consistent patterns like dialect, speaker)

    • Hidden speech-mode variable helps, but data is perhaps too sparse for dialect-dependent states.


Five problems yet to be solved4

Five Problems Yet to Be Solved

  • Information Structure

    • Language is about the message!

    • Hence, not all words are pronounced equal

    • Confounding variables:

      • Prosody & intonation (emphasis, de-accenting)

      • Position of word in utterance (beginning or end)

      • Given vs. new information; Topic/focus, etc.

      • First-time use vs. repetitions of a word


Five problems yet to be solved5

Five Problems Yet to Be Solved

  • Moving Beyond Phones as Basic Units

    • Other types of units

      • “Fenones”

      • Hybrid phones [x+y] for //x///y/ rules

    • Detecting (changes in) distinctive features

      • E.g., [ax]  {[+voicing,+nasality], [+voicing,+nasality,+back], [+voicing,+back], …}

      • (cf. Autosegmental & Non-linear phonology?)


Conclusions

Conclusions

  • An ideal model would:

    • Be dynamic and adaptive in dictionary use

    • Integrate knowledge of previously heard pronunciation patterns from that speaker

    • Incorporate higher-level factors (e.g., speaking rate, semantics of the message) to predict changes from the canonical pronunciation

    • (Perhaps) operate on a sub-phonetic level, too.


  • Login