A tutorial on pronunciation modeling for large vocabulary speech recognition
Download
1 / 35

A Tutorial on Pronunciation Modeling for Large Vocabulary Speech Recognition - PowerPoint PPT Presentation


  • 151 Views
  • Uploaded on

A Tutorial on Pronunciation Modeling for Large Vocabulary Speech Recognition. Dr. Eric Fosler-Lussier Presentation for CiS 788. Overview. Our task: moving from “read speech recognition” to recognizing spontaneous conversational speech

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about 'A Tutorial on Pronunciation Modeling for Large Vocabulary Speech Recognition' - tao


An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
A tutorial on pronunciation modeling for large vocabulary speech recognition

A Tutorial on Pronunciation Modeling for Large Vocabulary Speech Recognition

Dr. Eric Fosler-Lussier

Presentation for CiS 788


Overview
Overview

  • Our task: moving from “read speech recognition” to recognizing spontaneous conversational speech

  • Two basic approaches for modeling pronunciationvariation

    • Encoding linguistic knowledge to pre-specify possiblealternative pronunciations of words

    • Deriving alternatives directly from a pronunciation corpus.

  • Purposes of this tutorial

    • Explain basic linguistic concepts in phonetics and phonology

    • Outline several pronunciation modeling strategies

    • Summarize promising recent research directions.


Pronunciations pronunciation modeling
Pronunciations & Pronunciation Modeling


Pronunciations pronunciation modeling1
Pronunciations & Pronunciation Modeling

  • Why sub-word units?

    • Data sparseness at word level

    • Intermediate level allows extensible vocabulary

  • Why phone(me)s?

    • Available dictionaries/orthographies assume this unit

    • Research suggests humans use this unit

    • Phone inventory more manageable than syllables, etc. (in e.g., English)


Statistical underpinnings for pronunciation modeling
Statistical Underpinnings for Pronunciation Modeling

  • In the whole-word approach, we could find the most likely utterance (word-string) M* given the perceived signal:

    M* =


Statistical underpinnings for pronunciation modeling1
Statistical Underpinnings for Pronunciation Modeling

  • With independence assumptions, we can use the following approximation:

  • Argmax P(M|X)


Statistical underpinnings for pronunciation modeling2
Statistical Underpinnings for Pronunciation Modeling

  • PA(X|Q): the acoustic model

    • continuous sound (vector)s to discrete phone (state)s

    • Analogous to “categorical perception” in human hearing

  • PQ(Q|M): the pronunciation model

    • Probability of phone states given words

    • Also includes context-dependence & duration models

  • PL(M): the language model

    • The prior probability of word sequences


Statistical underpinnings for pronunciation modeling3
Statistical Underpinnings for Pronunciation Modeling

The three models working in sequence:


Linguistic formalisms pronunciation variation
Linguistic Formalisms & Pronunciation Variation

  • Phones & Phonemes

  • (Articulatory) Features

  • Phonological Rules

  • Finite State Transducers


Linguistic formalisms pronunciation variation1
Linguistic Formalisms & Pronunciation Variation

  • Phones & Phonemes

    • Phones: Types of (uttered) segments

      • E.g., [p] unaspirated voiceless labial stop [spik]

      • Vs. [ph] aspirated voiceless labial stop [phik]

    • Phonemes: Mental abstractions of phones

      • /p/ in speak = /p/ in peak to naïve speakers

    • ARPABET: between phones & phonemes

    • SAMPAbet: closer to phones, but not perfect…


Sampa for american english

Selected Consonants (arpa)

tS chin tSIn (ch)

dZ gin dZIn (jh)

T thin TIn (th)

D this DIs (dh)

Z measure "mEZ@` (zh)

N thing TIN (ng)

j yacht jAt (y)

4 butter bV4@` (dx)

Selected Vowels (arpa)

{ pat p{t (ae)

A pot pAt (aa)

V cut kVt (uh) !

U put pUt (uh) !

aI rise raIz (ay)

3` furs f3`z (er)

@ allow @laU (ax)

@` corner kOrn@` (axr)

SAMPA for American English


Linguistic formalisms pronunciation variation2
Linguistic Formalisms & Pronunciation Variation

  • (Articulatory) Features

    • Describe where (place) and how (manner) a sound is made, and whether it is voiced.

    • Typical features (dimensions) for vowels include height, backness, & roundness

  • (Acoustic) Features

    • Vowel features actually correlate better with formants than with actual tongue position



Linguistic formalisms pronunciation variation3
Linguistic Formalisms & Pronunciation Variation

  • Phonological Rules

    • Used to classify, explain, and predict phonetic alternations in related words: write (t) vs. writer (dx)

    • May also be useful for capturing differences in speech mode (e.g., dialect, register, rate)

    • Example: flapping in American English


Linguistic formalisms pronunciation variation4
Linguistic Formalisms & Pronunciation Variation

  • Finite State Transducers

    • (Same example transducer as on Tuesday)


Linguistic formalisms pronunciation variation5
Linguistic Formalisms & Pronunciation Variation

  • Useful properties of FSTs

    • Invertible

      (thus usable in both production & recognition)

    • Learnable (Oncina, Garcia, & Vidal 1993, Gildea & Jurafsky 1996)

    • Composable

    • Compatible with HMMs


Asr models predicting variation in pronunciations
ASR Models: Predicting Variation in Pronunciations

  • Knowledge-Based Approaches

    • Hand-Crafted Dictionaries

    • Letter to Sound Rules

    • Phonological Rules

  • Data-Driven Approaches

    • Baseform Learning

    • Learning Pronunciation Rules


Asr models predicting variation in pronunciations1
ASR Models: Predicting Variation in Pronunciations

  • Hand-Crafted Dictionaries

    • E.g., CMUdict, Pronlex for American English

    • The most readily available starting point

    • Limitations:

      • Generally only one or two pronunciations per word

      • Does not reflect fast speech, multi-word context

      • May not contain e.g., proper names, acronyms

      • Time-consuming to build for new languages


Asr models predicting variation in pronunciations2
ASR Models: Predicting Variation in Pronunciations

  • Letter to Sound Rules

    • In English, used to supplement dictionaries

    • In e.g., Spanish, may be enough by themselves

    • Can be learned (e.g. by DTs, ANNs)

    • Hard-to-catch Exceptions:

      • Compound-words, acronyms, etc.

      • Loan words, foreign words

      • Proper names (Brands, people, places)


Asr models predicting variation in pronunciations3
ASR Models: Predicting Variation in Pronunciations

  • Phonological Rules

    • Useful for modeling e.g., fast speech, likely non-canonical pronunciations

    • Can provide basis for speaker-adaptation

    • Limitations:

      • Requires labeled corpus to learn rule probabilities

      • May over-generalize, creating spurious homophones

      • (Pruning minimizes this)



Asr models predicting variation in pronunciations4
ASR Models: Predicting Variation in Pronunciations

  • Automatic Baseform Learning

    1) Use ASR with “dummy” dictionary to find “surface” phone sequences of an utterance

    2) Find canonical pronunciation of utterance (e.g., by forced-Viterbi)

    3) Align these two (w/ dynamic programming)

    4) Record “surface pronunciations” of words


Asr models predicting variation in pronunciations5
ASR Models: Predicting Variation in Pronunciations

  • Limitations of Baseform Learning

    • Limited to single-word learning

    • Ignores multi-word phrases, cross word-boundary effects (e.g., Did you  “didja”)

    • Misses generalizations across words (e.g., learns flapping separately for each word)


Asr models predicting variation in pronunciations6
ASR Models: Predicting Variation in Pronunciations

  • Learning Pronunciation Rules

    • Each word has a canonical pronunciation c1 c2 …cj…cn.

    • Each phone cj in a word can be pronounced by some sj.

    • Set of surface pronunciations S: {Si = si1, …, sin}

    • Taking canonical tri-phone and last surface phone into account, the probability of a given Si can be estimated:


Asr models predicting variation in pronunciations7
ASR Models: Predicting Variation in Pronunciations

  • (Machine) Learning Pronunciation Rules

    • Typical ML techniques apply: CART, ANNs, etc.

    • Using features (pre-specified or learned) helps

    • Brill-type rules (e.g., Yang & Martens 2000):

      • A  B // C __ D with P(B|A,C,D) positive rule

      • A  not B // C __ D with 1 - P(B|A,C,D) neg. rule

        (Note: equivalent to Two-level rule types 1 & 4)


Asr models predicting variation in pronunciations8
ASR Models: Predicting Variation in Pronunciations

  • Pruning Learned Rules & Pronunciations

    • Vary # of allowed pronunciations by word-frequency

      E.g., f (count(w)) = k log(count(w))

    • Use probability threshold for candidate pronunciations

      • Absolute cutoff

      • “Relmax” (relative to maximum) cutoff

    • Use acoustic confidence C(pj,wi) as measure


Online transformation based pronunciation modeling
Online Transformation-Based Pronunciation Modeling

  • In theory, a dynamic dictionary could halve error-rates

    • Using an “oracle dictionary” for each utterance in switchboard reduces error by 43%

    • Using e.g., multi-word context, hidden speaking-mode states may capture some of this.

    • Actual results less dramatic, of course!



Five problems yet to be solved
Five Problems Yet to Be Solved

  • Confusability and Discriminability

  • Hard Decisions

  • Consistency

  • Information Structure

  • Moving Beyond Phones as Basic Units


Five problems yet to be solved1
Five Problems Yet to Be Solved

  • Confusability and Discriminability

    • New pronunciations can create homophones not only with other words, but with parts of words.

    • Few exact metrics exist to measure confusion


Five problems yet to be solved2
Five Problems Yet to Be Solved

  • Hard Decisions

    • Forced-Viterbi throws away good, but “second-best” representations.

    • N-best would avoid this (Mokbel and Jouvet), but problematic for large-vocabulary

    • DTs also introduce hard decisions and data-splitting


Five problems yet to be solved3
Five Problems Yet to Be Solved

  • Consistency

    • Current ASR works word-by-word w/o picking up on long-term patterns (e.g., stretches of fast speech, consistent patterns like dialect, speaker)

    • Hidden speech-mode variable helps, but data is perhaps too sparse for dialect-dependent states.


Five problems yet to be solved4
Five Problems Yet to Be Solved

  • Information Structure

    • Language is about the message!

    • Hence, not all words are pronounced equal

    • Confounding variables:

      • Prosody & intonation (emphasis, de-accenting)

      • Position of word in utterance (beginning or end)

      • Given vs. new information; Topic/focus, etc.

      • First-time use vs. repetitions of a word


Five problems yet to be solved5
Five Problems Yet to Be Solved

  • Moving Beyond Phones as Basic Units

    • Other types of units

      • “Fenones”

      • Hybrid phones [x+y] for //x///y/ rules

    • Detecting (changes in) distinctive features

      • E.g., [ax]  {[+voicing,+nasality], [+voicing,+nasality,+back], [+voicing,+back], …}

      • (cf. Autosegmental & Non-linear phonology?)


Conclusions
Conclusions

  • An ideal model would:

    • Be dynamic and adaptive in dictionary use

    • Integrate knowledge of previously heard pronunciation patterns from that speaker

    • Incorporate higher-level factors (e.g., speaking rate, semantics of the message) to predict changes from the canonical pronunciation

    • (Perhaps) operate on a sub-phonetic level, too.