a tutorial on pronunciation modeling for large vocabulary speech recognition l.
Download
Skip this Video
Loading SlideShow in 5 Seconds..
A Tutorial on Pronunciation Modeling for Large Vocabulary Speech Recognition PowerPoint Presentation
Download Presentation
A Tutorial on Pronunciation Modeling for Large Vocabulary Speech Recognition

Loading in 2 Seconds...

play fullscreen
1 / 35

A Tutorial on Pronunciation Modeling for Large Vocabulary Speech Recognition - PowerPoint PPT Presentation


  • 308 Views
  • Uploaded on

A Tutorial on Pronunciation Modeling for Large Vocabulary Speech Recognition. Dr. Eric Fosler-Lussier Presentation for CiS 788. Overview. Our task: moving from “read speech recognition” to recognizing spontaneous conversational speech

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

A Tutorial on Pronunciation Modeling for Large Vocabulary Speech Recognition


An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
a tutorial on pronunciation modeling for large vocabulary speech recognition

A Tutorial on Pronunciation Modeling for Large Vocabulary Speech Recognition

Dr. Eric Fosler-Lussier

Presentation for CiS 788

overview
Overview
  • Our task: moving from “read speech recognition” to recognizing spontaneous conversational speech
  • Two basic approaches for modeling pronunciationvariation
    • Encoding linguistic knowledge to pre-specify possiblealternative pronunciations of words
    • Deriving alternatives directly from a pronunciation corpus.
  • Purposes of this tutorial
    • Explain basic linguistic concepts in phonetics and phonology
    • Outline several pronunciation modeling strategies
    • Summarize promising recent research directions.
pronunciations pronunciation modeling4
Pronunciations & Pronunciation Modeling
  • Why sub-word units?
    • Data sparseness at word level
    • Intermediate level allows extensible vocabulary
  • Why phone(me)s?
    • Available dictionaries/orthographies assume this unit
    • Research suggests humans use this unit
    • Phone inventory more manageable than syllables, etc. (in e.g., English)
statistical underpinnings for pronunciation modeling
Statistical Underpinnings for Pronunciation Modeling
  • In the whole-word approach, we could find the most likely utterance (word-string) M* given the perceived signal:

M* =

statistical underpinnings for pronunciation modeling6
Statistical Underpinnings for Pronunciation Modeling
  • With independence assumptions, we can use the following approximation:
  • Argmax P(M|X)
statistical underpinnings for pronunciation modeling7
Statistical Underpinnings for Pronunciation Modeling
  • PA(X|Q): the acoustic model
    • continuous sound (vector)s to discrete phone (state)s
    • Analogous to “categorical perception” in human hearing
  • PQ(Q|M): the pronunciation model
    • Probability of phone states given words
    • Also includes context-dependence & duration models
  • PL(M): the language model
    • The prior probability of word sequences
statistical underpinnings for pronunciation modeling8
Statistical Underpinnings for Pronunciation Modeling

The three models working in sequence:

linguistic formalisms pronunciation variation
Linguistic Formalisms & Pronunciation Variation
  • Phones & Phonemes
  • (Articulatory) Features
  • Phonological Rules
  • Finite State Transducers
linguistic formalisms pronunciation variation10
Linguistic Formalisms & Pronunciation Variation
  • Phones & Phonemes
    • Phones: Types of (uttered) segments
      • E.g., [p] unaspirated voiceless labial stop [spik]
      • Vs. [ph] aspirated voiceless labial stop [phik]
    • Phonemes: Mental abstractions of phones
      • /p/ in speak = /p/ in peak to naïve speakers
    • ARPABET: between phones & phonemes
    • SAMPAbet: closer to phones, but not perfect…
sampa for american english
Selected Consonants (arpa)

tS chin tSIn (ch)

dZ gin dZIn (jh)

T thin TIn (th)

D this DIs (dh)

Z measure "mEZ@` (zh)

N thing TIN (ng)

j yacht jAt (y)

4 butter bV4@` (dx)

Selected Vowels (arpa)

{ pat p{t (ae)

A pot pAt (aa)

V cut kVt (uh) !

U put pUt (uh) !

aI rise raIz (ay)

3` furs f3`z (er)

@ allow @laU (ax)

@` corner kOrn@` (axr)

SAMPA for American English
linguistic formalisms pronunciation variation12
Linguistic Formalisms & Pronunciation Variation
  • (Articulatory) Features
    • Describe where (place) and how (manner) a sound is made, and whether it is voiced.
    • Typical features (dimensions) for vowels include height, backness, & roundness
  • (Acoustic) Features
    • Vowel features actually correlate better with formants than with actual tongue position
linguistic formalisms pronunciation variation14
Linguistic Formalisms & Pronunciation Variation
  • Phonological Rules
    • Used to classify, explain, and predict phonetic alternations in related words: write (t) vs. writer (dx)
    • May also be useful for capturing differences in speech mode (e.g., dialect, register, rate)
    • Example: flapping in American English
linguistic formalisms pronunciation variation15
Linguistic Formalisms & Pronunciation Variation
  • Finite State Transducers
    • (Same example transducer as on Tuesday)
linguistic formalisms pronunciation variation16
Linguistic Formalisms & Pronunciation Variation
  • Useful properties of FSTs
    • Invertible

(thus usable in both production & recognition)

    • Learnable (Oncina, Garcia, & Vidal 1993, Gildea & Jurafsky 1996)
    • Composable
    • Compatible with HMMs
asr models predicting variation in pronunciations
ASR Models: Predicting Variation in Pronunciations
  • Knowledge-Based Approaches
    • Hand-Crafted Dictionaries
    • Letter to Sound Rules
    • Phonological Rules
  • Data-Driven Approaches
    • Baseform Learning
    • Learning Pronunciation Rules
asr models predicting variation in pronunciations18
ASR Models: Predicting Variation in Pronunciations
  • Hand-Crafted Dictionaries
    • E.g., CMUdict, Pronlex for American English
    • The most readily available starting point
    • Limitations:
      • Generally only one or two pronunciations per word
      • Does not reflect fast speech, multi-word context
      • May not contain e.g., proper names, acronyms
      • Time-consuming to build for new languages
asr models predicting variation in pronunciations19
ASR Models: Predicting Variation in Pronunciations
  • Letter to Sound Rules
    • In English, used to supplement dictionaries
    • In e.g., Spanish, may be enough by themselves
    • Can be learned (e.g. by DTs, ANNs)
    • Hard-to-catch Exceptions:
      • Compound-words, acronyms, etc.
      • Loan words, foreign words
      • Proper names (Brands, people, places)
asr models predicting variation in pronunciations20
ASR Models: Predicting Variation in Pronunciations
  • Phonological Rules
    • Useful for modeling e.g., fast speech, likely non-canonical pronunciations
    • Can provide basis for speaker-adaptation
    • Limitations:
      • Requires labeled corpus to learn rule probabilities
      • May over-generalize, creating spurious homophones
      • (Pruning minimizes this)
asr models predicting variation in pronunciations22
ASR Models: Predicting Variation in Pronunciations
  • Automatic Baseform Learning

1) Use ASR with “dummy” dictionary to find “surface” phone sequences of an utterance

2) Find canonical pronunciation of utterance (e.g., by forced-Viterbi)

3) Align these two (w/ dynamic programming)

4) Record “surface pronunciations” of words

asr models predicting variation in pronunciations23
ASR Models: Predicting Variation in Pronunciations
  • Limitations of Baseform Learning
    • Limited to single-word learning
    • Ignores multi-word phrases, cross word-boundary effects (e.g., Did you  “didja”)
    • Misses generalizations across words (e.g., learns flapping separately for each word)
asr models predicting variation in pronunciations24
ASR Models: Predicting Variation in Pronunciations
  • Learning Pronunciation Rules
    • Each word has a canonical pronunciation c1 c2 …cj…cn.
    • Each phone cj in a word can be pronounced by some sj.
    • Set of surface pronunciations S: {Si = si1, …, sin}
    • Taking canonical tri-phone and last surface phone into account, the probability of a given Si can be estimated:
asr models predicting variation in pronunciations25
ASR Models: Predicting Variation in Pronunciations
  • (Machine) Learning Pronunciation Rules
    • Typical ML techniques apply: CART, ANNs, etc.
    • Using features (pre-specified or learned) helps
    • Brill-type rules (e.g., Yang & Martens 2000):
      • A  B // C __ D with P(B|A,C,D) positive rule
      • A  not B // C __ D with 1 - P(B|A,C,D) neg. rule

(Note: equivalent to Two-level rule types 1 & 4)

asr models predicting variation in pronunciations26
ASR Models: Predicting Variation in Pronunciations
  • Pruning Learned Rules & Pronunciations
    • Vary # of allowed pronunciations by word-frequency

E.g., f (count(w)) = k log(count(w))

    • Use probability threshold for candidate pronunciations
      • Absolute cutoff
      • “Relmax” (relative to maximum) cutoff
    • Use acoustic confidence C(pj,wi) as measure
online transformation based pronunciation modeling
Online Transformation-Based Pronunciation Modeling
  • In theory, a dynamic dictionary could halve error-rates
    • Using an “oracle dictionary” for each utterance in switchboard reduces error by 43%
    • Using e.g., multi-word context, hidden speaking-mode states may capture some of this.
    • Actual results less dramatic, of course!
five problems yet to be solved
Five Problems Yet to Be Solved
  • Confusability and Discriminability
  • Hard Decisions
  • Consistency
  • Information Structure
  • Moving Beyond Phones as Basic Units
five problems yet to be solved30
Five Problems Yet to Be Solved
  • Confusability and Discriminability
    • New pronunciations can create homophones not only with other words, but with parts of words.
    • Few exact metrics exist to measure confusion
five problems yet to be solved31
Five Problems Yet to Be Solved
  • Hard Decisions
    • Forced-Viterbi throws away good, but “second-best” representations.
    • N-best would avoid this (Mokbel and Jouvet), but problematic for large-vocabulary
    • DTs also introduce hard decisions and data-splitting
five problems yet to be solved32
Five Problems Yet to Be Solved
  • Consistency
    • Current ASR works word-by-word w/o picking up on long-term patterns (e.g., stretches of fast speech, consistent patterns like dialect, speaker)
    • Hidden speech-mode variable helps, but data is perhaps too sparse for dialect-dependent states.
five problems yet to be solved33
Five Problems Yet to Be Solved
  • Information Structure
    • Language is about the message!
    • Hence, not all words are pronounced equal
    • Confounding variables:
      • Prosody & intonation (emphasis, de-accenting)
      • Position of word in utterance (beginning or end)
      • Given vs. new information; Topic/focus, etc.
      • First-time use vs. repetitions of a word
five problems yet to be solved34
Five Problems Yet to Be Solved
  • Moving Beyond Phones as Basic Units
    • Other types of units
      • “Fenones”
      • Hybrid phones [x+y] for //x///y/ rules
    • Detecting (changes in) distinctive features
      • E.g., [ax]  {[+voicing,+nasality], [+voicing,+nasality,+back], [+voicing,+back], …}
      • (cf. Autosegmental & Non-linear phonology?)
conclusions
Conclusions
  • An ideal model would:
    • Be dynamic and adaptive in dictionary use
    • Integrate knowledge of previously heard pronunciation patterns from that speaker
    • Incorporate higher-level factors (e.g., speaking rate, semantics of the message) to predict changes from the canonical pronunciation
    • (Perhaps) operate on a sub-phonetic level, too.