1 / 31

Feature-based Pronunciation Modeling Using Dynamic Bayesian Networks

Feature-based Pronunciation Modeling Using Dynamic Bayesian Networks. Karen Livescu JHU Workshop Planning Meeting April 16, 2004 Joint work with Jim Glass. Preview. The problem of pronunciation variation for automatic speech recognition (ASR)

jmccaffrey
Download Presentation

Feature-based Pronunciation Modeling Using Dynamic Bayesian Networks

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Feature-based Pronunciation ModelingUsing Dynamic Bayesian Networks Karen Livescu JHU Workshop Planning Meeting April 16, 2004 Joint work with Jim Glass

  2. Preview • The problem of pronunciation variation for automatic speech recognition (ASR) • Traditional methods: phone-based pronunciation modeling • Proposed approach: pronunciation modeling via multiple sequences of linguistic features • A natural framework: dynamic Bayesian networks (DBNs) • A feature-based pronunciation model using DBNs • Proof-of-concept experiments • Ongoing/future work • Integration with SVM feature classifiers

  3. The problem of pronunciation variation • Conversation from the Switchboard speech database: • “neither one of them”: • “decided”: • “never really”: • “probably”: • Noted as an obstacle for ASR (e.g., [McAllester et al. 1998])

  4. The problem of pronunciation variation (2) • More acute in casual/conversational than in read speech: probably p r aa b iy 2 p r ay 1 p r aw l uh 1 p r ah b iy 1 p r aa lg iy 1 p r aa b uw 1 p ow ih 1 p aa iy 1 p aa b uh b l iy 1 p aa ah iy 1

  5. Preview • The problem of pronunciation variation for automatic speech recognition (ASR) • Traditional methods: phone-based pronunciation modeling • Proposed approach: pronunciation modeling via multiple sequences of linguistic features • A natural framework: dynamic Bayesian networks (DBNs) • A feature-based pronunciation model using DBNs • Proof-of-concept experiments • Ongoing/future work • Integration with SVM feature classifiers

  6. [p] insertion rule dictionary Traditional solution: phone-based pronunciation modeling • Transformation rules are typically of the form p1 p2 / p3 __ p4 (where pimay be null) • E.g. Ø p / m __ {non-labial} • Rules are derived from • Linguistic knowledge (e.g. [Hazen et al. 2002]) • Data (e.g. [Riley & Ljolje 1996]) • Powerful, but: • Sparse data issues • Increased inter-word confusability • Some pronunciation changes not well described • Limited success in recognition experiments warmth [ w ao r m p th ] / w ao r m th /

  7. Preview • The problem of pronunciation variation for automatic speech recognition (ASR) • Traditional methods: phone-based pronunciation modeling • Proposed approach: pronunciation modeling via multiple sequences of linguistic features • A natural framework: dynamic Bayesian networks (DBNs) • A feature-based pronunciation model using DBNs • Proof-of-concept experiments • Ongoing/future work • Integration with SVM feature classifiers

  8. TB-LOC TT-LOC TB-OPEN VELUM TT-OPEN LIP-OP VOICING A feature-based approach • Speech can alternatively be described using sub-phonetic features • (This feature set based on articulatory phonology [Browman & Goldstein 1990])

  9. voicing V V V V !V lips & velum desynchronize velum Clo Clo Clo Op Clo dictionary lip opening Nar Mid Mid Clo Mid ... ... ... ... ... … Feature-based pronunciation modeling • instruments[ih_n s ch em ih_n n s] [ w ao r m p th ] warmth • wants[w aa_n t s] -- Phone deletion?? • several[s eh r v ax l] -- Exchange of two phones??? everybody[eh r uw ay]

  10. Related work • Much work on classifying features: • [King et al. 1998] • [Kirchhoff2002] • [Chang, Greenberg, & Wester 2001] • [Juneja & Espy-Wilson 2003] • [Omar & Hasegawa-Johnson 2002] • [Niyogi & Burges 2002] • Less work on “non-phonetic” relationship between words and features • [Deng et al. 1997], [Richardson & Bilmes 2000]: “fully-connected” state space via hidden Markov model • [Kirchhoff 1996]: features independent, except for synchronization at syllable boundaries • [Carson-Berndsen 1998]: bottom-up, constraint-based approach • Goal: Develop a general feature-based pronunciation model • Capable of using known independence assumptions • Without overly strong assumptions

  11. index 0 1 2 3 4 voicing V V V V !V velum Off Off Off On Off lip opening Nar Mid Mid Clo Mid ... ... ... ... ... … dictionary Approach: Main Ideas ([HLT/NAACL-2004]) • Begin with usual assumption: Each word has one or more underlying pronunciations, given by a dictionary warmth • Surface (actual) feature values can stray from underlying values via: • Substitution – modeled by confusion matrices P(s|u) • Asynchrony • Assign index (counter) to each feature, and allow index values to differ • Apply constraints on the difference between the mean indices of feature subsets • Natural to implement using graphical models, in particular dynamic Bayesian networks (DBNs)

  12. speaking rate # questions lunchtime frame i-1 framei ... ... S S O O Aside: Dynamic Bayesian networks • Bayesian network (BN): Directed-graph representation of a distribution over a set of variables • Graph node  variable + its distribution given parents • Graph edge  “dependency” • Dynamic Bayesian network (DBN): BN with a repeating structure • Example: HMM • Uniform algorithms for (among other things) • Finding the most likely values of a subset of the variables, given the rest (analogous to Viterbi algorithm for HMMs) • Learning model parameters via EM

  13. Preview • The problem of pronunciation variation for automatic speech recognition (ASR) • Traditional methods: phone-based pronunciation modeling • Proposed approach: pronunciation modeling via multiple sequences of linguistic features • A natural framework: dynamic Bayesian networks (DBNs) • A feature-based pronunciation model using DBNs • Proof-of-concept experiments • Ongoing/future work • Integration with SVM feature classifiers

  14. encodes baseform pronunciations CLO CRI NAR N-M MID … CLO .7 .2 .1 0 0 … CRI 0 .7 .2 .1 0 … NAR 0 0 .7 .2 .1 … … … … … … … … Approach: A DBN-based Model • Example DBN using 3 features: • (Simplified to show important properties! Implemented model has additional variables.)

  15. Approach: A DBN-based Model (2) • “Unrolled” DBN: . . . • Parameter learning via Expectation Maximization (EM) • Training data • Articulatory databases • Detailed phonetic transcriptions

  16. Preview • The problem of pronunciation variation for automatic speech recognition (ASR) • Traditional methods: phone-based pronunciation modeling • Proposed approach: pronunciation modeling via multiple sequences of linguistic features • A natural framework: dynamic Bayesian networks (DBNs) • A feature-based pronunciation model using DBNs • Proof-of-concept experiments • Ongoing/future work • Integration with SVM feature classifiers

  17. A proof-of-concept experiment • Task: classify an isolated word from the Switchboard corpus, given a detailed phonetic transcription (from ICSI Berkeley, [Greenberg et al. 1996]) • Convert transcription into feature vectors Si, one per 10ms • For each word w in a 3k+ word vocabulary, compute P(w|Si) • Output w* = arg maxw P(w|Si) • Used GMTK [Bilmes & Zweig 2002] for inference and EM parameter training • Note: ICSI transcription is somewhere between phones and features—not ideal, but as good as we have

  18. 1.7 prons/word 4 prons/word asynchronous feature-based 29.7 16.4 Model Error rate (%) Failure rate (%) asynch. + segmental constraint 32.7 19.4 Baseforms only 63.6 61.2 + phonological rules 50.3 47.9 27.8 synchronous feature-based 35.2 24.8 asynch. + segmental constraint + EM 19.4 Results (development set) • What didn’t work? • Some deletions ([ax], [t]) • Vowel retroflexion • Alveolar + [y]  palatal • (Cross-word effects) • (Speech/transcription errors…) • When did asynchrony matter? • Vowel nasalization & rounding • Nasal + stop  nasal • Some schwa deletions • instruments  [ih_n s ch em ih_n n s] • everybody  [eh r uw ay]

  19. Sample Viterbi path everybody [ eh r uw ay ]

  20. Ongoing/future work • Trainable synchrony constraints ([ICSLP 2004?]) • Context-dependent distributions for underlying (Ui) and surface (Si) feature values • Extension to more complex tasks (multi-word sequences, larger vocabularies) • Implementation in a complete recognizer (cf. [Eurospeech 2003]) • Articulatory databases for parameter learning/testing • Can we use such a model to learn something about speech?

  21. (rest of model) Integration with feature classifier outputs • Use (hard) classifier decisions as observations for Si • Convert classifier scores to posterior probabilities and use as “soft evidence” for Si • Landmark-based classifier outputs to DBN Si’s: • Convert landmark-based features to one feature vector/frame • (Possibly) convert from SVM feature set to DBN feature set

  22. Acknowledgment • Jeff Bilmes, U. Washington

  23. Thank you!

  24. GRAVEYARD

  25. possible pronunciations (typically phone strings) Bayes’ Rule acoustic model pronunciation model language model Background: Continuous Speech Recognition • Given waveform with acoustic features A, find most likely word string : • Assuming U* much more likely than all other U:

  26. Lips, tongue, velum, glottis: Right on it, sir! Lips, tongue, velum, glottis: Right on it, sir! Lips, tongue, velum, glottis: Right on it, sir! Lips, tongue, velum, glottis: Right on it, sir! Velum, glottis: Right on it, sir ! Velum, glottis: Right on it, sir ! Example: “warmth”  “warmpth” Brain: Give me a []! • Phone-based view: Brain: Give me a []! • (Articulatory) feature-based view: Lips: Huh? Tongue: Umm…yeah, OK.

  27. Graphical models for hidden feature modeling • Most ASR approaches use hidden Markov models (HMMs) and/or finite-state transducers (FSTs) • Efficient and powerful, but limited • Only one state variable per time frame • Graphical models (GMs) allow for • Arbitrary numbers of variables and dependencies • Standard algorithms over large classes of models • Straightforward mapping between feature-based models and GMs • Potentially large reduction in number of parameters • GMs for ASR: • Zweig (e.g. PhD thesis, 1998), Bilmes (e.g. PhD thesis, 1999), Stephenson (e.g. Eurospeech 2001) • Feature-based ASR with GMs suggested by Zweig, but not previously investigated

  28. Background • Brief intro to ASR • Words written in terms of sub-word units, acoustic models compute probability of acoustic (spectral) features given sub-word units or vice versa • Pronunciation model: mapping between words and strings of sub-word units

  29. Possible solution? • Allow every pronunciation in some large database • Unreliable probability estimation due to sparse data • Unseen words • Increased confusability

  30. Phone-based pronunciation modeling (2) • Generalize across words • But: • Data still sparse • Still increased confusability • Some pronunciation changes not well described by phonetic rules • Limited gains in speech recognition experiments

  31. Approach • Begin with usual assumption that each word has one or more “target” pronunciations, given by the dictionary • Model the evolution of multiple feature streams, allowing for: • Feature changes on a frame-by-frame basis • Feature desynchronization • Control of asynchrony—more “synchronous” feature configurations are preferable • Dynamic Bayesian networks (DBNs): Efficient parameterization and computation when state can be factored

More Related