Underspecified feature models for pronunciation variation in ASR

Underspecified feature models for pronunciation variation in ASR Eric Fosler-Lussier The Ohio State University Speech & Language Technologies Lab ITRW - Speech Recognition & Intrinsic Variation 20 May 2006

Introduction Why features? Role of transcription Approaches Vision Fill in the blanks • 3, 6, __, 12, 15, __, 21, 24 • A B C __ E F __ H • You’re going to Toulouse? Drink a bottle of _____ for me! • What’s the red object? We’re very good atfilling in the blankswhen we havecontext! Fosler-Lussier / Underspecified Feature ModelsITRW Speech Recognition and Intrinsic Variation

Introduction Why features? Role of transcription Approaches Vision Filling in the blanks: missing data • Missing data approaches have been used to integrate over noisy acoustics Wang & Hu 06 Fosler-Lussier / Underspecified Feature ModelsITRW Speech Recognition and Intrinsic Variation

Introduction Why features? Role of transcription Approaches Vision Decode this! (brackets indicate options) s iy n y {ah,ax,axr,er} {l,r} {eh,ih,iy} s er ch {ah,ax} s ow {s,sh,z,zh} {eh,ih,iy} {eh,ey} {t,d} Fosler-Lussier / Underspecified Feature ModelsITRW Speech Recognition and Intrinsic Variation

Introduction Why features? Role of transcription Approaches Vision Decode this! (brackets indicate options) s iy n y {ah,ax,axr,er} senior {l,r} {eh,ih,iy} s er ch research {ah,ax} s ow {s,sh,z,zh} {eh,ih,iy} {eh,ey} {t,d}associate Fosler-Lussier / Underspecified Feature ModelsITRW Speech Recognition and Intrinsic Variation

Introduction Why features? Role of transcription Approaches Vision Decode this! (brackets indicate options) s iy n y {ah,ax,axr,er}senior {l,r} {eh,ih,iy} s er ch research {ah,ax} s ow {s,sh,z,zh} {eh,ih,iy} {eh,ey} {t,d}associate dictionary pronunciation Fosler-Lussier / Underspecified Feature ModelsITRW Speech Recognition and Intrinsic Variation

Introduction Why features? Role of transcription Approaches Vision Decode this! (brackets indicate options) s iy ny {ah,ax,axr,er}senior {l,r} {eh,ih,iy}s er chresearch {ah,ax}s ow{s,sh,z,zh} {eh,ih,iy} {eh,ey} {t,d}associate dictionary pronunciation as marked by transcribers (Buckeye Corpus of Speech) Fosler-Lussier / Underspecified Feature ModelsITRW Speech Recognition and Intrinsic Variation

Introduction Why features? Role of transcription Approaches Vision What do these tasks have in common? • Recovering from erroneous information? • Context plays a big role in helping “clean up” Fosler-Lussier / Underspecified Feature ModelsITRW Speech Recognition and Intrinsic Variation

Introduction Why features? Role of transcription Approaches Vision What do these tasks have in common? • Recovering from erroneous information? • Context plays a big role in helping “clean up” • Recovering from incomplete information! • We should be treating pronunciation variation as a missing data problem • Integrate over “missing” phonological features • How much information do you need to decode words? • Particularly taking into account the context of the word, syllabic context of phones, etc… • Information theory problem Fosler-Lussier / Underspecified Feature ModelsITRW Speech Recognition and Intrinsic Variation

Introduction Why features? Role of transcription Approaches Vision Outline • Problems with phonetic representations of variation • Potential advantages of phonological features • Re-examining the role of phonetic transcription • Phonological feature approaches to ASR • Feature attribute detection • Feature combination methods • Learning to (dis-)trust features • A challenge for the future Fosler-Lussier / Underspecified Feature ModelsITRW Speech Recognition and Intrinsic Variation

Introduction Why features? Role of transcription Approaches Vision “The Case Against The Phoneme”Homage to Ostendorf (ASRU 99) • Four major indications that phonetic modeling of variation is not appropriate: Fosler-Lussier / Underspecified Feature ModelsITRW Speech Recognition and Intrinsic Variation

Introduction Why features? Role of transcription Approaches Vision “The Case Against The Phoneme”Homage to Ostendorf (ASRU 99) • Four major indications that phonetic modeling of variation is not appropriate: • Lack of progress on spontaneous speech WER • McAllaster et al (98): 50% improvement possible • Finke & Waibel (97): 6% WER reduction Fosler-Lussier / Underspecified Feature ModelsITRW Speech Recognition and Intrinsic Variation

Introduction Why features? Role of transcription Approaches Vision “The Case Against The Phoneme”Homage to Ostendorf (ASRU 99) • Four major indications that phonetic modeling of variation is not appropriate: • Lack of progress on spontaneous speech WER • Independence of decisions in phone-based models • When pronunciation variation is modeled on phone-by-phone level, unusual baseforms are often created • Word-based learning fails to generalize across words Riley et al 98 Fosler-Lussier / Underspecified Feature ModelsITRW Speech Recognition and Intrinsic Variation

Introduction Why features? Role of transcription Approaches Vision “The Case Against The Phoneme”Homage to Ostendorf (ASRU 99) • Four major indications that phonetic modeling of variation is not appropriate: • Lack of progress on spontaneous speech WER • Independence of decisions in phone-based models • Lack of granularity • Triphone contexts mean a symbolic change in phone can affect 9 HMM states (min 90 msec) • Much variation is already handled by triphone context Saraçlar et al 00 Jurafsky et al 01 Fosler-Lussier / Underspecified Feature ModelsITRW Speech Recognition and Intrinsic Variation

Introduction Why features? Role of transcription Approaches Vision “The Case Against The Phoneme”Homage to Ostendorf (ASRU 99) • Four major indications that phonetic modeling of variation is not appropriate: • Lack of progress on spontaneous speech WER • Independence of decisions in phone-based models • Lack of granularity • Difficulty in transcription • Phonetic transcription is expensive and time consuming • Many decisions difficult to make for transcribers Fosler-Lussier / Underspecified Feature ModelsITRW Speech Recognition and Intrinsic Variation

Introduction Why features? Role of transcription Approaches Vision Using phonological features • Finer granularity • Some phonological changes don’t result in canonical phones for a language • English: uw can sometimes be fronted (toot) • Common enough: TIMIT introduced a special phone (ux) • Symbol change loses all commonality between phones (uw->ux) • Handling odd phonological effects • Phone deletions: many “deletions” really leave small traces of coarticulation on neighboring segments • E.g. vowel nasalization with nasal deletion • Features may provide basis for cross-lingual recognition • International Phonetic Alphabet Fosler-Lussier / Underspecified Feature ModelsITRW Speech Recognition and Intrinsic Variation

Introduction Why features? Role of transcription Approaches Vision Issues with phonological features • Interlingua: “high vowels in English are not the same as high vowels in Japanese” • Richard Wright, lunch Wednesday, ICASSP 2006 • Concept of “independent directions” false • Correlation of feature values • Distances no longer euclidean among feature dimensions • Dealing with feature spreading • Even more difficulty in transcription • (but: Karen Livescu’s group, JHU workshop 2006) • Articulatory vs. acoustic features • No two definitions are exactly the same (see Richard’s talk) Fosler-Lussier / Underspecified Feature ModelsITRW Speech Recognition and Intrinsic Variation

Introduction Why features? Role of transcription Approaches Vision Phonetic transcription • There have been a number of efforts to transcribe speech phonetically • American English • TIMIT (4 hr read speech) • Switchboard (4 hr spontaneous speech) • Buckeye Corpus (40 hr spontaneous speech) http://buckeyecorpus.osu.edu • ASR researchers have found it difficult to utilize phonetic transcriptions directly Riley et al 99 Fosler-Lussier / Underspecified Feature ModelsITRW Speech Recognition and Intrinsic Variation

Introduction Why features? Role of transcription Approaches Vision ASR & Phonetic Transcription • Saraclar & Khudanpur (04) examined the means of acoustic models where canonical phone /x/ was transcribed as [y] over all pairs x:y • Compared means of x:y to x:x, y:y • Data showed that x:y means often fell between x:x and y:y, sometimes closer to x:x • Another view: data from Buckeye Corpus • /ae/ is sometimes transcribed as [eh] • Examined 80 vowels from one speaker • Formant frequencies from center of vowel Fosler-Lussier / Underspecified Feature ModelsITRW Speech Recognition and Intrinsic Variation

Fosler-Lussier / Underspecified Feature ModelsITRW Speech Recognition and Intrinsic Variation

higher than eh mixed ae/eh ae territory opposite side of ae from eh Fosler-Lussier / Underspecified Feature ModelsITRW Speech Recognition and Intrinsic Variation

Introduction Why features? Role of transcription Approaches Vision Can you trust transcription? • Perceptual marking ≠ acoustic measurement • Can’t take transcription at face value • What are the transcribers are trying to tell us? • This phone doesn’t sound like a canonical phone • Perhaps we can look at commonalities across canonical/transcribed phone • ae:eh -> front vowel (& not high?) • Phonological features may help us represent transcription differences. Fosler-Lussier / Underspecified Feature ModelsITRW Speech Recognition and Intrinsic Variation

Introduction Why features? Role of transcription Approaches Vision single dimensioncommon manner, voicingvariants morecommon than place Variation in single-phone changes • Compared canonical vs. transcribed consonants with single-phone substitutions in Switchboard, Buckeye • Differences in manner, place, voicing counted Fosler-Lussier / Underspecified Feature ModelsITRW Speech Recognition and Intrinsic Variation

Introduction Why features? Role of transcription Approaches Vision Recent approaches to feature modeling in ASR • Since 90’s there has been increased interest in phonological feature modeling • Deng et al (92 ff), Kirchhoff (96 ff) • Current directions of research • Approaches for detecting phonological features from data • Methods of combining phonological features • Knowing when to ignore information Fosler-Lussier / Underspecified Feature ModelsITRW Speech Recognition and Intrinsic Variation

Introduction Why features? Role of transcription Approaches Vision Feature detection methods • Frame-level decisions • Most common: artificial neural network methods • Input: various flavors of spectral/cepstral representations • Output: estimating posterior P(feature|acoustics) on a per-frame level • Recent competitor: support vector machines • Typically used for binary decision problems • Segmental-level decisions: integrate over time • HMM detectors • Hybrid ANN/Dynamic Bayesian Network Fosler-Lussier / Underspecified Feature ModelsITRW Speech Recognition and Intrinsic Variation

Introduction Why features? Role of transcription Approaches Vision Binary vs. n-ary features • Features can either be described as binary or n-ary if they can contrast • Binary: /t/ : +stop -fricative … • N-ary: /t/ : manner=stop • No real conclusion on whether which is better • Binary more matched to SVM learning • N-ary allows for discrimination among classes • Should a segment be allowed to be +stop +fricative? • Anecdotally (our lab) we find n-ary features slightly better Fosler-Lussier / Underspecified Feature ModelsITRW Speech Recognition and Intrinsic Variation

Introduction Why features? Role of transcription Approaches Vision Hierarchical representations • Phonological features are not truly independent • Chang et al (01): Place prediction improves if manner is known • ANN predicts P(place=x|manner=y,X) vs P(place=x|X) • Suggests need for hierarchical detectors • Rajamanohar & Fosler-Lussier (05): Cascading errors make chained decisions worse • Better to jointly model P(place=x,manner=y|X), or even derive P(place=x|X) from phone probabilities • Frankel et al (04): Hierarchy can be integrated as additional dependencies in DBN Fosler-Lussier / Underspecified Feature ModelsITRW Speech Recognition and Intrinsic Variation

Introduction Why features? Role of transcription Approaches Vision Combining features into higher-level structures • Once you have (frame-level) estimates of phonological features, need to combine • Temporal integration: Markov structures • Phonetic spatial integration: combining into higher-level units (phones, syllables, words) • Differences in methodologies: • spatial first, then temporal • joint/factored spatio-temporal integration • phone-level temporal integration with spatial rescoring Fosler-Lussier / Underspecified Feature ModelsITRW Speech Recognition and Intrinsic Variation

Introduction Why features? Role of transcription Approaches Vision Combining features into higher-level structures • Tandem ANN/HMM Systems • ANN feature posterior estimates are used as replacements for MFCCs for Mixture of Gaussians HMM system • We find decorrelation of features (via PCA) necessary to keep models well conditioned • Lattice rescoring with Landmarks • Maximum entropy models for local word discrimination • SVMs used as local features for MaxEnt model. • Dynamic Bayesian Models • Model asynchrony as a hidden variable • SVM outputs used as observations of features Launay et al 02 Hasegawa-Johnson et al 05 Livescu 05 Fosler-Lussier / Underspecified Feature ModelsITRW Speech Recognition and Intrinsic Variation

Introduction Why features? Role of transcription Approaches Vision Combining features intohigher-level structures • Conditional random fields • CRFs jointly model spatio-temporal integration • Probability expressed in terms of indicator functions s (state), t (transition) • Usually binary in NLP applications • Frame-level ANN posteriors are bounded • Probabilities can serve as observation feature functions • sstop(/t/,x,i)=P(manner=stop|xi) Morris & Fosler-Lussier 06 Fosler-Lussier / Underspecified Feature ModelsITRW Speech Recognition and Intrinsic Variation

Introduction Why features? Role of transcription Approaches Vision Conditional Random Fields • CRFs make no independence assumptions about input • Posteriors can be used directly without decorrelation • Can combine features, phones, … • No assumption of temporal independence • Entire label sequence is modeled jointly • Monophone feature CRF phone recog. similar to triphone HMM • Learning parameters (,) determines importance of feature/phone relationships • Implicit model of partial phonological underspecification • Slow to train Fosler-Lussier / Underspecified Feature ModelsITRW Speech Recognition and Intrinsic Variation

Introduction Why features? Role of transcription Approaches Vision Underspecification • All of these models learn what phonological information is important in higher-level processing • Ignoring “canonical” feature definitions for phone is a form of underspecification • Traditional underspecification: some features are undefined for a particular phone • Weighted models: partial underspecification • When can you ignore phonetic information? • Crucially, when it doesn’t help you disambiguate between word hypotheses Fosler-Lussier / Underspecified Feature ModelsITRW Speech Recognition and Intrinsic Variation

Introduction Why features? Role of transcription Approaches Vision Underspecification • Example: unstressed syllables tend to show more phonetic variation than stressed syllables • Experiment: reduce phonetic representation for unstressed syllables to manner class • Allowing recognizer to choose best representation (phone/manner) during training (WSJ0): • Minor degradation for clean speech (9.9 vs. 9.1 WER) • Larger improvement in 10dB car noise (15.8 vs 13.0 WER) • Moral: we don’t need to have exact phonetic representation to decode words • But we may need to integrate more higher-level knowledge Fosler-Lussier et al 05 Fosler-Lussier / Underspecified Feature ModelsITRW Speech Recognition and Intrinsic Variation

Introduction Why features? Role of transcription Approaches Vision Vision for the Future • Acoustic-phonetic variation is difficult • Still significant cause of errors in ASR • Underspecified models give a new way of looking at the problem • Rather than the “change x to y” model • Challenge for the field: • Current techniques for accent modeling, intrinsic pronunciation variation separate • Can we build a model that handles both? Fosler-Lussier / Underspecified Feature ModelsITRW Speech Recognition and Intrinsic Variation

Introduction Why features? Role of transcription Approaches Vision Conclusions • We have come quite a distance since 1999 • New methods for phonological feature detection • New methods for feature integration • New ways of thinking about variation: underspecification • Still have a long way to go • Integrating more knowledge sources • Stress, prosody, word confusability • Solving the pronunciation adaptation problem in a general way Fosler-Lussier / Underspecified Feature ModelsITRW Speech Recognition and Intrinsic Variation

Fin Fosler-Lussier / Underspecified Feature ModelsITRW Speech Recognition and Intrinsic Variation

Introduction Why features? Role of transcription Approaches Vision An example feature grid CLASS: VOICED: CMANNER: CPLACE: VHEIGHT: VFRONTNESS:VROUND:VTENSE: g ow t uw w aa sh ix ng t ax n go to washington Fosler-Lussier / Underspecified Feature ModelsITRW Speech Recognition and Intrinsic Variation

Underspecified feature models for pronunciation variation in ASR