S 1

word word ind1 ind1 U1 U1 sync1,2 sync1,2 S1 S1 ind2 ind2 U2 U2 sync2,3 sync2,3 S2 S2 ind3 ind3 U3 U3 S3 S3 Articulatory Feature-Based Speech RecognitionJHU WS06 Planning MeetingKaren LivescuApril 23, 2006

Project Participants Team members: Karen Livescu (MIT) Arthur Kantor (UIUC) Ozgur Cetin (ICSI Berkeley) Partha Lal (Edinburgh) Mark Hasegawa-Johnson (UIUC) Lisa Yung (JHU) Simon King (Edinburgh) Ari Bezman (Dartmouth) Nash Borges (DoD, JHU) Stephen Dawson-Haggerty (Harvard) Chris Bartels (UW) Bronwyn Woods (Swarthmore) Satellite members/advisors: Jeff Bilmes (UW), Nancy Chen (MIT), Xuemin Chi (MIT), Ghinwa Choueiter (MIT), Trevor Darrell (MIT), Edward Flemming (MIT), Eric Fosler-Lussier (OSU), Joe Frankel (Edinburgh/ICSI), Jim Glass (MIT), Katrin Kirchhoff (UW), Lisa Lavoie (Elizacorp, Emerson), Erik McDermott (NTT), Daryush Mehta (MIT), Florian Metze (Deutsche Telekom), Kate Saenko (MIT), Janet Slifka (MIT), Stefanie Shattuck-Hufnagel (MIT)

Schedule

This meeting is for: • Agreeing on motivations and goals • Asking questions • Making suggestions • Questioning assumptions • Discussing ideas • Developing plans • Dividing up tasks for the next 1-2 months

Why this project? • Why articulatory feature-based ASR? • Improved modeling of co-articulatory pronunciation phenomena • Application to audio-visual and multilingual ASR • Evidence of improved ASR performance with feature-based observation models in noise [Kirchhoff et al. 2002], for hyperarticulated speech [Soltau et al. 2002] • Savings in training data • Compatible with more recent theories of phonology (autosegmental phonology, articulatory phonology) • Why now? • A number of sites working on complementary aspects of this idea, e.g. • U. Edinburgh (King et al.) • UIUC (Hasegawa-Johnson et al.) • MIT (Livescu, Saenko, Glass, Darrell) • Recently developed tools (e.g. GMTK) for systematic exploration of the model space

Why this project? (part deux) • Many have argued for replacing the single phone stream with multiple sub-phonetic feature streams (Rose et al. ‘95, Ostendorf ‘99, ‘00, Nock ‘00, ‘02, Niyogi et al. ‘99 (for AVSR)) • Many have worked on parts of the problem • AF classification/recognition (Kirchhoff, King, Frankel, Wester, Richmond, Hasegawa-Johnson, Borys, Metze, Fosler-Lussier, Greenberg, Chang, Saenko, ...) • Pronunciation modeling (Livescu & Glass, Bates) • Many have combined AF classifiers with phone-based recognizers(Kirchhoff, King, Metze, Soltau, ...) • Some have built HMMs by combining AF states into product states (Deng et al., Richardson and Bilmes) • Only very recently, work has begun on end-to-end recognition with multiple streams of AF states (Hasegawa-Johnson et al. ‘04, Livescu ‘05) • No prior work on AF-based models for AVSR • Time for a systematic study

Yes No factored obs model? state asynchrony cross-word soft asynchrony soft asynchrony within word free within unit coupled state transitions Yes No [Livescu ‘04] [Deng ’97, Richardson ’00] fact. obs? fact. obs? fact. obs? fact. obs? obs model GM SVM NN N N N N Y Y Y Y [Metze ’02] [Kirchhoff ’02] [Juneja ’04] CD CD CD CD CD CD CD CD N N N Y N Y Y Y N Y N [Livescu ’05] N FHMMs ??? ??? Y N Y Y [WS04] [Kirchhoff ’96, Wester et al. ‘04] CHMMs ??? ??? ??? ??? ??? ??? ??? A (partial) taxonomy of design issues factored state (multistream structure)? (Not to mention choice of feature sets... same in hidden structure and observation model?)

What are the goals of this project? • Building complete AF-based recognizers and understanding the design issues involved? • Improving the state of the art on a standard ASR task? • Obtaining improved recognition in some domain? • Obtaining improved automatic articulatory feature (AF) transcription of speech? • Comparing different types of • Classifiers (NNs, SVMs, others?) • Pronunciation models • Recognition architectures (hybrid DBN/classifier vs. fully-generative, Gaussian mixture-based) • Analyzing articulatory phenomena • Dependence on context, speaker, speaking rate, speaking style, ... • Effects of articulatory reduction/asynchrony on recognition accuracy • Developing a “meta-toolkit” for AF-based recognition?

A tentative plan • Prior to workshop: • Selection of feature sets (done) • Selection of corpora for audio-only (done?) and audio-visual tasks • Baseline phone-based and feature-based results on selected data • Trained AF classifiers, with outputs on selected data (no classifier work to be done during workshop) • During workshop: Investigate 3 main types of recognizers • Fully-generative (Gaussian mixture-based) models for audio tasks • Hybrid DBN/classifier models for audio tasks • Fully-generative models for audio-visual tasks • Workshop, 1st half: • Compare AF classifiers • Develop/compare pronunciation models using AF-transcribed data • Build complete recognizers with basic pronunciation model • Workshop, 2nd half: Integrate most successful pronunciation and observation models from 1st half

Status report, in brief... • Feature sets & phone ↔ canonical feature value mappings determined (Karen, Simon, Mark, Eric) • Feature set for pronunciation modeling may still change a bit (Karen) • Manual feature transcriptions underway (Karen, Xuemin, Lisa) • Video front-end processing (Kate, Mark) and audio-visual baseline recognizers (Kate, Karen) underway • GMTK infrastructure • State tying tool underway (Simon, Jeff) • Updated parallel training/decoding scripts underway (Karen) • Training of NN feature classifiers about to start (Joe, Simon) • Maybe won’t use SVM feature classifiers? (Mark, Karen, Simon) • SVitchboard baseline GMTK recognizers unchanged  (Karen)

DBNs for ASR

Graphical models for automatic speech recognition • Most common ASR models (HMMs, FSTs) have some drawbacks • Strong independence assumptions • Single state variable per time frame • May want to model more complex structure • Multiple processes (audio + video, speech + noise, multiple streams of acoustic features, articulatory features) • Dependencies between these processes or between acoustic observations • Graphical models provide: • General algorithms for large class of models • No need to write new code for each new model • A “language” with which to talk about statistical models

A B C D A B C D Graphical models (GMs) • Represent probability distributions via graphs • directed, e.g. Bayesian networks (BNs) • undirected, e.g. Markov random fields (MRFs) • combination, e.g. chain graphs • Node ↔ variable • Lack of edge ↔ independence property p(b|a)  p(a,b,c,d) = p(a)p(b|a)p(c|b)p(d|b,c) p(c|b) p(a) BN p(d|b,c)  p(a,b,c,d) ∝ ψA,B(a,b) ψB,C,D(b,c,d) MRF

A B C D Bayesian networks (BNs) • Definition: • Directed acyclic graph (DAG) with one-to-one correspondence between nodes and variables X1, X2, ... , XN • Node Xi with parents pa(Xi) has a “local” probability function pXi|pa(Xi) • Joint probability of all of the variables = product of local probabilities: p(xi, ... , xN) =  p(xi|pa(xi)) • Graph specifies factorization only; for complete description of the distribution, also need • Implementation:The form of the local probabilities(Gaussian,table, ...) • Parameters of the local probabilities (means, covariances, table entries, ...) • (Terminology courtesy Jeff Bilmes) p(b|a)  p(a,b,c,d) = p(a)p(b|a)p(c|b)p(d|b,c) p(c|b) p(a) p(d|b,c)

frame i-1 frame i frame i+1 C C C A A B B A B D D D Dynamic Bayesian networks (DBNs) • BNs consisting of a structure that repeats an indefinite (i.e. dynamic) number of times • Useful for modeling time series (e.g. speech!)

HMM DBN frame i-1 frame i frame i+1 .7 .8 1 Qi-1 Qi+1 Qi .3 .2 . . . . . . P(qi|qi-1) P(obsi | qi) 1 2 3 obsi-1 obsi+1 obsi qi 1 2 3 qi-1 q=1 1 .7 .3 0 obs q=2 2 0 .8 .2 obs obs q=3 3 0 0 1 = variable = state = allowed dependency = allowed transition Representing an HMM as a DBN

Inference • Definition: • Computation of the probability of one subset of the variables given another subset • Inference is a subroutine of: • Viterbi decoding q* = argmaxqp(q|obs) • Maximum-likelihood estimation of the parameters of the local probabilities * = argmax p(obs| )

Whole-word HMM-based recognizer example frame 0 frame i last frame variable name values word {“one”, “two” ,...} 1 word transition {0,1} 0 sub-word state {0,1,2,..., 7} state transition {0,1} observation • This recognizer uses • A bigram language model • Whole-word HMMs with 8 states per word • Viterbi decoding amounts to: • Find most likely values of word variables, given observations • Read off the value of word in those frames where word transition = 1

Phone HMM-based recognizer example frame 0 frame i last frame variable name values word {“one”, “two” ,...} 1 word transition {0,1} 0 sub-word index {0,1,2,...} state transition {0,1} phone state {w1, w2, w3, s1,s2,s3,...} observation • This recognizer uses • A bigram language model • Context-independent 3-state phone HMMs

Training vs. decoding DBNs • Why do we need different structures for training and testing? Isn’t training just the same as testing but with more of the variables observed? • Not always! • Often, during training we have only partial information about some of the variables, e.g. the word sequence but not which frame goes with which word

Whole-word HMM-based training structure • Training structure when only the word transcript is known frame 0 frame i last frame variable name values wd counter {0, 1 , ... MaxUttLength-1} end of utterance {0,1} 1 word {“one”, “two” ,...} 1 word transition {0,1} 0 sub-word state {0,1,2,..., 7} state transition {0,1} observation • wd counter is copied from previous frame if previous frame’s word transition = 0; else, wd counter is incremented by 1. • end of utterance = 1 if wd counter = number of words in the utterance and word transition = 1 • Note: Newest versions of GMTK allow for direct use of training lattices instead of this setup

FEATURE-BASED PRONUNCIATION MODELING

word don’t probably baseform p r aa b ax b l iy d ow n t (2) p r aa b iy (1) p r ay (1) p r aw l uh (1) p r ah b iy (1) p r aa l iy (1) p r aa b uw (1) p ow ih (1) p aa iy (1) p aa b uh b l iy (1) p aa ah iy (37) d ow n (16) d ow (6) ow n (4) d ow n t (3) d ow t (3) d ah n (3) ow (3) n ax (2) d ax n (2) ax (1) n uw ... surface (actual) Examples of pronunciation variation everybody sense s eh n s eh v r iy b ah d iy [From data of Greenberg et al. ‘96] (1) s eh n t s (1) s ih t s (1) eh v r ax b ax d iy (1) eh v er b ah d iy (1) eh ux b ax iy (1) eh r uw ay (1) eh b ah iy

Pronunciation variation in automatic speech recognition Automatic speech recognition (ASR) is strongly affected by pronunciation variation • Words produced non-canonically are more likely to be mis-recognized [Fosler-Lussier 1999] • Conversational speech is recognized at twice the error rate of read speech [Weintraub et al. 1996]

[t] insertion rule dictionary Phone-based pronunciation modeling sense [ s eh n t s ] / s eh n s / Variants generated by transformation rules of the formu1 s2 / u3 _ u4 • E.g. Ø t / n _ s Rules are derived from • Linguistic knowledge (e.g. [Hazen et al. 2002]) • Data (e.g. [Riley & Ljolje 1996]) Some issues • Low coverage of conversational pronunciations • Sparse data • Partial changes not well described [Saraclar et al. 2003] increased inter-word confusability • So far, only small gains in recognition performance

TB-LOC TT-LOC TB-OP TT-OP LIP-LOC VELUM LIP-OP GLOTTIS Sub-phonetic features Feature set used in the International Phonetic Alphabet (IPA) Feature set based on articulatory phonology [Browman & Goldstein 1992]

Feature-based modeling: sense  [s eh n t s] Brain: Give me an [s]! • Phone-based view: Lips, tongue, velum, glottis: Right away! Lips, tongue, velum, glottis: Right away! Lips, tongue, velum, glottis: Right away! Lips, tongue, velum, glottis: Right away! Brain: Give me an [s]! • (Articulatory) feature-based view: Lips: Huh? Velum, glottis: Right away! Velum, glottis: Right away! Tongue: Umm… OK.

feature values GLO open critical open VEL closed open closed dictionary TB mid / uvular mid / palatal mid / uvular TT critical / alveolar mid / alveolar closed / alveolar critical / alveolar phone s eh n s feature values GLO open critical open VEL closed open closed surface variant #1 TB mid / uvular mid / palatal mid / uvular TT critical / alveolar mid / alveolar closed / alveolar critical / alveolar phone s eh n t s feature values GLO open critical open VEL closed open closed surface variant #2 TB mid / uvular mid-nar / palatal mid / uvular TT critical / alveolar mid-nar / alveolar closed / alveolar critical / alveolar phone s ih t s n Revisiting examples

index 0 1 2 3 … phone eh v r iy … GLOT V V V V … everybody VEL Off Off Off Off … LIP-OPEN Wide Crit Wide Wide … ... ... ... ... ... … 1 2 2 2 2 0 1 1 1 2 0 1 0 1 0 0 1 1 0 0 0 0 0 1 + asynchrony 0 1 1 1 1 + feature substitutions Approach: Main Ideas Inspired by autosegmental and articulatory phonology, but simplified and expressed in probabilistic terms baseform dictionary 0 0 0 0 1 1 1 2 2 2 2 2 ind VOI 0 0 0 0 0 0 0 1 0 1 0 1 0 1 2 ind VEL 0 0 0 0 1 1 1 1 2 2 2 2 ind LIP-OPEN W W W W C C C C W W W W U LIP-OPEN W W N N N C C C W W W W S LIP-OPEN [Livescu & Glass, HLT-NAACL’04]

word word index 0 1 2 3 ... phone eh v r iy ... ind1 ind1 GLO crit crit crit crit ... LIP-OP wide crit wide .5 nar .5 wide ... U1 U1 ... ... ... ... ... ... CL C N M O … sync1,2 sync1,2 CL .7 .2 .1 0 0 … S1 S1 C 0 .7 .2 .1 0 … N 0 0 .7 .2 .1 … … … … … … … … ind2 ind2 U2 U2 index 0 1 2 3 ... sync2,3 sync2,3 phone eh v r iy ... S2 S2 GLO crit crit crit crit ... LIP-OP wide crit wide .5 nar .5 wide ... ... ... ... ... ... ... ind3 ind3 0 0 0 0 1 1 1 2 2 2 2 2 ind GLOT 0 0 0 0 1 1 1 1 1 2 2 2 ind LIP-OPEN U3 U3 W W W W C C C C C W W W U LIP-OPEN W W N N N C C C C W W W S3 S3 S LIP-OPEN Dynamic Bayesian network representation

word word ind1 ind1 U1 U1 Feature timing dependent through synchrony constraints sync1,2 sync1,2 S1 S1 ind2 ind2 U2 U2 sync2,3 sync2,3 S2 S2 ind3 ind3 Surface feature values conditionally independent given underlying values U3 U3 S3 S3 Independence assumptions Underlying feature values conditionally independent given word and state indices

Different feature sets for pronunciation & observation models (rest of model) SGLOT SLIP-OPEN STT-OPEN STB-OPEN (deterministic mapping) glo degree 1 obsglo obsdg1

Model components Phonemic baseform dictionary Phone-to-feature mapping Soft synchrony constraints P(|indA-indB| = a) Feature substitution probabilities P(s|u) Transition probabilities • In each frame, the probability of transitioning to the next state in the word

word word Infer: ind1 ind1 Given: U1 U1 sync1,2 sync1,2 S1 S1 ind2 ind2 U2 U2 sync2,3 sync2,3 S2 S2 ind3 ind3 U3 U3 S3 S3 Experiments: Lexical access from manual transcriptions

word word ind1 ind1 U1 U1 sync1,2 sync1,2 S1 S1 ind2 ind2 For each word in the vocabulary, compute P(w|s) U2 U2 sync2,3 sync2,3 S2 S2 1 cents -143.24 ind3 ind3 Output: most likely word, w* = argmaxw P(w|s) 2 sent -159.95 3 tents -186.18 U3 U3 ... ... ... S3 S3 Lexical access experiments s ahn t s Input: manual, aligned phonetic transcription for one word [data fromGreenberg et al. 1996] Converted into feature values [Livescu & Glass, HLT-NAACL’04 & ICSLP ‘04]

{ phone-based models Lexical access: Selected results [Livescu ‘05]

0 1 2 3 4 … 0 .7 .2 .1 0 0 … 1 0 .7 .2 .1 0 … 2 0 0 .7 .2 .1 … … … … … … … … transcription phVEL UVEL SVEL phTT-LOC UTT-LOC STT-LOC Analysis instruments [ ih s ch em ih n s ] • Viterbi alignments • Learned parameter values

To do • Context-dependent feature substitution probabilities • Cross-word asynchrony • Different ways of modeling asynchrony • Using async variables for feature subsets • Using single async variable that evaluates the goodness of the entire “synchrony state” • No async variables, just condition each stream’s state transition on other stream states (coupled HMM-style model) • Analysis of data • How much asynchrony/substitution can occur? • How do these relate to speaker, dialect, context, ...? • How do model scores relate to human perceptual judgments? • What is the best way to measure pronunciation model quality?

Lip features asyncLT checkSyncLT Tongue features asyncTG checkSyncTG Glottis/velum SVitchboard 1 baseline recognizers • 100-word task from SVitchboard 1 [King et al. 2005] • Model: • Results:

FEATURE TRANSCRIPTIONS

“Ground-truth” articulatory feature transcriptions • A suggestion from ASRU meeting with Mark, Simon, Katrin, Eric, Jeff • Being carried out at MIT with the help of Xuemin and Lisa • Thanks also to Janet, Stefanie, Edward, Daryush, Jim, Nancy for discussions

“Ground-truth” articulatory feature transcriptions • Why? • No good existing reference data to test feature classifiers • Can be used to work on observation modeling and pronunciation modeling separately • (Also risky, though! Like doing phonetic recognition + phonetic pronunciation modeling separately) • Alternatives: Articulatory measurement data (MOCHA, X-ray microbeam), Switchboard Transcription Project data • Plan • Manually transcribe 50-100 utterances as test data for classifiers • Force-align a much larger set using classifiers + word transcripts + “permissive” articulatory model • (Assuming classifiers are good enough... Or even if not?) • Larger set will serve as “ground truth” for pronunciation modeling work • Issues • Feature set • Transcription interface & procedure • What is ground truth, really?

Evolution of transcription project • ASRU meeting, Nov.-Dec.: “We should generate some manual feature transcriptions” • Dec.: Discussions (Karen, Mark, Simon, Eric) of feature set and the possibility of having some of the Buckeye corpus transcribed at the feature level • Jan.: Buckeye data transcription not possible; transcription effort & feature set discussion shifts to MIT/Switchboard data • Jan.-early Mar. • Transcribers located: Xuemin Chi (grad student in Speech Communication group), Lisa Lavoie (phonetician) • Karen, Xuemin, Lisa, and others meet (semi-)weekly • Iterate practice transcriptions, refinements to transcription interface & feature set • Late Mar.-Apr. • Feature set settled • Improving transcriber speed and consistency • Settled on 2-pass transcription strategy: For every set of 8-15 utterances • 1st pass: Transcribe • 2nd pass: Compare with the other transcriber’s labels and, possibly, make corrections

Initial feature set • Started out with feature set based on [Chang, Wester, & Greenberg 2005] • Issues • Some articulatory configurations can’t be annotated • Double articulations • Aspiration • Lateral approximants • Rhoticized vowels • Not enough resolution in vowel space • Vowel regions very hard to transcribe

Current feature set • (For more info, see “Feature transcriptions” link on the wiki)

Alternative to “vowel” feature iy, ux FRONT ey1 ih, ay2, ey2, oy2 eh aw1, ae ix MID-FRONT ax, ah, er, axr, el, em, en MID uh, aw2, ow2 MID-BACK aa, ay1 ao, oy1 BACK ow1 uw MID-HIGH MID-LOW VERY HIGH HIGH MID LOW

S 1

S 1

Presentation Transcript

S-1

Fig. S-1

S-1

S + 1

S-1

S: 1

S-1

S 1

s 1

S 1

S-1

S-1

1’s 20 0’s

Background #1 S 1

for s= 1

S 1

S-1