1 / 30

Speech Recognition Models of the Interdependence Among Prosody, Syntax, and Segmental Acoustics

This research explores the interdependence between prosody, syntax, and segmental acoustics in speech recognition models. It discusses prosodic tags as hidden mode variables, acoustic and language models, and the use of factored models to address data sparsity. The study focuses on prosody-dependent allophones, pitch, duration, and syntactic correlates of prosody.

dorish
Download Presentation

Speech Recognition Models of the Interdependence Among Prosody, Syntax, and Segmental Acoustics

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Speech Recognition Models of the Interdependence Among Prosody,Syntax, and Segmental Acoustics Mark Hasegawa-Johnson Jennifer Cole, Chilin Shih, Ken Chen, Aaron Cohen, Sandra Chavarria, Heejin Kim, Taejin Yoon, Sarah Borys, and Jeung-Yoon Choi

  2. Outline • Prosodic tags as “hidden mode” variables • Acoustic models • Factored prosody-dependent allophones • Knowledge-based factoring: pitch & duration • Allophone clustering: spectral envelope • Language models • Factored syntactic-prosodic N-gram • Syntactic correlates of prosody

  3. Prosodic tags as “hidden speaking mode” variables(inspired by Ostendorf et al., 1996, Stolcke et al., 1999) W = argmaxW maxQABSP p(X,Y|Q,A,B) p(Q,A,B|W,S,P) p(W,S,P)

  4. “Toneless ToBI” Prosodic Transcription • Tagged Transcription: Wanted*% chief* justice* of the Massachusetts* supreme court*% • % is an intonational phrase boundary • * denotes pitch accented word • Lexicon: • Each word has four entries • wanted, wanted*, wanted%, wanted*% • IP boundary applies to phones in rhyme of final syllable • wanted%w aa n tax% d% • Accent applies to phones in lexically stressed syllable • wanted*w* aa* n*t ax d

  5. The problem: Data sparsity • Boston Radio News corpus • 7 talkers; Professional radio announcers • 24944 words prosodically transcribed • Insufficient data to train triphones: • Hierarchically clustered states: HERest fails to converge (insufficient data). • Fixed number of triphones (3/monophone): WER increases (monophone: 25.1%, triphone: 36.2%) • Switchboard • Many talkers; Conversational telephone speech • About 1700 words with full prosodic transcription • Insufficient to train HMM, but sufficient to test

  6. Proposed solution: Factored models • Factored Acoustic Model: p(X,Y|Q,A,B) = Pi p(di|qi,bi) Pt p(xt|qi) p(yt|qi,ai) • prosody-dependent allophone qi • pitch accent type ai€ {Accented,Unaccented} • intonational phrase position bi€ {Final,Nonfinal} • Factored Language Model: p(W,P,S) = p(W) p(S|W) p(P|S)

  7. Acoustic factor #1: Are the MFCCs Prosody-Dependent? Clustered Triphones Prosody-Dependent Allophones N N R Vowel? R Vowel? Yes Yes No No L Stop? N-VOW Pitch Accent? N-VOW No Yes No Yes N STOP+N N N* WER: 36.2% WER: 25.4% BUT: WER of baseline Monophone system = 25.1%

  8. Prosody-dependent allophones: ASR clustering matches EPG • Fougeron & Keating • (1997) • EPG Classes: • Strengthened • Lengthened • Neutral

  9. Acoustic factor #2: Pitch MFCC MFCC MFCC MFCC Stream Q(t-1) Q(t) Q(t+1) Phoneme State A(t-1) A(t) A(t+1) Accented? Transformed Pitch Stream G(F0) G(F0) G(F0) F0(t-2) F0(t-1) F0(t) F0(t+1) F0(t+2)

  10. Acoustic-prosodic observations: Y(t) = ANN(logf0(t-5),…,logf0(t+5))

  11. Acoustic Factor #3: Duration • Normalized phoneme duration is highly correlated with phrase position • Solution: Semi-Markov model (aka HMM with explicit duration distributions, EDHMM) P(x(1),…,x(T)|q1,…,qN) = Sd p(d1|q1)…p(dN|qN) p(x(1)…x(d1)|q1) p(x(d1+1)…x(d1+d2)|q2) …

  12. Phrase-final vs. Non-final Durations learned by the EDHMM /AA/ phrase-medial and phrase-final /CH/ phrase-medial and phrase-final

  13. A factored language model Unfactored Prosodically tagged words: cats* climb trees*% • Unfactored: Prosody and word string jointly modeled: p( trees*% | cats* climb ) • Factored: • Prosody depends on syntax: p( w*% | N V N, w* w ) • Syntax depends on words: p( N V N | cats climb trees ) pi-1,wi-1 pi,wi Factored pi-1 pi wi-1 wi si-1 si

  14. Result: Syntactic mediation of prosody reduces perplexity and WER Factored Model: Reduces Perplexity by 35% Reduces WER by 4% Syntactic Tags: For pitch accent: • POS sufficient For IP boundary: • Parse information useful if available pi-1 pi wi-1 wi si-1 si

  15. Syntactic factors: POS, Syntactic phrase boundary depth

  16. Results: Word Error Rate (Radio News Corpus)

  17. Results: Pitch Accent Error Rate

  18. Results: Intonational Phrase Boundary Error Rate

  19. Conclusions • Learn from sparse data: factor the model • F0 stream: depends on pitch accent • Duration PDF: depends on phrase position • POS: predicts pitch accent • Syntactic phrase boundary depth: predicts intonational phrase boundaries • Word Error Rate: reduced 12% only if both syntactic and acoustic dependencies modeled • Accent Detection Error: • 17% same corpus words known • 21% different corpus or words unknown • Boundary Detection Error: • 7% same corpus words known • 15% different corpus or words unknown

  20. Future Work: Switchboard • Different statistics (pa=0.32 vs. pa=0.55) • Different phenomena (Disfluency)

  21. A Bayesian network view of a speech utterance X: acoustic-phonetic observations Y: acoustic-prosodic observations Q: phonemes H: phone-level prosodic tags W: words P: word-level prosodic tags S: syntax M: message Y X Frame Level H Q Segmental Level P W Word Level S M

  22. Prosody modeled in our system • Two binary tag variables (Toneless ToBI): • The Pitch Accent (*) • The Intonational Phrase Boundary (%) • Both are highly correlated with acoustics and syntax. • Pitch accents: pitch excursion (H*, L*); encode syntax information (e.g. content/function word distinction). • IPBs: preboundary lengthening, boundary tones, pause, etc.; Highly correlated with syntactic phrase boundaries

  23. X,Y Q,H W,P S M Prosody dependent speech recognition framework • Advantages: • A natural extension of PI-ASR • Allow the convenient integration of useful linguistic knowledge at different levels • Flexible

  24. Hi Qi pi w Qi wi Prosody dependent pronunciation modeling p(Qi|wi) => p(Qi,Hi|wi,pi) • Phrasal pitch accent affects phones in lexically stressed syllable above ax b ah v above* ax b* ah* v* • IP boundary affects phones in phrase-final rhyme above% ax b ah% v% above*% ax b* ah*% v*%

  25. Yk Xk hk qk Xk qk Prosody dependent acoustic modeling • Prosody dependent allophone models Λ(q) => Λ(q,h): • Acoustic-phonetic observation PDF b(X|q) => b(X|q,h) • Duration PMF d(q) => d(q,h) • Acoustic-prosodic observation PDF f(Y|q,h)

  26. Acoustic-phonetic observations (MFCC+energy, MGHMM) • Reduction of cross-entropy on held-out data: Three-way distinction of allophones • “Strengthened:” phrase-initial accented, phrase-initial unaccented, phrase-medial accented • “Lengthened:” phrase-final accented, phrase-final unaccented • “Neutral:” phrase-medial unaccented • Reduction of word error rate: all allophone models (prosodic or triphone) underperform a monophone-based recognizer

  27. How Prosody Improves Word Recognition • Discriminant function, prosody-independent F(WT;O) = EWT,O { log p(WT|O) } = - EWT,O { log ( Sihi ) } hi = X p(O|Wi) p(Wi) p(O|WT) p(WT)

  28. How Prosody Improves Word Recognition • Discriminant function, prosody-dependent FP(WT;O) = EWT,O { log p’(WT|O) } = - EWT,O { log ( Sihi’ ) } hi ’ = X p(O|Wi,Pi) p(Wi,Pi) p(O|WT,PT) p(WT,PT)

  29. How Prosody Improves Word Recognition • Acoustically likely prosody must be… • unlikely to co-occur with… • an acoustically likely incorrect word string… • most of the time. • FP(WT;O) > F(WT;O) IFF Si < Si p(O|Wi) p(Wi) p(O|Wi,Pi) p(Wi,Pi) p(O|WT) p(WT) p(O|WT,PT) p(WT,PT)

  30. The Corpus • The Boston University Radio News Corpus • Stories read 7 professional radio announcers • 5k vocabulary • 25k word tokens • 3 hours clean speech • No disfluency • Expressive and well-behaved prosody • 85% utterances are selected randomly as training, 5% for development-test and the rest 10% for testing. • Small by ASR standards, but is the largest ToBI-transcribed English corpus

More Related