Automatic Speech Recognition

Automatic Speech Recognition ILVB-2006 Tutorial

The Noisy Channel Model • Automatic speech recognition (ASR) is a process by which an acoustic speech signal is converted into a set of words [Rabiner et al., 1993] • The noisy channel model [Lee et al., 1996] • Acoustic input considered a noisy version of a source sentence Noisy Channel Decoder Source sentence Noisy sentence Guess at original sentence 버스 정류장이 어디에 있나요? 버스 정류장이 어디에 있나요? ILVB-2006 Tutorial

The Noisy Channel Model • What is the most likely sentence out of all sentences in the language L given some acoustic input O? • Treat acoustic input O as sequence of individual observations • O = o1,o2,o3,…,ot • Define a sentence as a sequence of words: • W = w1,w2,w3,…,wn Bayes rule Golden rule ILVB-2006 Tutorial

Speech Recognition Architecture Meets Noisy Channel 버스 정류장이 어디에 있나요? 버스 정류장이 어디에 있나요? Feature Extraction Decoding Speech Signals Word Sequence Network Construction Speech DB Acoustic Model Pronunciation Model Language Model HMM Estimation G2P Text Corpora LM Estimation ILVB-2006 Tutorial

25ms . . . 10ms a1a2a3 Feature Extraction • The Mel-Frequency Cepstrum Coefficients (MFCC) is a popular choice [Paliwal, 1992] • Frame size : 25ms / Frame rate : 10ms • 39 feature per 10ms frame • Absolute : Log Frame Energy (1) and MFCCs (12) • Delta : First-order derivatives of the 13 absolute coefficients • Delta-Delta : Second-order derivatives of the 13 absolute coefficients X(n) Preemphasis/ Hamming Window FFT (Fast Fourier Transform) Mel-scale filter bank log|.| DCT (Discrete Cosine Transform) MFCC (12-Dimension) ILVB-2006 Tutorial

bj(x) codebook Acoustic Model • Provide P(O|Q) = P(features|phone) • Modeling Units [Bahl et al., 1986] • Context-independent : Phoneme • Context-dependent : Diphone, Triphone, Quinphone • pL-p+pR : left-right context triphone • Typical acoustic model [Juang et al., 1986] • Continuous-density Hidden Markov Model • Distribution : Gaussian Mixture • HMM Topology : 3-state left-to-right model for each phone, 1-state for silence or pause ILVB-2006 Tutorial

Pronunciation Model • Provide P(Q|W) = P(phone|word) • Word Lexicon [Hazen et al., 2002] • Map legal phone sequences into words according to phonotactic rules • G2P (Grapheme to phoneme) : Generate a word lexicon automatically • Several word may have multiple pronunciations • Example • Tomato • P([towmeytow]|tomato) = P([towmaatow]|tomato) = 0.1 • P([tahmeytow]|tomato) = P([tahmaatow]|tomato) = 0.4 [ow] [ey] 0.5 1.0 0.2 1.0 1.0 [m] [t] [ow] [t] 0.8 1.0 0.5 1.0 [ah] [aa] ILVB-2006 Tutorial

ONE TWO ONE THREE Sentence HMM ONE TWO THREE ONE W AH N Word HMM ONE W Phone HMM 2 1 3 Training • Training process [Lee et al., 1996] • Network for training yes Speech DB Feature Extraction Baum-Welch Re-estimation Converged? End no HMM ILVB-2006 Tutorial

Language Model • Provide P(W) ; the probability of the sentence [Beaujard et al., 1999] • We saw this was also used in the decoding process as the probability of transitioning from one word to another. • Word sequence : W = w1,w2,w3,…,wn • The problem is that we cannot reliably estimate the conditional word probabilities, for all words and all sequence lengths in a given language • n-gram Language Model • n-gram language models use the previous n-1 words to represent the history • Bi-grams are easily incorporated in a viterbi search ILVB-2006 Tutorial

I L 일 I L I 이 S S A M 삼 A S A 사 M Intra-word transition Word transition start end 이 I P(이|x) 이 LM is applied 일 P(일|x) 일 I L 사 P(사|x) 사 Between-word transition A S 삼 P(삼|x) 삼 S M A Network Construction • Expanding every word to state level, we get a search network [Demuynck et al., 1997] Acoustic Model Pronunciation Model Language Model Search Network ILVB-2006 Tutorial

Decoding • Find • Viterbi Search : Dynamic Programming • Token Passing Algorithm [Young et al., 1989] • Initialize all states with a token with a null history and the likelihood that it’s a start state • For each frame ak • For each token t in state s with probability P(t), history H • For each state r • Add new token to s with probability P(t) Ps,r Pr(ak), and history s.H ILVB-2006 Tutorial

Decoding • Pruning [Young et al., 1996] • Entire search space for Viterbi search is much too large • Solution is to prune tokens for paths whose score is too low • Typical method is to use: • histogram: only keep at most n total hypotheses • beam: only keep hypotheses whose score is a fraction of best score • N-best Hypotheses and Word Graphs • Keep multiple tokens and return n-best paths/scores • Can produce a packed word graph (lattice) • Multiple Pass Decoding • Perform multiple passes, applying successively more fine-grained language models ILVB-2006 Tutorial

Large Vocabulary Continuous Speech Recognition (LVCSR) • Decoding continuous speech over large vocabulary • Computationally complex because of huge potential search space • Weighted Finite State Transducers (WFST) [Mohri et al., 2002] • Efficiency in time and space • Dynamic Decoding • On-demand network constructions • Much less memory requirements WFST Word : Sentence Search Network WFST Phone : Word Combination Optimization WFST HMM : Phone WFST State : HMM ILVB-2006 Tutorial

References (1/2) • L. Bahl, P. F. Brown, P. V. de Souza, and R .L. Mercer, 1986. Maximum mutual information estimation of hidden Markov model ICASSP, pp.49–52. • C. Beaujard and M. Jardino, 1999. Language Modeling based on Automatic Word Concatenations, In Proceedings of 8th European Conference on Speech Communication and Technology, vol. 4, pp.1563-1566. • K. Demuynck, J. Duchateau, and D. V. Compernolle, 1997. A static lexicon network representation for cross-word context dependent phones, Proceedings of the 5th European Conference on Speech Communication and Technology, pp.143–146. • T. J. Hazen, I. L. Hetherington, H. Shu, and K. Livescu, 2002. Pronunciation modeling using a finite-state transducer representation, Proceedings of the ISCA Workshop on Pronunciation Modeling and Lexicon Adaptation, pp.99–104. • M. Mohri, F. Pereira, and M Riley, 2002. Weighted finite-state transducers in speech recognition, Computer Speech and Language, vol.16, no.1, pp.69–88. ILVB-2006 Tutorial

References (2/2) • B. H. Juang, S. E. Levinson, and M. M. Sondhi, 1986. Maximum likelihood estimation for multivariate mixture observations of Markov chains, IEEE Transactions on Information Theory, vol.32, no.2, pp.307–309. • C. H. Lee, F. K. Soong, and K. K. Paliwal, 1996. Automatic Speech and Speaker Recognition: Advanced Topics, Kluwer Academic Publishers. • K. K. Paliwal, 1992. Dimensionality reduction of the enhanced feature set for the HMMbased speech recognizer, Digital Signal Processing, vol.2, pp.157–173. • L. R. Rabiner, 1989, A tutorial on hidden Markov models and selected applications in speech recognition, Proceedings of the IEEE, vol.77, no.2, pp.257–286. • L. R. Rabiner and B. H. Juang, 1993. Fundamentals of Speech Recognition, Prentice-Hall. • S. J. Young, N. H. Russell, and J. H. S Thornton, 1989. Token passing: a simple conceptual model for connected speech recognition systems. Technical Report CUED/F-INFENG/TR.38, Cambridge University Engineering Department. • S. Young, J. Jansen, J. Odell, D. Ollason, and P. Woodland, 1996. The HTK book. Entropics Cambridge Research Lab., Cambridge, UK. ILVB-2006 Tutorial

Automatic Speech Recognition