1 / 37

Automatic Speech Recognition Introduction

This reading covers the introduction to automatic speech recognition (ASR) using Hidden Markov Models (HMMs) and Dynamic Bayesian Networks (DBNs). It explores the components of an ASR system, feature calculation, acoustic modeling, language modeling, and robust speech recognition techniques.

ecannon
Download Presentation

Automatic Speech Recognition Introduction

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Automatic Speech RecognitionIntroduction Readings: Jurafsky & Martin 7.1-2 HLT Survey Chapter 1

  2. The Human Dialogue System

  3. The Human Dialogue System

  4. Computer Dialogue Systems Dialogue Management Audition Automatic Speech Recognition Natural Language Understanding Natural Language Generation Text-to- speech Planning signal signal words words signal logical form

  5. Computer Dialogue Systems Dialogue Mgmt. Audition ASR NLU NLG Text-to- speech Planning signal signal words words signal logical form

  6. Parameters of ASR Capabilities • Different types of tasks with different difficulties • Speaking mode (isolated words/continuous speech) • Speaking style (read/spontaneous) • Enrollment (speaker-independent/dependent) • Vocabulary (small < 20 wd/large >20kword) • Language model (finite state/context sensitive) • Perplexity (small < 10/large >100) • Signal-to-noise ratio (high > 30 dB/low < 10dB) • Transducer (high quality microphone/telephone)

  7. The Noisy Channel Model message message noisy channel + Message Channel =Signal Decoding model: find Message*= argmax P(Message|Signal) But how do we represent each of these things?

  8. ASR using HMMs • Try to solve P(Message|Signal) by breaking the problem up into separate components • Most common method: Hidden Markov Models • Assume that a message is composed of words • Assume that words are composed of sub-word parts (phones) • Assume that phones have some sort of acoustic realization • Use probabilistic models for matching acoustics to phones to words

  9. HMMs: The Traditional View go home Markov model backbone composed of phones (hidden because we don’t know correspondences) g o h o m x0 x1 x2 x3 x4 x5 x6 x7 x8 x9 Acoustic observations Each line represents a probability estimate (more later)

  10. HMMs: The Traditional View go home Markov model backbone composed of phones (hidden because we don’t know correspondences) g o h o m x0 x1 x2 x3 x4 x5 x6 x7 x8 x9 Acoustic observations Even with same word hypothesis, can have different alignments. Also, have to search over all word hypotheses

  11. HMMs as Dynamic Bayesian Networks Markov model backbone composed of phones go home q0=g q1=o q2=o q3=o q4=h q5=o q6=o q7=o q8=m q9=m x0 x1 x2 x3 x4 x5 x6 x7 x8 x9 Acoustic observations

  12. HMMs as Dynamic Bayesian Networks Markov model backbone composed of phones go home q0=g q1=o q2=o q3=o q4=h q5=o q6=o q7=o q8=m q9=m x0 x1 x2 x3 x4 x5 x6 x7 x8 x9 ASR: What is best assignment to q0…q9 given x0…x9?

  13. Hidden Markov Models & DBNs DBN representation Markov Model representation

  14. Pronunciation Modeling cat: k@t dog: dog mail: mAl the: D&, DE … Parts of an ASR System Feature Calculation Acoustic Modeling Language Modeling cat dog: 0.00002 cat the: 0.0000005 the cat: 0.029 the dog: 0.031 the mail: 0.054 … k @ S E A R C H The cat chased the dog

  15. Pronunciation Modeling cat: k@t dog: dog mail: mAl the: D&, DE … Parts of an ASR System Feature Calculation Acoustic Modeling Language Modeling cat dog: 0.00002 cat the: 0.0000005 the cat: 0.029 the dog: 0.031 the mail: 0.054 … k @ Maps acoustics to phones Maps phones to words Strings words together Produces acoustics (xt)

  16. Feature calculation

  17. Feature calculation Frequency Time Find energy at each time step in each frequency channel

  18. Feature calculation Frequency Time Take inverse Discrete Fourier Transform to decorrelate frequencies

  19. Feature calculation Input: -0.1 0.3 1.4 -1.2 2.3 2.6 … 0.2 0.1 1.2 -1.2 4.4 2.2 … 0.2 0.0 1.2 -1.2 4.4 2.2 … -6.1 -2.1 3.1 2.4 1.0 2.2 … Output: …

  20. Robust Speech Recognition • Different schemes have been developed for dealing with noise, reverberation • Additive noise: reduce effects of particular frequencies • Convolutional noise: remove effects of linear filters (cepstral mean subtraction)

  21. Now what? -0.1 0.3 1.4 -1.2 2.3 2.6 … 0.2 0.1 1.2 -1.2 4.4 2.2 … 0.2 0.0 1.2 -1.2 4.4 2.2 … -6.1 -2.1 3.1 2.4 1.0 2.2 … ??? That you …

  22. Machine Learning! -0.1 0.3 1.4 -1.2 2.3 2.6 … 0.2 0.1 1.2 -1.2 4.4 2.2 … 0.2 0.0 1.2 -1.2 4.4 2.2 … -6.1 -2.1 3.1 2.4 1.0 2.2 … Pattern recognition That you … with HMMs

  23. Hidden Markov Models (again!) P(statet+1|statet) Pronunciation/Language models P(acousticst|statet) Acoustic Model

  24. dh a a t -0.1 0.3 1.4 -1.2 2.3 2.6 … 0.2 0.1 1.2 -1.2 4.4 2.2 … 0.2 0.0 1.2 -1.2 4.4 2.2 … -6.1 -2.1 3.1 2.4 1.0 2.2 … Acoustic Model • Assume that you can label each vector with a phonetic label • Collect all of the examples of a phone together and build a Gaussian model (or some other statistical model, e.g. neural networks) Na(m,S) P(X|state=a)

  25. 1-p 1-p 1-p 1-p a a a a p p p p Building up the Markov Model • Start with a model for each phone • Typically, we use 3 states per phone to give a minimum duration constraint, but ignore that here… transition probability

  26. 1-pt 1-pa 1-pdh t ow t dh a pdh pa pt Building up the Markov Model • Pronunciation model gives connections between phones and words • Multiple pronunciations: ow ey t m ah ah

  27. dh a h iy y uw Building up the Markov Model • Language model gives connections between words (e.g., bigram grammar) p(he|that) t p(you|that)

  28. h uh sh h uw y iy th a ASR as Bayesian Inference q1w1 q2w1 q3w1 p(he|that) t p(you|that) x1 x2 x3 iy argmaxW P(W|X) =argmaxW P(X|W)P(W)/P(X) =argmaxW P(X|W)P(W) =argmaxWSQ P(X,Q|W)P(W) ≈argmaxW maxQ P(X,Q|W)P(W) ≈argmaxW maxQ P(X|Q) P(Q|W) P(W) d

  29. ASR Probability Models • Three probability models • P(X|Q): acoustic model • P(Q|W): duration/transition/pronunciation model • P(W): language model • language/pronunciation models inferred from prior knowledge • Other models learned from data (how?)

  30. Pronunciation Modeling cat: k@t dog: dog mail: mAl the: D&, DE … Parts of an ASR System P(X|Q) P(Q|W) P(W) Feature Calculation Acoustic Modeling Language Modeling cat dog: 0.00002 cat the: 0.0000005 the cat: 0.029 the dog: 0.031 the mail: 0.054 … k @ S E A R C H The cat chased the dog

  31. EM for ASR: The Forward-Backward Algorithm • Determine “state occupancy” probabilities • I.e. assign each data vector to a state • Calculate new transition probabilities, new means & standard deviations (emission probabilities) using assignments

  32. h uh sh h uw y iy th a ASR as Bayesian Inference q1w1 q2w1 q3w1 p(he|that) t p(you|that) x1 x2 x3 iy argmaxW P(W|X) =argmaxW P(X|W)P(W)/P(X) =argmaxW P(X|W)P(W) =argmaxWSQ P(X,Q|W)P(W) ≈argmaxW maxQ P(X,Q|W)P(W) ≈argmaxW maxQ P(X|Q) P(Q|W) P(W) d

  33. Search • When trying to find W*=argmaxW P(W|X), need to look at (in theory) • All possible word sequences W • All possible segmentations/alignments of W&X • Generally, this is done by searching the space of W • Viterbi search: dynamic programming approach that looks for the most likely path • A* search: alternative method that keeps a stack of hypotheses around • If |W| is large, pruning becomes important

  34. How to train an ASR system • Have a speech corpus at hand • Should have word (and preferrably phone) transcriptions • Divide into training, development, and test sets • Develop models of prior knowledge • Pronunciation dictionary • Grammar • Train acoustic models • Possibly realigning corpus phonetically

  35. How to train an ASR system • Test on your development data (baseline) • **Think real hard • Figure out some neat new modification • Retrain system component • Test on your development data • Lather, rinse, repeat ** • Then, at the end of the project, test on the test data.

  36. Judging the quality of a system • Usually, ASR performance is judged by the word error rate ErrorRate = 100*(Subs + Ins + Dels) / Nwords REF: I WANT TO GO HOME *** REC: * WANT TWO GO HOME NOW SC: D C S C C I 100*(1S+1I+1D)/5 = 60%

  37. Judging the quality of a system • Usually, ASR performance is judged by the word error rate • This assumes that all errors are equal • Also, a bit of a mismatch between optimization criterion and error measurement • Other (task specific) measures sometimes used • Task completion • Concept error rate

More Related