1 / 23

T-61.184 Informaatiotekniikan erikoiskurssi IV

T-61.184 Informaatiotekniikan erikoiskurssi IV. HMMs and Speech Recognition. based on chapter 7 of D. Jurafsky, J. Martin: Speech and Language Processing. Jaakko Peltonen October 31, 2001. 1. Contents. speech recognition architecture HMM, Viterbi, A* speech acoustics & features

Download Presentation

T-61.184 Informaatiotekniikan erikoiskurssi IV

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. T-61.184 Informaatiotekniikan erikoiskurssi IV HMMs and Speech Recognition based on chapter 7 ofD. Jurafsky, J. Martin: Speech and Language Processing Jaakko PeltonenOctober 31, 2001

  2. 1 Contents • speech recognition architecture • HMM, Viterbi, A* • speech acoustics & features • computing acoustic probabilities • speech synthesis

  3. Speech Recognition Architecture SpeechWaveform Feature Extraction(Signal Processing) SpectralFeatureVectors Neural Net Phone LikelihoodEstimation (Gaussiansor Neural Networks) PhoneLikelihoodsP(o|q) N-gram Grammar Decoding (Viterbior Stack Decoder) HMM Lexicon Words 2 • Application: LVCSR • Large vocabulary: dictionary size 5000 – 60000 • Continuous speech (words not separated) • Speaker-independent

  4. 3 Noisy Channel Model revisited • acoustic input considered a noisy version of a source sentence • decoding: find the sentence that most probably generated the input • problems: - metric for selecting best match? - efficient algorithm for finding best match

  5. 4 Bayes revisited • acoustic input: symbol sequence • sentence: string of words • best match metric: probability • Bayes’ rule:• observation likelihood • prior probability• acoustic model • language model

  6. Hidden Markov Models (HMMs) 5 • previously, Markov chains used to model pronounciation • forward algorithm  phone sequence likelihood • real input is not symbolic: spectral features • input symbols do not correspond to machine states • HMM definition: • • state set Q, • observation symbols O ≠ Q• transition probabilities A • B not limited to 1 and 0• start and end state(s)• observation likelihoods B

  7. 6 HMMs, continued a24 Word Model a11 a22 a33 a01 a12 a23 a34 start0 n1 iy2 d3 end4 b1(o3) b1(o5) b1(o1) b1(o2) b1(o6) b1(o4) ObservationSequence … … o1 o2 o3 o4 o5 o6

  8. 7 The Viterbi Algorithm • word boundaries unknown  segmentation[ay d ih s hh er d s ah m th ih ng ax b aw …]I just heard something about… • assumption: dynamic programming invariant • If ultimate best path for o includes state qi , it includes the best path up to & including qi . • does not work for all grammars

  9. 8 Viterbi, continued b(ax,aw)left b(ax,aw)middle b(ax,aw)right function VITERBI(observations of len T, state-graph) returnsbest-pathnum_states NUM-OF-STATES(state-graph) Create a path probability matrix viterbi[num-states+2,T+2]viterbi[0,0]1.0for each time step tfrom 0 toTdofor each state sfrom 0 tonum-statesdofor each transition s’ from s specified by state-graphnew-scoreviterbi[s,t]*a[s,s’]*bs’(ot)if ((viterbi[s’,t+1] = 0) || (new-score > viterbi[s’,t+1]))thenviterbi[s’,t+1]new-scoreback-pointer[s’,t+1]s Backtrace from highest probability state in the final column of viterbi[] and return path. • single automaton  combine single-word networks add word transition probabilities = bigram probabilities • states correspond to subphones & context • beam search

  10. 9 Other Decoders • Viterbi has problems: • computes most probable state sequence, notword sequence • Cannot be used with all language models (only bigrams) • Solution 1: multiple-pass decoding • N-best-Viterbi: return N best sentences, sort with more complex model • word lattice: return directed word graph + word observation likelihoods  refine with more complex model

  11. 10 A* Decoder • Viterbi uses an approximation of the forward algorithm: max instead of sum • A* uses the complete forward algorithm  correct observation likelihoods, use any language model • ’Best-first’ search of word sequence tree: • priority queue of scored paths to extend • Algorithm: 1. select highest-priority path (pop queue) 2. create possible extensions (if none, stop) 3. calculate scores for extended paths (from forwardalgorithm and language model) 4. add scored paths to queue

  12. 11 A* Decoder, continued p(acoustic|music)=forward probability p(music|if) music32 p(acoustic|if)=forward probability if30 muscle31 p(if|START) messy25 (none)1 Alice40 was29 wants24 Every25 walls2 In4

  13. 12 A* Decoder, continued • score of word string w is not (y is the acoustic string) • reason: path prefix would have higher score • score: A* evaluation function • score from start to current string end • : estimated score of best extension to utterance end

  14. Acoustic Processingof Speech 13 • wave characteristics: frequency  pitch, amplitude  loudness • visible information: vowel/consonant, voicing, length, fricatives, stop closure • spectral features: Fourier spectrum / LPC spectrum - peaks characteristic of different sounds  formants • spectrogram: changes over time • digitization: sampling, quantization • processing  cepstral features / PLP features

  15. Computing Acoustic Probabilities 14 • simple way: vector quantization (cluster feature vectors & count cluster occurrences) • continuous approach: calculate probability density function (pdf) over observations • Gaussian pdf: trained with forward-backward algorithm • Gaussian mixtures, parameter tying • Multi-layer perceptron (MLP) pdf: trained with error back-propagation

  16. Training A Speech Recognizer 15 • evaluation metric: word error rate 1. Compute minimum edit distance between hypothesized and correct string 2. • e.g. correct: ”I went to a party” hypothesis: ”Eye went two a bar tea”3 substitutions, 1 deletion  word error rate 80% • State of the art: word error rate 20% on natural- speech tasks

  17. 16 Embedded Training • models to be trained: - language model: p(wi|wi-1wi-2) - observation likelihoods: bj(ot) - transition probabilities: aij - pronounciation lexicon: HMM state graph • training data: - corpus of speech wavefiles + word-transcription - large text corpus for language model training - smaller corpus of phonetically labeled speech • N-gram language model: trained as in Chapter 6 • HMM lexicon structure: built by hand - PRONLEX, CMUdict ”off-the-shelf” pronounciation dictionaries

  18. Embedded Training,continued 17 • HMM parameters: - initial estimate: equal transition probabilities, observation probabilities bootstrapped (labeled speech  label for each frame  initial Gaussian means / variances) • - MLP systems: forced Viterbi alignmentfeatures & correct words given  best states labels for each input  retrain MLP - Gaussian systems: forward-backward algorithm compute forward & backward probabilities re-estimate a and b. Correct words known prune model

  19. 18 Speech Synthesis • text-to-speech (TTS) system: output is a phone sequence with durations and a FO pitch contour • waveform concatenation: based on recorded speech database, segmented into short units • simplest: 1 unit / phone, join units & smooth edges • triphone models: too many combinationsdiphones used • diphones start/end midway through a phone for stability • does not model pitch & duration changes (prosody)

  20. Speech Synthesis, continued 19 • use signal processing to change prosody • LPC model separates pitch from spectral envelope to modify pitch: generate pulses in desired pitch, re-excite LPC coefficients  modified wave to modify duration: contract/expand coefficient frames • TD-PSOLA: frames centered around pitchmarks to change pitch: make pitchmarks closer together / further apart  to change duration: duplicate / leave out frames  recombine: overlap and add frames

  21. Speech Synthesis, continued 20 • problems with speech synthesis: - 1 example/diphone is insufficient - signal processing  distortion - subtle effects not modeled • unit selection: collect several examples/unit with different pitch/duration/linguistic situation • selection method: - FO contour with 3 values/phone, large unit corpus 1. find candidates (closest phone, duration & FO) rank them by target cost (closeness) 2. measure join quality of neighbour candidates rank joins by concatenation cost - pick best unit set  more natural speech

  22. Human SpechRecognition 21 • PLP analysis inspired by human auditory system • lexical access has common properties: - frequency - parallelism - neighborhood effects - cue-based processing (phoneme restoration) formant structure, timing, voicing, lexical cues, word association, repetition priming • differences: - time-course: human processing is on-line - other cues: prosody

  23. 22 Exercises 1. Hand-simulate the Viterbi algorithm: use the Automaton in Figure 7.8, on input [aa n n ax n iy d]. What is the most probable string of words? 2. Suggest two functions for use in A* decoding. What criteria should the function satisfy for the search to work (i.e. to return the best path)?

More Related