1 / 61

Automatic speech recognition

Automatic speech recognition. Contents ASR systems ASR applications ASR courses Presented by Kalle Palomäki Teaching material: Kalle Palomäki & Mikko Kurimo. About Kalle. Background: Acoustics and audio, auditory brain measurements, hearing models, noise robust ASR PhD 2005 at TKK

lona
Download Presentation

Automatic speech recognition

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Speech recognition Automatic speech recognition Contents ASR systems ASR applications ASR courses Presented by Kalle Palomäki Teaching material: Kalle Palomäki & Mikko Kurimo

  2. Speech recognition About Kalle • Background: Acoustics and audio, auditory brain measurements, hearing models, noise robust ASR • PhD 2005 at TKK • Research experience at: • Department of Signal Processing and Acoustics • Department of Information and computer science, Aalto • University of Sheffield, Speech and Hearing group • Team leader of noise robust ASR team, Academy research fellow • Current research themes • Hearing inspired missing data approach in noise robust ASR • Sound separation and feature extraction

  3. Speech recognition Goals of today Learn what methods are used for automatic speech recognition (ASR) Learn about typical ASR applications and things that affect the ASR performance Definition: Automatic speech recognition, ASR = transformation of speech audio to text

  4. Speech recognition Orientation • What are the main challenges faced in automatic speech recognition • Try to think of three most important ones with your pair

  5. ASR tasks and solutions • Speaking environment and microphone • Office, headset or close-talking • Telephone speech, mobile • Noise, outside, microphone far away • Style of speaking • Speaker modeling

  6. ASR tasks and solutions • Speaking environment and microphone • Style of speaking • Isolated words • Connected words, small vocabulary • Word spotting in fluent speech • Continuous speech, large vocabulary • Spontaneous speech, ungrammatical • Speaker modeling

  7. ASR tasks and solutions • Speaking environment and microphone • Style of speaking • Speaker modelling • Speaker-dependent models • Speaker-independent, average speaker models • Speaker adaptation

  8. Automatic speech recognition Large-vocabulary continuous speech (LVCSR) Complex pattern recognition system that utilizes many Probablistic models at different hierarchical levels Transform speech to text Speech signal Feature extraction Acoustic modeling Recognized text Decoder Language modeling

  9. Speech recognition What is speech recognition? Find the most likely word sequence given the acoustic signal and statistical models! Acoustic model defines the sound units independent of speaker and recording conditions Language model defines words and how likely they occur together Lexicon (vocabulary) defines the word set and how the words are formed from sound units

  10. Speech recognition What is speech recognition? Find the most likely word sequencegiven the acoustic observations and statistical models

  11. Speech recognition What is speech recognition? After applying Bayes rule Find the most likely word sequence given the observations and models Acoustic model Language model

  12. Preprocessing & Features Extract the essential information from the signal Describe the signal by compact feature vectors computed from short time intervals Speech signal Feature extraction Acoustic modeling Recognized text Decoder Language modeling

  13. s t (n) Audio signal | DFT{s t(n)} | Magnitude spectrogram St,f Auditory frequency resolution Mel spectrogram St,j Compression log {St,j} Mel frequency cepstral coefficients (MFCC) De-correlation Discrete cosine transformation

  14. Acoustic modeling Find basic speech units and their models in the feature space Given the features compute model probabilities Speech signal Feature extraction Acoustic modeling Recognized text Decoder Language modeling

  15. Speech recognition Phonemes Basic units of language Written language: letter Spoken language: phoneme Wikipedia: “The smallest contrastive linguistic unit which may bring about a change of meaning” There are different writing systems, e.g. IPA (International Phonetic Alphabet)‏ The phoneme sets differ depending on language

  16. Speech recognition IPA symbols for US English Speech recognition

  17. Speech recognition 1dim. Gaussian mixture model Picture by B.Pellom

  18. Speech recognition Gaussian mixture model GMM Picture by B.Pellom

  19. Training GMM Classifier Data collected _ k kkkkaeaeaeae _ _ _ _ t tttt _ _ _ _ _ _ _

  20. Training GMM Classifier Data collected _ k kkkkaeaeaeae _ _ _ _ t tttt _ _ _ _ _ _ _

  21. Training GMM Classifier Data collected _ k kkkkaeaeaeae _ _ _ _ t tttt _ _ _ _ _ _ _

  22. 0 Testing 0.05 0.4 0.05 0.5 Sum()=1 _ k kkkkaeaeaeae _ _ _ _ t tttt _ _ _ _ _ _ _

  23. Speech recognition How to model a sequence of phonemes (or GMMs)? ? Picture by B.Pellom

  24. Speech recognition Hidden Markov Model (HMM), 3-states Transitions Observation probabilities b(o1) b(o2) b(o3) o1 o2 o3 Acoustic observations GMM1 GMM2 GMM3

  25. Speech recognition Hidden Markov Model (HMM)1-state transitions Observation sequence: O={o1, o2, o3} Observation probability sequence: B={b(o1), b(o2), b(o3)} P=b(o1)*a11* b(o2)* a11*b(o3)* aO

  26. 0 0.05 Realistic scenario 0.4 0.05 0.5 _ k ae _ t _ Sum()=1 _ _ k t k t k aeowaeae _ _ _ _ t kk t t _ _ _ _ _ _ _

  27. _ _ ae k t _ k ae _ t 0.8 0.8 0.79 0.79 0.2 0.2 0.9 0.9 0.2 0.21 0.8 0.1 _ k t k aeowae _ _ t t k _ _ _ _ GMM

  28. Exercise 1. Calculate likelihood of phoneme sequence /k/ /ow/ as for word cow. Observation probabilities, temporal alignment, and a set of 1-state phoneme HMMs shown below. alignment k ow _ ae k ow t

  29. Calculate likelihood of phoneme sequence /k/ /ow/ as for word cow. Observation probabilities, temporal alignment, and a set of 1-state phoneme HMMs shown below. alignment k ow 0.4* 0.2 * 0.5 *0.92* 0.4* 0.92*0.5 = 0,006771 k ow

  30. Context dependent HMMs Triphone HMMs for: /_/, /k/, /ae/, /t/, /_/ _ kae aet _ k ae t

  31. More on HMMs Lecture 12-Feb, “Sentence level processing” by Oskar Kohonen Exercise 6, “Hidden Markov Models”

  32. Language modeling Gives a prior probability for any word (or phoneme sequence) Defines basic language units (e.g. words) Learns statistical models from large text collections Speech signal Feature extraction Acoustic modeling Recognized text Decoder Language modeling

  33. N-gram language model • Stochastic model of the relations between words • Which words often occur close to each other? • The model predicts the probability distribution of the next word given the previous ones • Estimated from large text corpuses i.e. millions of words • Smoothing and pruning required to learn compact long-span models from sparse training data • More information on lecture 26-Feb “Statistical language models” by Mikko Kurimo

  34. Speech recognition N-gram models • ‏Trigram = 3-gram: • Word occurrence depends only on immediate context • A conditional probability of word given its context Picture by B.Pellom Speech recognition

  35. Speech recognition Estimation of N-gram model c(“eggplant stew”) c(“eggplant”) • Bigram example: • Is a maximum likelihood estimate for prob. of wi given wj • c(wj,wi) count of wi,wjtogether • c(wj) count of wj • works well only for frequent bigrams Speech recognition

  36. Data from Berkeley restaurant corpus (Jurafsky & Martin, 2000 “Speech and language processing”). Uni-gram counts Calculate missing bi-gram probabilities

  37. Data from Berkeley restaurant corpus (Jurafsky & Martin, 2000 “Speech and language processing”). Uni-gram counts 1087 / 3437=.32 Calculate missing bi-gram probabilities

  38. Data from Berkeley restaurant corpus (Jurafsky & Martin, 2000 “Speech and language processing”). Uni-gram counts 1087 / 3437=.32 3 / 3256 = .00092 Calculate missing bi-gram probabilities

  39. Data from Berkeley restaurant corpus (Jurafsky & Martin, 2000 “Speech and language processing”). Uni-gram counts 1087 / 3437=.32 6 / 1215 = .0049 3 / 3256 = .00092 Calculate missing bi-gram probabilities

  40. Speech recognition On the N-gram sparsity • For Shakespeare’s complete works vocabulary size (word form types) is 29 066 • Total number of words is 884 647 • This makes number of possible bigrams 29 0662 = 844 million • Under 300 000 found in writings • Conclusion: even learned bigram model would be very sparse Speech recognition Speech recognition

  41. Morphemes as language units • In many languages words are not suitable as basic units for the language models • Inflections, prefixes, suffixes and compound words • Finnish language has these issues • The best unitscarry meaning(e.g. just letters or syllables are not good) • -> Morpheme or “statistical morf” tietä+isi+mme+kö+hän would + we +really + know April 28, 2008 http://www.cis.hut.fi/projects/speech/

  42. Speech recognition Lexicon for sub-word units? Better coverage, few or no OOVs, even new words Phonemes, syllables, morphemes, or stem+endings? un + re + late + d + ness unrelate + d + ness unrelated + ness How to split and rebuild words?

  43. More about language models Lecture 26-Feb “Statistical language models” by Mikko Kurimo Exercise 3. N-gram language models

  44. Decoding Join the acoustic and language probabilities Find the most likely sentence hypothesis by pruning and choosing the best Significant effect on recognition speed and accuracy Speech signal Feature extraction Acoustic modeling Recognized text Decoder Language modeling

  45. Speech recognition What is speech recognition? After applying Bayes rule Find the most likely word sequence given the observations and models Acoustic model Language model

  46. Speech recognition Decoding The task is to find the most probable word sequence, given models and the acoustic observations Viterbi search: Find the most probable state sequence An efficient exhaustive search by applying dynamic programming and recursion For Large Vocabulary Continuous Speech Recognition (LVCSR) the space must be pruned and optimized

  47. Speech recognition N-best lists • Easy to apply long span LMs for rescoring • The differences are small • Not very compact representation • Tokens can be decoded into a lattice or word graph structure that shows all good options Picture by B.Pellom

  48. Speech recognition Word graph representation Picture by B.Pellom

  49. Speech recognition Automatic speech recognition Content today: ASR systems today ASR applications ASR courses

  50. Speech recognition Typical applications User interface by speech Dictation Speech translation Audio information retrieval

More Related