1 / 65

Automatic Speech Recognition (ASR)

Automatic Speech Recognition (ASR). HISTORY, ARCHITECTURE, COMMON APPLICATIONS AND THE MARKETPLACE. Omar Khalil Gómez – Università di Pisa. What is ASR?. Spoken language understanding is a difficult task I will become a pirate ” vs “I will become a pilot ”

ardith
Download Presentation

Automatic Speech Recognition (ASR)

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. AutomaticSpeechRecognition (ASR) HISTORY, ARCHITECTURE, COMMON APPLICATIONS AND THE MARKETPLACE Omar Khalil Gómez – Università di Pisa

  2. Whatis ASR? • Spokenlanguageunderstandingis a difficulttask • I willbecome a pirate” vs “I willbecome a pilot” • ASR “addresses” thistaskcomputationally • Fromanacousticsignalto a string of words -> mapping • Automaticspeechunderstanding (ASU) isthegoal • Understandthesentenceratherthanknowjustthewords • Anotherrelatedfields • Speechsynthesis, text-to-speech

  3. ASR then and… tomorrow¿? Origin future • Whyshould i need ASR? • Firstelectricimplements(1800) • Can weemulatethe human behaviour? • Strong-AI • Commercialapplications in telecomunication • Defensivepurposes

  4. History of AutomaticSpeechRecognition Fromspeechproductiontotheacoustic-languagemodel

  5. History of ASR: From Speech Production Models to Spectral Representations • First attempts to mimic a human’s speech communication • Interest was creating a speaking machine. • In 1773 Kratzensteinsucceeded in producing vowel sounds with tubes and pipes. • In 1791 Kempelen in Vienna constructed an “Acoustic-Mechanical Speech Machine”. • In the mid-1800's Charles Wheatstone built a version of von Kempelen's speaking machine. • In thefirsthalf of the 20th century, workers of Bell Laboratories found relationships between a given speech spectrum and its sound characteristics • Distribution of power of a speech sound across frequency • Isthemain concept tomodelthespeech. • In the 1930’s Homer Dudley (Bell Labs.) developed a speech synthesizer called the VODER based on that research. • Speech pioneers like Harvery Fletcher and Homer Dudley firmly established the importance of the signal spectrum for reliable identification of the phonetic nature of a speech sound.

  6. History of ASR: EarlyAutomaticSpeechRecognizers • Early attempts to design systems for automatic speech recognition were mostly guided by the theory of acoustic-phonetics. • Analyze phonetic elements of speech: how are they acoustically realized? • Relationbetween place/manner of articulation and thedigitalizedspeech. • Firstadvances: • Goodresults in digitrecognition (1952) • Recognitiononcontinousspeechwithvowels and numbers (isolatedworddetection) (60’s) • First uses of statisticalsyntax at phonemelevel (60’s) • Butthesemodelsdidn’ttakeintoaccountthetemporal non-uniformity of speech events. • In the 70’s arrivedthedynamicprogramming (viterbi),

  7. History of ASR: Technology Drivers since the 1970’s (I) • Tom Martin developedthefirst ASR system, used in fewapplications: • FedEx • DARPA • Harpy: recognize speech using a vocabulary of 1,011 words • Phone tempate matching • Thespeechrecognitionlanguageisrepresentedby a connectednetwork • Syntacticalproduction rules • Word boundary rules • Hearsay • Generatehypothesigiveninformationprovidedfromparallelsources. • HWIM • Phonological rules -> phonemerecognitionaccuracy

  8. History of ASR: Technology Drivers since the 1970’s (II) • IBM’sTangora • Speaker-dependantsystemfor a voice-activatedtypewriter. • Structure of languagemodelrepresentedbystatistical and syntactical rules: n-gram. • Claude Shannon’s Word gamestronglyvalidatedthepower of the n-gram. • AT&T Bell Labs • Speaker-independantappplicationsforautomatedtelecommunicationservices • Niceworkwithacousticvariabilityoracousticmodel • Thisledtothecreation of speechclusteringalgorithmsforsoundreferencepatterns • Keywordspottingtotrainalso • Thesetwoapproacheshad a profoundinfluence in theevolution of human-speechcommunications • Thenthequickdevelopment of statisticalmethods in the 80’s caused a certaindegree of convergence in thesystemdesign

  9. History of ASR: Technology Directions in the 1980’s and 1990’s • Speechrecognitionshift in methodology • Fromtemplate-basedapproach • Torigorousstatisticalmodelingframework(HMM) • Theapplication of the HMM becamethepreferredmethod in mid 80’s • Anothersystemslike ANN wereused • Notgoodbecause of temporal variation of speech • In the 90’s theproblemwastransformedintoanoptimizationproblem • Kernel-basedmethodssuchsupport vector machines. • Real applications emerged onthe 90’s • Individual researchprogramsallovertheworld • Open-source software, API’s • …

  10. History of ASR: Overview

  11. The variable dimensions of ASR

  12. Large-vocabularycontiuousspeechrecognition (LVCSR)

  13. Architecture of an ASR system designingtheacoustic-languagemodel

  14. Architecture of an ASR system: TheNoisyChannelmodel • Noisychannelmetaphore • Knowhowthe cannel distortsthesource • Then use thisknowledgeto compute themostlikelystringoverthelanguagewhichbestfitsthe input • Bestfitsthe input??  Metricforsimilarity • Overthewholelanguage??  Efficientsearch

  15. Architecture of an ASR system • To pick thesentencethatbestmatchesthenoisy input • Bayesianinference and HMM • Eachstate oh the HMM is a typhe of phone • Theconectionsputconstraintsgiventhelexicon • Compute theprobabilities of transitions in time • Thesearch of thatsentencemust be efficient • Viterbidecodingalgorithmfor HMM

  16. Architecture of an ASR system: Bayesianinference • Whatisthemostlikelysentenceout of allsentences in thelanguage L givensomeacoustic input O? • Acoustic input as a sequence of individual “symbols” or “observations” • Sentence as a string of words • Bayesianinferencetoaddressthisproblem: • Likelihood: computedbytheacousticmodel • Prior probability: computedbythelanguagemodel

  17. Architecture of an ASR system

  18. Architecture of an ASR system: The HMM • Featureextraction • Acousticwaveformissampled in frames • Each time windowisrepresentedwith a vector of features • Gaussianmodelto compute • q: a state of the HMM • o: observationor vector of features • Thisisproducing a vector of probabilitiesforeachframe • Eachcomponentwill be theprobabilitythateach pone orsubphonecorrespondto thesefeatures. • The HMM  Phoneticdictionaryorlexicon • N-gramrepresentation • Use Viterbialgorithm

  19. Theacousticmodel Feautreextraction and likelihoodcalculation

  20. Theacousticmodel • Likelihood AcousticModel (AM) • Extractfeatures of thesounds • Thesoundisprocessed and weget a nicerepresentation  MFCS • Gaussian mixture modelto compute thelikelihood of therepresentationfor a pone (word) • Compute : a pone orsubphonecorrespondsto a state q in our HMM

  21. Extractingfeatures • Transformthe input waveforminto a sequece of acousticfeaturevectors MFCC • Each vector representstheinformation in a small time window of thesignal. • Common in speechrecognition, melfrequencycepstralcoefficients • Basedonthe idea of cepstrum • Firststepisconverttheanalogrepresentationsinto a digital signal • Sampling: measuretheamplitude at a particular time (samplingrate) • Quantization: represent and store thesamples • We are thenreadytoextract MFCC features

  22. Extractingfeatures: Pre-emphasis • Input: waveform • Output: thewaveformwiththehighfrequenciesboosted • Reason: highrequenciesoffer a lot of information

  23. Extractingfeatures: Windowing • Input: waveformwithboostedhighfrequencies • Output: frammedwaveform • Reason: • Thewaveformchangesveryquicly • Properties are notconstantthrough time

  24. Extractingfeatures: Discrete Fourier transform • Input: windowedsignal • Output: foreach of N discretefrequencybandswegetthesoundpresure • Reason: get new information • Amount of energyrelatedtothefrequency • Vowels

  25. Extractingfeatures: Melfilterbank and log • Input: informationabouttheamount of energyfor a frequency • Output: log (wrappedfrequencies) • Wrappingwithmelscale • Log makesa nicechangetointerpret data • Reason: interestingfrequencybands are in aninterval

  26. Extractingfeatures: InverseDiscrete Fourier Transform • Input: informationaboutamount of energy/frequency in theinterestingintervalsorspectrum • Output: cepstrumorthespectrum of the log of thespectrum (first 12 cepstralvalues) • Reason: more information • Usefullprocessinadvantages • Improvesphonerecognition • Vocal tractfilter, pitch  consonants

  27. Extractingfeatures: Deltas and energy • Input: cepstralform • Output: deltas foreach 12-value cepstral in a window and energy of thewindow • Reason: • Energyisusefultodetectstops and thensillabes and phones • Delta  Velocity: representchangesbetweenwindows (energy) • Double delta  Acceleration: changebetweenframes in thecorrespondingfeature of delta

  28. Extractingfeatures: MFCC • 12 cepstralcoefficients • 12 delta cepstralcoefficients • 12 double delta cepstralcoefficients • 1 energycoefficient • 1 delta energycoefficient • 1 double delta energycoefficient • 39 MFCC features

  29. Acousticlikelihoods: differentapproaches • Wehaveto compute thelikelihood of thesefeaturevectorsgivenanHMMstate • Given q and o, get p(o|q)  • Forpart-of-speechtaggingeachobservationis a discrete symbol • Forspeechrecognitionwedealwithvectors  discretize?? • Sameproblemwhendecoding and training • Weneedtogetthematrix B and thenchangethe training algorithm. • Differentapproaches • Vector quantization • GaussianPDF’s • ANN, SVM, Kernelmethods

  30. Acousticlikelihoods: Vector quantization • Usefulpedagogicalstep • Notused in reality • Clusterize • Getprototypevectors • Compute distanceswith a metric • Euclidean • Mahanabolis • Train withanalgorithm • Knn • K-means • Getthemostprobably symbol givenanobservation b(i)

  31. Acousticlikelihoods: GaussianPDFs • Speechissimply non categorical, symbolicprocess. • Wemust compute theobservationprobabilitiesdirectlyonthefeaturevectors • Probabilitydensitiyfunctionoverspace • Unvariategaussians • Simplest use of gaussianprobabilityestimator • Probability: areaunderthe curve = 1 • Onegaussiantellsushow probable thevalue of a featureisto be generatedbyan HMM state

  32. Acousticlikelihoods: GaussianPDFs • Multivariategaussians • Single cepstralfeatureto 39-dimension vector  new dimension • Use a gaussianforeachfeature supposingthedistribution • Gaussian mixture models • A particular cepstralvaluemighthave non-normal distribution • Weighted mixture of multivariategaussians • TrainedwiththeBaum-Welchalgorithm

  33. Acousticlikelihoods: Probabilities and distancefunctions • Log Probabilityismucheasytoworkwiththanprobability • Multipliyingmanyprobabilityresults in smallnumbers underflow • The log of a numberismucheasywaytowork VS • Computationalspeedbeacusewe are adding

  34. Thelanguagemodel N-gram and lexicon

  35. Thelanguagemodel • Prior  TheLanguageModel (LM) • Howlikely a string of wordsisto be a real englishsentence • N-gramapproach • We can seethe HMM as a net giventhelexicon… • List of words, pronunciationdictionaries • Accoringtobasicphones • Phoneticresources in the web forseverallanguages • Usefull in otherfields • More thanpronunctiation • Stress level • Morphologicalinformation • Part of speechinformation • … • …and the N-gram

  36. Thelanguagemodel: Pronunciationlexicon

  37. Thelanguagemodel: Pronunciationlexicon

  38. Thelanguagemodel: HMM and Lexicon • Sequences of HMM statesconcatenated • Left-to-right HMM • Simple ASR tasks can use a directrepresentation • For LVCSR weneed more granularitybecause of thechanges in theframes • Phone can reachthesecond samplingrate 10ms  100 framesfor a pone (different)

  39. Thelangaugemodel: The N-gram • Assignprobabilityto a sentence • 3-grams or4-grams • Dependingontheapplication • Dependingonthevocabularysize • Workingwithtextwewanttoknowtheprobability of a Word givensomehistory • Workingwithspeechwewanttoknowtheprobability of a phonegivensomehistory • Chain-rule probability • Thelenght of thehitoryis N • Featuresalsovalidforspeechrecognition

  40. Decodingand searching Puttingalltogether

  41. Decoding a sentence: joinprobabilites • Wehaveto combine alltheprobabilityestimatorstosolvetheproblem of decoding • Produce themost probable string of words • Modifications are needed in ourbayesianinference • Incorrect Independence assumptions • We are underestimatingtheprobability of eachsubphone • Reweigththeprobabilitiesbyaddinglanguagemodelscaling factor • Reweightrequiresone more change: • P(W) has a side-effect as a penaltyforinsertingwords • A sentencewith N wordswillhavelanguagemodelprobability of • Bag of words, themordwords in thesentence, the more times this penalti istaken and theless probable thesentencewill be

  42. Decoding a sentence: Againthe HMM • Difficulttasks • HMM • Set of states • Matrix A of probabilities • A set of observationlikelihoods • Suppose A and B are trained • Viterbialgorithmtosearchefficiently

  43. Searching: TheViterbialgorithm • Possiblecombinationsforthe Word “five” • Viterbitrellis • Representstheprobabilitythatthe HMM is in state j afterseeingthefirst t observations and passingthroughthemostlikelystatesequence • One per state

  44. Searching: TheViterbialgorithm

  45. Training Embedded and viterbi training

  46. Training: Embedded Training • Howan HMM-basedspeechrecognitionistrained? • Simplest Hand labeledisolated Word • Train A and B separately • Phonehandsegmented • Justtrainbycounting in the training set • Tooexpensive and slow • Goodway  traineach pone HMM ebedded in anentiresentence • Anyways, handphonesegmentation do playsome role • Transcription and wavefiletotrain • Baum-Welchalgorithm

  47. Evaluation Word error rate and mcnemar test

  48. Evaluation: Error Rate • Standarmetric error rate • Differencebetweenpredictedstring and expected  Minimumeditdistancefor WER • Setence error rate • Minimumeditdistanceby • Free script availablefromtheNationalInstitute of Standars and Technologies • Confusion matrices • Usefullfor test if a change in a systemismade

  49. Evaluation: McNemar Test • MAPSSWE orMcNemar Test • Looks at thedifferencesbetweenthenumber of worderrors of thetwosystems • Averageacross a number of segments

  50. Applications of ASR Principlecommercializedapplications

More Related