1 / 35

Introduction to Automatic Speech Recognition

Introduction to Automatic Speech Recognition. Outline. Define the problem What is speech? Feature Selection Models Early methods Modern statistical models Current State of ASR Future Work. The ASR Problem. There is no single ASR problem The problem depends on many factors

jtrussell
Download Presentation

Introduction to Automatic Speech Recognition

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Introduction to Automatic Speech Recognition

  2. Outline • Define the problem • What is speech? • Feature Selection • Models • Early methods • Modern statistical models • Current State of ASR • Future Work

  3. The ASR Problem • There is no single ASR problem • The problem depends on many factors • Microphone: Close-mic, throat-mic, microphone array, audio-visual • Sources: band-limited, background noise, reverberation • Speaker: speaker dependent, speaker independent • Language: open/closed vocabulary, vocabulary size, read/spontaneous speech • Output: Transcription, speaker id, keywords

  4. Performance Evaluation • Accuracy • Percentage of tokens correctly recognized • Error Rate • Inverse of accuracy • Token Type • Phones • Words* • Sentences • Semantics?

  5. What is Speech? • Analog signal produced by humans • You can think about the speech signal being decomposed into the source and filter • The source is the vocal folds in voiced speech • The filter is the vocal tract and articulators

  6. Speech Production

  7. Speech Production

  8. Speech Production

  9. Speech Visualization

  10. Speech Visualization

  11. Speech Visualization

  12. Feature Selection • As in any data-driven task, the data must be represented in some format • Cepstral features have been found to perform well • They represent the frequency of the frequencies • Mel-frequency cepstral coefficients (MFCC) are the most common variety

  13. Where do we stand? • Defined the multiple problems associated with ASR • Described how speech is produced • Illustrated how speech can be represented in an ASR system • Now that we have the data, how do we recognize the speech?

  14. Radio Rex • First known attempt at speech recognition • A toy from 1922 • Worked by analyzing the signal strength at 500Hz

  15. Actual speech recognition systems • Originally thought to be a relatively simple task requiring a few years of concerted effort • 1969, “Wither speech recognition” is published • A DARPA project ran from 1971-1976 in response to the statements in the Pierce article • We can examine a few general systems

  16. Template-Based ASR • Originally only worked for isolated words • Performs best when training and testing conditions are best • For each word we want to recognize, we store a template or example based on actual data • Each test utterance is checked against the templates to find the best match • Uses the Dynamic Time Warping (DTW) algorithm

  17. Dynamic Time Warping • Create a similarity matrix for the two utterances • Use dynamic programming to find the lowest cost path

  18. Hearsay-II • One of the systems developed during the DARPA program • A blackboard-based system utilizing symbolic problem solvers • Each problem solver was called a knowledge group • A complex scheduler was used to decide when each KG should be called

  19. Hearsay-II

  20. DARPA Results • The Hearsay-II system performed much better than the two other similar competing systems • However, only one system met the performance goals of the project • The Harpy system was also a CMU built system • In many ways it was a predecessor to the modern statistical systems

  21. Modern Statistical ASR

  22. Modern Statistical ASR

  23. Acoustic Model • For each frame of data, we need some way of describing the likelihood of it belonging to any of our classes • Two methods are commonly used • Multilayer perceptron (MLP) gives the likelihood of a class given the data • Gaussian Mixture Model (GMM) gives the likelihood of the data given a class

  24. Gaussian Distribution

  25. Pronunciation Model • While the pronunciation model can be very complex, it is typically just a dictionary • The dictionary contains the valid pronunciations for each word • Examples: • Cat: k ae t • Dog: d ao g • Fox: f aa x s

  26. Language Model • Now we need some way of representing the likelihood of any given word sequence • Many methods exist, but ngrams are the most common • Ngrams models are trained by simply counting the occurrences of words in a training set

  27. Ngrams • A unigram is the probability of any word in isolation • A bigram is the probability of a given word given the previous word • Higher order ngrams continue in a similar fashion • A backoff probability is used for any unseen data

  28. How do we put it together? • We now have models to represent the three parts of our equation • We need a framework to join these models together • The standard framework used is the Hidden Markov Model (HMM)‏

  29. Markov Model • A state model using the markov property • The markov property states that the future depends only on the present state • Models the likelihood of transitions between states in a model • Given the model, we can determine the likelihood of any sequence of states

  30. Hidden Markov Model • Similar to a markov model except the states are hidden • We now have observations tied to the individual states • We no longer know the exact state sequence given the data • Allows for the modeling of an underlying unobservable process

  31. HMMs for ASR • First we build an HMM for each phone • Next we combine the phone models based on the pronunciation model to create word level models • Finally, the word level models are combined based on the language model • We now have a giant network with potentially thousands or even millions of states

  32. Decoding • Decoding happens in the same way as the previous example • For each time frame we need to maintain two pieces of information • The likelihood of being at any state • The previous state for every state

  33. State of the Art • What works well • Constrained vocabulary systems • Systems adapted to a given speaker • Systems in anechoic environments without background noise • Systems expecting read speech • What doesn't work • Large unconstrained vocabulary • Noisy environments • Conversational speech

  34. Future Work • Better representations of audio based on humans • Better representation of acoustic elements based on articulatory phonology • Segmental models that do not rely on the simple frame-based approach

  35. Resources • Hidden Markov Model Toolkit (HTK)‏ • http://htk.eng.cam.ac.uk/ • CHIME ( a freely available dataset)‏ • http://spandh.dcs.shef.ac.uk/projects/chime/PCC/datasets.html • Machine Learning Lectures • http://www.stanford.edu/class/cs229/ • http://www.youtube.com/watch?v=UzxYlbK2c7E

More Related