1 / 69

Speech Recognition

Speech Recognition. Components of a Recognition System. Frontend. Feature extractor. Frontend. Feature extractor Mel-Frequency Cepstral Coefficients (MFCCs). Feature vectors. Hidden Markov Models ( HMMs ). Acoustic Observations. Hidden Markov Models ( HMMs ). Acoustic Observations

chuong
Download Presentation

Speech Recognition

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Speech Recognition

  2. Components of a Recognition System

  3. Frontend • Feature extractor

  4. Frontend • Feature extractor • Mel-Frequency Cepstral Coefficients (MFCCs) Feature vectors

  5. Hidden Markov Models (HMMs) • Acoustic Observations

  6. Hidden Markov Models (HMMs) • Acoustic Observations • Hidden States

  7. Hidden Markov Models (HMMs) • Acoustic Observations • Hidden States • Acoustic Observation likelihoods

  8. Hidden Markov Models (HMMs) “Six”

  9. Hidden Markov Models (HMMs)

  10. Acoustic Model • Constructs the HMMs of phones • Produces observation likelihoods

  11. Acoustic Model • Constructs the HMMs for units of speech • Produces observation likelihoods • Sampling rate is critical! • WSJ vs. WSJ_8k

  12. Acoustic Model • Constructs the HMMs for units of speech • Produces observation likelihoods • Sampling rate is critical! • WSJ vs. WSJ_8k • TIDIGITS, RM1, AN4, HUB4

  13. Language Model • Word likelihoods

  14. Language Model • ARPA format Example: 1-grams: -3.7839 board -0.1552 -2.5998 bottom -0.3207 -3.7839 bunch -0.2174 2-grams: -0.7782 as the -0.2717 -0.4771 at all 0.0000 -0.7782 at the -0.2915 3-grams: -2.4450 in the lowest -0.5211 in the middle -2.4450 in the on

  15. Grammar public <basicCmd> = <startPolite> <command> <endPolite>; public <startPolite> = (please | kindly | could you ) *; public <endPolite> = [ please | thanks | thank you ]; <command> = <action> <object>; <action> = (open | close | delete | move); <object> = [the | a] (window | file | menu);

  16. Dictionary • Maps words to phoneme sequences

  17. Dictionary • Example from cmudict.06d POULTICE P OW L T AH S POULTICES P OW L T AH S IH Z POULTON P AW L T AH N POULTRY P OW L T R IY POUNCE P AW N S POUNCED P AW N S T POUNCEY P AW N S IY POUNCING P AW N S IH NG POUNCY P UW NG K IY

  18. Linguist • Constructs the search graph of HMMs from: • Acoustic model • Statistical Language model ~or~ • Grammar • Dictionary

  19. Search Graph

  20. Search Graph

  21. Search Graph • Can be statically or dynamically constructed

  22. Linguist Types • FlatLinguist

  23. Linguist Types • FlatLinguist • DynamicFlatLinguist

  24. Linguist Types • FlatLinguist • DynamicFlatLinguist • LexTreeLinguist

  25. Decoder • Maps feature vectors to search graph

  26. Search Manager • Searches the graph for the “best fit”

  27. Search Manager • Searches the graph for the “best fit” • P(sequence of feature vectors| word/phone) • aka. P(O|W) -> “how likely is the input to have been generated by the word”

  28. F ay ay ay ay v v v v v F f ay ay ay ay v v v v F f f ay ay ay ay v v v F f f f ay ay ay ay v v F f f f ay ay ay ay ay v F f f f f ay ay ay ay v F f f f f f ay ay ay v …

  29. Viterbi Algorithm Time O1 O2 O3

  30. Pruner • Uses algorithms to weed out low scoring paths during decoding

  31. Result • Words!

  32. Word Error Rate • Most common metric • Measure the # of modifications to transform recognized sentence into reference sentence

  33. Word Error Rate • Reference: “This is a reference sentence.” • Result: “This is neuroscience.”

  34. Word Error Rate • Reference: “This is a reference sentence.” • Result: “This is neuroscience.” • Requires 2 deletions, 1 substitution

  35. Word Error Rate • Reference: “This is a reference sentence.” • Result: “This is neuroscience.”

  36. Word Error Rate • Reference: “This is a reference sentence.” • Result: “This is neuroscience.” • D S D

  37. Sphinx4 Implementation

  38. Sphinx4 Implementation

  39. Sphinx4 Implementation

  40. Sphinx4 Implementation

  41. Sphinx4 Implementation

  42. Sphinx4 Implementation

  43. Sphinx4 Implementation

  44. Sphinx4 Implementation

  45. Sphinx4 Implementation

  46. Sphinx4 Implementation

  47. Sphinx4 Implementation

  48. Where Speech Recognition Works • Limited Vocab Multi-Speaker

  49. Where Speech Recognition Works • Limited Vocab Multi-Speaker • Extensive Vocab Single Speaker

  50. Where Speech Recognition Works *If you have noisy audio input multiply expected error rate x 2

More Related