Speech recognition
Download
1 / 69

speech recognition - PowerPoint PPT Presentation


  • 556 Views
  • Updated On :

Speech Recognition. Components of a Recognition System. Frontend. Feature extractor. Frontend. Feature extractor Mel-Frequency Cepstral Coefficients (MFCCs). Feature vectors. Hidden Markov Models ( HMMs ). Acoustic Observations. Hidden Markov Models ( HMMs ). Acoustic Observations

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about 'speech recognition' - Jims


An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript


Frontend l.jpg
Frontend

  • Feature extractor


Frontend4 l.jpg
Frontend

  • Feature extractor

  • Mel-Frequency Cepstral Coefficients (MFCCs)

Feature vectors


Hidden markov models hmms l.jpg
Hidden Markov Models (HMMs)

  • Acoustic Observations


Hidden markov models hmms6 l.jpg
Hidden Markov Models (HMMs)

  • Acoustic Observations

  • Hidden States


Hidden markov models hmms7 l.jpg
Hidden Markov Models (HMMs)

  • Acoustic Observations

  • Hidden States

  • Acoustic Observation likelihoods


Hidden markov models hmms8 l.jpg
Hidden Markov Models (HMMs)

“Six”



Acoustic model l.jpg
Acoustic Model

  • Constructs the HMMs of phones

  • Produces observation likelihoods


Acoustic model11 l.jpg
Acoustic Model

  • Constructs the HMMs for units of speech

  • Produces observation likelihoods

  • Sampling rate is critical!

  • WSJ vs. WSJ_8k


Acoustic model12 l.jpg
Acoustic Model

  • Constructs the HMMs for units of speech

  • Produces observation likelihoods

  • Sampling rate is critical!

  • WSJ vs. WSJ_8k

  • TIDIGITS, RM1, AN4, HUB4


Language model l.jpg
Language Model

  • Word likelihoods


Language model14 l.jpg
Language Model

  • ARPA format Example:

    1-grams:

    -3.7839 board -0.1552

    -2.5998 bottom -0.3207

    -3.7839 bunch -0.2174

    2-grams:

    -0.7782 as the -0.2717

    -0.4771 at all 0.0000

    -0.7782 at the -0.2915

    3-grams:

    -2.4450 in the lowest

    -0.5211 in the middle

    -2.4450 in the on


Grammar l.jpg
Grammar

public <basicCmd> = <startPolite> <command> <endPolite>;

public <startPolite> = (please | kindly | could you ) *;

public <endPolite> = [ please | thanks | thank you ];

<command> = <action> <object>;

<action> = (open | close | delete | move);

<object> = [the | a] (window | file | menu);


Dictionary l.jpg
Dictionary

  • Maps words to phoneme sequences


Dictionary17 l.jpg
Dictionary

  • Example from cmudict.06d

    POULTICE P OW L T AH S

    POULTICES P OW L T AH S IH Z

    POULTON P AW L T AH N

    POULTRY P OW L T R IY

    POUNCE P AW N S

    POUNCED P AW N S T

    POUNCEY P AW N S IY

    POUNCING P AW N S IH NG

    POUNCY P UW NG K IY


Linguist l.jpg
Linguist

  • Constructs the search graph of HMMs from:

    • Acoustic model

    • Statistical Language model ~or~

    • Grammar

    • Dictionary




Search graph21 l.jpg
Search Graph

  • Can be statically or dynamically constructed


Linguist types l.jpg
Linguist Types

  • FlatLinguist


Linguist types23 l.jpg
Linguist Types

  • FlatLinguist

  • DynamicFlatLinguist


Linguist types24 l.jpg
Linguist Types

  • FlatLinguist

  • DynamicFlatLinguist

  • LexTreeLinguist


Decoder l.jpg
Decoder

  • Maps feature vectors to search graph


Search manager l.jpg
Search Manager

  • Searches the graph for the “best fit”


Search manager27 l.jpg
Search Manager

  • Searches the graph for the “best fit”

  • P(sequence of feature vectors| word/phone)

  • aka. P(O|W)

    -> “how likely is the input to have been generated by the word”


Slide28 l.jpg

F ay ay ay ay v v v v v

F f ay ay ay ay v v v v

F f f ay ay ay ay v v v

F f f f ay ay ay ay v v

F f f f ay ay ay ay ay v

F f f f f ay ay ay ay v

F f f f f f ay ay ay v


Viterbi algorithm l.jpg
Viterbi Algorithm

Time

O1

O2

O3


Pruner l.jpg
Pruner

  • Uses algorithms to weed out low scoring paths during decoding


Result l.jpg
Result

  • Words!


Word error rate l.jpg
Word Error Rate

  • Most common metric

  • Measure the # of modifications to transform recognized sentence into reference sentence


Word error rate33 l.jpg
Word Error Rate

  • Reference: “This is a reference sentence.”

  • Result: “This is neuroscience.”


Word error rate34 l.jpg
Word Error Rate

  • Reference: “This is a reference sentence.”

  • Result: “This is neuroscience.”

  • Requires 2 deletions, 1 substitution


Word error rate35 l.jpg
Word Error Rate

  • Reference: “This is a reference sentence.”

  • Result: “This is neuroscience.”


Word error rate36 l.jpg
Word Error Rate

  • Reference: “This is a reference sentence.”

  • Result: “This is neuroscience.”

  • D S D













Where speech recognition works l.jpg
Where Speech Recognition Works

  • Limited Vocab Multi-Speaker


Where speech recognition works49 l.jpg
Where Speech Recognition Works

  • Limited Vocab Multi-Speaker

  • Extensive Vocab Single Speaker


Where speech recognition works50 l.jpg
Where Speech Recognition Works

*If you have noisy audio input multiply expected error rate x 2


Where speech recognition works51 l.jpg
Where Speech Recognition Works

Other variables:

-Continuous vs. Isolated

-Conversational vs. Read

-Dialect



Appendix i viterbi algorithm l.jpg
Appendix I: Viterbi Algorithm

Time

O1

O2

O3


Appendix i viterbi algorithm54 l.jpg
Appendix I: Viterbi Algorithm

P(ay | f) *

P(O2|ay)

P(f|f) *

P(O2 | f)

Time

O1

O2

O3


Appendix i viterbi algorithm55 l.jpg
Appendix I: Viterbi Algorithm

P (O1) *

P(ay | f) *

P(O2|ay)

Time

O1

O2

O3


Appendix i viterbi algorithm56 l.jpg
Appendix I: Viterbi Algorithm

Time

O1

O2

O3


Appendix ii faqs l.jpg
Appendix II: FAQs

  • Common Sphinx4 FAQs can be found online:http://cmusphinx.sourceforge.net/sphinx4/doc/Sphinx4-faq.html

  • What followes are some less-FAQs


Appendix ii faqs58 l.jpg
Appendix II: FAQs

  • Q. Is a search graph created for every recognition result or one for the recognition app?

  • A. This depends on which Linguist is used. The flat linguist generates the entire search graph and holds it in memory. It is only useful for small vocab recognition tasks. The lexTreeLinguist dynamically generates search states allowing it to handle very large vocabularies


Appendix ii faqs59 l.jpg
Appendix II: FAQs

  • Q. How does the Viterbi algorithm save computation over exhaustive search?

  • A. The Viterbi algorithm saves memory and computation by reusing subproblems already solved within the larger solution. In this way probability calculations which repeat in different paths through the search graph do not get calculated multiple times

  • Viterbi cost = n2 – n3

  • Exhaustive search cost = 2n -3n


Appendix ii faqs60 l.jpg
Appendix II: FAQs

  • Q. Does the linguist use a grammar to construct the search graph if it is available?

  • A. Yes, a grammar graph is created


Appendix ii faqs61 l.jpg
Appendix II: FAQs

  • Q. What algorithm does the Pruner use?

  • A. Sphinx4 uses absolute and relative beam pruning


Appendix iii configuration parameters l.jpg
Appendix III: Configuration Parameters

  • Absolute Beam Width - # active search paths

  • <property name="absoluteBeamWidth" value="5000"/>


Appendix iii configuration parameters63 l.jpg
Appendix III: Configuration Parameters

  • Absolute Beam Width - # active search paths

  • <property name="absoluteBeamWidth" value="5000"/>

  • Relative Beam Width – probability threshold

  • <property name="relativeBeamWidth" value="1E-120"/>


Appendix iii configuration parameters64 l.jpg
Appendix III: Configuration Parameters

  • Absolute Beam Width - # active search paths

  • <property name="absoluteBeamWidth" value="5000"/>

  • Relative Beam Width – probability threshold

  • <property name="relativeBeamWidth" value="1E-120"/>

  • Word Insertion Probability – Word break likelihood

  • <property name="wordInsertionProbability" value="0.7"/>


Appendix iii configuration parameters65 l.jpg
Appendix III: Configuration Parameters

  • Absolute Beam Width - # active search paths

  • <property name="absoluteBeamWidth" value="5000"/>

  • Relative Beam Width – probability threshold

  • <property name="relativeBeamWidth" value="1E-120"/>

  • Word Insertion Probability – Word break likelihood

  • <property name="wordInsertionProbability" value="0.7"/>

  • Language Weight – Boosts language model scores

  • <property name="languageWeight" value="10.5"/>


Appendix iii configuration parameters66 l.jpg
Appendix III: Configuration Parameters

  • Silence Insertion Probability – Likelihood of inserting silence

  • <property name="silenceInsertionProbability" value=".1"/>


Appendix iii configuration parameters67 l.jpg
Appendix III: Configuration Parameters

  • Silence Insertion Probability – Likelihood of inserting silence

  • <property name="silenceInsertionProbability" value=".1"/>

  • Filler Insertion Probability – Likelihood of inserting filler words

  • <property name="fillerInsertionProbability" value="1E-10"/>


Appendix iv python note l.jpg
Appendix IV: Python Note

  • To call a Java example from Python:

    import subprocess

    subprocess.call(["java", "-mx1000m", "-jar",

    "/Users/Username/sphinx4/bin/Transcriber.jar”)


References l.jpg
References

  • Speech and Language Processing 2nd Ed.Daniel Jurafsky and James MartinPearson, 2009

  • Artificial Intelligence 6th Ed.George LugerAddison Wesley, 2009

  • Sphinx Whitepaperhttp://cmusphinx.sourceforge.net/sphinx4/#whitepaper

  • Sphinx Forumhttps://sourceforge.net/projects/cmusphinx/forums


ad