Speech recognition
This presentation is the property of its rightful owner.
Sponsored Links
1 / 27

Speech Recognition PowerPoint PPT Presentation


  • 73 Views
  • Uploaded on
  • Presentation posted in: General

Speech Recognition. Part 3 Back end processing. Speech recognition simplified block diagram. Training. Speech Capture. Feature Extraction. Models. Pattern Matching. Process Results. Text. Building a phone model. Annotate the speech input. Split and create feature vectors for each.

Download Presentation

Speech Recognition

An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -

Presentation Transcript


Speech recognition

Speech Recognition

Part 3

Back end processing


Speech recognition simplified block diagram

Speech recognition simplified block diagram

Training

Speech

Capture

Feature

Extraction

Models

Pattern

Matching

Process

Results

Text


Building a phone model

Building a phone model

  • Annotate the speech input

  • Split and create feature vectors for each


Praat

Praat

  • Semi automatic anotation

    • http://www.fon.hum.uva.nl/praat/


Hidden markov models

Hidden Markov Models

  • Probability based state machine

  • Transition probability and output probability


One hmm per phone monophone

One HMM per phone (monophone)

  • 45 phones in British English + silence ($)

th

r

iy

h#


One hmm per two phones biphone

One HMM per two phones (biphone)

  • Associate with left or right phone

  • Up to 45 x 46 + $ = 2,071 models

$ - th or th + r

th - r or r + iy

r - iy or iy + h#

iy - h# or h# + $


One hmm per three phones triphone

One HMM per three phones (triphone)

  • Associate with left and right phone

  • Up to 45x46x46+$ = 95,220 models

$ - th + r

th - r + iy

r - iy + h#

iy - h# + $


Training hmms

Training HMMs

  • HMMs are presented with FV sequence of each triphone and “learn” the sound to be recognised using the Baum – Welch algorithm

Feature vectors stepped passed and presented to the model


Speech recognition

  • Millions of utterances from different people used to train the models

  • Feature vectors from each phoneme are presented to the HMM in training mode

  • HMM states model temporal variability. Eg, one feature vector “sound” may last longer than another so the HMM may stay in that state for longer

  • What is the probability of the current FV being in state T?

  • What is the probability of the transition from state T to state T, T+1, T+2?

  • After many samples, state and transition probabilities stop improving

  • Not all models need to be created as not all triphone combinations are sensible for a particular language


Language model

Language Model

  • Determines the likelihood of word sequences by analysing lots of text from newspapers and other popular textual sources

  • Typically use trigrams – the frequency of a word which follows one word and precedes another

  • Trigrams for the whole ASR vocabulary (if enough training data is available) are stored for look-up to determine probabilities


Trigram probability

Trigram Probability


Recognition problem

Recognition problem

  • With the phone based HMMs trained, consider the recognition problem:

    An utterance consists of a sequence of words, W=w1,w2,… wn and the Large Vocabulary Recognition system (LVR) needs to find the most probable word sequence Wmax given the observed acoustic signal Y. In other words:


Bayes rule

Bayes’ rule

  • Need to maximise the probability of W given utterance Y - too complex

  • Rewrite using Bayes’ rule to create two more solvable problems


Speech recognition

  • Maximise the probability of

    can ignore P(y) as it is independent of W

  • P(W) is independent of Y so can be found using written text to form a language model

  • P(Y|W): for a given word(s) what is the probability that the current series of speech vectors are that word. This comes from the acoustic models of phonemes concatenated into words


Recognition goal

Recognition goal


Typical recogniser

Typical recogniser


Dictionary

Dictionary

  • Often called a lexicon. Contains the pronunciations of the vocabulary at a phonetic level

  • There may be more than one pronunciation stored for the same word if it is said differently in different contexts “The end” “Thee end” or dialects “Man-chest-er” or “Mon-chest-oh”

  • There may be one entry for more than one word e.g. Red and Read (need language model to sort this out)


Decoding

Decoding

  • Complex, processor intensive, memory intensive

  • Consider simplest method (but most costly):

    • Find start (after a quiet bit) and hypothesise all possible start words based on the current potential words available from the HMMs such as I, eye, high etc.

    • Determine the probability of the potential HMMs to be one of these


Phoneme recognition

Phoneme recognition

  • Feature vectors presented to all models. Each HMM generates a feature vector and it is compared to the current input feature vector. The output probabilities determine the most likely match. Which phonemes are tested are determined by the language model and lexicon.

P(th) = 0.51

P(r) = 0.05

P(iy) = 0.35

P(h#) = 0.001

Feature vectors

P(p) = 0.03

P(k) = 0.015


Speech recognition

th(0.01)

  • Concatenate phonemes to make words

R(0.5)

iy(0.065)

h#(0.001)

p(0.03)

K(0.015)


Speech recognition

  • Probabilities multiplied to get total score

  • Need some way of reducing complexity

  • Obviously too many combinations

    • th + r + ie + h# is ok

    • th + z + z + d is not

  • Tree is pruned by discarding least likely paths based on probability and lexical rules


Speech recognition

  • Check the validity of phoneme sequences using the lexicon

  • Continue to build the tree until an end of utterance is detected (silence or they may say “full stop as a command”)

  • From the language model, check the probability of the current possible words based on previous two words


Probability tree

Probability tree

  • A probability tree of possible word combinations is built and the best paths are calculated

  • The tree can become very large and based on processing power and memory requirements, the least likely paths are dropped or the tree is pruned.


Sentence decomposition

Sentence decomposition


  • Login