Utterance verification in continuous speech recognition decoding and training Procedures

Utterance verification in continuous speech recognition decoding and training Procedures Author :Eduardo Lleida, Richard C. Rose Reporter : 陳燦輝

Reference • [1]Eduardo Lleida, Richard C. Rose, “Utterance Verification in Continuous Speech Recognition: Decoding and Training Procedures”, IEEE Trans. SAP 2000. • [2] J. K. Chan and F. K. Soong, “An N-best candidates-based discriminative training for speech recognition applications”, Computer Speech and Language, 1995 • [3] W. Chou, B. H. Juang, and C. H. Lee, “Segmental GPD training of HMM based speech recognizer,” ICASSP 1992 • [4] B. H. Juang and S. Katagiri, “Discriminative learning for minimum error classification,” IEEE Trans. Signal Processing 1992

Outline • Introduction to Utterance Verification (UV) • Utterance Verification Paradigms • Utterance Verification Procedures • Confidence Measures • Likelihood Ratio-based Training • Experimental results • Summary and Conclusions

Introduction to Utterance Verification Utterance Verification Paradigms

Introduction to Utterance Verification (cont) Utterance Verification Paradigms • Some problems of UV • The observation vectors Y might be associated with a hypothesized word that is embedded in a string of words. • The lack of language model

Introduction to Utterance Verification (cont) Utterance Verification Procedures • Two-Pass Procedure : Fig. 1. Two-pass utterance verification where a word string and associated segmentation boundaries that are hypothesized by a maximum likelihood CSR decoder are verified in a second stage using a likelihood ratio test.

Introduction to Utterance Verification (cont) Utterance Verification Procedures • One-Pass Procedure : Fig. 2. One-pass utterance verification where the optimum decoded string is that which directly maximizes a likelihood ratio criterion

Introduction to Utterance Verification (cont) Utterance Verification Procedures • Likelihood Ratio Decoder

Introduction to Utterance Verification (cont) Utterance Verification Procedures • Likelihood Ratio Decoder There are two issues that must be addressed if the LR decoding is to be applicable to actual speech recognition tasks. 1. computation complexity. 2. the definition of the alternate hypothesis model.

Introduction to Utterance Verification (cont) Utterance Verification Procedures • computation complexity Fig. 3. A possible three-dimensional HMM search space.

Introduction to Utterance Verification (cont) Utterance Verification Procedures • computation complexity Unit level constraint : the target model and alternate model must occupy their unit initial states and unit final states at the same time instant : where corresponds to the state sequence for unit

Introduction to Utterance Verification (cont) Utterance Verification Procedures • computation complexity state level constraint :

Introduction to Utterance Verification (cont) Utterance Verification Procedures • Definition of alternative Models The alternative hypothesis model has two roles in UV 1. to reduce the effect of sources of variability. 2. to be more specifically to represent the incorrectly decoded hypotheses that are frequently confused with a given lexical item.

Introduction to Utterance Verification (cont) Utterance Verification Procedures • Definition of alternative Models • The alternate model must somehow “cover” the entire space of out-of-vocabulary lexical unit. • If OOV utterances that are easily confused with vocabulary words are to be detected, the alternate model must provide a more detailed representation of the utterances that likely to be decoded as false alarms for individual vocabulary words

Introduction to Utterance Verification (cont) Utterance Verification Procedures • Definition of alternative Model

Introduction to Utterance Verification (cont) Utterance Verification Procedures • Confidence measures It was suggested that modeling errors may result in extreme values in local likelihood ratios which may cause undo influence at the word or phrase level. In order to minimize these effects, we investigated several word level likelihood ratio based confidence measures that can be computed using a non-uniform of sub-word level confidence measures.

Introduction to Utterance Verification (cont) Utterance Verification Procedures • Confidence measures

Likelihood Ratio-based Training • The goal of the training procedure is to increase the average value of for correct hypotheses and decrease the average value of for false alarms. • LR based training is a discriminative training algorithm that based on a cost function which approximates a log likelihood ratio.

Likelihood Ratio-based Training (cont) • Using distance measure to underlie the cost function.

Likelihood Ratio-based Training (cont)

Likelihood Ratio-based Training (cont) • Imposters with scores greater than and targets with scores lower than tend to increase the average cost function. Therefore, if we minimize this function we can reduce the misclassification between targets and imposters.

Likelihood Ratio-based Training (cont)

Likelihood Ratio-based Training (cont) • The complete likelihood ratio based training procedure : • Train initial ML HMMs, and for each unit. • For each iteration over the training database : • Obtain hypothesized sub-word unit string, segmentation using the LR decoder • Align the decoded sub-word unit as correct or false alarm, to obtain indicator function • Update gradient of the expected cost, • Update the model parameter in (17)

Experimental results • Speech corpora : movie locator task • In a trial of the system over the public switched telephone network, the service was configured to accept approximately 105 theater names, 135 city names, and between 75 and 100 current movie titles. • A corpus of 4777 spontaneous spoken utterances from the trial were used in our evaluation.

Experimental results (cont) • A total of 3025 sentences were used for training acoustic models and 1752 utterances were used for testing. • The sub-word models used in the recognizer consisted of 43 context independent units. • Recognition was performed using a finite state grammar built form the specification of the service, with a lexicon of 570 different words.

Experimental results (cont) • The total number of words in the test set was 4864, where 134 of them were OOV. • Recognition performance of 94% word accuracy was obtained on the “in-grammar” utterance. • The feature set used for recognition included 12 mel-cepstrum, 12 delta mel-cepstrum, 12 delta-delta, mel-cepstrum, energy, delta energy, delta-delta energy coefficients, and cepstral mean normalization was applied.

Experimental results (cont) • A single “background” HMM alternate model, , containing three states with 32 mixtures per state was used. • A separate “imposter” alternative HMM model, was trained for each sub-word unit. These models contained three states with eight mixtures state.

Experimental results (cont) • Performance is described both in terms of the receiver operating characteristic curves (ROC) and curves displaying type I + type II error plotted against the decision threshold setting.

Experimental results (cont) • Experiment 1 : Comparison of UV Measures Fig. 4. ROC curve comparing performance of confidence measures using W1 (w); (dashed line) and W2 (w); (solid line) (left figure). and using W3 (w); (dashed line) and W4 (w); (solid line) (right figure).

Experimental results (cont) • Experiment 1 : Comparison of UV Measures • It appears from the error plot in fig. 5 that W4 is less sensitive to the setting of the confidence threshold. • In the remain simulation, the W4 will be used. Fig. 5. type I + type II comparing performance of confidence measures usingW3 (w); (dashed line) and W4 (w); (solid line)

Experimental results (cont) • Experiment 2 :Investigation of LR Training and UV strategies TABLE I Utterance Verification performance: type I + type II minimum error rate for the one-pass (OP) and the two-pass (TP) utterance verification procedure. b% number of mixtures for the background model, i% number of mixtures for the imposter model Fig. 6. Likelihood ratio training, ROC curves for initial models (dash-dot line), one iteration (dash line) and two iterations (solid line). The *-points are the minimum type I + type II error.

Experimental results (cont) • Experiment 2 :Investigation of LR Training and UV strategies Fig. 7. One-pass versus two-pass UV comparison with the b32.i8 configuration and two iterations of the likelihood ratio training.

Experimental results (cont) • Experiment 3 : whether or not the LR training procedures actually improved speech recognition performance TABLE II speech recognition performance given in terms of word accuracy without using utterance verification and utterance verification performance given as the sum of type I and type II error

Experimental results (cont) • Experiment 4 : measured over in-grammar and out-of-grammar utterances, respectively. Fig. 8. In-grammar and out-of-grammar sentences. Initial models: dot-dash line, one iteration: dash line and two iterations: solid line.

Summary and Conclusions • The one-pass decoding procedure improved UV performance over the two pass approach. • Likelihood ratio training and decoding has also been successfully applied to other task including speaker dependent voice label recognition. • Further research should involve the investigation of decoding and training paradigms for UV that incorporate additional, non-acoustic sources of knowledge.

Utterance verification in continuous speech recognition decoding and training Procedures

Utterance verification in continuous speech recognition decoding and training Procedures

Presentation Transcript

Large Vocabulary Continuous Speech Recognition (LVCSR)

Search and Decoding in Speech Recognition

Syllables and Concepts in Large Vocabulary Continuous Speech Recognition

Decoding and word recognition

Search and Decoding in Speech Recognition

Hybrid Systems for Continuous Speech Recognition

Usability of Continuous Speech Recognition Programs

Automatic Continuous Speech Recognition

Discriminative Training Approaches for Continuous Speech Recognition

Search and Decoding in Speech Recognition

Speech decoding

Decoding Techniques for Automatic Speech Recognition

A New Verification-Based Fast-Match for Large Vocabulary Continuous Speech Recognition

Search and Decoding in Speech Recognition

Search and Decoding in Speech Recognition

Search and Decoding in Speech Recognition

Hybrid Systems for Continuous Speech Recognition

Network Training for Continuous Speech Recognition

Search and Decoding in Speech Recognition