Utterance Verification for Spontaneous Mandarin Speech Keyword Spotting Liu Xin, BinXi Wang

Utterance Verification for Spontaneous Mandarin Speech Keyword SpottingLiu Xin, BinXi Wang Presenter: Kai-Wun Shih No.306, P.O. Box 1001 ,ZhengZhou,450002, Henan, P.R. China yrjhx@263.net, bingxiwang@263.net

Outline • Introduction • Feature Extraction and Acoustic Modeling • Keyword Recognition • Keyword Verification And Confidence Measures • Experiments and Results • Conclusions

Introduction (1/2) • Utterance verification represents an importanttechnology in the design of user-friendly speech recognition systems. • Recognizers equipped with a keyword spotting capability allow users the flexibility to speak naturally without the need to follow a rigid speaking format.

Introduction (2/2) • Keyword spotting systems introduce a filler model for enhancing keyword detection and absorbing out-of-vocabulary event. • To reduce false alarm rate, in this paper we have incorporated two-level utterance verification following detection and segmentation of speech into keyword hypothesis via a conventional Viterbi search.

Feature Extraction and Acoustic Modeling (1/3)

Feature Extraction and Acoustic Modeling (2/3) • Considering that Chinese is a monosyllable language, we choose syllable as the base recognition units. • Except for the background silence unit, each syllable is modeled by six-state left-to-right hidden markov models (HMM). • Each state is characterized by a mixture Gaussian state observation density. • Training of each syllable model consisted of estimating the mean, covariance, and mixture weights for each state using maximum likelihood(ML) estimation.

Feature Extraction and Acoustic Modeling (3/3) • For each syllable model, an anti-syllable model was also trained. • In general, for every syllable model, the corresponding anti-syllable model should be trained on the data of all syllables but that of syllable. • Aside from syllable and anti-syllable models, we also introduced a general acoustic filler model trained on non-keyword speech data, and a silence model trained on the non-speech segments of the signal.

Keyword Recognition (1/2)

Keyword Recognition (2/2)

Keyword Verification And Confidence Measures (1/9)

Keyword Verification And Confidence Measures (7/9) • The third one focuses on less confident syllables rather than averaging all the subwords. • In order to find less confident syllables, we normalize the log likelihood ratio assuming a Gaussian distribution for every syllable. We denote this normalization log likelihood as . where andare the mean and the variance for syllable class of n.

Keyword Verification And Confidence Measures (9/9) • The fourth confidence measure uses the sigmoid function. This form is used as a loss function for training with the minimum error rate criteria. , are used to control the slope and the range of the sigmoid function. For every confidence measure, a specific threshold is set up. If its value is below the threshold, the candidate is discarded from the word lattice.

Experiments and Results • In this system, 20 city names were selected as the keywords. A continuous telephone-speech database was employed to train the system which is composed of short spontaneous speech, syllables, words and sentences. This database was pronounced by 70 speakers (50 males,20 females). We also recorded 205 utterances for testing spoken by a different group of 20 speakers (15 males, 5 females) responding to 20 city names. Table1 Performance with several confidence measures

Conclusions • The spotting system adopts a Wastage strategy, with recognition followed by verification and the basic unit of the system is syllable. In the second stage, a keyword verification function with four different confidence measures is evaluated. • Experiment results show that utterance verification with the third confidence measure outperforms the baseline system.

END

Utterance Verification for Spontaneous Mandarin Speech Keyword Spotting Liu Xin, BinXi Wang