1 / 19

Discriminative MLE training using a product of Gaussian Likelihoods

Discriminative MLE training using a product of Gaussian Likelihoods. T. Nagarajan and Douglas O’Shaughnessy INRS-EMT, University of Quebec Montreal, Canada ICSLP 06. Outline. Introduction Model selection methods Bayesian information criterion Product of Gaussian likelihoods

rusti
Download Presentation

Discriminative MLE training using a product of Gaussian Likelihoods

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Discriminative MLE training using a product of Gaussian Likelihoods T. Nagarajan and Douglas O’Shaughnessy INRS-EMT, University of Quebec Montreal, Canada ICSLP 06

  2. Outline • Introduction • Model selection methods • Bayesian information criterion • Product of Gaussian likelihoods • Experimental setup • PoG-based model selection • Performance analysis • Conclusions

  3. References • [SAP 2000][Padmanabhan and Bahl], Model Complexity Adaptation Using a Discriminant • [ICASSP 03][Airey and Gales], Product of Gaussians and Multiple Stream Systems

  4. Introduction • Defining a structure of a HMM is one of the major issues in acoustic modeling • The number of states is usually chosen based on either the number of acoustic variations that one may expect across the utterance or the length of the utterance • The number of mixtures per state can be chosen based on the amount of training data available • Proportional to the number of data samples (PD) • Alternative criteria of model selection • Bayesian Information Criterion (also referred to as MDL, Minimum Description Length) • Discriminative Model Complexity Adaptation (MAC)

  5. Introduction • The major difference between the proposed technique with others lies in the fact that the complexity of a model is adjusted by considering only the training examples of other classes and not their corresponding models • In this technique, the focus is given to how well a given model can discriminative training data of different classes

  6. Model selection Methods • The focus is given to choosing a proper topology for the models to be generated, especially the number of mixtures per state • We compare the performance of the proposed technique with that of the conventional BIC-based technique • For this work, acoustic modeling is carried out using the MLE algorithm only • These systems are implemented using the HTK

  7. Bayesian Information Criterion • The number of components of a model is chosen by maximizing an objective function • That is essentially the likelihood of the training examples of a model penalized by the number of components in that model and the number of training examples • α is an additional penalty factor used to control the complexities of the resultant models

  8. Bayesian Information Criterion • Disadvantage • It does not consider information about other classes • This may lead to an increased error rate especially when most competitive and closely resembling other classes

  9. Product of Gaussian likelihoods • Figure 1(b) is possible only when the model is λj well trained • During training of the model, the acoustic likelihoods of all the utterances of the Cj should be maximized to a greater extent • To maximize the likelihood on the training data, the estimation procedure often tries to make the variances of all the mixture components very small • But it often provides poor matches to independent test data

  10. Product of Gaussian likelihoods • Thus, it is always better to reduce the overlap between the likelihood Gaussians (say, Nii and Nij) of utterances of different classes (say, Ci and Cj) for a given model (λi) • We assume that two Gaussians overlap with each other if either of the following conditions is met: • If , irrespective of their corresponding variances • If is wide enough so that both the Gaussians overlap considerably • In order to quantify the amount of overlap between two Gaussians, we can use error bounds, like the Chernoff or Bhattacharyya bounds

  11. Product of Gaussian likelihoods • However, these bounds are sensitive to the variances of the Gaussians • Here, we use a similar logic of “product of Gaussian” for estimating the amount of overlap between two Gaussians

  12. Product of Gaussian likelihoods • In order to quantify the amount of overlap between two different Gaussians, we define the following ration

  13. Product of Gaussian likelihoods • However, for this case we expect the overlap O to be equal to 1 • The resultant ON is used as a measure to estimate the amount of overlap between two Gaussians

  14. Experimental setup • The TIMIT corpus is considered for both training and testing • The word-lexicon is created only with the test words • For the words that in common in train and test data, pronunciation variations are taken from the transcriptions provided with the corpus • For the rest of the test words only one transcription is considered • Syllables extracted from the phonetic segmentation boundaries • Only considered 200 syllables that have more than 50 examples • The rest are replaced by their corresponding phonemes • 200 syllables and 46 monophone models are initalized

  15. Experimental setup • For initialized models • the number of states is fixed based on the number of phonemes for a given sub-word unit • the number of mixtures per state in considered as one • For the re-estimation of model parameters • A standard Viterbi alignment procedure is used • The number of mixtures per state for each model is then increased to 30, in steps of 1, by a conventional mixture splitting procedure • Each time, the model parameters are re-estimated twic

  16. PoG-based model selection

  17. Performance analysis • Since the model size seems to grow uncontrollably, we fixed the maximum number of mixtures per state at 30

  18. Conclusions • In conventional techniques for model optimization, the topology of a model is optimized either without considering other classes, or considering a subset of competing models • While in this work, it consider whether a given model can discriminative training utterances of other classes from its own

  19. Discriminative model complexity adaptation • The discriminant measure is a two-dimensional vector

More Related