1 / 27

S. Ananthaskrishnan and S. Narayanan Department of EE, Southern California ICASSP 2007

Improved Speech Recognition using Acoustic and Lexical Correlates of Pitch Accent in a N -Best Rescoring Framework. S. Ananthaskrishnan and S. Narayanan Department of EE, Southern California ICASSP 2007. Reporter: Shih-Hung Liu 2007/05/14. Outline. Introduction Data Corpus and Baseline ASR

amys
Download Presentation

S. Ananthaskrishnan and S. Narayanan Department of EE, Southern California ICASSP 2007

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Improved Speech Recognition using Acoustic and Lexical Correlates of Pitch Accent in a N-Best Rescoring Framework S. Ananthaskrishnan and S. Narayanan Department of EE, Southern California ICASSP 2007 Reporter: Shih-Hung Liu 2007/05/14

  2. Outline • Introduction • Data Corpus and Baseline ASR • Prosody model • Acoustic-prosodic model • De-lexicalized prosody sequence model • Lexical prosody model • Experimental results • Conclusions

  3. Introduction • Most statistical speech recognition systems make use of segment-level features, derived mainly from spectral envelope characteristics of the signal, but ignore supra-segmental cues that carry additional information likely to be useful for speech recognition • These cues, which constitute the prosody of the utterance and occur at the syllable, word and utterance level, are closely related to the lexical and syntactic organization of the utterance • In this paper, we explore the use of acoustic and lexical correlates of a subset of these cues in order to improve recognition performance on a read-speech corpus

  4. Data Corpus • The Boston University Radio News Corpus (BU-RNC) consists of about 3 hours of read speech with 6 speakers (3 female, 3 male). • We use this corpus because it contains prosodic annotations in the form of ToBI-style labels for pitch accents, phrase boundaries and lexical break indices

  5. Baseline ASR • We used the University of Colorado SONIC continuous speech recognizer to develop the baseline ASR • We adapted context-dependent triphone acoustic models from the Wall Street Journal task with data from the training partitions of the BU-RNC using the tree-based MAPLR algorithm • We used PMVDR (Perceptual Minimum Variance Distortionless Response) features derived from the acoustic signal to train these models • A standard back-off trigram language model with Kneser-Ney smoothing was trained with a mixture of text from the WSJ, HUB-4 and BU datasets

  6. Prosody model • We augment the standard ASR equation to include prosodic information as follows:

  7. Prosody model • Based on conditional independence assumptions Acoustic-prosodic model Lexical prosody model De-lexicalized prosody sequence model

  8. Acoustic-prosodic model • acoustic-prosodic features that make up Ap include: • 1. F0: F0-range features (max-min, max-avg, avg-min) • 2. Energy: within-syllable energy range features (maxmin, avg-min) • 3. Timing: syllable nucleus duration • These features were normalized to minimize effects of speaker- or nucleus-specific variation. • The model is trained as a feedforward neural network (MLP) with 8 input nodes, 25 hidden nodes and 2 output nodes with softmax activation, with outputs interpreted as posterior probabilities

  9. De-lexicalized prosody sequence model • The term p(P) establishes constraints on the sequence of pitch accent events P • Since P has a binary vocabulary, it was robustly estimated from small amounts of training data • We modeled this component as a 4-gram back-off language model with pitch accent labels obtained from the training data

  10. Lexical prosody model • Since we built prosody models at the syllable level, we first decomposed the sequence of words into the corresponding sequence of syllables S using the syllabifier • We have previously shown that these canonical stress labels exhibit high correlation with pitch accents. • This provided us with another stream of features L. The lexical prosody sequence model then becomes p(P|W) = p(P|S,L)

  11. Experimental results

  12. Conclusions • In this paper, we presented a N-best re-ranking scheme using a prosody model that was decoupled from the main ASR system. • The re-ranking method achieved a modest but significant reduction in WER of 1.3% (relative) compared to the baseline recognition system.

  13. Maximum Entropy Confidence Estimation for Speech Recognition C. White, JHU J. Droppo, A. Acero, J. Odell, Microsoft Research ICASSP 2007 Reporter: Shih-Hung Liu 2007/05/14

  14. Outline • Introduction • Baseline System • Data set • Observation Selection • GMM Baseline • Maximum Entropy System • A Simple ME System • Improved Results with Binning • Quadratic Observation Vector • Incorporating Augmented Features • Experiments • Conclusions

  15. Introduction • For many automatic speech recognition (ASR) applications, it is useful to predict the likelihood that the recognized string contains an error • The standard confidence estimation design consists of a classifier that predicts the probability of error using several observations taken from the recognition lattice emitted by the ASR engine • If a rich lattice is available, it can be renormalized to provide a good confidence estimate

  16. Introduction • The first improvement provides significant gains in overall accuracy, as well as good generalization behavior. This is accomplished with the introduction of a maximum entropy classifier • The second improvement allows the system to provide good confidence estimates, even when a rich recognition lattice is not available • The solution presented here is to produce alternate features designed to contain information similar to what has been pruned from the lattice

  17. Baseline System • Our goal was to build a system that generates good confidence estimates. This means that it should work transparently across a variety of recognition grammars. • It should be robust to duration, speaker, channel, and other irrelevant factors • We merged existing data to construct a new corpus. • It contains over 250,000 utterances pulled from source corpora covering different acoustic channels, additive noise, and accents

  18. Observation Selection • with lattice features denoted with an *, and augmented-set features denoted with a **. Features used in the ‘Unq’ case are denoted with a ‘U’, ‘Alt’ with a ‘A’

  19. GMM Baseline • The baseline system consists of two GMMs, one that models correctly recognized utterances (c) and one that models incorrectly recognized utterances (i) • Both c and i models use a full covariance matrix and have been trained using the expectation maximization (EM) algorithm

  20. Maximum Entropy System • Our model is of the form p(y|x). Here, y is a discrete random variable representing the class ‘correct’ or ‘incorrect’, and x is a vector of discrete or continuous random variables

  21. A Simple ME System (11c) • There are four feature functions created for each dimension of the observation vector • Because our trainer does not accept negative features, we create symmetric features based on whether the original observation was positive or negative • For each of these, another pair of symmetric features is created: one for the correct class, and one for the incorrect class • After adding 1 indicator feature for each class to build a truth-based prior there are a total of 42 and 46 features for the ‘Unq’ and ‘Alt’ case respectively

  22. Improved Results with Binning (11b) • This system uses the base set of 10 and 11 observation dimensions. But, instead of using features that are linear functions of the observations, it creates a set of histogram-based binary features • As a result, they allow the model to take advantage of nonlinear relationships in the data • These features are created by sorting each of the observation dimensions by value and creating bins based on a uniform-occupancy partitioning. • With a maximum of 100 bins (chosen experimentally) and a minimum occupancy of 100, there were 2246 and 1984 binary MaxEnt feature functions for ‘Alt’ and ’Unq’ respectively

  23. Quadratic Observation Vector (121b) • This system attempts to mimic the full covariance aspect of the GMM system • Instead of the base set of 10 and 11 observation dimensions, it uses the outer product consisting of 100 and 121 dimensions • After binning, with minimum and maximum occupancy set as above, there were 26,414 and 21,556 features in the two systems

  24. Incorporating Augmented Features (11+b) • This system augments the original feature set with additional lattice based observations • Most of the lattices generated by our engine on this test have a very small depth, with only 1 or 2 alternates • this system has 14 observation dimensions for both cases producing approximately 2800 features after binning as above

  25. Experiments

  26. Experiments

  27. Conclusions • This paper describes how a maximum entropy model can be used to generate confidence scores for a speech recognition engine on an array of grammars • Results on an evaluation set of 25,991 examples that span 280 grammars demonstrate that the methods of observation selection, feature generation, and model training in this paper provide a significant improvement over a standard baseline

More Related