VESTEL database realistic telephone speech corpus: PRNOK5TR: 5810 utterances in the training set

List LengthEstimator For example(vocabulary of 600 words) listLength(0) 100 listLength(1) 200 listLength(2) 300 List LengthEstimator Input parameters HypothesisGenerator Verification Module listLength(3) 400 listLength(4) 500 Dictionary List Length Phonetic string Indexes listLength(5) 600 Preprocessing & VQ processes Phonetic String Build-Up Detailed Matching Lexical Access Speech Preselection list Alignment costs VQ books HMMs Durations Text IMPROVED VARIABLE PRESELECTION LIST LENGTH ESTIMATION USING NNs IN A LARGE VOCABULARY TELEPHONE SPEECH RECOGNITION SYSTEM J. Macías-Guarasa, J. Ferreiros, J. Colás, A. Gallardo-Antolín and J.M. Pardo Grupo de Tecnología del Habla. Universidad Politécnica de Madrid. Spain SUMMARY EXPERIMENTAL SETUP CONCLUSIONS AND FUTURE WORK • Previous work on this topic EUROSPEECH’99: • Flexible Large Vocabulary (up to 10000 words) • Speaker independent Isolated word Telephone speech • Two stage - bottom up strategy • Using neural networks as a novel approach to estimate preselection list length • Postprocessing methods of the neural network output for final estimation • VESTEL database realistic telephone speech corpus: • PRNOK5TR: 5810 utterances in the training set • PERFDV: 2502 utterances in testing set 1 (vocabulary dependent) • PEIV1000: 1434 utterances in testing set 2 (vocabulary independent) • Vocabulary composed of 10000 words • Experimental alternatives: • Different output distribution coding • Different preselection list length estimation methods • Different postprocessing and optimisation methods of this estimation • NNs as a suitable strategy for variable preselection list length estimation • Improvements (both in IER and Average effort) • Typical: 7-8% for PERFDV, 18-28% for PRNOK and PEIV1000 • Maximum: 10% for PERFDV, 32% for PRNOK and PEIV1000 • Future work: • Bigger databases :-) • Further exploiting the parameter inventory • Extensibility to other architectures • What happens if single neuron output is used? • Application to word confidence estimation tasks • In this paper: • Parameter inventory increased • New validation scheme • Extensive testing with improved results MOTIVATION NN BASED PLL NN DESIGN • Traditional MLP with one hidden layer • Final topology: 8 inputs - 5 hidden - 10 outputs • Trained with BP: Enough data is available • Input parameters: • Direct parameters • Derived parameters: Direct normalized • Lexical Access Statistical Parameters: Calculated over the lexical access costs distribution • Input coding: maxmin, normalization, w/o clipping, single and multiple neurons, linear/nonlinear mapping, etc. • Verification module is computationally expensive • Idea: reduce preselection list length • Difficult, specially if low acoustic detail • Estimate a different PLL for every word • Methods: parametric, non-parametric, ... • To think about: • Computational demands do not depend linearly on PL • Final savings must take into account both modules • Only average estimations are possible • Output coding • Each output  different list length segment • Problem: Inhomogeneous number of activations per output • Solution: Train segment length distribution (Table 1 and Figure 2). • POST-PROCESSING OF THE NN ESTIMATION • The network output is postprocessed to increase robustness • Two alternatives: • The winner output neuron decides (WINNER). • Linear combination of normalised activations (SUMMA): SYSTEM ARCHITECTURE • PARAMETER SELECTION • Selected according to results in discrimination task (1st position vs. the rest) • Best absolute results: multiple input neuron, nonlinear mapping • No significative differences with single input neuron • 8 final parameters selected Neuronlength(i): Upper limit of neuron i (Table 1) normAct(i): Normalised activation of this neuron • Additionally, fixed (-FX) or proportional (-PP) threshold can be added (trained to achieve a certain IER) BASELINE SYSTEM FINAL NN BASED SYSTEM EVALUATION STRATEGY • How to compare the systems? • NN system gets a single point in the (averageEffort x inclusionRate) space • Fixed length system generates a full inclusion rate graphic • Alternatives • Use the single point • If improvement in both axis: OK • If not: ??? • Sensitivity? Spuriousness? • Extend the analysis: • Use the estimated thresholds to build an artificial inclusion rate histogram, around the area of interest (96.5% - 99%) • Compare each point in this range with fixed list length inclusion rate curve • Combine comparisons in both axes: • Inclusion rate improvement to get the same average effort • Average effort reduction to get the same inclusion rate • MLP: • 8 inputs - 5 hidden - 10 outputs • Standard input coding normalization • Nonlinear output coding • Additional control parameters: • Segment length assigned to last output neuron (0, 2500, 5000, 10000)  G (during training) B (during test) • Objective inclusion rate in threshold estimation process (98%, 98.5%, 99% and 99.5%)  T BEST EXPERIMENT • Target: 2% inclusion error rate (IER) + 90% pruning • SCHMM: 23+2 automatically clustered context independent phoneme-like units • Fixed length preselection lists RESULTS (I) RESULTS (II) • Quantitative improvements: • Typical: 7-8% for PERFDV, 18-28% for PRNOK and PEIV1000 • Maximum: 10% for PERFDV, 32% for PRNOK and PEIV1000 • Statistical confidence: • We do not have enough data to absolutely prove our results are statistically relevant • All bands overlap: • We have a small database • Our inclusion rates are very high • But: We prove improvements in a wide range of values • WINNER METHOD: • Lack of precision in discrimination • SUMMA METHOD: • Good results! • SUMMA plus FX improve fixed length system in almost all cases • High values for last neuron length are needed (>5000) • Relative improvements (for 10 best experiments selected looking at training set results): ICSLP’2000 Beijing (China)

VESTEL database realistic telephone speech corpus: PRNOK5TR: 5810 utterances in the training set

VESTEL database realistic telephone speech corpus: PRNOK5TR: 5810 utterances in the training set

Presentation Transcript

The Speech Mechanism

Chapter 3: Winning Telephone Skills

Speech Segregation

CHAPTER 6

JDeveloper ADF and the Oracle database – friends not foes

Speech Recognizer Training

Speech Recognizer Training

MnRAM 3.3 Database Training

What is a CORPUS?

Maid Of Honor Speech Ideas

Farewell Speech

British National Corpus (BNC)

Corpus Linguistics: Introduction

Part-of-speech tagging

Speech Segregation

BBI 3219

Chapter 2 Speech Sounds

Parts of Speech

Introduction to Database Design Methodology

Feature Computation: Representing the Speech Signal