Broadcast News Training Experiments

Broadcast News Training Experiments Anand Venkataraman, Dimitra Vergyri, Wen Wang, Ramana Rao Gadde, Martin Graciarena, Horacio Franco Jing Zheng and Andreas Stolcke Speech Technology & Research Laboratory SRI International, Menlo Park, CA EARS STT Workshop

Goals • Assess effect of TDT-4 data on SRI BN system (not previously used) • Explore alternatives for use of closed-caption transcripts for acoustic and LM training • Specifically, investigate algorithm for “repairing” inaccuracies in CC transcripts. • Initial test of voicing feature front end on BN (originally developed for CTS) EARS STT Workshop

Talk Overview • BN training on TDT-4 CC data • Generation of raw transcripts • Waveform segmentation • Transcript Repair with FlexAlign • FlexAlign output for LM training • Effect of amount of training data • Comparison with CUED TDT-4 transcripts • Ongoing effort on voicing features for BN acoustic modeling EARS STT Workshop

TDT-4 Training:Generation of Waveforms Segmentsand Reference Transcripts • References were assumed to be delimited by <TEXT> and </TEXT> in the LDC transcripts. • The speech signal was cut using the time marks extracted from the <DOC> tags surrounding the TEXT elements. • Long waveforms were identified and recut at progressively shorter pauses until all waveforms were 30s or shorter. • Used PTM acoustic models for forced alignment that didn’t require speaker-level normalizations. • Used “flexible” forced alignment (see next). EARS STT Workshop

FlexAlign • Special lattices were generated for each segment. • Each word was preceded by an optional pause and an optional nonlexical word model. • Goal was to simultaneously delete noisy or mistranscribed text and insert disfluencies. EARS STT Workshop

Optional Nonlexical Word Transition probabilities were approximated by the unigram relative frequencies in the 96/97 BN acoustic training corpus. EARS STT Workshop

Training Procedure • Final refs were the output of the recognizer on the FlexAlign lattices. • WER wrt original CC transcripts: 5.0% (Sub 0.4, Ins 4.4, Del 0.3) • Standard acoustic models were built using Viterbi training on these transcripts. EARS STT Workshop

Does FlexAlignment Help LM Training? “Subset”: Random selection of original CC references to match token count of FlexAlign transcripts. Note: Only disfluency in the test data was “uh”. EARS STT Workshop

An Accidental Experiment What happens if we train on only a subset of the data? Is the performance proportionately worse? EARS STT Workshop

Comparison with CUED TDT-4 Training Transcripts • CUED TDT-4 transcripts were generated by a STT system with a biased LM (trained on TDT-4 CC). • CUED transcripts were generated from CU word time information and SRI waveform segments. • CUED transcripts sometimes have “holes” in them where our wave segments span more than one of CUED waves (probably due to ad removal). • WER wrt CC transcriptions: Originals: 18.2% (Sub 7.7, Ins 3.2,, Del 7.2) Flex-align:19.5% (Sub 10.1, Ins 3.8, Del 5.6) • A fairer comparison ought to use CUED transcripts with CUED segments for training the acoustic models, so take results with a grain of salt! EARS STT Workshop

Results of First Decoding Pass EARS STT Workshop

Multi-pass System Results • Multi-pass system used new decoding strategy (described in later talk). • But: MFC instead of PLP, and no SAT normalization in training (to save time). EARS STT Workshop

Voicing Features • Test voicing features developed for CTS system for BN STT (cf. Martigny talk) • Then, we obtained a 2% relative error reduction across stages • Use Peak of autocorrelation and entropy of higher order cepstrum • Use a window of 5 frames of two voicing features • Juxtapose MDCC plus deltas and double deltas to window of voicing features • Apply dimensionality reduction with HLDA. Final feature vector has 39 dimensions EARS STT Workshop

Voicing Features Results • TDT-4 devtest set (results on first pass) • Used equivalent parameters to those optimized for CTS system • Need to investigate (reoptimize) FE parameters for higher BW • It is not clear what the effect of background music might be in voicing features in BN • Possible software issues • With higher BW features, voicing features may be more redundant . EARS STT Workshop

Summary • Developed CC transcript “repair” algorithm based on flexible alignment. • Training on “repaired” TDT-4 transcripts gives 8.8% (1st pass) to 6.2% (multi-pass) relative improvement of Hub-4 training. • Accidental result: leaving out 1/3 of new data reduces improvement only marginally. • Transcript “repair” not suitable for LM training (yet). • No improvement from voicing features (yet), need to investigate parameters. EARS STT Workshop

Future Work • Redo comparison with alternative transcripts more carefully. • Investigate data filtering (e.g., based on reject word occurrences in FlexAlign output). • Add the rest of the data ! • Further investigate the use of voicing features. EARS STT Workshop

Broadcast News Training Experiments