Broadcast news training experiments
1 / 16

Broadcast News Training Experiments - PowerPoint PPT Presentation

  • Uploaded on

Broadcast News Training Experiments. Anand Venkataraman, Dimitra Vergyri, Wen Wang, Ramana Rao Gadde, Martin Graciarena, Horacio Franco Jing Zheng and Andreas Stolcke Speech Technology & Research Laboratory SRI International, Menlo Park, CA. Goals.

I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
Download Presentation

PowerPoint Slideshow about 'Broadcast News Training Experiments' - ros

An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.

- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
Broadcast news training experiments

Broadcast News Training Experiments

Anand Venkataraman, Dimitra Vergyri,

Wen Wang, Ramana Rao Gadde,

Martin Graciarena, Horacio Franco

Jing Zheng and Andreas Stolcke

Speech Technology & Research Laboratory

SRI International, Menlo Park, CA

EARS STT Workshop


  • Assess effect of TDT-4 data on SRI BN system (not previously used)

  • Explore alternatives for use of closed-caption transcripts for acoustic and LM training

  • Specifically, investigate algorithm for “repairing” inaccuracies in CC transcripts.

  • Initial test of voicing feature front end on BN (originally developed for CTS)

EARS STT Workshop

Talk overview
Talk Overview

  • BN training on TDT-4 CC data

    • Generation of raw transcripts

    • Waveform segmentation

    • Transcript Repair with FlexAlign

    • FlexAlign output for LM training

    • Effect of amount of training data

    • Comparison with CUED TDT-4 transcripts

    • Ongoing effort on voicing features for BN acoustic modeling

EARS STT Workshop

Tdt 4 training generation of waveforms segments and reference transcripts
TDT-4 Training:Generation of Waveforms Segmentsand Reference Transcripts

  • References were assumed to be delimited by <TEXT> and </TEXT> in the LDC transcripts.

  • The speech signal was cut using the time marks extracted from the <DOC> tags surrounding the TEXT elements.

  • Long waveforms were identified and recut at progressively shorter pauses until all waveforms were 30s or shorter.

  • Used PTM acoustic models for forced alignment that didn’t require speaker-level normalizations.

  • Used “flexible” forced alignment (see next).

EARS STT Workshop


  • Special lattices were generated for each segment.

  • Each word was preceded by an optional pause and an optional nonlexical word model.

  • Goal was to simultaneously delete noisy or mistranscribed text and insert disfluencies.

EARS STT Workshop

Optional nonlexical word
Optional Nonlexical Word

Transition probabilities were approximated by the unigram relative frequencies in the 96/97 BN acoustic training corpus.

EARS STT Workshop

Training procedure
Training Procedure

  • Final refs were the output of the recognizer on the FlexAlign lattices.

  • WER wrt original CC transcripts:

    5.0% (Sub 0.4, Ins 4.4, Del 0.3)

  • Standard acoustic models were built using Viterbi training on these transcripts.

EARS STT Workshop

Does flexalignment help lm training
Does FlexAlignment Help LM Training?

“Subset”: Random selection of original CC references to match token count of FlexAlign transcripts.

Note: Only disfluency in the test data was “uh”.

EARS STT Workshop

An accidental experiment
An Accidental Experiment

What happens if we train on only a subset of the data? Is the performance proportionately worse?

EARS STT Workshop

Comparison with cued tdt 4 training transcripts
Comparison with CUED TDT-4 Training Transcripts

  • CUED TDT-4 transcripts were generated by a STT system with a biased LM (trained on TDT-4 CC).

  • CUED transcripts were generated from CU word time information and SRI waveform segments.

  • CUED transcripts sometimes have “holes” in them where our wave segments span more than one of CUED waves (probably due to ad removal).

  • WER wrt CC transcriptions:

    Originals: 18.2% (Sub 7.7, Ins 3.2,, Del 7.2)

    Flex-align:19.5% (Sub 10.1, Ins 3.8, Del 5.6)

  • A fairer comparison ought to use CUED transcripts with CUED segments for training the acoustic models, so take results with a grain of salt!

EARS STT Workshop

Multi pass system results
Multi-pass System Results

  • Multi-pass system used new decoding strategy (described in later talk).

  • But: MFC instead of PLP, and no SAT normalization in training (to save time).

EARS STT Workshop

Voicing features
Voicing Features

  • Test voicing features developed for CTS system for BN STT (cf. Martigny talk)

    • Then, we obtained a 2% relative error reduction across stages

  • Use Peak of autocorrelation and entropy of higher order cepstrum

  • Use a window of 5 frames of two voicing features

  • Juxtapose MDCC plus deltas and double deltas to window of voicing features

  • Apply dimensionality reduction with HLDA. Final feature vector has 39 dimensions

EARS STT Workshop

Voicing features results
Voicing Features Results

  • TDT-4 devtest set (results on first pass)

  • Used equivalent parameters to those optimized for CTS system

  • Need to investigate (reoptimize) FE parameters for higher BW

  • It is not clear what the effect of background music might be in voicing features in BN

    • Possible software issues

    • With higher BW features, voicing features may be more redundant .

EARS STT Workshop


  • Developed CC transcript “repair” algorithm based on flexible alignment.

  • Training on “repaired” TDT-4 transcripts gives 8.8% (1st pass) to 6.2% (multi-pass) relative improvement of Hub-4 training.

  • Accidental result: leaving out 1/3 of new data reduces improvement only marginally.

  • Transcript “repair” not suitable for LM training (yet).

  • No improvement from voicing features (yet), need to investigate parameters.

EARS STT Workshop

Future work
Future Work

  • Redo comparison with alternative transcripts more carefully.

  • Investigate data filtering (e.g., based on reject word occurrences in FlexAlign output).

  • Add the rest of the data !

  • Further investigate the use of voicing features.

EARS STT Workshop