Speech recognition in MUMIS

Speech recognition in MUMIS Mirjam Wester, Judith Kessens & Helmer Strik

Intro • Objective: Automatic speech recognition of football commentaries • SPEX transcribed two matches for two languages (Dutch and English): • England - Germany (Eng-Dld) and • Yugoslavia -The Netherlands (Yug-Ned) • Commentaries and stadium noise are mixed

Data Conversion • SPEX transcription: • text grid: • orthographic transcription • chunk alignment; chunk = a segment of speech of about 2 to 3 seconds • CD with one large wav file • Split according to chunk alignments

Examples of data • Yug-Ned Dutch • Yug-Ned English • Eng-Dld Dutch • Eng-Dld English

Statistics English matches have two commentators, Dutch only one. Overlapping segments have been disregarded.

Training Dutch: • Yug-Ned ¾ of CD (19 min speech) • France Telecom Noise Reduction (FTNR) English: • Yug-Ned ¾ of CD (28 min speech) • FTNR For more information on France Telecom Noise Reduction tool see: B. Noé, J. Sienel, D. Jouvet, L. Mauuary, L. Boves, J. de Veth & F. de Wet “Noise Reduction for Noise Robust Feature Extraction for Distributed Speech Recognition”. In Proc. of Eurospeech ’01

Test Dutch: • Yug-Ned ¼ of CD • 626 chunks, 1577 words • lexicon and language model based on complete Yug-Ned match English: • Yug-Ned ¼ of CD • 636 chunks, 2641 words • lexicon and language model based on complete Yug-Ned match

SNR before and after FTNR tool

WER results for Yug-Ned before and after FTNR

Dutch – Polyphone • Data is phonetically rich sentences • Phone models were trained on: • Polyphone all speakers • Polyphone male speakers • Polyphone male speakers + MUMIS noise • Polyphone as bootstrap for segmentation of MUMIS material

Polyphone models (Dutch)Yug-Ned test set

Cross tests (Dutch & English) Cross-tests: • train on ¾ Yug-Ned test on ¼ Eng-Dld • train on ¾ Eng-Dld test on ¼ Yug-Ned

MUMIS models (Dutch) Yug-Ned test Eng-Dld test

MUMIS models (English) Yug-Ned test Eng-Dld test

MUMIS models (Dutch+English) Yug-Ned test Eng-Dld test

Function words vs content words word type Dutch data English data

SNR vs. WER (1)

SNR vs. WER (2)

Discussion • WERs are high • Noise? • FTNR leads to lower SNR, but WERs do not improve substantially • Not enough training data? • Polyphone for training/bootstrapping does not lead to lower WERs than training on MUMIS data • Noisifying Polyphone with MUMIS gives encouraging results

Discussion continued • Function words comprise ± 50% of the data, and cause great deal of the errors • Names are recognized very well • Function words not necessary for information extraction (?)

Future work • Steps to noise robust speech recognition: • model/speaker adaptation • combinations of noisified Polyphone models and FTNR • Other issues: • transcription of more data • English, Dutch and German • preference specific games? radio? TV? • generic football specific language model • confidence measures?

Future work continued Questions: • What type of output from ASR is needed? • word-graph • n-best list • top of the list • word spotting? only content words? • For research purposes: is it possible to obtain data that has not been mixed (noise + commentary)?

Speech recognition in MUMIS

Speech recognition in MUMIS

Presentation Transcript

Speech Recognition

Speech Recognition

Speech Recognition

Speech Recognition

Speech Recognition

Speech Recognition

Speech recognition

Speech Recognition

Speech Recognition

Speech Recognition

Speech Recognition

Speech Recognition

MUMIS

Speech Recognition

SPEECH RECOGNITION:

Speech Recognition

Speech Recognition

Speech Recognition

Speech Recognition

Speech Recognition

Speech Recognition

Speech Recognition