1 / 23

Speech recognition in MUMIS

Speech recognition in MUMIS. Mirjam Wester, Judith Kessens & Helmer Strik. Intro. Objective: Automatic speech recognition of football commentaries SPEX transcribed two matches for two languages (Dutch and English): England - Germany (Eng-Dld) and Yugoslavia -The Netherlands (Yug-Ned)

iago
Download Presentation

Speech recognition in MUMIS

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Speech recognition in MUMIS Mirjam Wester, Judith Kessens & Helmer Strik

  2. Intro • Objective: Automatic speech recognition of football commentaries • SPEX transcribed two matches for two languages (Dutch and English): • England - Germany (Eng-Dld) and • Yugoslavia -The Netherlands (Yug-Ned) • Commentaries and stadium noise are mixed

  3. Data Conversion • SPEX transcription: • text grid: • orthographic transcription • chunk alignment; chunk = a segment of speech of about 2 to 3 seconds • CD with one large wav file • Split according to chunk alignments

  4. Examples of data • Yug-Ned Dutch • Yug-Ned English • Eng-Dld Dutch • Eng-Dld English

  5. Statistics English matches have two commentators, Dutch only one. Overlapping segments have been disregarded.

  6. Training Dutch: • Yug-Ned ¾ of CD (19 min speech) • France Telecom Noise Reduction (FTNR) English: • Yug-Ned ¾ of CD (28 min speech) • FTNR For more information on France Telecom Noise Reduction tool see: B. Noé, J. Sienel, D. Jouvet, L. Mauuary, L. Boves, J. de Veth & F. de Wet “Noise Reduction for Noise Robust Feature Extraction for Distributed Speech Recognition”. In Proc. of Eurospeech ’01

  7. Test Dutch: • Yug-Ned ¼ of CD • 626 chunks, 1577 words • lexicon and language model based on complete Yug-Ned match English: • Yug-Ned ¼ of CD • 636 chunks, 2641 words • lexicon and language model based on complete Yug-Ned match

  8. SNR before and after FTNR tool

  9. WER results for Yug-Ned before and after FTNR

  10. Dutch – Polyphone • Data is phonetically rich sentences • Phone models were trained on: • Polyphone all speakers • Polyphone male speakers • Polyphone male speakers + MUMIS noise • Polyphone as bootstrap for segmentation of MUMIS material

  11. Polyphone models (Dutch)Yug-Ned test set

  12. Cross tests (Dutch & English) Cross-tests: • train on ¾ Yug-Ned test on ¼ Eng-Dld • train on ¾ Eng-Dld test on ¼ Yug-Ned

  13. MUMIS models (Dutch) Yug-Ned test Eng-Dld test

  14. MUMIS models (English) Yug-Ned test Eng-Dld test

  15. MUMIS models (Dutch+English) Yug-Ned test Eng-Dld test

  16. Function words vs content words word type Dutch data English data

  17. SNR vs. WER (1)

  18. SNR vs. WER (2)

  19. Discussion • WERs are high • Noise? • FTNR leads to lower SNR, but WERs do not improve substantially • Not enough training data? • Polyphone for training/bootstrapping does not lead to lower WERs than training on MUMIS data • Noisifying Polyphone with MUMIS gives encouraging results

  20. Discussion continued • Function words comprise ± 50% of the data, and cause great deal of the errors • Names are recognized very well • Function words not necessary for information extraction (?)

  21. Future work • Steps to noise robust speech recognition: • model/speaker adaptation • combinations of noisified Polyphone models and FTNR • Other issues: • transcription of more data • English, Dutch and German • preference specific games? radio? TV? • generic football specific language model • confidence measures?

  22. Future work continued Questions: • What type of output from ASR is needed? • word-graph • n-best list • top of the list • word spotting? only content words? • For research purposes: is it possible to obtain data that has not been mixed (noise + commentary)?

More Related