1 / 26

Sebastian leidig, Tim Schlippe , Tanja Schultz

Automatic Detection of Anglicisms for the Pronunciation Dictionary Generation: A Case Study on our German IT Corpus. Sebastian leidig, Tim Schlippe , Tanja Schultz. Motivation. F rom Microsoft's German website www.microsoft.de :

zion
Download Presentation

Sebastian leidig, Tim Schlippe , Tanja Schultz

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Automatic Detection of Anglicisms for the Pronunciation Dictionary Generation: A Case Study on our German IT Corpus Sebastian leidig, Tim Schlippe, Tanja Schultz

  2. Motivation Combining Grapheme-to-Phoneme Converter Outputs for Enhanced Pronunciation Generation in Low-Resource Scenarios • FromMicrosoft's German websitewww.microsoft.de: • “Zeigen Sie andere Apps für einfaches Multitasking neben dem Browser an.” • “Internet Explorer nutzt Hardwarebeschleunigung. Websites werden schneller geladen, damit Sie noch reibungsloser surfen können.”

  3. Motivation Combining Grapheme-to-Phoneme Converter Outputs for Enhanced Pronunciation Generation in Low-Resource Scenarios • With the globalization words from other languages come into a language without assimilation to the phonetic system of the new language • To economically build up lexical resources with automatic or semi-automatic methods detect and treat them separately

  4. Overview Input features combination Output grapheme perplexity g2p confidence word list voting word1 hunspell lookup (native) word2 decision tree classification word3 hunspell lookup (English) word4 word5 SVM Wiktionary lookup word6 Google hit count Combining Grapheme-to-Phoneme Converter Outputs for Enhanced Pronunciation Generation in Low-Resource Scenarios

  5. Outline Motivation and Overview Test Sets Single Features Combinations Summary and Future Work Combining Grapheme-to-Phoneme Converter Outputs for Enhanced Pronunciation Generation in Low-Resource Scenarios

  6. Test Sets - Domains • German IT website • www.microsoft.de • 4.6k unique words • German general news • www.spiegel.de • 6.6k unique words • Afrikaans • NCHLT corpus (Heerden, Davel, Barnard, 2013), (Basson, Davel, 2013) • 9.4k unique words Combining Grapheme-to-Phoneme Converter Outputs for Enhanced Pronunciation Generation in Low-Resource Scenarios

  7. Test Sets - Domains • Different word categories: • Abbreviations: • e.g. UV, CIA, … • Other foreign words • Compound words • e.g. Français, Niveau, … • Tag for • “English”: • e.g. Software, Brain, … • Foreign hybrids • Compound words • e.g. Schadsoftware, … • Grammatically adapted words • e.g. downloaden, … • Decisions based on • Agreement of annotators • In doubt: duden.de • . Combining Grapheme-to-Phoneme Converter Outputs for Enhanced Pronunciation Generation in Low-Resource Scenarios

  8. Foreign words in different test sets Combining Grapheme-to-Phoneme Converter Outputs for Enhanced Pronunciation Generation in Low-Resource Scenarios

  9. Single Features – Design Criteria • Features trained on commonly available resources • Word lists, Pronunciation dictionaries, Spellchecker dictionaries, Wiktionary, Google • Thresholds without supervised training • Comparison between English and native models • New approaches Combining Grapheme-to-Phoneme Converter Outputs for Enhanced Pronunciation Generation in Low-Resource Scenarios

  10. Grapheme Perplexity Combining Grapheme-to-Phoneme Converter Outputs for Enhanced Pronunciation Generation in Low-Resource Scenarios

  11. Grapheme Perplexity Combining Grapheme-to-Phoneme Converter Outputs for Enhanced Pronunciation Generation in Low-Resource Scenarios

  12. Grapheme-to-Phoneme Confidence Phonetisaurusconfidencescores (costs) Combining Grapheme-to-Phoneme Converter Outputs for Enhanced Pronunciation Generation in Low-Resource Scenarios

  13. Grapheme-to-Phoneme Confidence Combining Grapheme-to-Phoneme Converter Outputs for Enhanced Pronunciation Generation in Low-Resource Scenarios

  14. Hunspell Lookup spellchecker dictionary English: Hunspell-en word list classification word1 Hunspell word2 dictionary lookup classification word3 derive word forms word4 2 featuresperformedbest spellchecker dictionary German: Hunspell-de word list classification word1 Hunspell word2 dictionary lookup classification word3 derive word forms word4 Combining Grapheme-to-Phoneme Converter Outputs for Enhanced Pronunciation Generation in Low-Resource Scenarios

  15. Hunspell Lookup spellchecker dictionary English: Hunspell-en word list classification word1 Hunspell word2 dictionary lookup classification word3 derive word forms word4 spellchecker dictionary German: Hunspell-de word list classification word1 Hunspell word2 dictionary lookup classification word3 derive word forms word4 Combining Grapheme-to-Phoneme Converter Outputs for Enhanced Pronunciation Generation in Low-Resource Scenarios

  16. Wiktionary Lookup Check crowdsourced information from matrix language Wiktionary Combining Grapheme-to-Phoneme Converter Outputs for Enhanced Pronunciation Generation in Low-Resource Scenarios

  17. Google Hit Count Combining Grapheme-to-Phoneme Converter Outputs for Enhanced Pronunciation Generation in Low-Resource Scenarios Based on Alex B. (2008) “Automatic Detection of English Inclusion in Mixed-lingual Data with an Application to Parsing”, University of Edinburgh

  18. Google Hit Count Combining Grapheme-to-Phoneme Converter Outputs for Enhanced Pronunciation Generation in Low-Resource Scenarios Based on Alex B. (2008) “Automatic Detection of English Inclusion in Mixed-lingual Data with an Application to Parsing”, University of Edinburgh

  19. Result: Single Features Combining Grapheme-to-Phoneme Converter Outputs for Enhanced Pronunciation Generation in Low-Resource Scenarios

  20. Grapheme-to-Phoneme Confidence Combining Grapheme-to-Phoneme Converter Outputs for Enhanced Pronunciation Generation in Low-Resource Scenarios

  21. Result: Single Features Combining Grapheme-to-Phoneme Converter Outputs for Enhanced Pronunciation Generation in Low-Resource Scenarios  On Spiegel-de test set: Higher ratio of words classified as English are wrong

  22. Result: Combination Combining Grapheme-to-Phoneme Converter Outputs for Enhanced Pronunciation Generation in Low-Resource Scenarios

  23. Challenges Performance after filteringdifficultwords (oracle) Combining Grapheme-to-Phoneme Converter Outputs for Enhanced Pronunciation Generation in Low-Resource Scenarios

  24. Conclusion and Future Work • Features based on available sources • New approaches: • G2P confidence • Wiktionary • Further features: • Part-of-speech (POS) • Context, trigger words • Capitalization • Translate and compare Combining Grapheme-to-Phoneme Converter Outputs for Enhanced Pronunciation Generation in Low-Resource Scenarios

  25. благодари́м за внима́ние! Combining Grapheme-to-Phoneme Converter Outputs for Enhanced Pronunciation Generation in Low-Resource Scenarios

  26. Result: Combination 2.2k pronunciations from German IT corpus PER to reference pronunciations Combining Grapheme-to-Phoneme Converter Outputs for Enhanced Pronunciation Generation in Low-Resource Scenarios

More Related