Can knowledge of ethnic origin classes improve pronunciation accuracy of foreign proper names?

Can knowledge of ethnic origin classes improve pronunciation accuracy of foreign proper names? Ariadna Font Llitjos Advisor: Dr. Alan Black Introduction and motivation Danish, Dutch, Estonian, Hebrew, Italian, Malaysian, Norwegian, Portuguese, Serbian, Slovenian, Swedish, Turkish, and also built corpora of proper nouns only for a few more languages, by crawling the web automatically (Corpusbuilder [Ghani, Jones and Mladenic, 2001]) and manually (size: 500 – 6198 names): Catalan, Chinese, Japanese, Korean, Polish,Thai, Tamil and other-Indian (-Tamil). Our hypothesis is that knowing the origin of an unknown word may allow more specific rules to be applied ([Black et al., 1998] and [Church, 2000]). In some cases, we even expect our system to outperform native English speakers, since what we are after is educated pronunciation of foreign proper names in American English. One such case is Chinese names, since few native English speakers know how to pronounce them, but there are very concrete English rules for pronouncing such names, and so if we added this information to our system, it would pronounce those names in the Americanized, educated way, achieving higher pronunciation accuracy than average American speakers. Previous experiments on large lists of words and pronunciations have shown that when a lexicon has more foreign words than another (CMUDICT vs. OALD in [Black et al., 1998]), this has quite an impact on speech synthesis accuracy. Such experiments report a difference of 16.76% on word accuracy due to foreign words, a great proportion of which is likely to be proper names, since they are harder to predict without any higher level of information. Therefore, there is clearly room for improvement in this domain. Building Letter Language Models Once I collected all these multilingual data, I built a trigram LLM for each language. I used Laplace smoothing, which only made a significant difference for the proper names corpora, since I didn’t have enough data to reliably estimate all the triagrams. I then implemented a language identifier, which given a word (or a document), uses the LLMs to determine to which language it belongs to with a certain probability. To build the CART we decided to add the following features to each name: 1st-language, higher-probability, 2nd-language, 2nd-higher-probability and the difference between the two higher probabilities: (zysk((best-lang slovenian.train)(higher-prob 0.18471) (2nd-best-lang czech.train) (2nd-higher-prob 0.18428)(prob-difference 0.00043))) The resulting CART (probCART), therefore, had a richer parameter space, since it used all the previous features (previous and next phones) as well as the language of origin features. Baselines For the baseline, I looked up CMUdict with stress (version 0.4) to extract the pronunciations of a list of 56,000 foreign proper names, and every tenth word in the lexicon was held out for testing, using the rest 90% as training data. Based on the techniques described in [Black et al., 1998] and used in [Chotimongkol and Black, 2000], I used decision trees to predict phones based on letters and their context. In English, letters map to epsilon, a phone or occasionally two phones. The following three mappings illustrate this: (a) Monongahela m ax n oa1 ng g ax hh ey1 l ax (b) Pittsburgh p ih1 t  s b  er g  (c) exchange ih k-s ch  ey1 n jh  The results for the CART trained on proper names only turned out to have a 3.15% higher word accuracy than the CART trained on the whole CMUdict (see table). Preliminary results We built the probabilistic CART (probCART) using the stop value which had been proven optimal for the CMUdict in previous experiments [Black et al., 1998], 5. However, since the parameter space was richer (the tree had more features to split itself into), we suspected there was a data fragmentation problem, and there wasn’t actually enough data on the leaves to have reliable estimates, so we also built CARTs using a stop value of 8. The word accuracy for all the CARTs were the following: Multilingual Data collection This represents a 7.64% increment in word accuracy over the proper names baseline. The next step was to collect data to build the letter to language models (LLM) for as many different languages as possible, so that I could effectively build a language identifier and add the relevant features to build the CART. For that, I post processed all the corpora for the languages in the European Corpus Initiative Multilingual Corpus I (size: 255 thousand – 11 billion words): English, French, German, Spanish, Croatian, Czech, Some references - Black, A., Lenzo, K. and Pagel, V. Issues in Building General Letter to Sound Rules. 3rd ESCA Speech Synthesis Workshop, pp. 77-80, Jenolan Caves, Australia, 1998 - Chotimongkol, A. and Black, A. Statistically trained orthographic to sound models forThai. Beijing October 2000. - Church, K. (2000). Stress Assignment in Letter to Sound rules for Speech Synthesis (Technical Memoradnum).AT&T Labs –Research. November 27, 2000. - Ghani R., Jones R. and Mladenic D. (2001). Building Minority Language Corpora by Learning to Generate Web Search Queries. Technical Report CMU-CALD-01-100 http://www.cs.cmu.edu/~TextLearning/corpusbuilder/.

Can knowledge of ethnic origin classes improve pronunciation accuracy of foreign proper names?

Can knowledge of ethnic origin classes improve pronunciation accuracy of foreign proper names?

Presentation Transcript

New Zealand English

Louis A. Picard CAPSTONE AND READING SEMINAR: FOREIGN AID, FOREIGN POLICY AND DEVELOPMENT MANAGEMENT

INDIAN HUMOR See also PowerPoints on “Alexie’s Humorous Names,” and “Ethnic Humor”

Foreign Trade Policy, 2009-14

Ethnic foods and ethnic diets

Chapter Eleven: Nationalistic and Ethnic Terrorism

The Path of Truth to Life Bible Translation Project

Social Problems

A Three-Stage Model of Ethnic Identity Development in Adolescence

Foreign Bodies

Importance’s of Biofertilizers in Modern Agriculture

Asma ul Husna (99 names of Allah)

Unit 9: Origin of Life and Evolution

PIA 2096/PIA 2504- Week Three

Basic principles of NMR NMR signal origin, properties, detection, and processing

台灣語言源流、演變 kap 特色

FOREIGN /NATIVE TEACHERS’ MEETING Katja Pavlič Škerjanc, katja.pavlic@zrss.si , August 31, 2009

English Intonation – its Structure & the Use

Chapter 18

Can knowledge of ethnic origin classes improve pronunciation accuracy of foreign proper names?