1 / 20

A Joint Source-Channel Model for Machine Transliteration

A Joint Source-Channel Model for Machine Transliteration. Li Haizhou, Zhang Min, Su Jian Institute for Infocomm Research 21 Heng Mui Keng Terrace, Singapore 119613 {hli,sujian,mzhang}@i2r.a-star.edu.sg ACL 2004. Abstract.

aldan
Download Presentation

A Joint Source-Channel Model for Machine Transliteration

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. A Joint Source-Channel Model for Machine Transliteration Li Haizhou, Zhang Min, Su Jian Institute for Infocomm Research 21 Heng Mui Keng Terrace, Singapore 119613 {hli,sujian,mzhang}@i2r.a-star.edu.sg ACL 2004

  2. Abstract • This paper presents a new framework that allows direct orthographical mapping (DOM) between two different languages, through a joint source- channel model, also called n-gram TM model, • Automated the orthographic alignment process derives the aligned transliteration units from a bilingual dictionary. • State-of-the-art machine learning algorithm.

  3. Research Question • Out of vocabulary • Transliterating English names into Chinese is not straightforward. • For example,/Lafontant/ is transliterated into 拉丰唐(La-Feng-Tang) while /Constant/ becomes 康斯坦特(Kang-Si-Tan-Te), where syllable /-tant/ in the two names are transliterated differently depending on the names’ language of origin.

  4. Research Question • For a given English name E as the observed channel output, one seeks a posteriori the most likely Chinese transliteration C that maximizes P(C|E). • P(E|C) : the probability of transliterating C to E through a noisy channel, which is also called transformation rules. • P(C) : n-gram language models. • Likewise, in C2E transliteration • P(E) : n-gram language models. ? ?

  5. Research Question • E intermediat phonemic C • P(C|E) = P(C|P)P(P|E) P(E|C) = P(E|P)P(P|C) , P: phonemic • For Example: • Chinese characters -> PinYin -> IPA • English characters -> IPA • Orthographic alignment • For example: • one-to-many mappings

  6. English name : Chinese name : Research Question ?

  7. Machine Transliteration - Review • phoneme-based techniques • Transformation-based learning algorithm • (Meng et al. 2001; Jung et al, 2000;Virga & Khudanpur, 2003) • Using finite state transducer that implements transformation rules. • (Knight & Graehl, 1998), • Web- (Sigir 2004) • Using the Web for Automated Translation Extraction • Ying Zhang, Phil Vines • Learning Phonetic Similarity for Matching Named EntityTranslations and Mining New Translations • Wai Lam, Ruizhang Huang, Pik-Shan Cheung • Translating Unknown Queries with Web Corpora for Cross-Language Information Retrieval • Pu-Jen Cheng, Jei-Wen Teng, Ruei-Cheng Chen, Jenq-Haur Wang, Wen-Hsiang Lu+, and Lee-Feng Chien

  8.  :  : ? γ  :  : Joint Source-Channel Model • For K aligned transliteration units , we propose to estimate P(E,C) by a joint source-channel model. • Suppose exists an alignment γ with

  9. Joint Source-Channel Model • A transliteration unit correspondence <e,c> is called a transliteration pair. • E2C transliteration can be formulated as • Similarly the C2E back-transliteration as • An n-gram transliteration model is defined as the conditional probability, or transliteration probability, of a transliteration pair <e,c>k depending on its immediate n predecessor pairs:

  10. Transliteration alignment • A bilingual dictionary contains entries mapping English names to their respective Chinese transliterations. • To obtain the statistics, the bilingual dictionary needs to be aligned. The maximum likelihood approach, through EM Algorithm.

  11.  :  : γ  :  : Transliteration alignment

  12. Database from the bilingual dictionary “Chinese Transliteration of Foreign Personal Names” Xinhua News Agency 37,694 unique English entries and their official Chinese transliteration. 13 subsets for 13-flod open test 13-fold open tests : error rates < 1% deviation. EM iteration converges at 8th round The experiments ?

  13. The experiments • For a test set W composed of V names • Each name has been aligned into a sequence of transliteration pair tokens. • The cross-entropy Hp(W) of a model on data W is defined as ,where and WT is the total number of aligned transliteration pair tokens in the data W. • The perplexity :

  14. The experiments

  15. The experiments • In word error report, a word is considered correct only if an exact match happens between transliteration and the reference. • The character error rate is the sum of deletion, insertion and substitution errors..

  16. The experiments • 5,640 transliteration pairs are cross mappings between 3,683 English and 374 Chinese units. • Each English unit have 1.53 = 5,640/3,683 Chinese correspondences. • Each Chinese unit have 15.1 = 5,640/374 English back-transliteration units. • Monolingual language perplexity • Chinese and English independently using the n-gram language models

  17. The experiments

  18. Discussions

  19. Conclusions • In this paper, we propose a new framework (DOM) for transliteration. n-gram TM is a successful realization of DOM paradigm. • It generates probabilistic orthographic transformation rules using a data driven approach. By skipping the intermediate phonemic interpretation, the transliteration error rate is reduced significantly. • Place and company names : • /Victoria-Fall/ becomes 维多利亚瀑布 (Pinyin:Wei Duo Li Ya Pu Bu). • It can also be easily extended to handle such name translation. ?

  20. Ca Va

More Related