1 / 21

Adapting EBMT to Chinese

Adapting EBMT to Chinese. Joy (Ying Zhang) Joy@cs.cmu.edu Jan 26, 2001. Topics. Project overview EBMT outline Chinese language Improved Segmenter English phrase recognizing and bracketing Statistical dictionary Results Ongoing and future work. Project Overview.

thalassa
Download Presentation

Adapting EBMT to Chinese

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Adapting EBMT to Chinese Joy (Ying Zhang) Joy@cs.cmu.edu Jan 26, 2001 Adapting EBMT to Chinese joy@cs.cmu.edu

  2. Topics • Project overview • EBMT outline • Chinese language • Improved Segmenter • English phrase recognizing and bracketing • Statistical dictionary • Results • Ongoing and future work Adapting EBMT to Chinese joy@cs.cmu.edu

  3. Project Overview • Part of Lingwear, TIDES • Adapting existing multi-engine Pangloss MT system to Chinese-English • Quick-deploy MT system, develop MT with the smallest amount of human effort and knowledge Adapting EBMT to Chinese joy@cs.cmu.edu

  4. Multi-engine MT system • There are three translation engines in the current system: • EBMT: Example Based Machine Translation • DICTionary: to provide coverage for words not otherwise covered by EBMT, it can be constructed automatically from binlingual corpus • GLOSSaries: from hand-crafted word/phrase bilingual glossaries Adapting EBMT to Chinese joy@cs.cmu.edu

  5. EBMT outline • Concepts • An Example-Based Machine Translation (EBMT) system is given a set of sentences in the source language (from which one is translating) and their corresponding translations in the target language, and uses those examples to translate other, similar source-language sentences into the target language. The basic premise is that, if a previously translated sentence occurs again, the same translation is likely to be correct again. (Ralf. Brown) • Other EBMT systems operate on parse trees, or find the most similar complete sentence and modify its translation based on the differences between the sentence to be translated and the matched example. (Ralf. Brown) • Our system is a shallow EBMT system • Bilingual corpus • Indexing (using dictionary)---Matching • One of the most important issues: increase the performance of MATCHING Adapting EBMT to Chinese joy@cs.cmu.edu

  6. Chinese language • Character • Unit for constructing word, almost each character has a meaning. When constructed with other characters to form a word, the meaning of the word may be different with the meaning of the character • Word: • Usually bigram (two character word), a unigram, trigram or 4-gram, n-gram with n>4 are specific idioms (Data from FDMC 1986) Adapting EBMT to Chinese joy@cs.cmu.edu

  7. Chinese language (cont.) • Problems with words • Vague definition of words • E.g. People’s Republic of China (all these words can be considered as legal words) Adapting EBMT to Chinese joy@cs.cmu.edu

  8. Chinese language (cont.) • Unknown words • New words • Words unique for a certain domain, e.g. legal code Adapting EBMT to Chinese joy@cs.cmu.edu

  9. Chinese language (cont.) • Segmentation • Segmenting words from the sequence of characters • LDC segmenter, using dynamic algorithm, depends on a frequency dictionary • Problem of LDC segmentation • The frequency dictionary can not cover the corpus (miss-segmentation) Adapting EBMT to Chinese joy@cs.cmu.edu

  10. Chinese language (cont.) • Consequence of miss-segmentation • Match?? • The longer the word, the better coverage for EBMT (encapsulating the context into the word) Adapting EBMT to Chinese joy@cs.cmu.edu

  11. Improved Segmenter • Basic ideas: using statistical lexical acquisition to augment the frequency dictionary for the segmenter • Steps: • Using sliding window extract repeating patterns (sequence of characters) from the corpus • Refine patterns to construct longer words/term Adapting EBMT to Chinese joy@cs.cmu.edu

  12. Improved Segmenter (cont.) • Assumptions: • Localization: Same type of word appears more frequently near each other, rather than distributed evenly among the whole corpus Adapting EBMT to Chinese joy@cs.cmu.edu

  13. Improved Segmenter (cont.) Adapting EBMT to Chinese joy@cs.cmu.edu

  14. Improved Segmenter • Assumption: 2. If there will be another pattern appear, it should appear in a range related to the average distance of appeared patterns Adapting EBMT to Chinese joy@cs.cmu.edu

  15. Improved Segmenter • Results: • Hard to evaluate, because the vague definition of words • The effects of improved segmenter can be seen in the improvement of EBMT coverage Adapting EBMT to Chinese joy@cs.cmu.edu

  16. English phrase bracket • Match: • As we increased the length in average the length of Chinese words, to match between the Chinese and English part of corpus, we did the similar thing for English • Recognizing English phrase and bracketing the corpus (replacing the blank with underscore) e.g. the_people’s_republic_of_china (it will be treated as a word) Adapting EBMT to Chinese joy@cs.cmu.edu

  17. Statistical dictionary • Step1: collapsing the inflection form of English phrase/words to one class • Algorithm: Longest common sub string of two phrases should be long enough. Adapting EBMT to Chinese joy@cs.cmu.edu

  18. Statistical dictionary • Step2: building statistical dictionary • Algorithm (with help from Benjamin) S: source language word T: target language word Adapting EBMT to Chinese joy@cs.cmu.edu

  19. Statistical dictionary • Iteration • As the improved segmenter and phrase extraction all work monolingually, there is possibility that Chinese term extracted can not be found with a translation • Using only Chinese words and English phrases that are found with translation to re-segment/re-bracketing the corpus. • Build statistical dictionary again. • Repeat this loop for several times, size of statistical dictionary increased. Adapting EBMT to Chinese joy@cs.cmu.edu

  20. Results • Exp16: Baseline system • Exp15: Base system + improved segmenter • Exp18: Base system + improved segmenter + StatDict • Exp14: Base system + improved segmenter + bracketer + statistical dictionary (3 iterations) Adapting EBMT to Chinese joy@cs.cmu.edu

  21. Ongoing and future work • Feed back from statistical dictionary to segmenter and brackter • Topic detection, corpus clustering • Related work ongoing: • Ralf: Generalization, word clustering • Erik: Relative clause detection and reordering Adapting EBMT to Chinese joy@cs.cmu.edu

More Related