1 / 18

Malay-English Bitext Mapping and Alignment Using SIMR/GSA Algorithms

Malay-English Bitext Mapping and Alignment Using SIMR/GSA Algorithms. Mosleh Al-Adhaileh and Tang Enya Kong Computer Aided Translation Unit School of Computer Sciences U niversity S cience M alaysia. I. Dan Melamed Department of Computer Science Courant Institute

ivy
Download Presentation

Malay-English Bitext Mapping and Alignment Using SIMR/GSA Algorithms

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Malay-English Bitext Mapping and Alignment Using SIMR/GSA Algorithms Mosleh Al-Adhaileh and Tang Enya Kong Computer Aided Translation Unit School of Computer Sciences University Science Malaysia I. Dan Melamed Department of Computer Science Courant Institute New York University

  2. Presentation Outline Introduction Bitext Mapping and Alignment SIMR and GSA algorithms. Porting SIMR/GSA to Malay-English Bitext Data Collection Steps to adopt SIMR into Malay-English language pair Matching Predicate Axis Generator Parameter Optimization Results and Evaluation Conclusion

  3. Bitext Mapping and Alignment • Bitext-(parallel text): A text in one language and its translation in another language. • Bitext Mapping and Alignment: to describe the correspondence between the two halves of the bitext. • Bitext Mapping: is to find the corresponding points, i.e. words, text units, or segments boundaries, between its two halves • Bitext Alignment: is a segmentation of the two texts, such that the nth segment of one text corresponds to thenthsegment of the other.

  4. Machine Translation • Bilingual lexicography • Word sense disambiguation • Multilingual information retrieval • Also as practical tool for assisting translators • Bitext Mapping and Alignment are needed in order to compile this Data into a useful source of knowledge.

  5. A Bitext can form the axes of a rectangular bitext space. terminus terminus • True Points of Correspondence (TPCs) can be plotted as points in the bitext space. Y: Characters’ position in text 2 Y: Characters’ position in text 2 Main diagonal Main diagonal origin origin X: Characters’ position in text 1 X: Characters’ position in text 1 • Bitext space The point (X,Y) is TPC if token at positionXand a token at positionYare translation to each other.

  6. Real bitexts are noisy: - Fertility = A single segment in one half may correspond to zero, one, two or more segments in the other half. - crossed dependencies (distortion) = Where human translators change and rearrange material so the target output text will not flow well according to the order of the source text.

  7. Mapped Bitext Malay English Bitext SIMR Malay English SIMR TCPs TBM SIMR and GSA algorithms • Mapping • SIMR: stands for Smooth Injective Map Recognizer

  8. j i h g f e d c b a j A B C D E F G H I J K L i h g f e d c • Each cell represents the intersection of two segments, one from each half of the bitext • Segment boundaries form a grid over the bitext space • SIMR Output: the correspondence points b a A B C D E F G H I J K L • Alignment • GSA: reduces the sets of correspondence points in SIMR’s output to segment alignments • GSA: stands for Geometric Segment Alignment. • A point inside (X,y) cell indicates that some token in segment X corresponds with some token in segment y; segments X and y correspond.

  9. Malay Version: 107,161 words 13 chapters English Version: 101,790 words Malay Version: 51,802 words 8 chapters English Version: 50,170 words Malay Version: 8,281 words First 20 pages English Version: 6,974 words • Data Collection Malay-English Bitexts from Unit Terjemahan Melalui Komputer (UTMK) - USM • “The 7 Habits of Highly Effective People” • “Semantics” • “User’s Guide: Microsoft Word for Windows”

  10. Manual Alignment Validate Manual Alignment Training Data Parameter re-optimization ADOMIT Bitext Mapping Test Data Axis generator Malay Malay English English SIMR English Malay KIMD Bilingual dictionary Lexicon GSA Segment Alignment English Malay • Steps to adoptSIMRinto Malay-English language pair

  11. Computer Komputer • Cognate words System Sistem • Lexicon Bury: mengebumikan, menanam, kematian,kereta • Matching Predicate • It is a heuristic used to decide whether two given tokens might be mutual translation. Find the TPCs between the two halves of the bitext • Punctuation marks The matching Predicates were fine-tuned with stop-list words for both Malay and English languages

  12. 0 <EOS> 3 tujuh 9.5 tabiat 17.5 gambaran 26 seluruh 31 . 31.5 <EOS> • Axis Generator • For each language, an axis generator performs the mapping from tokens (the smallest semantic units) to axis position. • The position of a token (in character) is the position of its median character. “tujuh tabiat gambaran seluruh.” • Data Lemmatization • English: word stemming POS tag (Brill’s) and XTAG lexicon (contains 90000 roots, 317000 inflected forms). • Malay: root construction  rules, and lexicon (contains 10000 popular words).

  13. Parameter Value Parameters value Chain size 7 Max. point ambiguity 6 Max. angle deviation 0.14 Max. linear regression error 5 Min. Cognate length ratio 0.80 b a A O B • Parameter Optimization • We use Chapter 3, 7 and 11 from the 7habits book. All together 1245 segments. It is manually aligned at the sentence level. ADOMIT (AutomaticDetectionofOMIssions inTranslation)  Alignment Validation • Simply say: Any segment whose slope is unusually low is a likely omission.

  14. Results and Evaluation

  15. Visit http://www.cs.nyu.edu/~melamed/, the URL for important references Visit http://utmk-ultra.cs.usm.my/, the URL for Unit Terjemahan Melalui Komputer (UTMK) – USM. • Conclusion • This experiment shows that SIMR/GSA algorithms can map/align Malay-English bitexts with high accuracy as they performed on the other variety of language pairs and text genres. • These results encourage us, as a future work, to think of extending the text alignment to word alignment aiming at the identification of correspondence between linguistic units below the sentence level within a bitext. • Bitexts are becoming plentiful and available, both in private data warehouses and on publicly accessible sites on the WWW. They form a very useful source of knowledge if they were treated efficiently.

  16. Thank you….. Mosleh Al-Adhaileh and Tang Enya Kong Computer Aided Translation Unit School of Computer Sciences University Science Malaysia I. Dan Melamed Department of Computer Science Courant Institute New York University

More Related