1 / 40

Advisor : Dr. Hsu Student : Sheng-Hsuan Wang Department of Information Management

A Technical Word and Term Translation Aid using Noisy Parallel Corpora across Language Groups Pascale Fung, Kathleen McKeown Machine Translation, 1997. Advisor : Dr. Hsu Student : Sheng-Hsuan Wang Department of Information Management. Outline. Motivation Objective Introduction

nizana
Download Presentation

Advisor : Dr. Hsu Student : Sheng-Hsuan Wang Department of Information Management

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. A Technical Word and Term Translation Aid using Noisy Parallel Corpora across Language GroupsPascale Fung, Kathleen McKeownMachine Translation, 1997 Advisor : Dr. Hsu Student : Sheng-Hsuan Wang Department of Information Management

  2. Outline • Motivation • Objective • Introduction • Related work • Noisy parallel corpora across language groups • Algorithm overview • Experiments • Conclusion

  3. Motivation • The difficult task, technical term translation • Translators quality and domain specific terminology. • Not adequately covered by printed dictionaries. • Terms from noisy parallel corpora, especially. • Ex: • Hong Kong Governor /香港總督 • Basic Law / 基本法 • Green Paper / 綠皮書

  4. Objective • This paper describes an algorithm for • translating technical words and • terms from noisy parallel corpora across language groups. 2 to 1

  5. 1. Introduction • Technical terms • often cannot be translated on a word by word basis. • The individual words of the term may have many possible translations. • Example: Governor • 總督, 主管(top manager) , 總裁(chief), 州長(of a State) • Hong Kong Governor – 香港總督 • Domain-specific terms • Basic Law / 基本法 • Green Paper / 綠皮書

  6. 1. Introduction • An algorithm for translating technical terms given a noisy parallel corpus as input • Notion • similar words won’t occur at the exact same position in each half of the corpus • distances between instances of the same word will be similar across languages • Method • To find word correlations and then builds technical terms translations. • Dynamic time warping algorithm. • Reliable anchor points.

  7. 2. Related work • Sentence alignment • Segment alignment • Word and term translation • Word alignment • Phrase translation

  8. 2.1. Sentence alignment • Two main approaches • Text-based: use of lexical information (dictionary) • Use paired lexical indicators across the languages to find matching sentences. • Length-based: use of the total number of characters (words) • Make the assumption that translated sentences in the parallel corpus will be of approximately the same, or constantly related, length.

  9. 2.2. Segment alignment • Church(1993) show that we can align a text by using delimiters. • Segment alignment is more appropriate for aligning noisy corpora. • The problem is finding reliable anchor points that can be used for Asian/Romance language pairs.

  10. 2.3. Word and term translation • Some algorithms used for alignment produce a small bilingual lexicon. • Some others use sentence-aligned parallel text. • Most of the following algorithms require clean, sentence-aligned parallel text input.

  11. 2.4. Word alignment • [Brown et al. 1990, Brown et al. 1993] • [Gale & Church 1991] • [Dagan et al. 1993] • [Wu & Xia 1994] • Various filtering techniques are used to improve the matching.

  12. 2.5. Phrase translation • [Kupiec1993] • [Smadja & McKeown1993] • [Dagan & Church1994] • All the work described in this section assumes a clean, parallel corpus as input.

  13. 3. Noisy parallel corpora across language groups • Previous approaches are lack of robustness • Against structural noise in parallel corpora. • Against language pairs which don’t share etymological roots. • Still exist problems • Bilingual texts which are translations of each other but are not translated sentence by sentence. • Language robustness.

  14. 3. Noisy parallel corpora across language groups • Two noisy parallel corpora • English version of the AWK manual and its Japanese translation. • Parts of the HKUST English-Chinese Bilingual Corpora. • Two noisy parallel corpora • English version of the AWK manual and its Japanese translation. • Parts of the HKUST English-Chinese Bilingual Corpora.

  15. 4. Algorithm overview • Treat the domain word translation problem as a pattern matching problem • Each word shares some common features with its counterpart in the translated text. • To find the best representations of these features and the best ways to match them.

  16. 1 – 4 Corpus English Chinese Tag English word list Tokenize Japanese and Chinese texts, and form a word list Algorithm overview 5. Compile non-linear segment boundaries with high frequency word pairs 1. Primary lexicon 2. Anchor points for alignment 3. Align the text 4. Secondary lexicon 6. Compile bilingual word lexicon 7. Suggest a word list for each technical term to the translator

  17. 5. Extracting technical terms from English text • To find domain-specific terms, we tagged the English part of the corpus by a modified POS tagger • Extracted noun phrases which are most likely to be technical terms. • To find the translations for words which are part of these terms only.

  18. 6. Tokenization of Chinese and Japanese texts • Tokenization of the Chinese text is done by using a statistically augmented dictionary-based tokenizer which is able to recognize frequent domain words. • Example: 基本法/Basic Law • The Japanese text is tokenized by JUMAN without domain word augmentation.

  19. 7. A rough word pair based alignment • Treat translation as a pattern matching task. • The task is to find a representation and similarity measurement which can find word pairs to serve as anchor points.

  20. 7.1. Dynamic Recency Vectors • Governor • The word position • <2380,2390,2463,…> of length 212. • Recency vector • <10,73,102,…> • 總督 • The word position • <90,2021,2150,…> of length 254. • Recency vector • <1931,129,8,…>

  21. Recency vector signals Governor.ch Governor.en Bill.ch President.en

  22. 總督 Governor 7.2. Matching Recency Vectors • Dynamic time warping, DTW • Takes two vectors of lengths N and M, finds an optimal path through the N by M trellis, starting from (1,1) to (N,M).

  23. 總督 Governor DTW algorithm • Initialization • Costs are initialized according to recency vector values

  24. 總督 Governor DTW algorithm • Recursion • To accumulate cost of the DTW path

  25. 總督 Governor DTW algorithm • Termination • Final cost of the DTW path is normalized by the length of the path.

  26. 總督 Governor DTW algorithm • Path reconstruction • Reconstruct the DTW path and obtain the points on the path. • For finding anchor points and eliminating noise use.

  27. DTW algorithm • For each word vector in language A, the word vector in language B which has lowest DTW score is taken to be its translation. • We thresholded the bilingual word pairs obtained from above stages in the algorithm and stored the more reliable pairs as our primary bilingual lexicon.

  28. 7.3. Statistical filters • To avoid the complexity, we incorporated constraints to filter the set of possible pairs • Starting point constraints, i.e., position constraint. • Length constraint, i.e., frequency constraint. • Means/standard deviation constraint

  29. 8. Finding anchor points and eliminating noise • Primary lexicon is used for aligning the segments in the corpus • To find anchor points on the DTW paths which divide the texts into multiple aligned segments for the secondary lexicon. • We only keep an anchor point (i,j) if it satisfies the following • (slope constraint) • (continuity constraint) • (window size constraint) • (offset constraint)

  30. 8. Finding anchor points and eliminating noise AWK HKUST Text alignment path After filtering All word pairs

  31. 9. Finding bilingual word pair matches • To obtain the secondary and final bilingual word lexicon • A non-linear K segment binary vector representation for each word. • A similarity measure to compute word pair correlations.

  32. 9.1. Non-Linear K segments • The anchor points <(i1,j1),(i2,j2),…,(ik,jk)> divide a bilingual corpus into k+1 non-linear segments, where i in text1 and j in text2. • The algorithm then proceeds to obtain a secondary bilingual lexicon, considering words of both high and low frequency.

  33. 9.2. Non-Linear segment binary vectors • The occurrences of a pair of translated words in a bilingual corpus, i.e., to compute the correlation between two words. • Pr(ws, wt) occurring in the same place in the corpus. • Binary vector where the i-th bit is set to 1 if both words are found in the i-th segment. governor K segments

  34. T F T F 9.2. Non-Linear segment binary vectors • If the source and target words are good translations of one another, then a should be large.

  35. 9.3. Binary vector correlation measure • Similarity measure, weighted mutual information

  36. 10. Word translation results

  37. 11. Term translations from word groups

  38. Term translation aid result

  39. Conclusion • A technique to align noisy parallel corpora by segments, and to extract a bilingual word lexicon from it. • Substitute the sentence alignment step with a rough segment alignment. • No sentence boundary information and with noise. • Highly reliable anchor points using DTW to serve as segment delimiters.

  40. Personal opinion • Valuable idea • Treat the domain word translation problem as a pattern matching problem. • Contribution • Language robustness and noisy parallel corpora. • Drawback • Too long and too complex.

More Related