advisor dr hsu student sheng hsuan wang department of information management n.
Skip this Video
Loading SlideShow in 5 Seconds..
Advisor : Dr. Hsu Student : Sheng-Hsuan Wang Department of Information Management PowerPoint Presentation
Download Presentation
Advisor : Dr. Hsu Student : Sheng-Hsuan Wang Department of Information Management

Loading in 2 Seconds...

  share
play fullscreen
1 / 40
Download Presentation

Advisor : Dr. Hsu Student : Sheng-Hsuan Wang Department of Information Management - PowerPoint PPT Presentation

nizana
52 Views
Download Presentation

Advisor : Dr. Hsu Student : Sheng-Hsuan Wang Department of Information Management

- - - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript

  1. A Technical Word and Term Translation Aid using Noisy Parallel Corpora across Language GroupsPascale Fung, Kathleen McKeownMachine Translation, 1997 Advisor : Dr. Hsu Student : Sheng-Hsuan Wang Department of Information Management

  2. Outline • Motivation • Objective • Introduction • Related work • Noisy parallel corpora across language groups • Algorithm overview • Experiments • Conclusion

  3. Motivation • The difficult task, technical term translation • Translators quality and domain specific terminology. • Not adequately covered by printed dictionaries. • Terms from noisy parallel corpora, especially. • Ex: • Hong Kong Governor /香港總督 • Basic Law / 基本法 • Green Paper / 綠皮書

  4. Objective • This paper describes an algorithm for • translating technical words and • terms from noisy parallel corpora across language groups. 2 to 1

  5. 1. Introduction • Technical terms • often cannot be translated on a word by word basis. • The individual words of the term may have many possible translations. • Example: Governor • 總督, 主管(top manager) , 總裁(chief), 州長(of a State) • Hong Kong Governor – 香港總督 • Domain-specific terms • Basic Law / 基本法 • Green Paper / 綠皮書

  6. 1. Introduction • An algorithm for translating technical terms given a noisy parallel corpus as input • Notion • similar words won’t occur at the exact same position in each half of the corpus • distances between instances of the same word will be similar across languages • Method • To find word correlations and then builds technical terms translations. • Dynamic time warping algorithm. • Reliable anchor points.

  7. 2. Related work • Sentence alignment • Segment alignment • Word and term translation • Word alignment • Phrase translation

  8. 2.1. Sentence alignment • Two main approaches • Text-based: use of lexical information (dictionary) • Use paired lexical indicators across the languages to find matching sentences. • Length-based: use of the total number of characters (words) • Make the assumption that translated sentences in the parallel corpus will be of approximately the same, or constantly related, length.

  9. 2.2. Segment alignment • Church(1993) show that we can align a text by using delimiters. • Segment alignment is more appropriate for aligning noisy corpora. • The problem is finding reliable anchor points that can be used for Asian/Romance language pairs.

  10. 2.3. Word and term translation • Some algorithms used for alignment produce a small bilingual lexicon. • Some others use sentence-aligned parallel text. • Most of the following algorithms require clean, sentence-aligned parallel text input.

  11. 2.4. Word alignment • [Brown et al. 1990, Brown et al. 1993] • [Gale & Church 1991] • [Dagan et al. 1993] • [Wu & Xia 1994] • Various filtering techniques are used to improve the matching.

  12. 2.5. Phrase translation • [Kupiec1993] • [Smadja & McKeown1993] • [Dagan & Church1994] • All the work described in this section assumes a clean, parallel corpus as input.

  13. 3. Noisy parallel corpora across language groups • Previous approaches are lack of robustness • Against structural noise in parallel corpora. • Against language pairs which don’t share etymological roots. • Still exist problems • Bilingual texts which are translations of each other but are not translated sentence by sentence. • Language robustness.

  14. 3. Noisy parallel corpora across language groups • Two noisy parallel corpora • English version of the AWK manual and its Japanese translation. • Parts of the HKUST English-Chinese Bilingual Corpora. • Two noisy parallel corpora • English version of the AWK manual and its Japanese translation. • Parts of the HKUST English-Chinese Bilingual Corpora.

  15. 4. Algorithm overview • Treat the domain word translation problem as a pattern matching problem • Each word shares some common features with its counterpart in the translated text. • To find the best representations of these features and the best ways to match them.

  16. 1 – 4 Corpus English Chinese Tag English word list Tokenize Japanese and Chinese texts, and form a word list Algorithm overview 5. Compile non-linear segment boundaries with high frequency word pairs 1. Primary lexicon 2. Anchor points for alignment 3. Align the text 4. Secondary lexicon 6. Compile bilingual word lexicon 7. Suggest a word list for each technical term to the translator

  17. 5. Extracting technical terms from English text • To find domain-specific terms, we tagged the English part of the corpus by a modified POS tagger • Extracted noun phrases which are most likely to be technical terms. • To find the translations for words which are part of these terms only.

  18. 6. Tokenization of Chinese and Japanese texts • Tokenization of the Chinese text is done by using a statistically augmented dictionary-based tokenizer which is able to recognize frequent domain words. • Example: 基本法/Basic Law • The Japanese text is tokenized by JUMAN without domain word augmentation.

  19. 7. A rough word pair based alignment • Treat translation as a pattern matching task. • The task is to find a representation and similarity measurement which can find word pairs to serve as anchor points.

  20. 7.1. Dynamic Recency Vectors • Governor • The word position • <2380,2390,2463,…> of length 212. • Recency vector • <10,73,102,…> • 總督 • The word position • <90,2021,2150,…> of length 254. • Recency vector • <1931,129,8,…>

  21. Recency vector signals Governor.ch Governor.en Bill.ch President.en

  22. 總督 Governor 7.2. Matching Recency Vectors • Dynamic time warping, DTW • Takes two vectors of lengths N and M, finds an optimal path through the N by M trellis, starting from (1,1) to (N,M).

  23. 總督 Governor DTW algorithm • Initialization • Costs are initialized according to recency vector values

  24. 總督 Governor DTW algorithm • Recursion • To accumulate cost of the DTW path

  25. 總督 Governor DTW algorithm • Termination • Final cost of the DTW path is normalized by the length of the path.

  26. 總督 Governor DTW algorithm • Path reconstruction • Reconstruct the DTW path and obtain the points on the path. • For finding anchor points and eliminating noise use.

  27. DTW algorithm • For each word vector in language A, the word vector in language B which has lowest DTW score is taken to be its translation. • We thresholded the bilingual word pairs obtained from above stages in the algorithm and stored the more reliable pairs as our primary bilingual lexicon.

  28. 7.3. Statistical filters • To avoid the complexity, we incorporated constraints to filter the set of possible pairs • Starting point constraints, i.e., position constraint. • Length constraint, i.e., frequency constraint. • Means/standard deviation constraint

  29. 8. Finding anchor points and eliminating noise • Primary lexicon is used for aligning the segments in the corpus • To find anchor points on the DTW paths which divide the texts into multiple aligned segments for the secondary lexicon. • We only keep an anchor point (i,j) if it satisfies the following • (slope constraint) • (continuity constraint) • (window size constraint) • (offset constraint)

  30. 8. Finding anchor points and eliminating noise AWK HKUST Text alignment path After filtering All word pairs

  31. 9. Finding bilingual word pair matches • To obtain the secondary and final bilingual word lexicon • A non-linear K segment binary vector representation for each word. • A similarity measure to compute word pair correlations.

  32. 9.1. Non-Linear K segments • The anchor points <(i1,j1),(i2,j2),…,(ik,jk)> divide a bilingual corpus into k+1 non-linear segments, where i in text1 and j in text2. • The algorithm then proceeds to obtain a secondary bilingual lexicon, considering words of both high and low frequency.

  33. 9.2. Non-Linear segment binary vectors • The occurrences of a pair of translated words in a bilingual corpus, i.e., to compute the correlation between two words. • Pr(ws, wt) occurring in the same place in the corpus. • Binary vector where the i-th bit is set to 1 if both words are found in the i-th segment. governor K segments

  34. T F T F 9.2. Non-Linear segment binary vectors • If the source and target words are good translations of one another, then a should be large.

  35. 9.3. Binary vector correlation measure • Similarity measure, weighted mutual information

  36. 10. Word translation results

  37. 11. Term translations from word groups

  38. Term translation aid result

  39. Conclusion • A technique to align noisy parallel corpora by segments, and to extract a bilingual word lexicon from it. • Substitute the sentence alignment step with a rough segment alignment. • No sentence boundary information and with noise. • Highly reliable anchor points using DTW to serve as segment delimiters.

  40. Personal opinion • Valuable idea • Treat the domain word translation problem as a pattern matching problem. • Contribution • Language robustness and noisy parallel corpora. • Drawback • Too long and too complex.