1 / 48

Using Pivot/Bridge Languages

Using Pivot/Bridge Languages. Matthias Eck. General Problem. Resources are available between languages A and B … and between languages B and C … but not C and A How to train translation models between C and A?. A. C. B. 1 st paper.

salaam
Download Presentation

Using Pivot/Bridge Languages

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Using Pivot/Bridge Languages Matthias Eck

  2. General Problem • Resources are available between languages A and B… and between languages B and C… but not C and A • How to train translation models between C and A? A C B

  3. 1st paper Multipath Translation Lexicon Induction via Bridge Languages • Gideon S. Mann and David Yarowsky • NAACL 2001 • Method for inducing translation lexicons based on transduction models of cognate pairs via bridge languages

  4. Lexicon via Cognate pairs Lexicon: • Mapping of word in source language to words in target language Here: • Lexicon is built between arbitrary languages using models of cognate pairs and cognate distance

  5. General idea dictionary cognate model Romance Family English Spanish Portuguese Italian French Romanian source bridge target

  6. Translation pairs • Cognate pairs can make up significant portion of lexicon if languages are in the same family and close

  7. Cognate string edit distance • Obvious condition for a good distance D • So we choose …as the translation for s

  8. Used distance measures • L: Levenshtein distance • Minimum sum of the costs of edit operations required to transform one string into another • Deletion, Substitution, Insertion – traditional cost 1 • S: Stochastic transducers • Probabilistic costs for each possible edit operation • H: Hidden Markov Model • Each character has separate edit operation parameters

  9. Distance Measures Variants of Levenshtein distance: • L-V: vowel substitution cost only: 0.5 • L-S/L-A: Filter probabilities obtained by S into 3 classes 0.5, 0.75, 1 • L-S: Each pair separately trained • L-A: Collectively trained for all Romance languages Limitation • Method cannot discover translation pairs with having no surface form relationship • Assumed cognate pairs: Levenshtein edit distance < 3 • Few false positives

  10. Intra Family Translation Lexicon Induction • Family: Romance languages • Available: dictionary (English/Bridge language) • General evaluation algorithm: • Select 100 word pairs from dictionary for testing • For adaptive metrics: Select hypothesized word pairs (Edit distance < 3) as cognate pairs and train on them • For each word in source language select closest word from the 100 target words

  11. Results Source Languages: • Spanish, French, Italian, Romanian Target Language: • Portuguese • 1000 word pairs in dictionary for Spanish/Portuguese • 900 for other language pairs

  12. Results • Pure Levenshtein distance works surprisingly well • S gives boost on French-Portuguese • Reason could be that Spanish-Portuguese are closer than French-Portuguese • L-S usually best

  13. Consonant-to-consonant • Consonant-to-consonant edit operations • Most probable for French – Portuguese

  14. Analysis

  15. Analysis - Example

  16. Multiple bridge languages dictionary cognate model Slavic Family English Czech Russian Ukrainian Polish Serbian source bridge target

  17. Translation Lexicon Induction Algorithm (One or more bridge languages) For each word s  S For each bridge language B Translate s → b  B t  T, Calculate D(b,t) Rank t by D(b,t) Score t using information from all bridges Select highest scored t Map s → t

  18. Results • One bridge languages, but multiple pathes

  19. Examples

  20. Different Pathways • English to Portuguese (via Romance languages) • English to Norwegian (via Germanic languages) • English to Ukrainian (via Slavic languages) • Portuguese to English (via Germanic languages, French)

  21. Results

  22. 2nd Paper Inducing Translation Lexicons via Diverse Similarity Measures and Bridge Languages • Charles Schafer and David Yarowsky • COLING 2002 • Improves results of first paper by introducing additional similarity scores between candidate translations

  23. Basic Idea • Decompose: • P(English|Serbian) = P(English|Czech) x P(Czech|Serbian) • For any language L close to Czech: • P(English|L) = P(English|Czech) x P(Czech|L) • P (Czech|L) as presented was done using similarity on cognate pairs

  24. Covered Languages Serbian Ukrainian English Slovene Czech Slovak Bulgarian Polish Punjabi Nepali Hindi Gujarati Bengali Marathi

  25. Serbian – Czech – English Czech – English dictionary: 171k word pairs Corpora:English: 192M wordsSerbian: 12M(News data from web) Gujarati – Hindi – English Hindi – English dictionary:74k word pairs Corpora:Gujarati: 2M Resources

  26. Problem with Cognate Pairs Serbian Czech English favor not correct prazan prizen grace pazen patronage blank prazdny correct empty

  27. Idea Introduce additional similarity models • Weighted Levenshtein Similarity • Context Similarity • Date distributional Similarity • Relative frequency Similarity • Burstiness Similarity and Inverse Document Frequency • Use of Additional Bridge Languages • Combine them with weighted string distance

  28. Weighted Levenshtein Similarity • 1. Iteration: Vowel cluster operations have half the cost of single consonant substitutions, insertions and deletions • dist(vowel+, vowel+) • Use highest weighted of the top 2000 to re-estimate edit weights • Some high probability substitutions:

  29. Context Similarity Compare narrow and wide contexts for candidates Context: bag of words (Narrow: radius 1/ Wide: radius 10) • Calculate Context on Source Language (Serbian) • Translate to English using current estimations • Compare with English Contexts via Cosine Similarity

  30. Context Similarity - Example Nezavisnost pravo: 2 suvereniteti: 3 deklaracije: 3 pokrajina: 4 Context in Serbian Corpus with frequencies

  31. 2 1.5 1.5 1.5 4 1.5 Context Similarity - Example Nezavisnost pravo: 2 suvereniteti: 3 deklaracije: 3 pokrajina: 4 majesty declaration justice sovereignty country ornamental Translate with Initial Lexicon

  32. 2 1.5 1.5 1.5 4 1.5 Context Similarity - Example Nezavisnost pravo: 2 suvereniteti: 3 deklaracije: 3 pokrajina: 4 majesty declaration 0 0 justice sovereignty country ornamental Independence 3 1 10 0 479 836 191 0 Freedom 681 184 104 0 21 4 141 0 expression Context of Candidates in English Corpus religion

  33. 2 1.5 1.5 1.5 4 1.5 Context Similarity - Example Nezavisnost pravo: 2 suvereniteti: 3 deklaracije: 3 pokrajina: 4 majesty declaration 0 0 justice sovereignty country ornamental COS Independence 3 1 10 0 479 836 191 0 Freedom 681 184 104 0 21 4 141 0 expression Cosine Similarity finds correct candidate (Independence) religion

  34. Date distributional Similarity • News Data: • Events are reported in parallel in multiple languages (+/- 2 days) • Construct term frequency vectors over time and compare candidates

  35. Date distributional Similarity

  36. Relative Frequencies • Word and translation are likely to have similar relative frequencies • Modest frequency variations are expected • Useful to rule out pairings with several orders of magnitude difference in relative frequency • Ratio of logs of frequencies correlates well with translational compatibility

  37. Relative Frequency Similarity • Correct translation “laud” has higher RF Score than higher ranked incorrect candidates “calibre”, “quarter” and “class”

  38. Burstiness Similarity • Define Burstiness to measure differences

  39. Burstiness Similarity • Burstiness matches better for correct translations “laud” and “praise”

  40. Combine the different measures • Weighted Levenshtein distance to get initial candidate pairs • Calculate 8 similarity measures • Weighted Levenshtein • Wide bag-of-words context similarity • Narrow bag of words context similarity • Local News date distribution similarity • All News date distribution similarity • IDF similarity • Burstiness similarity

  41. Combine the different measures • Integrate similarity measures into a single similarity function: • POS SimilarityBias in favor of compatible parts of speech (N, V, ADJ)Penalty for non-matching candidates • Sort candidates for each score in decreasing orderAssign Ranks 0,1,… and normalize by count • Scoring: Similarity models have associated weights

  42. Weight Allocation

  43. Evaluation 3 Evaluation Criteria • Exact Match Accuracy • Percentage of correct English in the top k ranks • Median Position of the per word highest ranked correct translation

  44. Results

  45. Results • Improvements with second bridge language

  46. Additional Bridge Language Work Interlingua based Statistical Machine Translation • Manuel Kauers, Stephan Vogel, Christian Fügen, Alex Waibel • ICSLP 2002 • Paper covers SMT from Text to a structured Interlingua format (IF) • Corpus English/IF is available…but we also want to translate other languages into IF? English IF

  47. Generalized problem • Assume we have translation model F to E and G to F… but we want G to E? • Decompose: • Because: E G F

  48. And just translating… • Experiments done during PF-STAR project 2003/2004 • Training data: 48k lines of BTEC data • Test data: 506 lines, Test set for CSTAR 2003 • Translating Chinese → Italian • Also via a bridge language Chinese → English → Italian

More Related