1 / 49

Word and Phrase Alignment

Word and Phrase Alignment. Presenters: Marta Tatu Mithun Balakrishna. Translating Collocations for Bilingual Lexicons: A Statistical Approach. Frank Smadja, Kathleen R. McKeown and Vasileios Hatzivassiloglou CL-1996. Overview – Champollion.

Download Presentation

Word and Phrase Alignment

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Word and Phrase Alignment Presenters: Marta Tatu Mithun Balakrishna

  2. Translating Collocations for Bilingual Lexicons: A Statistical Approach Frank Smadja, Kathleen R. McKeown and Vasileios Hatzivassiloglou CL-1996

  3. Overview – Champollion • Translates collocations from English into French using an aligned corpus (Hansards) • The translation is constructed incrementally, adding one word at a time • Correlation method: the Dice coefficient • Accuracy between 65% and 78%

  4. The Similarity Measure • Dice coefficient (Dice, 1945) where p(X,Y),p(X), and p(Y) are the joint and marginal probability of X and Y • If the probabilities are estimated using maximum likelihood, then where fX,fY, and fXY are the absolute frequencies of appearance of “1”s for X andY

  5. Algorithm - Preprocessing • Source and target language sentences must be aligned (Gale and Church 1991) • List of collocations to be translated must be provided (Xtract, Smadja 1993)

  6. Algorithm 1/3 • Champollion identifies a set S of k words highly correlated with the source collocation • The target collocation is in the powerset of S • These words have a Dice-measure  Td ( = 0.10) and appear  Tf ( = 5 ) times • Form all pairs of words from S • Evaluate the correlation between each pair and the source collocation (Dice)

  7. Algorithm 2/3 • Keep pairs that score above the threshold Td • Construct 3–word elements containing one of the highly correlated pairs plus a member of S • … • Until for some n ≤ k, no n–word scores above the threshold

  8. Algorithm 3/3 • Champollion selects the best translation among the top candidates • In case of ties, the longer collocation is preferred • Determine whether the selected translation is a single word, a flexible, or a rigid collocation, in case of multiword translations • Are the words used consistently in the same order and at the same distance?

  9. Experimental Setup • DB1 = 3.5*106 words (8 months of 1986) • DB2 = 8.5*106 words (1986 and 1987) • C1 = 300 collocations from DB1 of mid-range frequency • C2 = 300 collocations from 1987 • C3 = 300 collocations from 1988 • Three fluent bilingual speakers • Canadian French vs. continental French

  10. Results

  11. Future Work • Translating the closed class words • Tools for the target language • Separating corpus-dependent translations from general ones • Handling low frequency collocations • Analysis of the effects of thresholds • Incorporating the length of the translation into the score • Using nonparallel corpora

  12. Comments

  13. A Pattern Matching Method for Finding Noun and Proper Noun Translations from Noisy Parallel Corpora Pascal Fung ACL-1995

  14. Goal of the Paper • Create bilingual lexicon of nouns and proper nouns • From unaligned, noisy parallel texts of Asian/Indo-European language pairs • Pattern matching method

  15. Introduction • Previous research on sentence-aligned, parallel texts • Alignment not always practical • Unclear sentence boundaries in corpora • Noisy text segments present in only one language • Two main steps • Find small bilingual primary lexicon • Compute a better secondary lexicon from these partially aligned texts

  16. Algorithm • Tag the English half of the parallel text • Nouns and proper nouns (they have consistent translations over the entire text) • Tagged English part with a modified POS tagger • Find translations for nouns, plural nouns and proper nouns only

  17. Algorithm • Positional Difference Vectors • Correspondence between a word and its translated counterpart • In their frequency • In their positions • Correspondence need not be linear • Calculation • p – position vector of a word • V – positional difference vector • V[i-1] = p[i] – p[i-1]

  18. Algorithm

  19. Algorithm • Match pairs of positional difference vectors, giving scores • Dynamic Time Warping (Fung & McKeown, 1994) • For non-identical vectors • Trace correspondence between all points in V1 and V2 • No penalty for deletions and insertions • Statistical filters

  20. Dynamic Time Warping • Given V1 and V2, which point in V1 corresponds to which point in V2?

  21. Algorithm

  22. Algorithm • Finding anchor points and eliminating noise • Every word pair selected to run DTW • Obtain DTW score • Obtain DTW path • Plot DTW paths of all such word pairs • Keep highly reliable points and discard rest • Point (i,j) is noise if

  23. Algorithm

  24. Algorithm • Finding low frequency bilingual word pairs • Non-linear segment binary vectors • V1[i] = 1 if word occurs in ith segment • Binary vector correlation measure

  25. Results

  26. Comments

  27. Automated Dictionary Extraction for “Knowledge-Free” Example-Based Translation Ralf D. Brown TMIMT-1997

  28. Goal of the Paper • Extract a bilingual dictionary • Using a aligned bilingual corpus • Perform tests to compare the performance of PanEBMT using • Collins Spanish-English dictionary + WordNet English root/synonym list • Various automatically extracted bilingual dictionaries

  29. Introduction

  30. Extracting Bilingual Dictionary • Extracted from corpus using • Correspondence table • Threshold Schema • Correspondence Table • Two dimensional array • Indexed by source language words • Indexed by target language words • Cross-product word entries of each sentence pair are incremented

  31. Extracting Bilingual Dictionary • Similar word orders language pairs biased • Threshold setting • A step function • Unreachably high for co-occurrence < MIN • Constant otherwise • A sliding scale • Start at 1.0 for co-occurrence = 1 • Slide smoothly to MIN threshold value

  32. Extracting Bilingual Dictionary • Filtering • Symmetric threshold • Asymmetric threshold • Any elements of Correspondence table which fail both tests set to zero • Non-zero elements added to dictionary

  33. Extracting Bilingual Dictionary - Results

  34. Extracting Bilingual Dictionary - Errors • High-frequency Error-ridden terms • Short list high frequency words (all words which appear in at least 20% of source sentences) • Short list sentence pairs containing extactly one or two high frequency words • Results in 7 of 16 words – Zero error • Merge with results from first pass

  35. Experimental Setup • Manually created tokenization – 47 equivalence classes, 880 words and translations of each word • Two test texts • 275 UN corpus sentences : in-domain • 253 Newswire sentences : out-of-domain

  36. Results

  37. Comments

  38. Extracting Paraphrases from a Parallel Corpus Regina Barzilay and Kathleen R. McKeown ACL-2001

  39. Overview • Corpus-based unsupervised learning algorithm for paraphrase extraction • Lexical paraphrases (single and multi-word) • (refuse, say no) • Morpho-syntactic paraphrases • (king’s son, son of the king) • (start to talk, start talking) • Phrases which appear in similar contexts are paraphrases

  40. Data • Multiple English translations of literary texts written by foreign authors • Madam Bovary, Fairy Tales, Twenty Thousand Leagues Under the Sea, etc. • 11 translations

  41. Preprocessing • Sentence alignment • Translations of the same source contain a number of identical words • 42% of the words in corresponding sentences are identical (average) • Dynamic programming (Gale & Church, 1991) • 94.5% correct alignments (127 sentences) • POS tagger and chunker  NP and VP

  42. Algorithm – Bootstrapping • Co-training method: DLCoTrain (Collins & Singer, 1999) • Similar contexts surround two phrases  paraphrase • Having good paraphrase predictor contexts  new paraphrases • Analyze contexts surrounding identical words in aligned sentence pairs • Use these contexts to learn new paraphrases

  43. Feature Extraction • Paraphrase features • Lexical: tokens for each phrase in the paraphrase pair • Syntactic: POS tags • Contextual features: left and right syntactic contexts surrounding the paraphrase (POS n-grams) tried tocomfortherleft1=“VB1 TO2”, right1=“PRP$3” tried toconsoleher left2=“VB1 TO2”, right2=“PRP$3”

  44. Algorithm • Initialization • Identical words are the seeds (positive paraphrasing examples) • Negatives are created by pairing each word with all the other words in the sentence • Training of the context classifier • Record contexts around positive and negative paraphrases of length ≤ 3 • Identify the strong predictors based on their strength and frequency

  45. Algorithm • Keep the most frequent k = 10 contexts with a strength > 95% • Training of the paraphrasing classifier • Using the context rules extracted previously, derive new pairs of paraphrases • When no more paraphrases are discovered, stop

  46. Results • 9483 paraphrases, 25 morpho-syntactic rules • Out of 500: 86.5% (without context), 91.6% (with context) correct paraphrases • 69% recall evaluated on 50 sentences

  47. Future Work • Extract paraphrases from comparable corpora (news reports about the same event) • Improve the context representation

  48. Comments

  49. Thank You !

More Related