1 / 69

Building Lexicons

Building Lexicons. Jae Dong Kim Matthias Eck. Building Lexicons. Introduction Previous Work Translation Model Decomposition Reestimated Models Parameter Estimation Method A Method B Method C Evaluation Conclusion. Building Lexicons. Introduction Previous Work

aelwen
Download Presentation

Building Lexicons

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Building Lexicons Jae Dong Kim Matthias Eck

  2. Building Lexicons • Introduction • Previous Work • Translation Model Decomposition • Reestimated Models • Parameter Estimation • Method A • Method B • Method C • Evaluation • Conclusion

  3. Building Lexicons • Introduction • Previous Work • Translation Model Decomposition • Reestimated Models • Parameter Estimation • Method A • Method B • Method C • Evaluation • Conclusion

  4. Definitions • Translational equivalence: A relation that holds between two expressions with the same meaning, where two expressions are in different languages. • Statistical Translation Models: statistical models of translational equivalence • Empirical estimation of statistical translation models is typically based on parallel texts or bitexts • Word-to-Word Lexicon • A list of word pairs (source word, target word ) • Bidirectional • Probabilistic word-to-word lexicon (source word, target word, prob.)

  5. Additional Universal Property • Translation models benefit from the best of both the empiricist and rationalist traditions • Models to be proposed • Most word tokens translate to only one word token. Approximated by one-to-one assumption - Method A • Most text segments are not translated word for word. Explicit Noise Model - Method B • Different linguistic objects have statistically different behavior in translation. Translation models on different word classes. - Method C • Human judgment has shown that each of three estimation biases improves translation model accuracy over a baseline knowledge-free model

  6. Applications of Translation Models • Where word order is not important • Cross-language information retrieval • Multilingual document filtering • Computer-assisted language learning • Certain machine-assisted translation tools • Concordancing for bilingual lexicography • Corpus linguistics • “crummy” machine translation • Where word order is important • Speech transcription for translation • Bootstrapping of OCR systems for new languages • Interactive translation • Fully automatic high-quality machine translation

  7. Advantages of translation models • Compared to handcrafted models • The possibility of better coverage • The possibility of frequent updates • More accurate information about relative importance of different translations IRDB Q’ T Qi IR Uniform Importance?

  8. Building Lexicons • Introduction • Previous Work • Translation Model Decomposition • Reestimated Models • Parameter Estimation • Method A • Method B • Method C • Evaluation • Conclusion

  9. Models of Co-occurrence • Intuition: words that are translations of each other are more likely to appear in corresponding bitext regions than other pairs of words. • A boundary-based model: assumes that both halves of the bitext have been segmented into s segments, so that segment Ui in one half of the bitext and segment Vi in the other half are mutual translations, 1<=i<=s • Co-occurrence count by Brown et al • Co-occurrence count by Melamed

  10. Nonprobabilistic Translation Lexicons (1) • Summary of non-probabilistic translation lexicon algorithms • Choose a similarity function S between word types in L1 and word types L2 • Compute association scores S(u,v) for a set of word type pairs (u,v)  (L1 x L2) that occur in training data • Sort the word pairs in descending order of their association scores • Discard all word pairs for which S(u,v) is less than a chosen threshold. The remaining word pairs become the entries in the translation lexicon • Main difference: choice of similarity function • Those functions are based on a model of co-occurrence with some linguistically motivated filtering

  11. Nonprobabilistic Translation Lexicons (2) • Problem: independence assumption in step 2 • Models of translational equivalence that are ignorant of indirect association have “a tendency … to be confused by collocates” • If all the entries in a translation lexicon are sorted by their association scores, the direct associations will be very dense near the top of the list, and sparser towards the bottom He nods his head Il hoche la tete Direct association Indirect association

  12. Nonprobabilistic Translation Lexicons (3) • The very top of the list can be over 98% correct - Gale and Church (1991) • Gleaned lexicon entries for about 61% of the word tokens in a sample of 800 English sentences • Selected only entries with high association score • 61% word tokens represent 4.5%word types • 71.6% precision with top 23.8% of noun-noun entries - Fung(1995) • Automatic acquisition of 6,517 lexicon entries with 86% precision from 3.3-million-word corpus - Wu & Xia (1994) • 19% recall • Weighted precision: in {(E1,C1,0.533), (E1,C2,0.277), (E1,C3,0.190)}, if (E1,C3,0.190) is wrong, we have precision of 0.810 • Higher than unweighted one

  13. Building Lexicons • Introduction • Previous Work • Translation Model Decomposition • Reestimated Models • Parameter Estimation • Method A • Method B • Method C • Evaluation • Conclusion

  14. Decomposition of Translation Model (1) • Two stage decomposition of sequence-to-sequence model • First stage: • Every sequence L is just an ordered bag, and the bag B can be modeled independently of its order O

  15. Decomposition of Translation Model (2) • First Stage: • Let L1 and L2 be two sequences and let A be a one-to-one mapping between the elements of L1 and the elements of L2

  16. Decomposition of Translation Model (2) • First Stage: • Let L1 and L2 be two sequences and let A be a one-to-one mapping between the elements of L1 and the elements of L2

  17. Decomposition of Translation Model (3) • First Stage: • Bag-to-bag translation model

  18. Decomposition of Translation Model (4) • Second Stage: • From bags of words to the words that they contain • Bag pair generation process - how word-to-word model is embeded • Generate a bag size l. l is also the assignment size • Generate l language-independent concepts C1,…,Cl. • From each concept Ci, 1<=i<=l, generate a pair of word sequences from L1* x L2*, according to the distribution , to lexicalize the concept in the two languages. Some concepts are not lexicalized in some languages, so one of ui and vi may be empty. • Bags: • An assignment: {(i1,j1),…,(il,jl)}

  19. Decomposition of Translation Model (5) • Second Stage: • The probability of generating a pair of bags (B1,B2)

  20. Decomposition of Translation Model (5) • Second Stage: • The probability of generating a pair of bags (B1,B2) • is zero for all concepts except one • is symmetric unlike the models of Brown et al.

  21. The One-to-One Assumption • and may consist of at most one word each • A pair of bags containing m and n nonempty words can be generated by a process where the bag size l is anywhere between max(m,n) and m+n • Not as restrictive as it may appear. What if we extend a word to include spaces?

  22. Building Lexicons • Introduction • Previous Work • Translation Model Decomposition • Reestimated Models • Parameter Estimation • Method A • Method B • Method C • Evaluation • Conclusion

  23. Reestimated Seq.-to-Seq. Trans. Model (1) • Variations on the theme proposed by Brown et al. • Conditional probabilities, but can be compared to symmetric models if the letter are normalized marginally • Only Co-occurrence Information • EM • When information about segment lengths is not available

  24. Reestimated Seq.-to-Seq. Trans. Model (2) • Word Order Correlation Biases • In any bitext, the positions of words relative to the true bitext map correlate with the positions of their translations • The word order correlation bias is most useful when it has high predictive power • Absolute word positions - Brown et al. 1988 • A much smaller set of relative offset parameters - Dagan, Church, and Gale. 1993 • Even more efficient parameter estimation using HMM with some additional assumptions - Vogel, Ney, and Tillman. 1996

  25. Reestimated Bag-to-Bag Trans. Models • Another Bag-to-Bag model by Hiemstra. 1996 • The same: one-to-one assumption • The difference: empty words are allowed in only one of the two bags, the one representing the shorter sentence • Iterative Proportional Fitting Procedure(IPFP) for parameter estimation • IPFP is subjective to initial conditions • With the most advantageous, more accurate than Model 1

  26. Building Lexicons • Introduction • Previous Work • Translation Model Decomposition • Reestimated Models • Parameter Estimation • Method A • Method B • Method C • Evaluation • Conclusion

  27. Building Lexicons • Introduction • Previous Work • Translation Model Decomposition • Reestimated Models • Parameter Estimation • Method A • Method B • Method C • Evaluation • Conclusion

  28. Parameter Estimation Methods for estimating the parameters of a symmetric word-to-word translation model from a bitext. • Interested in probability trans(u,v) Probability to jointly generate the pair of words (u,v) • trans(u,v)cannot be directly inferred: It is unknown which words were generated together • Observable in bitext is only cooc(u,v)(co-occurrence count)

  29. Definitions • Link counts:links(u,v): hypothesis about the number of times u and v were generated together • Link token: Ordered Pair of word tokens • Link type: Ordered Pair of word types • links(u,v) ranges over Link types • trans(u,v) can be calculated using links(u,v)

  30. Definitions (continued) • score(u,v) chance u and v can ever be mutual translationssimilar to trans(u,v), convenient for estimation • Relationship between trans(u,v) and score(u,v) can be direct (depending on model)

  31. General outline for all Methods • Initialize the score parameter to a first approximation based only on cooc(u,v) REPEAT • Approximate links(u,v) based on score and cooc • Calculate trans(u,v),Stop if only little change • Reestimate score(u,v) based on links and cooc

  32. EM-Algorithm! • Initialize the score parameter to a first approximation based only on cooc(u,v) REPEAT • Approximate links(u,v) based on score and cooc • Calculate trans(u,v),Stop if only little change • Re-estimate score(u,v) based on links and cooc Initial E-Step M-Step E-Step

  33. EM: Maximum Likelihood Approach • Find the parameters that maximize the probability of the given bitext • Assignments cannot be decomposed due to the one-to-one assumption (compare to Brown et al. 1993) • MLE approach is infeasible • Approximating EM is necessary

  34. Maximum a Posteriori • Evaluate Expectations using the single most probable assignment only (Maximum a posteriori (MAP) assignment)

  35. Maximum a Posteriori • Evaluate Expectations using the single most probable assignment (Maximum a posteriori (MAP) assignment) • l: number of Concepts, number of produced words

  36. Maximum a Posteriori • Evaluate Expectations using the single most probable assignment (Maximum a posteriori (MAP) assignment)

  37. Maximum a Posteriori • Evaluate Expectations using the single most probable assignment (Maximum a posteriori (MAP) assignment) • l, Pr(l): constant

  38. Maximum a Posteriori • Evaluate Expectations using the single most probable assignment (Maximum a posteriori (MAP) assignment)

  39. Bipartite Graph • Represent bitext as bipartite graph • Find solution for weighted maximum matching • Still too expensive to solve • Competitive Linking Algorithm approximates … u … log(trans(u,v)) … v …

  40. Building Lexicons • Introduction • Previous Work • Translation Model Decomposition • Reestimated Models • Parameter Estimation • Method A • Method B • Method C • Evaluation • Conclusion

  41. Method A: Competitive Linking Step 1: • Co-occurrence counts • Use “whole” table information • Initialize score(u,v) to G2(u,v) (similar to Chi-square) • Good-Turing Smoothing gives improvements

  42. Step 2: Estimation of link counts • Competitive Linking algorithm is employed • Greedy approximation of the MAP approximation Algorithm • Sort all score(u,v) from the highest to the lowest • For each score(u,v) in order: • Link all co-occurring token pairs (u,v) in the bitext(If u is NULL consider all tokens of v in the bitext linked to NULL and vice versa) • One-to-One assumption: Linked words cannot be linked againRemove all linked words from the bitext

  43. Example: Competitive Linking u a b c d v

  44. Competitive Linking X X X u a X X X X X X b X X X X c d v

  45. Competitive Linking X X X X X X u a X X X X X X b X X X X X X X X X X X X c d v

  46. Competitve Linking per sentence … b a … links(a,c)++ links(b,d)++ … … c d … … a b … links(a,d)++ links(b,e)++ … … c d e …

  47. Building Lexicons • Introduction • Previous Work • Translation Model Decomposition • Reestimated Models • Parameter Estimation • Method A • Method B • Method C • Evaluation • Conclusion

  48. Method B: • “Most texts are not translated word-for-word” • Why is that a problem with Method A? … a b x … … c d e f …

  49. Method B: • “Most texts are not translated word-for-word” • Why is that a problem with Method A? … a b x … Competitive Linking … c d e f … … a b x … We are forced to connect (b,d)! … c d e f …

  50. After one iteration of Method A on 300k sentences Hansard links = cooc often, probably correct links < cooc rare, might be correct links << cooc often, probably incorrect Method B:

More Related